From 7b98b8b9b6507a1ef23541a6f896f7c7c3ebd6cd Mon Sep 17 00:00:00 2001
From: suraj-vathsa <85908731+suraj-vathsa@users.noreply.github.com>
Date: Fri, 15 Dec 2023 11:40:32 -0800
Subject: [PATCH] Suraj/update triton main (#1)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* Changed copyright (#5705)

* Modify timeout test in L0_sequence_batcher to use portable backend (#5696)

* Modify timeout test in L0_sequence_batcher to use portable backend

* Use identity backend that is built by default on Windows

* updated upstream container name (#5713)

* Fix triton container version (#5714)

* Update the L0_model_config test expected error message (#5684)

* Use better value in timeout test L0_sequence_batcher (#5716)

* Use better value in timeout test L0_sequence_batcher

* Format

* Update JAX install (#5613)

* Add notes about socket usage to L0_client_memory_growth test (#5710)

* Check TensorRT error message more granularly (#5719)

* Check TRT err msg more granularly

* Clarify source of error messages

* Consolidate tests for message parts

* Pin Python Package Versions for HTML Document Generation (#5727)

* updating with pinned versions for python dependencies

* updated with pinned sphinx and nbclient versions

* Test full error returned when custom batcher init fails (#5729)

* Add testing for batcher init failure, add wait for status check

* Formatting

* Change search string

* Add fastertransformer test  (#5500)

Add fastertransformer test that uses 1GPU.

* Fix L0_backend_python on Jetson  (#5728)

* Don't use mem probe in Jetson

* Clarify failure messages in L0_backend_python

* Update copyright

* Add JIRA ref, fix _test_jetson

* Add testing for Python custom metrics API (#5669)

* Add testing for python custom metrics API

* Add custom metrics example to the test

* Fix for CodeQL report

* Fix test name

* Address comment

* Add logger and change the enum usage

* Add testing for Triton Client Plugin API (#5706)

* Add HTTP client plugin test

* Add testing for HTTP asyncio

* Add async plugin support

* Fix qa container for L0_grpc

* Add testing for grpc client plugin

* Remove unused imports

* Fix up

* Fix L0_grpc models QA folder

* Update the test based on review feedback

* Remove unused import

* Add testing for .plugin method

* Install jemalloc (#5738)

* Add --metrics-address and testing (#5737)

* Add --metrics-address, add tests to L0_socket, re-order CLI options for consistency

* Use non-localhost address

* Add testing for basic auth plugin for HTTP/gRPC clients (#5739)

* Add HTTP basic auth test

* Add testing for gRPC basic auth

* Fix up

* Remove unused imports

* Add multi-gpu, multi-stream testing for dlpack tensors (#5550)

* Add multi-gpu, multi-stream testing for dlpack tensors

* Update note on SageMaker MME support for ensemble (#5723)

* Run L0_backend_python subtests with virtual environment (#5753)

* Update 'main' to track development of 2.35.0 / r23.06 (#5764)

* Include jemalloc into the documentation (#5760)

* Enhance tests in L0_model_update (#5724)

* Add model instance name update test

* Add gap for timestamp to update

* Add some tests with dynamic batching

* Extend supported test on rate limit off

* Continue test if off mode failed

* Fix L0_memory_growth (#5795)

(1) reduce MAX_ALLOWED_ALLOC to be more strict for bounded tests, and generous for unbounded tests.
(2) allow unstable measurement from PA.
(3) improve logging for future triage

* Add note on --metrics-address (#5800)

* Add note on --metrics-address

* Copyright

* Minor fix for running "mlflow deployments create -t triton --flavor triton ..." (#5658)

UnboundLocalError: local variable 'meta_dict' referenced before assignment

The above error shows in listing models in Triton model repository

* Adding test for new sequence mode (#5771)

* Adding test for new sequence mode

* Update option name

* Clean up testing spacing and new lines

* MLFlow Triton Plugin: Add support for s3 prefix and custom endpoint URL (#5686)

* MLFlow Triton Plugin: Add support for s3 prefix and custom endpoint URL

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Update the function order of config.py and use os.path.join to replace filtering a list of strings then joining

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Update onnx flavor to support s3 prefix and custom endpoint URL

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Fix two typos in MLFlow Triton plugin README.md

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments (replace => strip)

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments (init regex only for s3)

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Remove unused local variable: slash_locations

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

---------

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Fix client script (#5806)

* Add MLFlow test for already loaded models. Update copyright year (#5808)

* Use the correct gtest filter (#5824)

* Add error message test on S3 access decline (#5825)

* Add test on access decline

* Fix typo

* Add MinIO S3 access decline test

* Make sure bucket exists during access decline test

* Restore AWS_SECRET_ACCESS_KEY on S3 local test (#5832)

* Restore AWS_SECRET_ACCESS_KEY

* Add reason for restoring keys

* nnshah1 stream infer segfault fix (#5842)

match logic from infer_handler.cc

* Remove unused test (#5851)

* Add and document memory usage in statistic protocol (#5642)

* Add and document memory usage in statistic protocol

* Fix doc

* Fix up

* [DO NOT MERGE Add test. FIXME: model generation

* Fix up

* Fix style

* Address comment

* Fix up

* Set memory tracker backend option in build.py

* Fix up

* Add CUPTI library in Windows image build

* Add note to build with memory tracker by default

* use correct lib dir on CentOS (#5836)

* use correct lib dir on CentOS

* use new location for opentelemetry-cpp

* Document that gpu-base flag is optional for cpu-only builds (#5861)

* Update Jetson tests in Docker container (#5734)

* Add flags for ORT build

* Separate list with commas

* Remove unnecessary detection of nvcc compiler

* Fixed Jetson path for perf_client, datadir

* Create version directoryy for custom model

* Remove probe check for shm, add shm exceed error for Jetson

* Copyright updates, fix Jetson Probe

* Fix be_python test num on Jetson

* Remove extra comma, non-Dockerized Jetson comment

* Remove comment about Jetson being non-dockerized

* Remove no longer needed flag

* Update `main` post-23.05 release (#5880)

* Update README and versions for 23.05 branch

* Changes to support 23.05 (#5782)

* Update python and conda version

* Update CMAKE installation

* Update checksum version

* Update ubuntu base image to 22.04

* Use ORT 1.15.0

* Set CMAKE to pull latest version

* Update libre package version

* Removing unused argument

* Adding condition for ubuntu 22.04

* Removing installation of the package from the devel container

* Nnshah1 u22.04 (#5770)

* Update CMAKE installation

* Update python and conda version

* Update CMAKE installation

* Update checksum version

* Update ubuntu base image to 22.04

* updating versions for ubuntu 22.04

* remove re2

---------

Co-authored-by: Neelay Shah <neelays@neelays-dt.nvidia.com>
Co-authored-by: Neelay Shah <neelays@nvidia.com>

* Set ONNX version to 1.13.0

* Fix L0_custom_ops for ubuntu 22.04 (#5775)

* add back rapidjson-dev

---------

Co-authored-by: Neelay Shah <neelays@neelays-dt.nvidia.com>
Co-authored-by: Neelay Shah <neelays@nvidia.com>
Co-authored-by: nv-kmcgill53 <101670481+nv-kmcgill53@users.noreply.github.com>

* Fix L0_mlflow (#5805)

* working thread

* remove default install of blinker

* merge issue fixed

* Fix L0_backend_python/env test (#5799)

* Fix L0_backend_python/env test

* Address comment

* Update the copyright

* Fix up

* Fix L0_http_fuzz (#5776)

* installing python 3.8.16 for test

* spelling

Co-authored-by: Neelay Shah <neelays@nvidia.com>

* use util functions to install python3.8 in an easier way

---------

Co-authored-by: Neelay Shah <neelays@nvidia.com>

* Update Windows versions for 23.05 release (#5826)

* Rename Ubuntu 20.04 mentions to 22.04 (#5849)

* Update DCGM version (#5856)

* Update DCGM version (#5857)

* downgrade DCGM version to 2.4.7 (#5860)

* Updating link for latest release notes to 23.05

---------

Co-authored-by: Neelay Shah <neelays@neelays-dt.nvidia.com>
Co-authored-by: Neelay Shah <neelays@nvidia.com>
Co-authored-by: nv-kmcgill53 <101670481+nv-kmcgill53@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com>

* Disable memory tracker on Jetpack until the library is available (#5882)

* Fix datadir for x86 (#5894)

* Add more test on instance signature (#5852)

* Add testing for new error handling API (#5892)

* Test batch input for libtorch (#5855)

* Draft ragged TensorRT unit model gen

* Draft libtorch special identity model

* Autoformat

* Update test, fix ragged model gen

* Update suffix for io for libtorch

* Remove unused variables

* Fix io names for libtorch

* Use INPUT0/OUTPUT0 for libtorch

* Reorder to match test model configs

* Remove unnecessary capitalization

* Auto-format

* Capitalization is necessary

* Remove unnecessary export

* Clean up Azure dependency in server build (#5900)

* [DO NOT MERGE]

* Remove Azure dependency in server component build

* Finalize

* Fix dependency

* Fixing up

* Clean up

* Add response parameters for streaming GRPC inference to enhance decoupled support (#5878)

* Update 'main' to track development of 2.36.0 / 23.07 (#5917)

* Add test for detecting S3 http2 upgrade request (#5911)

* Add test for detecting S3 http2 upgrade request

* Enhance testing

* Copyright year update

* Add Redis cache build, tests, and docs (#5916)

* Updated handling for uint64 request priority

* Ensure HPCX dependencies found in container (#5922)

* Add HPCX dependencies to search path

* Copy hpcx to CPU-only container

* Add ucc path to CPU-only image

* Fixed if statement

* Fix df variable

* Combine hpcx LD_LIBRARY_PATH

* Add test case where MetricFamily is deleted before deleting Metric (#5915)

* Add test case for metric lifetime error handling

* Address comment

* Use different MetricFamily name

* Add testing for Pytorch instance group kind MODEL (#5810)

* Add testing for Pytorch instance group kind MODEL

* Remove unused item

* Update testing to verify the infer result

* Add copyright

* Remove unused import

* Update pip install

* Update the model to use the same add sub logic

* Add torch multi-gpu and multi-device models to L0_io

* Fix up model version

* Add test for sending instance update config via load API (#5937)

* Add test for passing config via load api

* Add more docs on instance update behavior

* Update to suggested docs

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Use dictionary for json config

* Modify the config fetched from Triton instead

---------

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Fix L0_batcher count check (#5939)

* Add testing for json tensor format (#5914)

* Add redis config and use local logfile for redis server (#5945)

* Add redis config and use local logfile for redis server

* Move redis log config to CLI

* Have separate redis logs for unit tests and CLI tests

* Add test on rate limiter max resource decrease update (#5885)

* Add test on rate limiter max resource decrease update

* Add test with explicit resource

* Check server log for decreased resource limit

* Add docs on decoupled final response feature (#5936)

* Allow changing ping behavior based on env variable in SageMaker and entrypoint updates (#5910)

* Allow changing ping behavior based on env variable in SageMaker

* Add option for additional args

* Make ping further configurable

* Allow further configuration of grpc and http ports

* Update docker/sagemaker/serve

* Update docker/sagemaker/serve

---------

Co-authored-by: GuanLuo <41310872+GuanLuo@users.noreply.github.com>

* Remove only MPI libraries in HPCX in L0_perf_analyzer (#5967)

* Be more specific with MPI removal

* Delete all libmpi libs

* Ensure L0_batch_input requests received in order (#5963)

* Add print statements for debugging

* Add debugging print statements

* Test using grpc client with stream to fix race

* Use streaming client in all non-batch tests

* Switch all clients to streaming GRPC

* Remove unused imports, vars

* Address comments

* Remove random comment

* Set inputs as separate function

* Split set inputs based on test type

* Add test for redis cache auth credentials via env vars (#5966)

* Auto-formatting (#5979)

* Auto-format

* Change to clang-format-15 in CONTRIBTUING

* Adding tests ensuring locale setting is passed to python backend interpreter

* Refactor build.py CPU-only Linux libs for readability (#5990)

* Improve the error message when the number of GPUs is insufficient (#5993)

* Update README to include CPP-API Java Bindings (#5883)

* Update env variable to use for overriding /ping behavior (#5994)

* Add test that >1000 model files can be loaded in S3 (#5976)

* Add test for >1000 files

* Capitalization for consistency

* Add bucket cleaning at end

* Move test pass/fail to end

* Check number of files in model dir at load time

* Add testing for GPU tensor error handling (#5871)

* Add testing for GPU tensor error handling

* Fix up

* Remove exit 0

* Fix jetson

* Fix up

* Add test for Python BLS model loading API (#5980)

* Add test for Python BLS model loading API

* Fix up

* Update README and versions for 23.06 branch

* Fix LD_LIBRARY_PATH for PyTorch backend

* Return updated df in add_cpu_libs

* Remove unneeded df param

* Update test failure messages to match Dataloader changes (#6006)

* Add dependency for L0_python_client_unit_tests (#6010)

* Improve performance tuning guide (#6026)

* Enabling nested spans for trace mode OpenTelemetry (#5928)

* Adding nested spans to OTel tracing + support of ensemble models

* Move multi-GPU dlpack test to a separate L0 test (#6001)

* Move multi-GPU dlpack test to a separate L0 test

* Fix copyright

* Fix up

* OpenVINO 2023.0.0 (#6031)

* Upgrade OV to 2023.0.0

* Upgrade OV model gen script to 2023.0.0

* Add test to check the output memory type for onnx models (#6033)

* Add test to check the output memory type for onnx models

* Remove unused import

* Address comment

* Add testing for implicit state for PyTorch backend (#6016)

* Add testing for implicit state for PyTorch backend

* Add testing for libtorch string implicit models

* Fix CodeQL

* Mention that libtorch backend supports implicit state

* Fix CodeQL

* Review edits

* Fix output tests for PyTorch backend

* Allow uncompressed conda execution enviroments (#6005)

Add test for uncompressed conda execution enviroments

* Fix implicit state test (#6039)

* Adding target_compile_features cxx_std_17 to tracing lib (#6040)

* Update 'main' to track development of 2.37.0 / 23.08

* Fix intermittent failure in L0_model_namespacing (#6052)

* Fix PyTorch implicit model mounting in gen_qa_model_repository (#6054)

* Fix broken links pointing to the `grpc_server.cc` file (#6068)

* Fix L0_backend_python expected instance name (#6073)

* Fix expected instance name

* Copyright year

* Fix L0_sdk: update the search name for the client wheel (#6074)

* Fix name of client wheel to be looked for

* Fix up

* Add GitHub action to format and lint code (#6022)

* Add pre-commit

* Fix typos, exec/shebang, formatting

* Remove clang-format

* Update contributing md to include pre-commit

* Update spacing in CONTRIBUTING

* Fix contributing pre-commit link

* Link to pre-commit install directions

* Wording

* Restore clang-format

* Fix yaml spacing

* Exclude templates folder for check-yaml

* Remove unused vars

* Normalize spacing

* Remove unused variable

* Normalize config indentation

* Update .clang-format to enforce max line length of 80

* Update copyrights

* Update copyrights

* Run workflows on every PR

* Fix copyright year

* Fix grammar

* Entrypoint.d files are not executable

* Run pre-commit hooks

* Mark not executable

* Run pre-commit hooks

* Remove unused variable

* Run pre-commit hooks after rebase

* Update copyrights

* Fix README.md typo (decoupled)

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Run pre-commit hooks

* Grammar fix

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Redundant word

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Revert docker file changes

* Executable shebang revert

* Make model.py files non-executable

* Passin is proper flag

* Run pre-commit hooks on init_args/model.py

* Fix typo in init_args/model.py

* Make copyrights one line

---------

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Fix default instance name change when count is 1 (#6088)

* Add test for sequence model instance update (#5831)

* Add test for sequence model instance update

* Add gap for file timestamp update

* Update test for non-blocking sequence update

* Update documentation

* Remove mentioning increase instance count case

* Add more documentaion for scheduler update test

* Update test for non-blocking batcher removal

* Add polling due to async scheduler destruction

* Use _ as private

* Fix typo

* Add docs on instance count decrease

* Fix typo

* Separate direct and oldest to different test cases

* Separate nested tests in a loop into multiple test cases

* Refactor scheduler update test

* Improve doc on handling future test failures

* Address pre-commit

* Add best effort to reset model state after a single test case failure

* Remove reset model method to make harder for chaining multiple test cases as one

* Remove description on model state clean up

* Fix default instance name (#6097)

* Removing unused tests (#6085)

* Update post-23.07 release  (#6103)

* Update README and versions for 2.36.0 / 23.07

* Update Dockerfile.win10.min

* Fix formating issue

* fix formating issue

* Fix whitespaces

* Fix whitespaces

* Fix whitespaces

* Improve asyncio testing (#6122)

* Reduce instance count to 1 for python bls model loading test (#6130)

* Reduce instance count to 1 for python bls model loading test

* Add comment when calling unload

* Fix queue test to expect exact number of failures (#6133)

* Fix queue test to expect exact number of failures

* Increase the execution time to more accurately capture requests

* Add CPU & GPU metrics in Grafana dashboard.json for K8s op prem deployment (fix #6047) (#6100)

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Adding the support tracing of child models invoked from a BLS model (#6063)

* Adding tests for bls

* Added fixme, cleaned previous commit

* Removed unused imports

* Fixing commit tree:
Refactor code, so that OTel tracer provider is initialized only once
Added resource cmd option, testig
Added docs

* Clean up

* Update docs/user_guide/trace.md

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Revision

* Update doc

* Clean up

* Added ostream exporter to OpenTelemetry for testing purposes; refactored trace tests

* Added opentelemetry trace collector set up to tests; refactored otel exporter tests to use OTel collector instead of netcat

* Revising according to comments

* Added comment regarding 'parent_span_id'

* Added permalink

* Adjusted test

---------

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Test python environments 3.8-3.11 (#6109)

Add tests for python 3.8-3.11 for L0_python_backends

* Improve L0_backend_python debugging (#6157)

* Improve L0_backend_python debugging

* Use utils function for artifacts collection

* Add unreachable output test for reporting source of disconnectivity (#6149)

* Update 'main' to track development of 2.38.0 / 23.09 (#6163)

* Fix the versions in the doc (#6164)

* Update docs with NVAIE messaging (#6162)

Update docs with NVAIE messaging

* Add sanity tests for parallel instance loading (#6126)

* Remove extra whitespace (#6174)

* Remove a test case that sanity checks input value of --shape CLI flag (#6140)

* Remove test checking for --shape option

* Remove the entire test

* Add test when unload/load requests for same model is received at the same time (#6150)

* Add test when unload/load requests for same model received the same time

* Add test_same_model_overlapping_load_unload

* Use a load/unload stress test instead

* Pre-merge test name update

* Address pre-commit error

* Revert "Address pre-commit error"

This reverts commit 781cab1bfe816a3ffd5eaf23b01a7bfa38314bcd.

* Record number of occurrence of each exception

* Make assert failures clearer in L0_trt_plugin (#6166)

* Add end-to-end CI test for decoupled model support (#6131) (#6184)

* Add end-to-end CI test for decoupled model support

* Address feedback

* Test preserve_ordering for oldest strategy sequence batcher (#6185)

* added debugging guide (#5924)

* added debugging guide

* Run pre-commit

---------

Co-authored-by: David Yastremsky <dyastremsky@nvidia.com>

* Add deadlock gdb section to debug guide (#6193)

* Fix character escape in model repository documentation (#6197)

* Fix docs test (#6192)

* Add utility functions for array manipulation (#6203)

* Add utility functions for outlier removal

* Fix functions

* Add newline to end of file

* Add gc collect to make sure gpu tensor is deallocated (#6205)

* Testing: add gc collect to make sure gpu tensor is deallocated

* Address comment

* Check for log error on failing to find explicit load model (#6204)

* Set default shm size to 1MB for Python backend (#6209)

* Trace Model Name Validation (#6199)

* Initial commit

* Cleanup using new standard formatting

* QA test restructuring

* Add newline to the end of test.sh

* HTTP/GRCP protocol changed to pivot on ready status & error status. Log file name changed in qa test.

* Fixing unhandled error memory leak

* Handle index function memory leak fix

* Fix the check for error message (#6226)

* Fix copyright for debugging guide (#6225)

* Add watts units to GPU power metric descriptions (#6242)

* Update post-23.08 release  (#6234)

* CUDA 12.1 > 12.2

* DLIS-5208: onnxruntime+windows - stop treat warnings on compile as errors

* Revert "DLIS-5208: onnxruntime+windows - stop treat warnings on compile as errors"

This reverts commit 0cecbb7461fd944ff09f456011dfab960dff170e.

* Update Dockerfile.win10.min

* Update Dockerfile.win10.min

* Update README and versions for 23.08 branch

* Update Dockerfile.win10

* Fix the versions in docs

* Add the note about stabilization of the branch

* Update docs with NVAIE messaging (#6162) (#6167)

Update docs with NVAIE messaging

Co-authored-by: David Zier <42390249+dzier@users.noreply.github.com>

* Resolve merge conflict

---------

Co-authored-by: tanmayv25 <tanmay2592@gmail.com>
Co-authored-by: David Zier <42390249+dzier@users.noreply.github.com>

* Add tests/docs for queue size (pending request count) metric (#6233)

* Adding safe string to number conversions (#6173)

* Added catch for out of range error for trace setting update

* Added wrapper to safe parse options

* Added option names to errors

* Adjustments

* Quick fix

* Fixing option name for Windows

* Removed repetitive code

* Adjust getopt_long for Windows to use longindex

* Moved try catch into ParseOption

* Removed unused input

* Improved names

* Refactoring and clean up

* Fixed Windows

* Refactored getopt_long for Windows

* Refactored trace test, pinned otel's collector version to avoid problems with go requirements

* Test Python execute() to return Triton error code (#6228)

* Add test for Python execute error code

* Add all supported error codes into test

* Move ErrorCode into TritonError

* Expose ErrorCode internal in TritonError

* Add docs on IPv6 (#6262)

* Add test for TensorRT version-compatible model support (#6255)

* Add tensorrt version-compatibility test

* Generate one version-compatible model

* Fix copyright year

* Remove unnecessary variable

* Remove unnecessary line

* Generate TRT version-compatible model

* Add sample inference to TRT version-compatible test

* Clean up utils and model gen for new plan model

* Fix startswith capitalization

* Remove unused imports

* Remove unused imports

* Add log check

* Upgrade protobuf version (#6268)

* Add testing for retrieving shape and datatype in backend API (#6231)

Add testing for retrieving output shape and datatype info from backend API

* Update 'main' to track development of 2.39.0 / 23.10 (#6277)

* Apply UCX workaround (#6254)

* Add ensemble parameter forwarding test (#6284)

* Exclude extra TRT version-compatible models from tests (#6294)

* Exclude compatible models from tests.

* Force model removal, in case it does not exist

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

---------

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Adding installation of docker and docker-buildx (#6299)

* Adding installation of docker and docker-buildx

* remove whitespace

* Use targetmodel from header as model name in SageMaker (#6147)

* Use targetmodel from header as model name in SageMaker

* Update naming for model hash

* Add more error messages, return codes, and refactor HTTP server (#6297)

* Fix typo (#6318)

* Update the request re-use example (#6283)

* Update the request re-use example

* Review edit

* Review comment

* Disable developer tools build for In-process API + JavaCPP tests (#6296)

* Add Python binding build. Add L0_python_api to test Python binding (#6319)

* Add L0_python_api to test Python binding

* Install Python API in CI image

* Fix QA build

* Increase network timeout for valgrind (#6324)

* Tests and docs for ability to specify subdirectory to download for LocalizePath (#6308)

* Added custom localization tests for s3 and azure, added docs

* Refactor HandleInfer into more readable chunks (#6332)

* Refactor model generation scripts (#6336)

* Refactor model generation scripts

* Fix codeql

* Fix relative path import

* Fix package structure

* Copy the gen_common file

* Add missing uint8

* Remove duplicate import

* Add testing for scalar I/O in ORT backend (#6343)

* Add testing for scalar I/O in ORT backend

* Review edit

* ci

* Update post-23.09 release (#6367)

* Update README and versions for 23.09 branch (#6280)

* Update `Dockerfile` and `build.py`  (#6281)

* Update configuration for Windows Dockerfile (#6256)

* Adding installation of docker and docker-buildx

* Enable '--expt-relaxed-constexpr' flag for custom ops models

* Upate Dockerfile version

* Disable unit tests for Jetson

* Update condition (#6285)

* removing Whitespaces (#6293)

* removing Whitespaces

* removing whitespaces

* Add security policy (#6376)

* Adding client-side request cancellation support and testing (#6383)

* Add L0_request_cancellation (#6252)

* Add L0_request_cancellation

* Remove unittest test

* Add cancellation to gRPC server error handling

* Fix up

* Use identity model

* Add tests for gRPC client-side cancellation (#6278)

* Add tests for gRPC client-side cancellation

* Fix CodeQL issues

* Formatting

* Update qa/L0_client_cancellation/client_cancellation_test.py

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Move to L0_request_cancellation

* Address review comments

* Removing request cancellation support from asyncio version

* Format

* Update copyright

* Remove tests

* Handle cancellation notification in gRPC server (#6298)

* Handle cancellation notification in gRPC server

* Fix the request ptr initialization

* Update src/grpc/infer_handler.h

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Address review comment

* Fix logs

* Fix request complete callback by removing reference to state

* Improve documentation

---------

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

---------

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Fixes on the gRPC frontend to handle AsyncNotifyWhenDone() API (#6345)

* Fix segmentation fault in gRPC frontend

* Finalize all states upon completion

* Fixes all state cleanups

* Handle completed states when cancellation notification is received

* Add more documentation steps

* Retrieve dormant states to minimize the memory footprint for long streams

* Update src/grpc/grpc_utils.h

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Use a boolean state instead of raw pointer

---------

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Add L0_grpc_state_cleanup test (#6353)

* Add L0_grpc_state_cleanup test

* Add model file in QA container

* Fix spelling

* Add remaining subtests

* Add failing subtests

* Format fixes

* Fix model repo

* Fix QA docker file

* Remove checks for the error message when shutting down server

* Fix spelling

* Address review comments

* Add schedulers request cancellation tests (#6309)

* Add schedulers request cancellation tests

* Merge gRPC client test

* Reduce testing time and covers cancelling other requests as a consequence of request cancellation

* Add streaming request cancellation test

---------

Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com>

* Add missing copyright (#6388)

* Add basic generate endpoints for LLM tasks (#6366)

* PoC of parsing request prompt and converting to Triton infer request

* Remove extra trace

* Add generate endpoint

* Enable streaming version

* Fix bug

* Fix up

* Add basic testing. Cherry pick from #6369

* format

* Address comment. Fix build

* Minor cleanup

* cleanup syntax

* Wrap error in SSE format

* Fix up

* Restrict number of response on non-streaming generate

* Address comment on implementation.

* Re-enable trace on generate endpoint

* Add more comprehensive llm endpoint tests (#6377)

* Add security policy (#6376)

* Start adding some more comprehensive tests

* Fix test case

* Add response error testing

* Complete test placeholder

* Address comment

* Address comments

* Fix code check

---------

Co-authored-by: dyastremsky <58150256+dyastremsky@users.noreply.github.com>
Co-authored-by: GuanLuo <gluo@nvidia.com>

* Address comment

* Address comment

* Address comment

* Fix typo

---------

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Co-authored-by: dyastremsky <58150256+dyastremsky@users.noreply.github.com>

* Add Python backend request cancellation test (#6364)

* Add cancelled response status test

* Add Python backend request cancellation test

* Add Python backend decoupled request cancellation test

* Simplified response if cancelled

* Test response_sender.send() after closed

* Rollback test response_sender.send() after closed

* Rollback non-decoupled any response on cancel

* Add TRT-LLM backend build to Triton (#6365) (#6392)

* Add TRT-LLM backend build to Triton (#6365)

* Add trtllm backend to build

* Temporarily adding version map for 23.07

* Fix build issue

* Update comment

* Comment out python binding changes

* Add post build

* Update trtllm backend naming

* Update TRTLLM base image

* Fix cmake arch

* Revert temp changes for python binding PR

* Address comment

* Move import to the top (#6395)

* Move import to the top

* pre commit format

* Add Python backend when vLLM backend built (#6397)

* Update build.py to build vLLM backend (#6394)

* Support parameters object in generate route

* Update 'main' to track development of 2.40.0 / 23.11 (#6400)

* Fix L0_sdk (#6387)

* Add documentation on request cancellation (#6403)

* Add documentation on request cancellation

* Include python backend

* Update docs/user_guide/request_cancellation.md

Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com>

* Update docs/user_guide/request_cancellation.md

Co-authored-by: Neelay Shah <neelays@nvidia.com>

* Update docs/README.md

Co-authored-by: Neelay Shah <neelays@nvidia.com>

* Update docs/user_guide/request_cancellation.md

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Remove inflight term from the main documentation

* Address review comments

* Fix

* Update docs/user_guide/request_cancellation.md

Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com>

* Fix

---------

Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com>
Co-authored-by: Neelay Shah <neelays@nvidia.com>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com>

* Fixes in request cancellation doc (#6409)

* Document generate HTTP endpoint (#6412)

* Document generate HTTP endpoint

* Address comment

* Fix up

* format

* Address comment

* Update SECURITY.md to not display commented copyright (#6426)

* Fix missing library in L0_data_compression (#6424)

* Fix missing library in L0_data_compression

* Fix up

* Add Javacpp-presets repo location as env variable in Java tests(#6385)

Simplify testing when upstream (javacpp-presets) build changes. Related to triton-inference-server/client#409

* TRT-LLM backend build changes (#6406)

* Update url

* Debugging

* Debugging

* Update url

* Fix build for TRT-LLM backend

* Remove TRTLLM TRT and CUDA versions

* Fix up unused var

* Fix up dir name

* FIx cmake patch

* Remove previous TRT version

* Install required packages for example models

* Remove packages that are only needed for testing

* Add gRPC AsyncIO request cancellation tests (#6408)

* Fix gRPC test failure and refactor

* Add gRPC AsyncIO cancellation tests

* Better check if a request is cancelled

* Use f-string

* Fix L0_implicit_state (#6427)

* Fixing vllm build (#6433)

* Fixing torch version for vllm

* Switch Jetson model TensorRT models generation to container (#6378)

* Switch Jetson model TensorRT models generation to container

* Adding missed file

* Fix typo

* Fix typos

* Remove extra spaces

* Fix typo

* Bumped vllm version (#6444)

* Adjust test_concurrent_same_model_load_unload_stress (#6436)

* Adding emergency vllm latest release (#6454)

* Fix notify state destruction and inflight states tracking (#6451)

* Ensure notify_state_ gets properly destructed

* Fix inflight state tracking to properly erase states

* Prevent removing the notify_state from being erased

* Wrap notify_state_ object within unique_ptr

* Update TRT-LLM backend url (#6455)

* TRTLLM backend post release

* TRTLLM backend post release

* Update submodule url for permission issue

* Update submodule url

* Fix up

* Not using postbuild function to workaround submodule url permission issue

* Added docs on python based backends (#6429)


Co-authored-by: Neelay Shah <neelays@nvidia.com>

* L0_model_config Fix (#6472)

* Minor fix for L0_model_config

* Add test for Python model parameters (#6452)

* Test Python BLS with different sizes of CUDA memory pool (#6276)

* Test with different sizes of CUDA memory pool

* Check the server log for error message

* Improve debugging

* Fix syntax

* Add documentation for K8s-onprem StartupProbe (#5257)

Co-authored-by: dyastremsky <58150256+dyastremsky@users.noreply.github.com>
Co-authored-by: Ryan McCormick <mccormick.codes@gmail.com>

* Update `main` post-23.10 release   (#6484)

* Update README and versions for 23.10 branch (#6399)

* Cherry-picking vLLM backend changes (#6404)

* Update build.py to build vLLM backend (#6394)

* Add Python backend when vLLM backend built (#6397)

---------

Co-authored-by: dyastremsky <58150256+dyastremsky@users.noreply.github.com>

* Add documentation on request cancellation (#6403) (#6407)

* Add documentation on request cancellation

* Include python backend

* Update docs/user_guide/request_cancellation.md

* Update docs/user_guide/request_cancellation.md

* Update docs/README.md

* Update docs/user_guide/request_cancellation.md

* Remove inflight term from the main documentation

* Address review comments

* Fix

* Update docs/user_guide/request_cancellation.md

* Fix

---------

Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com>
Co-authored-by: Neelay Shah <neelays@nvidia.com>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com>

* Fixes in request cancellation doc (#6409) (#6410)

* TRT-LLM backend build changes (#6406) (#6430)

* Update url

* Debugging

* Debugging

* Update url

* Fix build for TRT-LLM backend

* Remove TRTLLM TRT and CUDA versions

* Fix up unused var

* Fix up dir name

* FIx cmake patch

* Remove previous TRT version

* Install required packages for example models

* Remove packages that are only needed for testing

* Fixing vllm build (#6433) (#6437)

* Fixing torch version for vllm

Co-authored-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com>

* Update TRT-LLM backend url (#6455) (#6460)

* TRTLLM backend post release

* TRTLLM backend post release

* Update submodule url for permission issue

* Update submodule url

* Fix up

* Not using postbuild function to workaround submodule url permission issue

* remove redundant lines

* Revert "remove redundant lines"

This reverts commit 86be7ad969b484e5b55a3c0541d21eee7a06d889.

* restore missed lines

* Update build.py

Co-authored-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com>

* Update build.py

Co-authored-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com>

---------

Co-authored-by: Tanmay Verma <tanmay2592@gmail.com>
Co-authored-by: dyastremsky <58150256+dyastremsky@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com>
Co-authored-by: Neelay Shah <neelays@nvidia.com>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com>
Co-authored-by: Kris Hung <krish@nvidia.com>
Co-authored-by: Katherine Yang <80359429+jbkyang-nvi@users.noreply.github.com>
Co-authored-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com>

* Adding structure reference to the new document (#6493)

* Improve L0_backend_python test stability (ensemble / gpu_tensor_lifecycle) (#6490)

* Test torch allocator gpu memory usage directly rather than global gpu memory for more consistency

* Add L0_generative_sequence test (#6475)

* Add testing backend and test

* Add test to build / CI. Minor fix on L0_http

* Format. Update backend documentation

* Fix up

* Address comment

* Add negative testing

* Fix up

* Downgrade vcpkg version (#6503)

* Collecting sub dir artifacts in GitLab yaml. Removing collect function from test script. (#6499)

* Use post build function for TRT-LLM backend (#6476)

* Use postbuild function

* Remove updating submodule url

* Enhanced python_backend autocomplete (#6504)

* Added testing for python_backend autocomplete: optional input and model_transaction_policy

* Parse reuse-grpc-port and reuse-http-port as booleans (#6511)

Co-authored-by: Francesco Petrini <francescogpetrini@gmail.com>

* Fixing L0_io (#6510)

* Fixing L0_io

* Add Python-based backends CI (#6466)

* Bumped vllm version

* Add python-bsed backends testing

* Add python-based backends CI

* Fix errors

* Add vllm backend

* Fix pre-commit

* Modify test.sh

* Remove vllm_opt qa model

* Remove vLLM ackend tests

* Resolve review comments

* Fix pre-commit errors

* Update qa/L0_backend_python/python_based_backends/python_based_backends_test.py

Co-authored-by: Tanmay Verma <tanmay2592@gmail.com>

* Remove collect_artifacts_from_subdir function call

---------

Co-authored-by: oandreeva-nv <oandreeva@nvidia.com>
Co-authored-by: Tanmay Verma <tanmay2592@gmail.com>

* Enabling option to restrict access to HTTP APIs based on header value pairs (similar to gRPC)

* Upgrade DCGM from 2.4.7 to 3.2.6 (#6515)

* Enhance GCS credentials documentations (#6526)

* Test file override outside of model directory (#6516)

* Add boost-filesystem

* Update ORT version to 1.16.2 (#6531)

* Adjusting expected error msg (#6517)

* Update 'main' to track development of 2.41.0 / 23.12 (#6543)

* Enhance testing for pending request count (#6532)

* Enhance testing for pending request count

* Improve the documentation

* Add more documentation

* Add testing for Python backend request rescheduling (#6509)

* Add testing

* Fix up

* Enhance testing

* Fix up

* Revert test changes

* Add grpc endpoint test

* Remove unused import

* Remove unused import

* Update qa/L0_backend_python/request_rescheduling/grpc_endpoint_test.py

Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com>

* Update qa/python_models/bls_request_rescheduling/model.py

Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com>

---------

Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com>

* Check that the wget is installed (#6556)

* secure deployment considerations guide (#6533)

* draft document

* updates

* updates

* updated

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* update

* updates

* updates

* Update docs/customization_guide/deploy.md

Co-authored-by: Kyle McGill <101670481+nv-kmcgill53@users.noreply.github.com>

* Update docs/customization_guide/deploy.md

Co-authored-by: Kyle McGill <101670481+nv-kmcgill53@users.noreply.github.com>

* fixing typos

* updated with clearer warnings

* updates to readme and toc

---------

Co-authored-by: Kyle McGill <101670481+nv-kmcgill53@users.noreply.github.com>

* Fix typo and change the command line order (#6557)

* Fix typo and change the command line order

* Improve visual experience. Add 'clang' package

* Add error during rescheduling test to L0_generative_sequence (#6550)

* changing references to concrete instances

* Add testing for implicit state enhancements (#6524)

* Add testing for single buffer

* Add testing for implicit state with buffer growth

* Improve testing

* Fix up

* Add CUDA virtual address size flag

* Add missing test files

* Parameter rename

* Test fixes

* Only build implicit state backend for GPU=ON

* Fix copyright (#6584)

* Mention TRT LLM backend supports request cancellation (#6585)

* update model repository generation for onnx models for protobuf (#6575)

* Fix L0_sagemaker (#6587)

* Add C++ server wrapper to the doc (#6592)

* Add timeout to client apis and tests (#6546)

Client PR: triton-inference-server/client#429

* Change name generative -> iterative (#6601)

* name changes

* updated names

* Add documentation on generative sequence (#6595)

* Add documentation on generative sequence

* Address comment

* Reflect the "iterative" change

* Updated description of iterative sequences

* Restricted HTTP API documentation

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Add request cancellation and debugging guide to generated docs (#6617)

* Support for http request cancellation. Includes fix for seg fault in generate_stream endpoint.

* Bumped vLLM version to v0.2.2 (#6623)

* Upgrade ORT version (#6618)

* Use compliant preprocessor (#6626)

* Update README.md (#6627)

* Extend request objects lifetime and fixes possible segmentation fault (#6620)

* Extend request objects lifetime

* Remove explicit TRITONSERVER_InferenceRequestDelete

* Format fix

* Include the inference_request_ initialization to cover RequestNew

---------

Co-authored-by: Neelay Shah <neelays@nvidia.com>

* Update protobuf after python update for testing (#6638)

This fixes the issue where python client has
`AttributeError: 'NoneType' object has no attribute 'enum_types_by_name'
errors after python version is updated.

* Update post-23.11 release (#6653)

* Update README and versions for 2.40.0 / 23.11 (#6544)

* Removing path construction to use SymLink alternatives

* Update version for PyTorch

* Update windows Dockerfile configuration

* Update triton version to 23.11

* Update README and versions for 2.40.0 / 23.11

* Fix typo

* Ading 'ldconfig' to configure dynamic linking in container (#6602)

* Point to tekit_backend (#6616)

* Point to tekit_backend

* Update version

* Revert tekit changes (#6640)

---------

Co-authored-by: Kris Hung <krish@nvidia.com>

* PYBE Timeout Tests (#6483)

* New testing to confirm large request timeout values can be passed and retrieved within Python BLS models.

* Add note on lack of ensemble support (#6648)

* Added request id to span attributes (#6667)

* Add test for optional internal tensor within an ensemble (#6663)

* Add test for optional internal tensor within an ensemble

* Fix up

* Set CMake version to 3.27.7 (#6675)

* Set CMake version to 3.27.7

* Set CMake version to 3.27.7

* Fix double slash typo

* restore typo (#6680)

* Update 'main' to track development of 2.42.0 / 24.01 (#6673)

* iGPU build refactor (#6684) (#6691)

* Mlflow Plugin Fix (#6685)

* Mlflow plugin fix

* Fix extra content-type headers in HTTP server (#6678)

* Fix iGPU CMakeFile tags (#6695)

* Unify iGPU test build with x86 ARM

* adding TRITON_IGPU_BUILD to core build definition; adding logic to skip caffe2plan test if TRITON_IGPU_BUILD=1

* re-organizing some copies in Dockerfile.QA to fix igpu devel build

* Pre-commit fix

---------

Co-authored-by: kyle <kmcgill@kmcgill-ubuntu.nvidia.com>

* adding default value for TRITON_IGPU_BUILD=OFF (#6705)

* adding default value for TRITON_IGPU_BUILD=OFF

* fix newline

---------

Co-authored-by: kyle <kmcgill@kmcgill-ubuntu.nvidia.com>

* Add test case for decoupled model raising exception (#6686)

* Add test case for decoupled model raising exception

* Remove unused import

* Address comment

* Escape special characters in general docs (#6697)

* vLLM Benchmarking Test (#6631)

* vLLM Benchmarking Test

* Allow configuring GRPC max connection age and max connection age grace (#6639)

* Add ability to configure GRPC max connection age and max connection age grace
* Allow pass GRPC connection age args when they are set from command
----------
Co-authored-by: Katherine Yang <80359429+jbkyang-nvi@users.noreply.github.com>

---------

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Co-authored-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com>
Co-authored-by: GuanLuo <41310872+GuanLuo@users.noreply.github.com>
Co-authored-by: Neelay Shah <neelays@nvidia.com>
Co-authored-by: Tanmay Verma <tanmay2592@gmail.com>
Co-authored-by: Kris Hung <krish@nvidia.com>
Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Co-authored-by: dyastremsky <58150256+dyastremsky@users.noreply.github.com>
Co-authored-by: Katherine Yang <80359429+jbkyang-nvi@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com>
Co-authored-by: Gerard Casas Saez <gerardc@squareup.com>
Co-authored-by: Misha Chornyi <99709299+mc-nv@users.noreply.github.com>
Co-authored-by: R0CKSTAR <yeahdongcn@gmail.com>
Co-authored-by: Elias Bermudez <6505145+debermudez@users.noreply.github.com>
Co-authored-by: ax-vivien <113907557+ax-vivien@users.noreply.github.com>
Co-authored-by: Neelay Shah <neelays@neelays-dt.nvidia.com>
Co-authored-by: nv-kmcgill53 <101670481+nv-kmcgill53@users.noreply.github.com>
Co-authored-by: Matthew Kotila <matthew.r.kotila@gmail.com>
Co-authored-by: Nikhil Kulkarni <knikhil29@gmail.com>
Co-authored-by: Misha Chornyi <mchornyi@nvidia.com>
Co-authored-by: Iman Tabrizian <itabrizian@nvidia.com>
Co-authored-by: David Yastremsky <dyastremsky@nvidia.com>
Co-authored-by: Timothy Gerdes <50968584+tgerdesnv@users.noreply.github.com>
Co-authored-by: Mate Mijolović <mate.mijolovic@gmail.com>
Co-authored-by: David Zier <42390249+dzier@users.noreply.github.com>
Co-authored-by: Hyunjae Woo <107147848+nv-hwoo@users.noreply.github.com>
Co-authored-by: Tanay Varshney <tvarshney@nvidia.com>
Co-authored-by: Francesco Petrini <francescogpetrini@gmail.com>
Co-authored-by: Dmitry Mironov <dmitrym@nvidia.com>
Co-authored-by: Ryan McCormick <mccormick.codes@gmail.com>
Co-authored-by: Sai Kiran Polisetty <spolisetty@nvidia.com>
Co-authored-by: oandreeva-nv <oandreeva@nvidia.com>
Co-authored-by: kyle <kmcgill@kmcgill-ubuntu.nvidia.com>
Co-authored-by: Neal Vaidya <nealv@nvidia.com>
Co-authored-by: siweili11 <152239970+siweili11@users.noreply.github.com>
---
 .clang-format                                 |    4 +-
 .github/workflows/codeql.yml                  |   84 +
 .github/workflows/pre-commit.yaml             |   39 +
 .gitignore                                    |    5 +
 .pre-commit-config.yaml                       |   74 +
 CITATION.cff                                  |    7 +
 CMakeLists.txt                                |   67 +-
 CONTRIBUTING.md                               |   34 +-
 Dockerfile.QA                                 |   92 +-
 Dockerfile.sdk                                |   53 +-
 Dockerfile.win10.min                          |  161 +-
 LICENSE                                       |    2 +-
 README.md                                     |  178 +-
 SECURITY.md                                   |   44 +
 TRITON_VERSION                                |    2 +-
 build.py                                      | 2598 +++++----
 compose.py                                    |  437 +-
 deploy/alibaba-cloud/README.md                |   10 +-
 deploy/aws/README.md                          |   32 +-
 deploy/aws/templates/deployment.yaml          |    6 +-
 deploy/aws/values.yaml                        |    4 +-
 deploy/fleetcommand/Chart.yaml                |   13 +-
 deploy/fleetcommand/README.md                 |   10 +-
 deploy/fleetcommand/templates/deployment.yaml |    2 +
 deploy/fleetcommand/templates/secrets.yaml    |    2 +
 deploy/fleetcommand/values.yaml               |    8 +-
 deploy/gcp/README.md                          |   28 +-
 deploy/gcp/values.yaml                        |    4 +-
 deploy/gke-marketplace-app/README.md          |   93 +-
 .../gke-marketplace-app/benchmark/README.md   |   17 +-
 .../model-store/bert_base_tf_gpu/config.pbtxt |    4 +-
 .../bert_base_trt_gpu/config.pbtxt            |    4 +-
 .../bert_distill_tf_cpu/config.pbtxt          |    4 +-
 .../bert_distill_tf_gpu/config.pbtxt          |    4 +-
 .../perf-analyzer-script/perf_query.sh        |    0
 .../perf-analyzer-script/triton_client.yaml   |    4 +-
 .../client-sample/bert_request.json           |    6 +-
 ...tfile_bert_large.py => locustfile_bert.py} |   17 +-
 .../client-sample/perf_analyzer_grpc.sh       |    2 +-
 .../server-deployer/build_and_push.sh         |    9 +-
 .../server-deployer/chart/triton/Chart.yaml   |    6 +-
 .../chart/triton/templates/application.yaml   |   16 +-
 .../chart/triton/templates/deployment.yaml    |    4 +-
 .../chart/triton/templates/hpa.yaml           |   10 +-
 .../chart/triton/templates/ingress.yaml       |   48 +
 .../chart/triton/templates/service.yaml       |    6 +-
 .../server-deployer/chart/triton/values.yaml  |   10 +-
 .../server-deployer/data-test/schema.yaml     |   26 +-
 .../server-deployer/schema.yaml               |   26 +-
 .../gke-marketplace-app/trt-engine/README.md  |   63 +
 deploy/k8s-onprem/README.md                   |   38 +-
 deploy/k8s-onprem/dashboard.json              |  690 ++-
 deploy/k8s-onprem/templates/deployment.yaml   |   13 +
 deploy/k8s-onprem/values.yaml                 |    2 +-
 deploy/mlflow-triton-plugin/README.md         |   13 +-
 .../onnx_float32_int32_int32/config.pbtxt     |    0
 .../mlflow_triton/__init__.py                 |    6 +-
 .../mlflow_triton/config.py                   |  115 +-
 .../mlflow_triton/deployments.py              |  359 +-
 .../scripts/publish_model_to_mlflow.py        |   22 +-
 .../scripts/triton_flavor.py                  |   16 +-
 deploy/mlflow-triton-plugin/setup.py          |   10 +-
 docker/cpu_only/entrypoint.d/12-banner.sh     |    0
 .../entrypoint.d/50-gpu-driver-check2.sh      |    0
 .../entrypoint.d/15-container-copyright.txt   |    2 +-
 docker/entrypoint.d/50-gpu-driver-check2.sh   |    0
 docker/entrypoint.d/99-check-run-aip-mode.sh  |    0
 docker/sagemaker/serve                        |   71 +-
 docs/.gitignore                               |    1 +
 docs/Dockerfile.docs                          |   54 +
 docs/Makefile                                 |   53 +
 docs/README.md                                |  277 +-
 docs/_static/.gitattributes                   |    2 +
 docs/_static/custom.css                       |  319 ++
 docs/_static/logo_2color_horizontal.svg       |    2 +
 docs/_static/logo_2color_vertical.svg         |    2 +
 .../nvidia-logo-horiz-rgb-blk-for-screen.png  |    3 +
 .../nvidia-logo-vert-rgb-blk-for-screen.png   |    3 +
 docs/_static/rtd-data.js                      |   36 +
 docs/_templates/layout.html                   |   31 +
 docs/conf.py                                  |  256 +
 docs/contents.md                              |  104 +
 docs/{ => customization_guide}/build.md       |   70 +-
 docs/{ => customization_guide}/compose.md     |   18 +-
 docs/customization_guide/deploy.md            |  279 +
 .../inference_protocols.md                    |  187 +-
 .../repository_agents.md                      |    2 +-
 docs/{ => customization_guide}/test.md        |    6 +-
 docs/examples/README.md                       |   35 +
 docs/examples/jetson/README.md                |   14 +-
 .../concurrency_and_dynamic_batching/Makefile |    6 +-
 .../README.md                                 |   24 +-
 .../concurrency_and_dynamic_batching/common.h |    1 +
 .../people_detection.cc                       |    7 +-
 .../tao/convert_peoplenet.sh                  |    0
 .../simple_identity/config.pbtxt              |    0
 docs/{ => getting_started}/quickstart.md      |   42 +-
 docs/index.md                                 |  106 +
 docs/metrics.md                               |  143 -
 docs/perf_analyzer.md                         |  667 ---
 docs/protocol/README.md                       |   63 +-
 docs/protocol/extension_binary_data.md        |    4 +-
 docs/protocol/extension_classification.md     |    6 +-
 docs/protocol/extension_generate.md           |  188 +
 docs/protocol/extension_logging.md            |  198 +
 .../protocol/extension_model_configuration.md |   12 +-
 docs/protocol/extension_model_repository.md   |   54 +-
 docs/protocol/extension_parameters.md         |  104 +
 docs/protocol/extension_schedule_policy.md    |   33 +-
 docs/protocol/extension_sequence.md           |    6 +-
 docs/protocol/extension_shared_memory.md      |   35 +-
 docs/protocol/extension_statistics.md         |   74 +-
 docs/protocol/extension_trace.md              |   36 +-
 docs/response_cache.md                        |   87 -
 docs/trace.md                                 |  305 --
 docs/{ => user_guide}/architecture.md         |   25 +-
 docs/{ => user_guide}/custom_operations.md    |   47 +-
 docs/user_guide/debugging_guide.md            |  151 +
 docs/{ => user_guide}/decoupled_models.md     |   45 +-
 docs/{ => user_guide}/faq.md                  |   72 +-
 docs/{ => user_guide}/images/arch.jpg         |  Bin
 .../images/dyna_sequence_example0.png         |  Bin
 .../images/dyna_sequence_example1.png         |  Bin
 .../images/ensemble_example0.png              |  Bin
 .../images/multi_model_exec.png               |  Bin
 .../images/multi_model_parallel_exec.png      |  Bin
 .../images/multi_model_serial_exec.png        |  Bin
 .../images/sequence_example0.png              |  Bin
 .../images/sequence_example1.png              |  Bin
 .../images/sequence_example2.png              |  Bin
 .../images/triton_on_jetson.png               |  Bin
 docs/{ => user_guide}/jetson.md               |   44 +-
 docs/user_guide/metrics.md                    |  345 ++
 docs/user_guide/model_analyzer.md             |   45 +
 docs/{ => user_guide}/model_configuration.md  |  287 +-
 docs/{ => user_guide}/model_management.md     |   91 +-
 docs/{ => user_guide}/model_repository.md     |  152 +-
 docs/{ => user_guide}/optimization.md         |   71 +-
 docs/user_guide/perf_analyzer.md              |   30 +
 docs/user_guide/performance_tuning.md         |  393 ++
 docs/{ => user_guide}/ragged_batching.md      |    3 +-
 docs/{ => user_guide}/rate_limiter.md         |    4 +-
 docs/user_guide/request_cancellation.md       |  102 +
 docs/user_guide/response_cache.md             |  243 +
 docs/user_guide/trace.md                      |  539 ++
 docs/{ => user_guide}/v1_to_v2.md             |    4 +-
 pyproject.toml                                |   51 +
 qa/L0_async_work_queue/test.sh                |    0
 qa/L0_backend_bls/test.sh                     |   15 +-
 qa/L0_backend_config/test.sh                  |  127 +-
 qa/L0_backend_fastertransformer/test.sh       |   83 +
 qa/L0_backend_identity/identity_test.py       |  235 +-
 qa/L0_backend_identity/test.sh                |    2 +-
 qa/L0_backend_output_detail/test.sh           |   69 +
 .../models/argument_validation/1/model.py     |  110 +-
 .../argument_validation/test.sh               |    7 +-
 .../bls/bls_parameters_test.py                |   71 +
 qa/L0_backend_python/bls/test.sh              |  346 +-
 qa/L0_backend_python/common.sh                |   34 +-
 qa/L0_backend_python/custom_metrics/test.sh   |   85 +
 .../decoupled/decoupled_test.py               |  215 +-
 .../decoupled/models/decoupled_bls/1/model.py |  175 +-
 .../models/decoupled_bls_stream/1/model.py    |  132 +
 .../models/decoupled_bls_stream/config.pbtxt  |   54 +
 .../models/decoupled_execute_error/1/model.py |   52 +-
 .../decoupled_raise_exception/1/model.py      |   35 +
 .../decoupled_raise_exception/config.pbtxt    |   55 +
 .../1/model.py                                |   47 +-
 .../1/model.py                                |   46 +-
 qa/L0_backend_python/decoupled/test.sh        |   51 +-
 .../ensemble/ensemble_test.py                 |   68 +-
 qa/L0_backend_python/ensemble/test.sh         |   13 +-
 qa/L0_backend_python/env/test.sh              |  198 +-
 qa/L0_backend_python/examples/test.sh         |  205 +-
 qa/L0_backend_python/io/io_test.py            |  140 +-
 qa/L0_backend_python/io/test.sh               |   82 +-
 .../lifecycle/lifecycle_test.py               |  156 +-
 qa/L0_backend_python/lifecycle/test.sh        |   22 +-
 qa/L0_backend_python/logging/logging_test.py  |   58 +
 qa/L0_backend_python/logging/test.sh          |  231 +
 .../model_control/model_control_test.py       |   23 +-
 qa/L0_backend_python/model_control/test.sh    |    8 +-
 .../python_based_backends_test.py             |  144 +
 .../python_based_backends/test.sh             |  113 +
 qa/L0_backend_python/python_test.py           |  388 +-
 qa/L0_backend_python/python_unittest.py       |   55 +-
 .../grpc_endpoint_test.py                     |  111 +
 .../request_rescheduling/test.sh              |  116 +
 .../restart/models/restart/1/model.py         |   21 +-
 qa/L0_backend_python/restart/restart_test.py  |   24 +-
 qa/L0_backend_python/restart/test.sh          |    7 +-
 .../setup_python_enviroment.sh                |  171 +
 qa/L0_backend_python/test.sh                  |  171 +-
 qa/L0_backend_python/variants/test.sh         |    2 +-
 qa/L0_backend_tutorial/test.sh                |   23 +-
 qa/L0_batch_custom/batch_custom_test.py       |  273 +
 qa/L0_batch_custom/test.sh                    |  192 +
 qa/L0_batch_input/batch_input_test.py         |  272 +-
 qa/L0_batch_input/test.sh                     |    8 +-
 qa/L0_batcher/batcher_test.py                 | 1362 +++--
 qa/L0_batcher/test.sh                         |   92 +-
 qa/L0_batcher/verify_timestamps.py            |   45 +-
 .../buffer_attributes_test.py                 |   65 +-
 qa/L0_buffer_attributes/models/bls/1/model.py |   19 +-
 .../models/identity/1/model.py                |    7 +-
 qa/L0_buffer_attributes/test.sh               |    3 +-
 qa/L0_client_build_variants/test.sh           |   37 +-
 qa/L0_client_java/test.sh                     |    0
 .../client_memory_mail.py                     |   13 +-
 .../models/custom_identity_int32/config.pbtxt |    2 +-
 qa/L0_client_memory_growth/test.sh            |   46 +-
 qa/L0_client_nobatch/client_test.py           |  205 +-
 ...t_test.py => client_infer_timeout_test.py} |  177 +-
 .../client_non_infer_timeout_test.py          |  340 ++
 .../models/custom_identity_int32/config.pbtxt |    2 +-
 qa/L0_client_timeout/test.sh                  |   80 +-
 .../models/custom_identity_int32/config.pbtxt |    2 +-
 qa/L0_client_valgrind/test.sh                 |    4 +-
 qa/L0_cmdline_trace/test.sh                   |  134 +-
 qa/L0_cmdline_trace/trace_client.py           |   79 +
 qa/L0_config_json/max_priority_level.pbtxt    |   62 +
 qa/L0_config_json/test.sh                     |   48 +-
 qa/L0_cuda_graph/test.sh                      |   51 +-
 qa/L0_cuda_graph/trt_cuda_graph_test.py       |   85 +-
 .../cuda_shared_memory_test.py                |  138 +-
 qa/L0_cuda_shared_memory/test.sh              |    2 +-
 qa/L0_custom_ops/cuda_op_test.py              |   66 +-
 qa/L0_custom_ops/mod_op_test.py               |   77 +-
 qa/L0_custom_ops/onnx_op_test.py              |   74 +-
 qa/L0_custom_ops/test.sh                      |   66 +-
 qa/L0_custom_ops/vision_op_test.py            |   74 +-
 qa/L0_custom_ops/zero_out_test.py             |   64 +-
 qa/L0_data_compression/test.sh                |    7 +-
 qa/L0_data_compression/validation.py          |   12 +-
 qa/L0_decoupled/decoupled_test.py             |  530 +-
 qa/L0_decoupled/test.sh                       |   16 +-
 qa/L0_device_memory_tracker/test.py           |  109 +
 qa/L0_device_memory_tracker/test.sh           |  128 +
 .../unittest => L0_dlpack_multi_gpu}/test.sh  |   21 +-
 qa/L0_doc_links/mkdocs.yml                    |   44 +
 qa/L0_doc_links/test.sh                       |   76 +
 qa/L0_dyna_implicit_state/test.sh             |   15 +-
 .../dyna_sequence_batcher_test.py             | 1016 ++--
 qa/L0_dyna_sequence_batcher/test.sh           |   16 +-
 .../client_plugin_test/1/model.py             |   63 +
 .../client_plugin_test/config.pbtxt           |   33 +-
 qa/L0_grpc/grpc_basic_auth_test.py            |   66 +
 qa/L0_grpc/grpc_client_plugin_test.py         |  120 +
 qa/L0_grpc/nginx.conf                         |   54 +
 qa/L0_grpc/python_grpc_aio_test.py            |  125 +
 qa/L0_grpc/python_unit_test.py                |  159 +
 qa/L0_grpc/test.sh                            |  175 +-
 qa/L0_grpc_state_cleanup/cleanup_test.py      |  560 ++
 qa/L0_grpc_state_cleanup/test.sh              |  194 +
 qa/L0_http/generate_endpoint_test.py          |  419 ++
 .../generate_models/mock_llm/1/model.py       |  107 +
 .../generate_models/mock_llm/config.pbtxt     |   60 +
 qa/L0_http/http_basic_auth_test.py            |   66 +
 qa/L0_http/http_client_plugin_test.py         |  175 +
 qa/L0_http/http_restricted_api_test.py        |   94 +
 qa/L0_http/http_test.py                       |  125 +-
 qa/L0_http/nginx.conf                         |   57 +
 qa/L0_http/python_http_aio_test.py            |  116 +
 qa/L0_http/test.sh                            |  189 +-
 qa/L0_http_fuzz/fuzztest.py                   |   56 +-
 qa/L0_http_fuzz/test.sh                       |   16 +-
 qa/L0_https/test.sh                           |   29 +-
 qa/L0_implicit_state/implicit_state.py        |  229 +-
 .../models/growable_memory/config.pbtxt       |  103 +
 .../models/single_state_buffer/config.pbtxt   |   97 +
 qa/L0_implicit_state/test.sh                  |   26 +-
 qa/L0_infer/infer_test.py                     | 1229 +++--
 qa/L0_infer/install_and_test.sh               |    2 +-
 qa/L0_infer/test.sh                           |  182 +-
 qa/L0_infer_reshape/infer_reshape_test.py     |  257 +-
 qa/L0_infer_reshape/test.sh                   |    2 +-
 qa/L0_infer_variable/infer_variable_test.py   |  453 +-
 qa/L0_infer_zero/infer_zero_test.py           |  337 +-
 qa/L0_infer_zero/test.sh                      |    4 +
 qa/L0_inferentia_perf_analyzer/test.sh        |   34 +-
 qa/L0_io/test.sh                              |   63 +-
 .../iterative_sequence_e2e.py                 |  192 +
 .../models/iterative_sequence/config.pbtxt    |   48 +
 qa/L0_iterative_sequence/test.sh              |   92 +
 .../MemoryGrowthTest.java                     | 1570 +++---
 qa/L0_java_memory_growth/test.sh              |   16 +-
 qa/L0_java_resnet/ResnetTest.java             | 1039 ++--
 qa/L0_java_resnet/test.sh                     |   12 +-
 qa/L0_java_sequence_batcher/SequenceTest.java | 1083 ++--
 qa/L0_java_sequence_batcher/test.sh           |   12 +-
 qa/L0_java_simple_example/test.sh             |   12 +-
 qa/L0_json/test.sh                            |   44 +
 qa/L0_large_payload/large_payload_test.py     |  103 +-
 qa/L0_large_payload/test.sh                   |    0
 qa/L0_libtorch_inference_mode/test.sh         |    4 +-
 .../client.py                                 |   90 +
 .../gen_models.py                             |   90 +
 .../models/libtorch_multi_device/config.pbtxt |   60 +
 .../test.sh                                   |  149 +
 qa/L0_libtorch_io_names/io_names_client.py    |   52 +-
 qa/L0_libtorch_io_names/test.sh               |    0
 qa/L0_libtorch_io_types/test.sh               |  131 +
 qa/L0_libtorch_nvfuser/test.sh                |    3 +-
 qa/L0_libtorch_optimized_execution/test.sh    |    0
 .../libtorch_shared_weights_test.py           |   25 +-
 qa/L0_libtorch_shared_weights/test.sh         |    3 +-
 qa/L0_lifecycle/lifecycle_test.py             | 2899 +++++++----
 qa/L0_lifecycle/test.sh                       |  847 +--
 qa/L0_logging/logging_endpoint_test.py        |  405 ++
 qa/L0_logging/test.sh                         |  595 +++
 qa/L0_long_running_stress/crashing_client.py  |   61 +-
 qa/L0_long_running_stress/scenarios.py        |  654 +--
 qa/L0_long_running_stress/stress.py           |  530 +-
 qa/L0_long_running_stress/stress_mail.py      |   28 +-
 qa/L0_long_running_stress/test.sh             |   26 +-
 qa/L0_memory/test.sh                          |    0
 qa/L0_memory_growth/busy_op_test.py           |   84 +-
 qa/L0_memory_growth/server_memory_mail.py     |   23 +-
 qa/L0_memory_growth/test.sh                   |   71 +-
 qa/L0_metrics/ensemble_delay/config.pbtxt     |   67 +
 qa/L0_metrics/identity_delay/config.pbtxt     |   58 +
 qa/L0_metrics/metrics_config_test.py          |  134 +
 qa/L0_metrics/metrics_queue_size_test.py      |  306 ++
 qa/L0_metrics/test.sh                         |  240 +-
 .../identity_cache_off/config.pbtxt           |   46 +
 .../identity_cache_on/config.pbtxt            |   46 +
 qa/L0_mlflow/plugin_test.py                   |   55 +-
 qa/L0_mlflow/test.sh                          |  109 +-
 .../common/no_version/expected                |    2 +-
 .../custom/no_delimiter/config.pbtxt          |    0
 .../custom/no_delimiter/expected              |    1 +
 .../unknown_backend.unknown/config.pbtxt      |    0
 .../custom/unknown_backend.unknown/expected   |    2 +
 .../invalid_input_map/config.pbtxt            |    2 +-
 .../ensemble/non_existing_model/expected      |    2 +-
 .../unreachable_output_3/config.pbtxt         |   94 +
 .../ensemble/unreachable_output_3/expected    |    1 +
 .../openvino/bad_input_dims/config.pbtxt      |   12 +
 .../openvino/bad_input_dims/expected          |    1 +
 .../openvino/bad_output_dims/config.pbtxt     |   12 +
 .../openvino/bad_output_dims/expected         |    1 +
 .../openvino/too_few_inputs/config.pbtxt      |    6 +
 .../openvino/too_few_inputs/expected          |    1 +
 .../openvino/too_many_inputs/config.pbtxt     |   18 +
 .../openvino/too_many_inputs/expected         |    1 +
 .../openvino/unknown_input/config.pbtxt       |   24 +
 .../openvino/unknown_input/expected           |    1 +
 .../openvino/unknown_output/config.pbtxt      |   18 +
 .../openvino/unknown_output/expected          |    1 +
 .../conflicting_max_batch_size/model.py       |   15 +-
 .../conflicting_scheduler_sequence/model.py   |   15 +-
 .../python/input_missing_datatype/model.py    |   15 +-
 .../python/input_missing_dims/model.py        |   15 +-
 .../python/input_missing_name/model.py        |   15 +-
 .../python/input_wrong_property/expected      |    2 +-
 .../python/input_wrong_property/model.py      |   20 +-
 .../config.pbtxt                              |   24 +
 .../expected                                  |    1 +
 .../model.py                                  |   47 +
 .../config.pbtxt                              |   28 +
 .../expected                                  |    1 +
 .../model.py                                  |   46 +
 .../python/no_return/model.py                 |   15 +-
 .../python/output_missing_datatype/model.py   |   15 +-
 .../python/output_missing_dims/model.py       |   15 +-
 .../python/output_missing_name/model.py       |   15 +-
 .../python/output_wrong_property/model.py     |   20 +-
 .../1/model.savedmodel/saved_model.pb         |  Bin
 .../bad_input_dims/config.pbtxt               |    0
 .../bad_input_dims/expected                   |    1 +
 .../1/model.savedmodel/saved_model.pb         |  Bin
 .../bad_input_type/config.pbtxt               |    0
 .../bad_input_type/expected                   |    1 +
 .../1/model.savedmodel/saved_model.pb         |  Bin
 .../bad_output_dims/config.pbtxt              |    0
 .../bad_output_dims/expected                  |    1 +
 .../1/model.savedmodel/saved_model.pb         |  Bin
 .../bad_output_type/config.pbtxt              |    0
 .../bad_output_type/expected                  |    1 +
 .../1/model.savedmodel/saved_model.pb         |  Bin
 .../too_many_inputs/config.pbtxt              |    2 +-
 .../too_many_inputs/expected                  |    1 +
 .../1/model.savedmodel/saved_model.pb         |  Bin
 .../unknown_input/config.pbtxt                |    0
 .../unknown_input/expected                    |    1 +
 .../1/model.savedmodel/saved_model.pb         |  Bin
 .../unknown_output/config.pbtxt               |    0
 .../unknown_output/expected                   |    1 +
 .../tensorrt/bad_dynamic_shapes_max/expected  |    2 +-
 .../tensorrt/bad_dynamic_shapes_min/expected  |    2 +-
 .../custom/empty_config.identity/config.pbtxt |    0
 .../custom/empty_config.identity/expected     |   22 +
 .../custom/no_backend.identity/config.pbtxt   |   15 +
 .../custom/no_backend.identity/expected       |   33 +
 .../onnx/cpu_instance/config.pbtxt            |    0
 .../onnx/empty_config/expected                |    1 +
 .../onnx/empty_config/expected.1              |    1 +
 .../onnx/empty_config/expected.2              |    1 +
 .../onnx/empty_config/expected.3              |    1 +
 .../onnx/no_config/expected                   |    1 +
 .../onnx/no_config/expected.1                 |    1 +
 .../onnx/no_config/expected.2                 |    1 +
 .../onnx/no_config/expected.3                 |    1 +
 .../openvino/dynamic_batch/config.pbtxt       |    0
 .../openvino/dynamic_batch/expected           |   45 +
 .../openvino/dynamic_batch/expected.1         |   45 +
 .../openvino/dynamic_batch/expected.2         |   45 +
 .../openvino/dynamic_batch/expected.3         |   45 +
 .../openvino/empty_config/config.pbtxt        |    0
 .../openvino/empty_config/expected            |   45 +
 .../openvino/empty_config/expected.1          |   45 +
 .../openvino/empty_config/expected.2          |   45 +
 .../openvino/empty_config/expected.3          |   45 +
 .../openvino/no_config/expected               |   45 +
 .../openvino/no_config/expected.1             |   45 +
 .../openvino/no_config/expected.2             |   45 +
 .../openvino/no_config/expected.3             |   45 +
 .../openvino/partial_config/config.pbtxt      |   14 +
 .../partial_config}/expected                  |   23 +-
 .../partial_config}/expected.1                |   23 +-
 .../conflicting_scheduler_ensemble/model.py   |   11 +-
 .../ensemble_first_step/model.py              |   11 +-
 .../ensemble_second_step/model.py             |   11 +-
 .../python/dynamic_batching/expected          |    1 +
 .../python/dynamic_batching/expected.1        |    1 +
 .../python/dynamic_batching/expected.2        |    1 +
 .../python/dynamic_batching/expected.3        |    1 +
 .../python/dynamic_batching/model.py          |   15 +-
 .../python/dynamic_batching_no_op/model.py    |   15 +-
 .../python/incomplete_input/model.py          |   13 +-
 .../model_transaction_policy/config.pbtxt     |   24 +
 .../model_transaction_policy}/expected        |   38 +-
 .../model_transaction_policy}/expected.1      |   34 +-
 .../model_transaction_policy}/expected.2      |   34 +-
 .../model_transaction_policy}/expected.3      |   34 +-
 .../python/model_transaction_policy/model.py  |   46 +
 .../config.pbtxt                              |   24 +
 .../expected                                  |   45 +
 .../expected.1                                |   37 +-
 .../expected.2                                |   37 +-
 .../expected.3                                |   33 +-
 .../model.py                                  |   46 +
 .../config.pbtxt                              |   28 +
 .../model_transaction_policy_no_op}/expected  |   34 +-
 .../expected.1                                |   38 +-
 .../expected.2                                |   38 +-
 .../expected.3                                |   38 +-
 .../model_transaction_policy_no_op/model.py   |   46 +
 .../python/optional_input/config.pbtxt        |    7 +
 .../optional_input/expected}                  |   27 +-
 .../python/optional_input/model.py            |   48 +
 .../bad_input_dims/expected.3                 |   44 -
 .../bad_output_dims/expected                  |   44 -
 .../bad_output_type/expected                  |   44 -
 .../bad_output_type/expected.1                |   44 -
 .../bad_output_type/expected.2                |   44 -
 .../empty_config/expected                     |    1 +
 .../empty_config/expected.1                   |    1 +
 .../empty_config/expected.2                   |    1 +
 .../empty_config/expected.3                   |    1 +
 .../1/model.savedmodel/saved_model.pb         |  Bin
 .../config.pbtxt                              |    0
 .../expected                                  |    4 +-
 .../expected.1                                |    4 +-
 .../expected.2                                |    4 +-
 .../expected.3                                |    4 +-
 .../1/model.savedmodel/saved_model.pb         |  Bin 0 -> 1407 bytes
 .../hint_for_no_batch_2/config.pbtxt          |   10 +
 .../hint_for_no_batch_2/expected              |   47 +
 .../hint_for_no_batch_2/expected.1            |   47 +
 .../hint_for_no_batch_2/expected.2            |   47 +
 .../hint_for_no_batch_2/expected.3            |   47 +
 .../incomplete_input/expected                 |    1 +
 .../incomplete_input/expected.1               |    1 +
 .../incomplete_input/expected.2               |    1 +
 .../incomplete_input/expected.3               |    1 +
 .../incomplete_output/expected                |    1 +
 .../incomplete_output/expected.1              |    1 +
 .../incomplete_output/expected.2              |    1 +
 .../incomplete_output/expected.3              |    1 +
 .../kind_model_config/expected                |    1 +
 .../kind_model_config/expected.1              |    1 +
 .../kind_model_config/expected.2              |    1 +
 .../kind_model_config/expected.3              |    1 +
 .../max_batch_size_set/expected               |    1 +
 .../max_batch_size_set/expected.1             |    1 +
 .../max_batch_size_set/expected.2             |    1 +
 .../max_batch_size_set/expected.3             |    1 +
 .../tensorflow_savedmodel/no_config/expected  |    1 +
 .../no_config/expected.1                      |    1 +
 .../no_config/expected.2                      |    1 +
 .../no_config/expected.3                      |    1 +
 .../reshape_config_provided/config.pbtxt      |    0
 .../reshape_config_provided/expected          |   21 +-
 .../reshape_config_provided/expected.1        |   21 +-
 .../reshape_config_provided/expected.2        |   21 +-
 .../reshape_config_provided/expected.3        |   21 +-
 .../too_many_inputs/expected                  |   44 -
 .../too_many_inputs/expected.1                |   44 -
 .../too_many_inputs/expected.2                |   44 -
 .../too_many_inputs/expected.3                |   44 -
 .../unknown_input/expected.3                  |   44 -
 .../unknown_output/expected                   |   44 -
 .../unknown_output/expected.1                 |   44 -
 .../unknown_output/expected.2                 |   44 -
 .../unknown_output/expected.3                 |   44 -
 .../tensorrt/empty_config/expected            |    1 +
 .../tensorrt/empty_config_variable/expected   |    1 +
 .../tensorrt/incomplete_input/expected        |    1 +
 .../tensorrt/incomplete_input/expected.1      |    1 +
 .../tensorrt/incomplete_input/expected.2      |    1 +
 .../tensorrt/incomplete_input/expected.3      |    1 +
 .../tensorrt/incomplete_output/expected       |    1 +
 .../tensorrt/incomplete_output/expected.1     |    1 +
 .../tensorrt/incomplete_output/expected.2     |    1 +
 .../tensorrt/incomplete_output/expected.3     |    1 +
 .../tensorrt/multi_prof_max_bs/expected       |    1 +
 .../tensorrt/no_config/expected               |    1 +
 .../tensorrt/no_config_shape_tensor/expected  |    1 +
 .../tensorrt/no_config_variable/expected      |    1 +
 .../tensorrt/no_name_platform/expected        |    1 +
 .../no_name_platform_variable/expected        |    1 +
 .../tensorrt/reshape_config_provided/expected |    1 +
 .../cli_messages/cli_deprecation/expected     |    1 +
 .../cli_messages/cli_override/expected        |    1 +
 qa/L0_model_config/compare_status.py          |   45 +-
 qa/L0_model_config/noautofill_test.py         |   62 +
 .../noautofill_noconfig/expected              |    1 +
 qa/L0_model_config/test.sh                    |  186 +-
 .../python_addsub/__init__.py                 |  123 +
 .../python_subadd/__init__.py                 |  123 +
 qa/L0_model_namespacing/test.py               |  361 ++
 qa/L0_model_namespacing/test.sh               |  149 +
 .../addsub_repo/composing_model/1/model.py    |    6 +
 .../addsub_repo/simple_addsub/config.pbtxt    |   90 +
 .../subadd_repo/composing_model/1/model.py    |    6 +
 .../subadd_repo/simple_subadd/config.pbtxt    |   88 +
 .../addsub_repo/composing_model/1/model.py    |    6 +
 .../addsub_repo/simple_addsub/config.pbtxt    |   90 +
 .../subadd_repo/composing_model/1/model.py    |    6 +
 .../subadd_repo/simple_subadd/config.pbtxt    |   90 +
 .../addsub_repo/composing_addsub/1/model.py   |    6 +
 .../addsub_repo/simple_ensemble/config.pbtxt  |   90 +
 .../subadd_repo/composing_subadd/1/model.py   |    6 +
 .../subadd_repo/simple_ensemble/config.pbtxt  |   90 +
 .../addsub_repo/composing_addsub/1/model.py   |    6 +
 .../addsub_repo/simple_addsub/config.pbtxt    |   90 +
 .../subadd_repo/composing_subadd/1/model.py   |    6 +
 .../subadd_repo/simple_subadd/config.pbtxt    |   90 +
 qa/L0_model_queue/model_queue_test.py         |  427 +-
 qa/L0_model_queue/test.sh                     |   62 +-
 qa/L0_model_update/instance_update_test.py    |  649 +++
 qa/L0_model_update/test.sh                    |  111 +
 qa/L0_multi_server/test.sh                    |    0
 .../models/nan_inf_output/1/model.py          |   12 +-
 qa/L0_nan_inf/nan_inf_test.py                 |   46 +-
 .../nullchar_string_client.py                 |   63 +-
 qa/L0_nullchar_string/test.sh                 |   14 +-
 qa/L0_onnx_optimization/test.sh               |    5 +-
 .../ensemble_identity_2_float32/config.pbtxt  |    0
 .../models/identity_2_float32/config.pbtxt    |    0
 .../optional_connecting_tensor/config.pbtxt   |   98 +
 .../models/optional_identity/1/model.py       |   46 +
 .../models/optional_identity/config.pbtxt     |   53 +
 .../pipeline_identity_2_float32/config.pbtxt  |    0
 qa/L0_optional_input/optional_input_test.py   |  269 +-
 qa/L0_optional_input/test.sh                  |    8 +-
 qa/L0_output_name/output_name_test.py         |   29 +-
 qa/L0_output_name/test.sh                     |    0
 qa/L0_output_validation/lt_op_val_client.py   |   18 +-
 qa/L0_output_validation/test.sh               |    0
 qa/L0_parallel_copy/parallel_copy_test.py     |   81 +-
 .../model_repository/ensemble/config.pbtxt    |   68 +
 .../model_repository/identity/config.pbtxt    |   44 +
 .../model_repository/parameter/1/model.py     |   77 +
 qa/L0_parameters/parameters_test.py           |  223 +
 qa/L0_parameters/test.sh                      |   95 +
 .../config.pbtxt                              |    0
 .../passive_instance_test.py                  |   17 +-
 qa/L0_passive_instance/test.sh                |    0
 .../perf_analyzer_profile_export_schema.json  |   95 +
 qa/L0_perf_analyzer/test.sh                   |  260 +-
 qa/L0_perf_analyzer_capi/test.sh              |  138 +-
 qa/L0_perf_analyzer_doc_links/mkdocs.yml      |   36 +
 qa/L0_perf_analyzer_doc_links/test.sh         |   73 +
 qa/L0_perf_analyzer_ground_truth/test.sh      |  175 +
 qa/L0_perf_analyzer_report/test.sh            |   17 +-
 qa/L0_perf_deeprecommender/run_test.sh        |   17 +-
 qa/L0_perf_deeprecommender/test.sh            |    4 +-
 qa/L0_perf_kaldi/create_data.sh               |    2 +-
 qa/L0_perf_kaldi/test.sh                      |    0
 qa/L0_perf_nomodel/run_test.sh                |   31 +-
 qa/L0_perf_nomodel/test.sh                    |    6 +-
 qa/L0_perf_pyclients/simple_perf_client.py    |  318 +-
 qa/L0_perf_pyclients/test.sh                  |    6 +-
 qa/L0_perf_resnet/run_test.sh                 |   25 +-
 qa/L0_perf_resnet/test.sh                     |    8 +-
 qa/L0_perf_tfs/test.sh                        |  153 -
 qa/L0_perf_ts/test.sh                         |  124 -
 qa/L0_perf_vllm/test.sh                       |  146 +
 qa/L0_pinned_memory/test.sh                   |   14 +-
 qa/L0_python_api/test.sh                      |   50 +
 .../test.sh                                   |   35 +-
 qa/L0_query/query_e2e.py                      |  113 +-
 qa/L0_query/test.sh                           |    0
 qa/L0_rate_limiter/rate_limiter_test.py       |  143 +-
 qa/L0_rate_limiter/test.sh                    |   22 +-
 qa/L0_register/test.sh                        |    0
 qa/L0_repoagent_checksum/identity_test.py     |   68 +-
 .../grpc_cancellation_test.py                 |  141 +
 qa/L0_request_cancellation/scheduler_test.py  |  233 +
 qa/L0_request_cancellation/test.sh            |  183 +
 .../models/decoupled_cache/config.pbtxt       |   49 +
 .../models/identity_cache/config.pbtxt        |   46 +
 qa/L0_response_cache/test.sh                  |  239 +-
 qa/L0_sagemaker/sagemaker_multi_model_test.py |  233 +-
 qa/L0_sagemaker/sagemaker_test.py             |  338 +-
 qa/L0_sagemaker/test.sh                       |   34 +-
 .../saved_model_shape_test.py                 |  306 +-
 qa/L0_savedmodel_shape/test.sh                |    0
 qa/L0_scalar_io/scalar_test.py                |   71 +
 qa/L0_scalar_io/test.sh                       |   93 +
 qa/L0_sdk/grpc_test.cc                        |    1 +
 qa/L0_sdk/http_test.cc                        |    1 +
 qa/L0_sdk/test.sh                             |    6 +-
 qa/L0_secure_grpc/test.sh                     |   22 +-
 .../config.pbtxt                              |   62 +
 .../sequence_batcher_test.py                  | 2718 ++++++----
 qa/L0_sequence_batcher/test.sh                |  334 +-
 .../sequence_corrid_batcher_test.py           |  140 +-
 qa/L0_sequence_corrid_batcher/test.sh         |    4 +-
 qa/L0_sequence_stress/sequence_stress.py      |  429 +-
 qa/L0_sequence_stress/test.sh                 |    4 +-
 qa/L0_server_status/server_status_test.py     |  535 +-
 qa/L0_shared_memory/shared_memory_test.py     |  165 +-
 qa/L0_shared_memory/test.sh                   |    2 +-
 qa/L0_simple_ensemble/ensemble_test.py        |   75 +-
 qa/L0_simple_go_client/test.sh                |   31 +-
 qa/L0_simple_nodejs_client/test.sh            |    0
 qa/L0_socket/test.sh                          |  116 +-
 qa/L0_storage_S3/infer_test.py                |  174 -
 qa/L0_storage_S3/test.sh                      |  199 +-
 qa/L0_storage_S3_local/mock_s3_service.py     |  113 +
 .../test.sh                                   |  207 +-
 qa/L0_storage_azure/infer_test.py             |  174 -
 qa/L0_storage_azure/test.sh                   |  135 +-
 qa/L0_storage_swiftstack/infer_test.py        |  274 +-
 qa/L0_string_io/string_client_test.py         |  159 +-
 qa/L0_tf_gpu_io/test.sh                       |   76 +-
 qa/L0_tf_gpu_io/tf_gpu_io_test.py             |  105 +
 qa/L0_tf_parameters/test.sh                   |  150 +
 qa/L0_tf_parameters/tf_parameter_test.py      |   81 +
 qa/L0_tf_tag_sigdef/test.sh                   |   17 +-
 qa/L0_tf_tag_sigdef/tf_tag_sigdef_test.py     |   19 +-
 qa/L0_tf_unknown_rank/test.sh                 |    7 +-
 qa/L0_tf_unknown_rank/tf_unknown_rank_test.py |   35 +-
 .../tftrt_optimization_test.py                |   40 +-
 qa/L0_trace/opentelemetry_unittest.py         |  274 +
 qa/L0_trace/test.sh                           |  541 +-
 qa/L0_trace/trace-config.yaml                 |   45 +
 qa/L0_trace/trace_endpoint_test.py            |  455 +-
 qa/L0_triton_repo_agent/test.sh               |    0
 qa/L0_trt_compat/test.sh                      |  110 +
 qa/L0_trt_compat/trt_compatibility_test.py    |   50 +
 qa/L0_trt_data_dependent_shape/test.sh        |   94 +
 .../trt_data_dependent_shape_test.py          |   85 +
 qa/L0_trt_dla/dla_test.py                     |   27 +-
 qa/L0_trt_dla/test.sh                         |    0
 qa/L0_trt_dynamic_shape/test.sh               |    2 +-
 .../trt_dynamic_shape_test.py                 |   84 +-
 qa/L0_trt_error_propagation/test.sh           |   82 +
 .../trt_error_propagation_test.py             |   72 +
 qa/L0_trt_plugin/test.sh                      |  186 +-
 qa/L0_trt_plugin/trt_plugin_test.py           |   99 +-
 qa/L0_trt_reformat_free/test.sh               |    3 +-
 .../trt_reformat_free_test.py                 |  205 +-
 qa/L0_trt_shape_tensors/test.sh               |    8 +-
 .../trt_shape_tensor_test.py                  |  688 ++-
 qa/L0_vertex_ai/test.sh                       |    4 +-
 qa/L0_vertex_ai/vertex_ai_test.py             |  251 +-
 qa/L0_warmup/decoupled/1/model.py             |    7 +-
 qa/L0_warmup/failing_infer/1/model.py         |    9 +-
 qa/L0_warmup/test.sh                          |   85 +-
 qa/common/check_copyright.py                  |  189 +-
 qa/common/check_massif_log.py                 |   45 +-
 qa/common/check_valgrind_log.py               |   42 +-
 qa/common/cuda_op_kernel.cu.cc.patch          |    8 +-
 qa/common/gen_common.py                       |  160 +
 qa/common/gen_ensemble_model_utils.py         |  653 ++-
 qa/common/gen_jetson_trt_models               |  188 +
 qa/common/gen_qa_custom_ops                   |   45 +-
 qa/common/gen_qa_custom_ops_models.py         |  271 +-
 .../gen_qa_dyna_sequence_implicit_models.py   |  542 +-
 qa/common/gen_qa_dyna_sequence_models.py      | 1020 ++--
 qa/common/gen_qa_identity_models.py           | 1179 +++--
 qa/common/gen_qa_implicit_models.py           | 1290 +++--
 qa/common/gen_qa_model_repository             |  156 +-
 qa/common/gen_qa_models.py                    | 2824 ++++++----
 qa/common/gen_qa_noshape_models.py            |  513 +-
 qa/common/gen_qa_ort_scalar_models.py         |  130 +
 qa/common/gen_qa_pytorch_model.py             |  124 +
 qa/common/gen_qa_ragged_models.py             |  706 +--
 qa/common/gen_qa_reshape_models.py            | 1596 +++---
 qa/common/gen_qa_sequence_models.py           | 1000 ++--
 qa/common/gen_qa_tf_parameters.py             |  122 +
 qa/common/gen_qa_torchtrt_models.py           |   34 +-
 qa/common/gen_qa_trt_data_dependent_shape.py  |  158 +
 qa/common/gen_qa_trt_format_models.py         |  402 +-
 qa/common/gen_qa_trt_plugin_models.py         |  366 +-
 qa/common/gen_tag_sigdef.py                   |  255 +-
 qa/common/gen_xavier_trt_models               |  118 -
 qa/common/infer_test.py                       |  220 +
 qa/common/infer_util.py                       |  926 ++--
 .../non_aligned_validation_batched.json       |   56 +-
 .../non_aligned_validation_no_batch.json      |   56 +-
 .../simple_model.py                           |  106 +-
 .../validation_batched.json                   |   64 +-
 .../validation_no_batch.json                  |   64 +-
 .../wrong_validation_batched.json             |   64 +-
 .../wrong_validation_no_batch.json            |   64 +-
 qa/common/libtorch_infer_client.py            |   45 +-
 qa/common/nightly_email_helper.py             |   41 +-
 .../int_data.json                             |    4 +-
 .../int_data_diff_shape.json                  |    4 +-
 .../int_data_optional.json                    |   14 +
 .../perf_analyzer_input_data_json/output.json |    2 +-
 .../repeat_int32_data.json                    |   31 +
 .../string_data_with_shape.json               |    8 +-
 .../wrong_output.json                         |    2 +-
 .../wrong_output_2.json                       |    2 +-
 qa/common/reporter.py                         |  203 +-
 qa/common/sequence_util.py                    |  836 +--
 qa/common/shm_util.py                         |  330 +-
 qa/common/test_util.py                        |  201 +-
 qa/common/trace_summary.py                    |  352 +-
 qa/common/util.sh                             |  124 +-
 .../custom_zero_1_float32/config.pbtxt        |    0
 qa/openvino_models/README.md                  |   34 +
 qa/openvino_models/dynamic_batch/1/model.bin  |    0
 .../dynamic_batch/1/model.mapping             |  195 +
 qa/openvino_models/dynamic_batch/1/model.xml  |  166 +
 qa/openvino_models/fixed_batch/1/model.bin    |    0
 .../fixed_batch/1/model.mapping               |  211 +
 qa/openvino_models/fixed_batch/1/model.xml    |  152 +
 qa/python_models/add_sub/config.pbtxt         |    1 -
 qa/python_models/add_sub/model.py             |   50 +-
 qa/python_models/add_sub_gpu/config.pbtxt     |    8 +-
 qa/python_models/auto_complete/model.py       |   58 +-
 qa/python_models/auto_complete_error/model.py |   15 +-
 qa/python_models/bls/model.py                 |  712 ++-
 qa/python_models/bls_async/model.py           |  172 +-
 .../bls_finalize_error/config.pbtxt           |   38 +
 qa/python_models/bls_finalize_error/model.py  |   45 +
 qa/python_models/bls_init_error/config.pbtxt  |   38 +
 qa/python_models/bls_init_error/model.py      |   44 +
 qa/python_models/bls_memory/model.py          |   68 +-
 qa/python_models/bls_memory_async/model.py    |   48 +-
 .../bls_model_loading/config.pbtxt            |   43 +
 qa/python_models/bls_model_loading/model.py   |  135 +
 qa/python_models/bls_onnx_warmup/config.pbtxt |   38 +
 qa/python_models/bls_onnx_warmup/model.py     |   88 +
 qa/python_models/bls_parameters/config.pbtxt  |   52 +
 qa/python_models/bls_parameters/model.py      |   77 +
 .../bls_request_rescheduling/config.pbtxt     |   38 +
 .../bls_request_rescheduling/model.py         |  133 +
 qa/python_models/bls_simple/bls_simple.py     |   84 +
 qa/python_models/bls_undefined/config.pbtxt   |   50 +
 qa/python_models/bls_undefined/model.py       |   33 +
 .../cuda_memory_consumer/1/model.py           |   69 +
 .../cuda_memory_consumer/config.pbtxt         |   28 +
 qa/python_models/custom_metrics/config.pbtxt  |   43 +
 qa/python_models/custom_metrics/model.py      |  278 +
 qa/python_models/delayed_model/model.py       |   10 +-
 qa/python_models/dlpack_add_sub/model.py      |  101 +-
 .../dlpack_empty_output/config.pbtxt          |   43 +
 qa/python_models/dlpack_empty_output/model.py |   53 +
 qa/python_models/dlpack_identity/model.py     |    8 +-
 qa/python_models/dlpack_io_identity/model.py  |   53 +-
 .../dlpack_io_identity_decoupled/model.py     |   43 +-
 qa/python_models/dlpack_square/config.pbtxt   |   48 +
 qa/python_models/dlpack_square/model.py       |  139 +
 qa/python_models/dlpack_sub_add/model.py      |  101 +-
 qa/python_models/dlpack_test/model.py         |  312 +-
 qa/python_models/error_code/config.pbtxt      |   47 +
 qa/python_models/error_code/model.py          |   59 +
 qa/python_models/execute_cancel/config.pbtxt  |   47 +
 qa/python_models/execute_cancel/model.py      |  108 +
 qa/python_models/execute_error/model.py       |   19 +-
 .../execute_return_error/model.py             |    5 +-
 qa/python_models/fini_error/model.py          |    3 +-
 qa/python_models/ground_truth/config.pbtxt    |   52 +
 qa/python_models/ground_truth/model.py        |   51 +
 qa/python_models/identity_fp32/model.py       |    3 +-
 .../identity_fp32_logging/config.pbtxt        |   53 +
 .../identity_fp32_logging/model.py            |   72 +
 .../identity_fp32_timeout/config.pbtxt        |   60 +
 .../identity_fp32_timeout/model.py            |   45 +
 qa/python_models/init_args/model.py           |   46 +-
 qa/python_models/init_error/model.py          |    5 +-
 qa/python_models/init_exit/config.pbtxt       |   46 +
 qa/python_models/init_exit/model.py           |   40 +
 .../iterative_sequence/config.pbtxt           |   51 +
 qa/python_models/iterative_sequence/model.py  |  131 +
 qa/python_models/model_env/model.py           |    9 +-
 qa/python_models/model_init_del/config.pbtxt  |   52 +
 qa/python_models/model_init_del/model.py      |   57 +
 qa/python_models/model_init_del/util.py       |  189 +
 qa/python_models/multi_file/file1.py          |    6 +-
 qa/python_models/multi_file/file2.py          |    6 +-
 qa/python_models/multi_file/model.py          |   11 +-
 qa/python_models/non_contiguous/model.py      |   11 +-
 qa/python_models/optional/config.pbtxt        |    7 -
 qa/python_models/optional/model.py            |   14 +-
 .../add_sub_backend/model.py                  |  162 +
 qa/python_models/python_version/model.py      |   27 +-
 qa/python_models/pytorch_fp32_fp32/model.py   |    6 +-
 .../request_rescheduling_addsub/config.pbtxt  |   61 +
 .../request_rescheduling_addsub/model.py      |   82 +
 .../response_sender_error/model.py            |   37 +-
 qa/python_models/sequence_int32/config.pbtxt  |   80 +
 qa/python_models/sequence_int32/model.py      |   92 +
 .../python_models/sequence_py/config.pbtxt    |   48 +-
 qa/python_models/sequence_py/model.py         |   93 +
 qa/python_models/string/model.py              |    8 +-
 qa/python_models/string_fixed/model.py        |   28 +-
 qa/python_models/string_identity/model.py     |   14 +-
 qa/python_models/sub_add/model.py             |   54 +-
 .../torchvision/resnet50/config.pbtxt         |   40 +
 .../torchvision/resnet50/model.py             |   62 +
 .../variable_gpu_output/config.pbtxt          |   55 +
 qa/python_models/variable_gpu_output/model.py |   46 +
 qa/python_models/wrong_model/model.py         |    3 +-
 .../wrong_return_type/config.pbtxt            |   49 +
 qa/python_models/wrong_return_type/model.py   |   67 +
 src/CMakeLists.txt                            |  188 +-
 src/classification.cc                         |    1 +
 src/classification.h                          |    1 +
 src/command_line_parser.cc                    | 2244 ++++++++
 src/command_line_parser.h                     |  345 ++
 src/common.cc                                 |   12 +-
 src/common.h                                  |   55 +-
 src/data_compressor.h                         |   13 +-
 src/grpc/CMakeLists.txt                       |  144 +
 src/grpc/grpc_handler.h                       |   46 +
 src/grpc/grpc_server.cc                       | 2552 +++++++++
 src/grpc/grpc_server.h                        |  139 +
 src/grpc/grpc_utils.cc                        |  160 +
 src/grpc/grpc_utils.h                         |  187 +
 src/grpc/infer_handler.cc                     | 1068 ++++
 src/grpc/infer_handler.h                      | 1436 +++++
 src/grpc/stream_infer_handler.cc              |  732 +++
 src/grpc/stream_infer_handler.h               |  124 +
 src/grpc_server.cc                            | 4621 -----------------
 src/grpc_server.h                             |  132 -
 src/http_server.cc                            | 2723 +++++++---
 src/http_server.h                             |  297 +-
 src/main.cc                                   | 1646 +-----
 src/memory_alloc.cc                           |    2 +
 src/multi_server.cc                           |    2 +
 src/restricted_features.h                     |  114 +
 src/sagemaker_server.cc                       |  637 ++-
 src/sagemaker_server.h                        |   55 +-
 src/shared_memory_manager.cc                  |   10 +-
 src/shared_memory_manager.h                   |    1 +
 src/simple.cc                                 |   44 +-
 src/test/CMakeLists.txt                       |    8 +-
 src/test/caffe2plan.cc                        |    7 +-
 src/test/data_compressor_test.cc              |    6 +-
 .../src/distributed_addsub.cc                 |   11 +-
 src/test/dyna_sequence/src/dyna_sequence.cc   |    1 +
 src/test/implicit_state/src/implicit_state.cc |  204 +-
 src/test/iterative_sequence/CMakeLists.txt    |  118 +
 ...tonIterativeSequenceBackendConfig.cmake.in |   39 +
 .../src/iterative_sequence.cc                 |  582 +++
 .../src/libtriton_iterative_sequence.ldscript |   22 +-
 src/test/query_backend/src/query.cc           |    5 +-
 .../relocation_repoagent/src/relocation.cc    |    8 +-
 src/test/sequence/src/sequence.cc             |    1 +
 src/tracer.cc                                 |  778 ++-
 src/tracer.h                                  |  214 +-
 src/triton_signal.h                           |    1 +
 src/vertex_ai_server.cc                       |    6 +-
 src/vertex_ai_server.h                        |    2 +-
 883 files changed, 76704 insertions(+), 32486 deletions(-)
 create mode 100644 .github/workflows/codeql.yml
 create mode 100644 .github/workflows/pre-commit.yaml
 create mode 100644 .pre-commit-config.yaml
 create mode 100644 CITATION.cff
 create mode 100644 SECURITY.md
 mode change 100644 => 100755 compose.py
 mode change 100644 => 100755 deploy/gke-marketplace-app/benchmark/perf-analyzer-script/perf_query.sh
 rename deploy/gke-marketplace-app/client-sample/{locustfile_bert_large.py => locustfile_bert.py} (87%)
 mode change 100644 => 100755
 mode change 100644 => 100755 deploy/gke-marketplace-app/client-sample/perf_analyzer_grpc.sh
 mode change 100644 => 100755 deploy/gke-marketplace-app/server-deployer/build_and_push.sh
 create mode 100644 deploy/gke-marketplace-app/server-deployer/chart/triton/templates/ingress.yaml
 create mode 100644 deploy/gke-marketplace-app/trt-engine/README.md
 mode change 100755 => 100644 deploy/mlflow-triton-plugin/examples/onnx_float32_int32_int32/config.pbtxt
 mode change 100644 => 100755 deploy/mlflow-triton-plugin/mlflow_triton/__init__.py
 mode change 100644 => 100755 deploy/mlflow-triton-plugin/mlflow_triton/config.py
 mode change 100644 => 100755 deploy/mlflow-triton-plugin/mlflow_triton/deployments.py
 mode change 100644 => 100755 deploy/mlflow-triton-plugin/scripts/publish_model_to_mlflow.py
 mode change 100644 => 100755 deploy/mlflow-triton-plugin/scripts/triton_flavor.py
 mode change 100644 => 100755 deploy/mlflow-triton-plugin/setup.py
 mode change 100644 => 100755 docker/cpu_only/entrypoint.d/12-banner.sh
 mode change 100644 => 100755 docker/cpu_only/entrypoint.d/50-gpu-driver-check2.sh
 mode change 100644 => 100755 docker/entrypoint.d/50-gpu-driver-check2.sh
 mode change 100644 => 100755 docker/entrypoint.d/99-check-run-aip-mode.sh
 create mode 100644 docs/.gitignore
 create mode 100644 docs/Dockerfile.docs
 create mode 100644 docs/Makefile
 create mode 100644 docs/_static/.gitattributes
 create mode 100644 docs/_static/custom.css
 create mode 100644 docs/_static/logo_2color_horizontal.svg
 create mode 100644 docs/_static/logo_2color_vertical.svg
 create mode 100644 docs/_static/nvidia-logo-horiz-rgb-blk-for-screen.png
 create mode 100644 docs/_static/nvidia-logo-vert-rgb-blk-for-screen.png
 create mode 100644 docs/_static/rtd-data.js
 create mode 100644 docs/_templates/layout.html
 create mode 100755 docs/conf.py
 create mode 100644 docs/contents.md
 rename docs/{ => customization_guide}/build.md (90%)
 rename docs/{ => customization_guide}/compose.md (89%)
 create mode 100644 docs/customization_guide/deploy.md
 rename docs/{ => customization_guide}/inference_protocols.md (65%)
 rename docs/{ => customization_guide}/repository_agents.md (98%)
 rename docs/{ => customization_guide}/test.md (96%)
 create mode 100644 docs/examples/README.md
 mode change 100644 => 100755 docs/examples/jetson/concurrency_and_dynamic_batching/tao/convert_peoplenet.sh
 mode change 100755 => 100644 docs/examples/model_repository/simple_identity/config.pbtxt
 rename docs/{ => getting_started}/quickstart.md (84%)
 create mode 100644 docs/index.md
 delete mode 100644 docs/metrics.md
 delete mode 100644 docs/perf_analyzer.md
 create mode 100644 docs/protocol/extension_generate.md
 create mode 100644 docs/protocol/extension_logging.md
 create mode 100644 docs/protocol/extension_parameters.md
 delete mode 100644 docs/response_cache.md
 delete mode 100644 docs/trace.md
 rename docs/{ => user_guide}/architecture.md (96%)
 rename docs/{ => user_guide}/custom_operations.md (81%)
 create mode 100644 docs/user_guide/debugging_guide.md
 rename docs/{ => user_guide}/decoupled_models.md (73%)
 rename docs/{ => user_guide}/faq.md (68%)
 rename docs/{ => user_guide}/images/arch.jpg (100%)
 rename docs/{ => user_guide}/images/dyna_sequence_example0.png (100%)
 rename docs/{ => user_guide}/images/dyna_sequence_example1.png (100%)
 rename docs/{ => user_guide}/images/ensemble_example0.png (100%)
 rename docs/{ => user_guide}/images/multi_model_exec.png (100%)
 rename docs/{ => user_guide}/images/multi_model_parallel_exec.png (100%)
 rename docs/{ => user_guide}/images/multi_model_serial_exec.png (100%)
 rename docs/{ => user_guide}/images/sequence_example0.png (100%)
 rename docs/{ => user_guide}/images/sequence_example1.png (100%)
 rename docs/{ => user_guide}/images/sequence_example2.png (100%)
 rename docs/{ => user_guide}/images/triton_on_jetson.png (100%)
 rename docs/{ => user_guide}/jetson.md (78%)
 create mode 100644 docs/user_guide/metrics.md
 create mode 100644 docs/user_guide/model_analyzer.md
 rename docs/{ => user_guide}/model_configuration.md (76%)
 rename docs/{ => user_guide}/model_management.md (66%)
 rename docs/{ => user_guide}/model_repository.md (67%)
 rename docs/{ => user_guide}/optimization.md (85%)
 create mode 100644 docs/user_guide/perf_analyzer.md
 create mode 100644 docs/user_guide/performance_tuning.md
 rename docs/{ => user_guide}/ragged_batching.md (97%)
 rename docs/{ => user_guide}/rate_limiter.md (98%)
 create mode 100644 docs/user_guide/request_cancellation.md
 create mode 100644 docs/user_guide/response_cache.md
 create mode 100644 docs/user_guide/trace.md
 rename docs/{ => user_guide}/v1_to_v2.md (95%)
 create mode 100644 pyproject.toml
 mode change 100644 => 100755 qa/L0_async_work_queue/test.sh
 mode change 100644 => 100755 qa/L0_backend_config/test.sh
 create mode 100755 qa/L0_backend_fastertransformer/test.sh
 mode change 100644 => 100755 qa/L0_backend_identity/identity_test.py
 create mode 100755 qa/L0_backend_output_detail/test.sh
 mode change 100644 => 100755 qa/L0_backend_python/argument_validation/test.sh
 create mode 100755 qa/L0_backend_python/bls/bls_parameters_test.py
 mode change 100644 => 100755 qa/L0_backend_python/bls/test.sh
 mode change 100644 => 100755 qa/L0_backend_python/common.sh
 create mode 100755 qa/L0_backend_python/custom_metrics/test.sh
 mode change 100644 => 100755 qa/L0_backend_python/decoupled/decoupled_test.py
 create mode 100644 qa/L0_backend_python/decoupled/models/decoupled_bls_stream/1/model.py
 create mode 100644 qa/L0_backend_python/decoupled/models/decoupled_bls_stream/config.pbtxt
 create mode 100644 qa/L0_backend_python/decoupled/models/decoupled_raise_exception/1/model.py
 create mode 100644 qa/L0_backend_python/decoupled/models/decoupled_raise_exception/config.pbtxt
 mode change 100644 => 100755 qa/L0_backend_python/decoupled/test.sh
 mode change 100644 => 100755 qa/L0_backend_python/ensemble/ensemble_test.py
 mode change 100644 => 100755 qa/L0_backend_python/ensemble/test.sh
 mode change 100644 => 100755 qa/L0_backend_python/env/test.sh
 mode change 100644 => 100755 qa/L0_backend_python/examples/test.sh
 mode change 100644 => 100755 qa/L0_backend_python/io/io_test.py
 mode change 100644 => 100755 qa/L0_backend_python/io/test.sh
 mode change 100644 => 100755 qa/L0_backend_python/lifecycle/lifecycle_test.py
 mode change 100644 => 100755 qa/L0_backend_python/lifecycle/test.sh
 create mode 100755 qa/L0_backend_python/logging/logging_test.py
 create mode 100755 qa/L0_backend_python/logging/test.sh
 mode change 100644 => 100755 qa/L0_backend_python/model_control/model_control_test.py
 mode change 100644 => 100755 qa/L0_backend_python/model_control/test.sh
 create mode 100644 qa/L0_backend_python/python_based_backends/python_based_backends_test.py
 create mode 100755 qa/L0_backend_python/python_based_backends/test.sh
 mode change 100644 => 100755 qa/L0_backend_python/python_test.py
 mode change 100644 => 100755 qa/L0_backend_python/python_unittest.py
 create mode 100755 qa/L0_backend_python/request_rescheduling/grpc_endpoint_test.py
 create mode 100755 qa/L0_backend_python/request_rescheduling/test.sh
 mode change 100644 => 100755 qa/L0_backend_python/restart/restart_test.py
 mode change 100644 => 100755 qa/L0_backend_python/restart/test.sh
 create mode 100755 qa/L0_backend_python/setup_python_enviroment.sh
 mode change 100644 => 100755 qa/L0_backend_python/variants/test.sh
 create mode 100755 qa/L0_batch_custom/batch_custom_test.py
 create mode 100755 qa/L0_batch_custom/test.sh
 mode change 100644 => 100755 qa/L0_batch_input/batch_input_test.py
 mode change 100644 => 100755 qa/L0_batch_input/test.sh
 mode change 100644 => 100755 qa/L0_batcher/batcher_test.py
 mode change 100644 => 100755 qa/L0_batcher/verify_timestamps.py
 mode change 100644 => 100755 qa/L0_buffer_attributes/buffer_attributes_test.py
 mode change 100644 => 100755 qa/L0_buffer_attributes/test.sh
 mode change 100644 => 100755 qa/L0_client_java/test.sh
 mode change 100644 => 100755 qa/L0_client_memory_growth/client_memory_mail.py
 mode change 100644 => 100755 qa/L0_client_nobatch/client_test.py
 rename qa/L0_client_timeout/{client_timeout_test.py => client_infer_timeout_test.py} (61%)
 mode change 100644 => 100755
 create mode 100755 qa/L0_client_timeout/client_non_infer_timeout_test.py
 mode change 100644 => 100755 qa/L0_client_timeout/test.sh
 create mode 100755 qa/L0_cmdline_trace/trace_client.py
 create mode 100644 qa/L0_config_json/max_priority_level.pbtxt
 mode change 100644 => 100755 qa/L0_cuda_graph/test.sh
 mode change 100644 => 100755 qa/L0_cuda_graph/trt_cuda_graph_test.py
 mode change 100644 => 100755 qa/L0_cuda_shared_memory/cuda_shared_memory_test.py
 mode change 100644 => 100755 qa/L0_cuda_shared_memory/test.sh
 mode change 100644 => 100755 qa/L0_custom_ops/cuda_op_test.py
 mode change 100644 => 100755 qa/L0_custom_ops/mod_op_test.py
 mode change 100644 => 100755 qa/L0_custom_ops/onnx_op_test.py
 mode change 100644 => 100755 qa/L0_custom_ops/vision_op_test.py
 mode change 100644 => 100755 qa/L0_custom_ops/zero_out_test.py
 mode change 100644 => 100755 qa/L0_data_compression/test.sh
 mode change 100644 => 100755 qa/L0_data_compression/validation.py
 mode change 100644 => 100755 qa/L0_decoupled/decoupled_test.py
 mode change 100644 => 100755 qa/L0_decoupled/test.sh
 create mode 100755 qa/L0_device_memory_tracker/test.py
 create mode 100755 qa/L0_device_memory_tracker/test.sh
 rename qa/{L0_backend_python/unittest => L0_dlpack_multi_gpu}/test.sh (79%)
 mode change 100644 => 100755
 create mode 100644 qa/L0_doc_links/mkdocs.yml
 create mode 100755 qa/L0_doc_links/test.sh
 mode change 100644 => 100755 qa/L0_dyna_implicit_state/test.sh
 mode change 100644 => 100755 qa/L0_dyna_sequence_batcher/dyna_sequence_batcher_test.py
 create mode 100644 qa/L0_grpc/client_plugin_models/client_plugin_test/1/model.py
 rename docs/model_analyzer.md => qa/L0_grpc/client_plugin_models/client_plugin_test/config.pbtxt (63%)
 create mode 100755 qa/L0_grpc/grpc_basic_auth_test.py
 create mode 100755 qa/L0_grpc/grpc_client_plugin_test.py
 create mode 100644 qa/L0_grpc/nginx.conf
 create mode 100755 qa/L0_grpc/python_grpc_aio_test.py
 create mode 100755 qa/L0_grpc/python_unit_test.py
 mode change 100644 => 100755 qa/L0_grpc/test.sh
 create mode 100755 qa/L0_grpc_state_cleanup/cleanup_test.py
 create mode 100755 qa/L0_grpc_state_cleanup/test.sh
 create mode 100755 qa/L0_http/generate_endpoint_test.py
 create mode 100644 qa/L0_http/generate_models/mock_llm/1/model.py
 create mode 100644 qa/L0_http/generate_models/mock_llm/config.pbtxt
 create mode 100755 qa/L0_http/http_basic_auth_test.py
 create mode 100755 qa/L0_http/http_client_plugin_test.py
 create mode 100755 qa/L0_http/http_restricted_api_test.py
 mode change 100644 => 100755 qa/L0_http/http_test.py
 create mode 100644 qa/L0_http/nginx.conf
 create mode 100755 qa/L0_http/python_http_aio_test.py
 mode change 100644 => 100755 qa/L0_http/test.sh
 mode change 100644 => 100755 qa/L0_http_fuzz/fuzztest.py
 mode change 100644 => 100755 qa/L0_http_fuzz/test.sh
 mode change 100644 => 100755 qa/L0_https/test.sh
 mode change 100644 => 100755 qa/L0_implicit_state/implicit_state.py
 create mode 100644 qa/L0_implicit_state/models/growable_memory/config.pbtxt
 create mode 100644 qa/L0_implicit_state/models/single_state_buffer/config.pbtxt
 mode change 100644 => 100755 qa/L0_implicit_state/test.sh
 mode change 100644 => 100755 qa/L0_infer/infer_test.py
 mode change 100644 => 100755 qa/L0_infer_reshape/infer_reshape_test.py
 mode change 100644 => 100755 qa/L0_infer_variable/infer_variable_test.py
 mode change 100644 => 100755 qa/L0_infer_zero/infer_zero_test.py
 mode change 100644 => 100755 qa/L0_inferentia_perf_analyzer/test.sh
 create mode 100755 qa/L0_iterative_sequence/iterative_sequence_e2e.py
 create mode 100644 qa/L0_iterative_sequence/models/iterative_sequence/config.pbtxt
 create mode 100755 qa/L0_iterative_sequence/test.sh
 create mode 100755 qa/L0_json/test.sh
 mode change 100644 => 100755 qa/L0_large_payload/large_payload_test.py
 mode change 100644 => 100755 qa/L0_large_payload/test.sh
 mode change 100644 => 100755 qa/L0_libtorch_inference_mode/test.sh
 create mode 100755 qa/L0_libtorch_instance_group_kind_model/client.py
 create mode 100755 qa/L0_libtorch_instance_group_kind_model/gen_models.py
 create mode 100644 qa/L0_libtorch_instance_group_kind_model/models/libtorch_multi_device/config.pbtxt
 create mode 100755 qa/L0_libtorch_instance_group_kind_model/test.sh
 mode change 100644 => 100755 qa/L0_libtorch_io_names/io_names_client.py
 mode change 100644 => 100755 qa/L0_libtorch_io_names/test.sh
 create mode 100755 qa/L0_libtorch_io_types/test.sh
 mode change 100644 => 100755 qa/L0_libtorch_nvfuser/test.sh
 mode change 100644 => 100755 qa/L0_libtorch_optimized_execution/test.sh
 mode change 100644 => 100755 qa/L0_libtorch_shared_weights/libtorch_shared_weights_test.py
 mode change 100644 => 100755 qa/L0_libtorch_shared_weights/test.sh
 mode change 100644 => 100755 qa/L0_lifecycle/lifecycle_test.py
 create mode 100755 qa/L0_logging/logging_endpoint_test.py
 create mode 100755 qa/L0_logging/test.sh
 mode change 100644 => 100755 qa/L0_long_running_stress/crashing_client.py
 mode change 100644 => 100755 qa/L0_long_running_stress/scenarios.py
 mode change 100644 => 100755 qa/L0_long_running_stress/stress.py
 mode change 100644 => 100755 qa/L0_long_running_stress/stress_mail.py
 mode change 100644 => 100755 qa/L0_memory/test.sh
 mode change 100644 => 100755 qa/L0_memory_growth/busy_op_test.py
 mode change 100644 => 100755 qa/L0_memory_growth/server_memory_mail.py
 create mode 100644 qa/L0_metrics/ensemble_delay/config.pbtxt
 create mode 100644 qa/L0_metrics/identity_delay/config.pbtxt
 create mode 100755 qa/L0_metrics/metrics_config_test.py
 create mode 100755 qa/L0_metrics/metrics_queue_size_test.py
 create mode 100644 qa/L0_metrics/unit_test_models/identity_cache_off/config.pbtxt
 create mode 100644 qa/L0_metrics/unit_test_models/identity_cache_on/config.pbtxt
 mode change 100644 => 100755 qa/L0_mlflow/plugin_test.py
 mode change 100644 => 100755 qa/L0_mlflow/test.sh
 create mode 100644 qa/L0_model_config/autofill_noplatform/custom/no_delimiter/config.pbtxt
 create mode 100644 qa/L0_model_config/autofill_noplatform/custom/no_delimiter/expected
 create mode 100644 qa/L0_model_config/autofill_noplatform/custom/unknown_backend.unknown/config.pbtxt
 create mode 100644 qa/L0_model_config/autofill_noplatform/custom/unknown_backend.unknown/expected
 create mode 100644 qa/L0_model_config/autofill_noplatform/ensemble/unreachable_output_3/config.pbtxt
 create mode 100644 qa/L0_model_config/autofill_noplatform/ensemble/unreachable_output_3/expected
 create mode 100644 qa/L0_model_config/autofill_noplatform/openvino/bad_input_dims/config.pbtxt
 create mode 100644 qa/L0_model_config/autofill_noplatform/openvino/bad_input_dims/expected
 create mode 100644 qa/L0_model_config/autofill_noplatform/openvino/bad_output_dims/config.pbtxt
 create mode 100644 qa/L0_model_config/autofill_noplatform/openvino/bad_output_dims/expected
 create mode 100644 qa/L0_model_config/autofill_noplatform/openvino/too_few_inputs/config.pbtxt
 create mode 100644 qa/L0_model_config/autofill_noplatform/openvino/too_few_inputs/expected
 create mode 100644 qa/L0_model_config/autofill_noplatform/openvino/too_many_inputs/config.pbtxt
 create mode 100644 qa/L0_model_config/autofill_noplatform/openvino/too_many_inputs/expected
 create mode 100644 qa/L0_model_config/autofill_noplatform/openvino/unknown_input/config.pbtxt
 create mode 100644 qa/L0_model_config/autofill_noplatform/openvino/unknown_input/expected
 create mode 100644 qa/L0_model_config/autofill_noplatform/openvino/unknown_output/config.pbtxt
 create mode 100644 qa/L0_model_config/autofill_noplatform/openvino/unknown_output/expected
 create mode 100644 qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_invalid_args/config.pbtxt
 create mode 100644 qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_invalid_args/expected
 create mode 100644 qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_invalid_args/model.py
 create mode 100644 qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_mismatch/config.pbtxt
 create mode 100644 qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_mismatch/expected
 create mode 100644 qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_mismatch/model.py
 rename qa/L0_model_config/{autofill_noplatform_success => autofill_noplatform}/tensorflow_savedmodel/bad_input_dims/1/model.savedmodel/saved_model.pb (100%)
 rename qa/L0_model_config/{autofill_noplatform_success => autofill_noplatform}/tensorflow_savedmodel/bad_input_dims/config.pbtxt (100%)
 create mode 100644 qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_dims/expected
 rename qa/L0_model_config/{autofill_noplatform_success => autofill_noplatform}/tensorflow_savedmodel/bad_input_type/1/model.savedmodel/saved_model.pb (100%)
 rename qa/L0_model_config/{autofill_noplatform_success => autofill_noplatform}/tensorflow_savedmodel/bad_input_type/config.pbtxt (100%)
 create mode 100644 qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_type/expected
 rename qa/L0_model_config/{autofill_noplatform_success => autofill_noplatform}/tensorflow_savedmodel/bad_output_dims/1/model.savedmodel/saved_model.pb (100%)
 rename qa/L0_model_config/{autofill_noplatform_success => autofill_noplatform}/tensorflow_savedmodel/bad_output_dims/config.pbtxt (100%)
 create mode 100644 qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_dims/expected
 rename qa/L0_model_config/{autofill_noplatform_success => autofill_noplatform}/tensorflow_savedmodel/bad_output_type/1/model.savedmodel/saved_model.pb (100%)
 rename qa/L0_model_config/{autofill_noplatform_success => autofill_noplatform}/tensorflow_savedmodel/bad_output_type/config.pbtxt (100%)
 create mode 100644 qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_type/expected
 rename qa/L0_model_config/{autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch => autofill_noplatform/tensorflow_savedmodel/too_many_inputs}/1/model.savedmodel/saved_model.pb (100%)
 rename qa/L0_model_config/{autofill_noplatform_success => autofill_noplatform}/tensorflow_savedmodel/too_many_inputs/config.pbtxt (93%)
 create mode 100644 qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/too_many_inputs/expected
 rename qa/L0_model_config/{autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs => autofill_noplatform/tensorflow_savedmodel/unknown_input}/1/model.savedmodel/saved_model.pb (100%)
 rename qa/L0_model_config/{autofill_noplatform_success => autofill_noplatform}/tensorflow_savedmodel/unknown_input/config.pbtxt (100%)
 create mode 100644 qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_input/expected
 rename qa/L0_model_config/{autofill_noplatform_success/tensorflow_savedmodel/unknown_input => autofill_noplatform/tensorflow_savedmodel/unknown_output}/1/model.savedmodel/saved_model.pb (100%)
 rename qa/L0_model_config/{autofill_noplatform_success => autofill_noplatform}/tensorflow_savedmodel/unknown_output/config.pbtxt (100%)
 create mode 100644 qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_output/expected
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/custom/empty_config.identity/config.pbtxt
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/custom/empty_config.identity/expected
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/custom/no_backend.identity/config.pbtxt
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/custom/no_backend.identity/expected
 mode change 100755 => 100644 qa/L0_model_config/autofill_noplatform_success/onnx/cpu_instance/config.pbtxt
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/config.pbtxt
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected.1
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected.2
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected.3
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/config.pbtxt
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected.1
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected.2
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected.3
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected.1
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected.2
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected.3
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/partial_config/config.pbtxt
 rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/unknown_input => openvino/partial_config}/expected (62%)
 rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/unknown_input => openvino/partial_config}/expected.1 (62%)
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/config.pbtxt
 rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/bad_input_dims => python/model_transaction_policy}/expected (51%)
 rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/bad_input_type => python/model_transaction_policy}/expected.1 (51%)
 rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/bad_input_type => python/model_transaction_policy}/expected.2 (51%)
 rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/bad_input_type => python/model_transaction_policy}/expected.3 (51%)
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/model.py
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/config.pbtxt
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected
 rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/bad_output_dims => python/model_transaction_policy_decoupled_false}/expected.1 (50%)
 rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/bad_input_dims => python/model_transaction_policy_decoupled_false}/expected.2 (50%)
 rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/bad_output_type => python/model_transaction_policy_decoupled_false}/expected.3 (50%)
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/model.py
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/config.pbtxt
 rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/bad_input_type => python/model_transaction_policy_no_op}/expected (50%)
 rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/bad_input_dims => python/model_transaction_policy_no_op}/expected.1 (50%)
 rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/bad_output_dims => python/model_transaction_policy_no_op}/expected.2 (50%)
 rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/bad_output_dims => python/model_transaction_policy_no_op}/expected.3 (50%)
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/model.py
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/python/optional_input/config.pbtxt
 rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/unknown_input/expected.2 => python/optional_input/expected} (52%)
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/python/optional_input/model.py
 delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected.3
 delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected
 delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected
 delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected.1
 delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected.2
 rename qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/{unknown_output => hint_for_no_batch_1}/1/model.savedmodel/saved_model.pb (100%)
 rename qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/{hint_for_no_batch => hint_for_no_batch_1}/config.pbtxt (100%)
 rename qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/{hint_for_no_batch => hint_for_no_batch_1}/expected (91%)
 rename qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/{hint_for_no_batch => hint_for_no_batch_1}/expected.1 (91%)
 rename qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/{hint_for_no_batch => hint_for_no_batch_1}/expected.2 (91%)
 rename qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/{hint_for_no_batch => hint_for_no_batch_1}/expected.3 (91%)
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/1/model.savedmodel/saved_model.pb
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/config.pbtxt
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/expected
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/expected.1
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/expected.2
 create mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/expected.3
 mode change 100755 => 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/reshape_config_provided/config.pbtxt
 delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/expected
 delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/expected.1
 delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/expected.2
 delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/expected.3
 delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/expected.3
 delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/expected
 delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/expected.1
 delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/expected.2
 delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/expected.3
 create mode 100644 qa/L0_model_config/cli_messages/cli_deprecation/expected
 create mode 100644 qa/L0_model_config/cli_messages/cli_override/expected
 mode change 100644 => 100755 qa/L0_model_config/compare_status.py
 create mode 100755 qa/L0_model_config/noautofill_test.py
 create mode 100644 qa/L0_model_config/special_cases/noautofill_noconfig/expected
 create mode 100755 qa/L0_model_namespacing/python_addsub/__init__.py
 create mode 100755 qa/L0_model_namespacing/python_subadd/__init__.py
 create mode 100755 qa/L0_model_namespacing/test.py
 create mode 100755 qa/L0_model_namespacing/test.sh
 create mode 100644 qa/L0_model_namespacing/test_duplication/addsub_repo/composing_model/1/model.py
 create mode 100644 qa/L0_model_namespacing/test_duplication/addsub_repo/simple_addsub/config.pbtxt
 create mode 100644 qa/L0_model_namespacing/test_duplication/subadd_repo/composing_model/1/model.py
 create mode 100644 qa/L0_model_namespacing/test_duplication/subadd_repo/simple_subadd/config.pbtxt
 create mode 100644 qa/L0_model_namespacing/test_dynamic_resolution/addsub_repo/composing_model/1/model.py
 create mode 100644 qa/L0_model_namespacing/test_dynamic_resolution/addsub_repo/simple_addsub/config.pbtxt
 create mode 100644 qa/L0_model_namespacing/test_dynamic_resolution/subadd_repo/composing_model/1/model.py
 create mode 100644 qa/L0_model_namespacing/test_dynamic_resolution/subadd_repo/simple_subadd/config.pbtxt
 create mode 100644 qa/L0_model_namespacing/test_ensemble_duplication/addsub_repo/composing_addsub/1/model.py
 create mode 100644 qa/L0_model_namespacing/test_ensemble_duplication/addsub_repo/simple_ensemble/config.pbtxt
 create mode 100644 qa/L0_model_namespacing/test_ensemble_duplication/subadd_repo/composing_subadd/1/model.py
 create mode 100644 qa/L0_model_namespacing/test_ensemble_duplication/subadd_repo/simple_ensemble/config.pbtxt
 create mode 100644 qa/L0_model_namespacing/test_no_duplication/addsub_repo/composing_addsub/1/model.py
 create mode 100644 qa/L0_model_namespacing/test_no_duplication/addsub_repo/simple_addsub/config.pbtxt
 create mode 100644 qa/L0_model_namespacing/test_no_duplication/subadd_repo/composing_subadd/1/model.py
 create mode 100644 qa/L0_model_namespacing/test_no_duplication/subadd_repo/simple_subadd/config.pbtxt
 mode change 100644 => 100755 qa/L0_model_queue/model_queue_test.py
 mode change 100644 => 100755 qa/L0_model_queue/test.sh
 create mode 100755 qa/L0_model_update/instance_update_test.py
 create mode 100755 qa/L0_model_update/test.sh
 mode change 100644 => 100755 qa/L0_multi_server/test.sh
 mode change 100644 => 100755 qa/L0_nan_inf/nan_inf_test.py
 mode change 100644 => 100755 qa/L0_nullchar_string/nullchar_string_client.py
 mode change 100644 => 100755 qa/L0_nullchar_string/test.sh
 mode change 100755 => 100644 qa/L0_optional_input/models/ensemble_identity_2_float32/config.pbtxt
 mode change 100755 => 100644 qa/L0_optional_input/models/identity_2_float32/config.pbtxt
 create mode 100644 qa/L0_optional_input/models/optional_connecting_tensor/config.pbtxt
 create mode 100644 qa/L0_optional_input/models/optional_identity/1/model.py
 create mode 100644 qa/L0_optional_input/models/optional_identity/config.pbtxt
 mode change 100755 => 100644 qa/L0_optional_input/models/pipeline_identity_2_float32/config.pbtxt
 mode change 100644 => 100755 qa/L0_optional_input/optional_input_test.py
 mode change 100644 => 100755 qa/L0_output_name/output_name_test.py
 mode change 100644 => 100755 qa/L0_output_name/test.sh
 mode change 100644 => 100755 qa/L0_output_validation/lt_op_val_client.py
 mode change 100644 => 100755 qa/L0_output_validation/test.sh
 mode change 100644 => 100755 qa/L0_parallel_copy/parallel_copy_test.py
 create mode 100644 qa/L0_parameters/model_repository/ensemble/config.pbtxt
 create mode 100644 qa/L0_parameters/model_repository/identity/config.pbtxt
 create mode 100644 qa/L0_parameters/model_repository/parameter/1/model.py
 create mode 100755 qa/L0_parameters/parameters_test.py
 create mode 100755 qa/L0_parameters/test.sh
 mode change 100755 => 100644 qa/L0_passive_instance/models/distributed_int32_int32_int32/config.pbtxt
 mode change 100644 => 100755 qa/L0_passive_instance/passive_instance_test.py
 mode change 100644 => 100755 qa/L0_passive_instance/test.sh
 create mode 100644 qa/L0_perf_analyzer/perf_analyzer_profile_export_schema.json
 create mode 100644 qa/L0_perf_analyzer_doc_links/mkdocs.yml
 create mode 100755 qa/L0_perf_analyzer_doc_links/test.sh
 create mode 100755 qa/L0_perf_analyzer_ground_truth/test.sh
 mode change 100644 => 100755 qa/L0_perf_kaldi/create_data.sh
 mode change 100644 => 100755 qa/L0_perf_kaldi/test.sh
 mode change 100644 => 100755 qa/L0_perf_pyclients/simple_perf_client.py
 delete mode 100755 qa/L0_perf_tfs/test.sh
 delete mode 100755 qa/L0_perf_ts/test.sh
 create mode 100755 qa/L0_perf_vllm/test.sh
 create mode 100755 qa/L0_python_api/test.sh
 rename qa/{L0_jetson_example => L0_python_client_unit_tests}/test.sh (57%)
 mode change 100644 => 100755
 mode change 100644 => 100755 qa/L0_query/query_e2e.py
 mode change 100644 => 100755 qa/L0_query/test.sh
 mode change 100644 => 100755 qa/L0_rate_limiter/rate_limiter_test.py
 mode change 100644 => 100755 qa/L0_rate_limiter/test.sh
 mode change 100644 => 100755 qa/L0_register/test.sh
 mode change 100644 => 100755 qa/L0_repoagent_checksum/identity_test.py
 create mode 100755 qa/L0_request_cancellation/grpc_cancellation_test.py
 create mode 100755 qa/L0_request_cancellation/scheduler_test.py
 create mode 100755 qa/L0_request_cancellation/test.sh
 create mode 100644 qa/L0_response_cache/models/decoupled_cache/config.pbtxt
 create mode 100644 qa/L0_response_cache/models/identity_cache/config.pbtxt
 mode change 100644 => 100755 qa/L0_sagemaker/sagemaker_multi_model_test.py
 mode change 100644 => 100755 qa/L0_sagemaker/sagemaker_test.py
 mode change 100644 => 100755 qa/L0_savedmodel_shape/saved_model_shape_test.py
 mode change 100644 => 100755 qa/L0_savedmodel_shape/test.sh
 create mode 100755 qa/L0_scalar_io/scalar_test.py
 create mode 100755 qa/L0_scalar_io/test.sh
 mode change 100644 => 100755 qa/L0_secure_grpc/test.sh
 create mode 100644 qa/L0_sequence_batcher/request_timeout_models/custom_sequence_int32_timeout/config.pbtxt
 mode change 100644 => 100755 qa/L0_sequence_batcher/sequence_batcher_test.py
 mode change 100644 => 100755 qa/L0_sequence_corrid_batcher/sequence_corrid_batcher_test.py
 mode change 100644 => 100755 qa/L0_sequence_stress/sequence_stress.py
 mode change 100644 => 100755 qa/L0_server_status/server_status_test.py
 mode change 100644 => 100755 qa/L0_shared_memory/shared_memory_test.py
 mode change 100644 => 100755 qa/L0_shared_memory/test.sh
 mode change 100644 => 100755 qa/L0_simple_ensemble/ensemble_test.py
 mode change 100644 => 100755 qa/L0_simple_go_client/test.sh
 mode change 100644 => 100755 qa/L0_simple_nodejs_client/test.sh
 mode change 100644 => 100755 qa/L0_socket/test.sh
 delete mode 100644 qa/L0_storage_S3/infer_test.py
 create mode 100755 qa/L0_storage_S3_local/mock_s3_service.py
 rename qa/{L0_s3_local => L0_storage_S3_local}/test.sh (64%)
 mode change 100644 => 100755
 delete mode 100644 qa/L0_storage_azure/infer_test.py
 mode change 100644 => 100755 qa/L0_storage_swiftstack/infer_test.py
 mode change 100644 => 100755 qa/L0_string_io/string_client_test.py
 create mode 100755 qa/L0_tf_gpu_io/tf_gpu_io_test.py
 create mode 100755 qa/L0_tf_parameters/test.sh
 create mode 100755 qa/L0_tf_parameters/tf_parameter_test.py
 mode change 100644 => 100755 qa/L0_tf_tag_sigdef/tf_tag_sigdef_test.py
 mode change 100644 => 100755 qa/L0_tf_unknown_rank/test.sh
 mode change 100644 => 100755 qa/L0_tf_unknown_rank/tf_unknown_rank_test.py
 mode change 100644 => 100755 qa/L0_tftrt_optimization/tftrt_optimization_test.py
 create mode 100644 qa/L0_trace/opentelemetry_unittest.py
 create mode 100644 qa/L0_trace/trace-config.yaml
 mode change 100644 => 100755 qa/L0_trace/trace_endpoint_test.py
 mode change 100644 => 100755 qa/L0_triton_repo_agent/test.sh
 create mode 100755 qa/L0_trt_compat/test.sh
 create mode 100755 qa/L0_trt_compat/trt_compatibility_test.py
 create mode 100755 qa/L0_trt_data_dependent_shape/test.sh
 create mode 100755 qa/L0_trt_data_dependent_shape/trt_data_dependent_shape_test.py
 mode change 100644 => 100755 qa/L0_trt_dla/dla_test.py
 mode change 100644 => 100755 qa/L0_trt_dla/test.sh
 mode change 100644 => 100755 qa/L0_trt_dynamic_shape/trt_dynamic_shape_test.py
 create mode 100755 qa/L0_trt_error_propagation/test.sh
 create mode 100755 qa/L0_trt_error_propagation/trt_error_propagation_test.py
 mode change 100644 => 100755 qa/L0_trt_plugin/test.sh
 mode change 100644 => 100755 qa/L0_trt_plugin/trt_plugin_test.py
 mode change 100644 => 100755 qa/L0_trt_reformat_free/trt_reformat_free_test.py
 mode change 100644 => 100755 qa/L0_trt_shape_tensors/test.sh
 mode change 100644 => 100755 qa/L0_trt_shape_tensors/trt_shape_tensor_test.py
 mode change 100644 => 100755 qa/L0_vertex_ai/test.sh
 mode change 100644 => 100755 qa/L0_vertex_ai/vertex_ai_test.py
 mode change 100644 => 100755 qa/L0_warmup/test.sh
 create mode 100644 qa/common/gen_common.py
 mode change 100644 => 100755 qa/common/gen_ensemble_model_utils.py
 create mode 100755 qa/common/gen_jetson_trt_models
 mode change 100644 => 100755 qa/common/gen_qa_custom_ops_models.py
 mode change 100644 => 100755 qa/common/gen_qa_dyna_sequence_implicit_models.py
 mode change 100644 => 100755 qa/common/gen_qa_dyna_sequence_models.py
 mode change 100644 => 100755 qa/common/gen_qa_identity_models.py
 mode change 100644 => 100755 qa/common/gen_qa_implicit_models.py
 mode change 100644 => 100755 qa/common/gen_qa_models.py
 mode change 100644 => 100755 qa/common/gen_qa_noshape_models.py
 create mode 100755 qa/common/gen_qa_ort_scalar_models.py
 create mode 100644 qa/common/gen_qa_pytorch_model.py
 mode change 100644 => 100755 qa/common/gen_qa_ragged_models.py
 mode change 100644 => 100755 qa/common/gen_qa_reshape_models.py
 mode change 100644 => 100755 qa/common/gen_qa_sequence_models.py
 create mode 100755 qa/common/gen_qa_tf_parameters.py
 mode change 100644 => 100755 qa/common/gen_qa_torchtrt_models.py
 create mode 100755 qa/common/gen_qa_trt_data_dependent_shape.py
 mode change 100644 => 100755 qa/common/gen_qa_trt_format_models.py
 mode change 100644 => 100755 qa/common/gen_qa_trt_plugin_models.py
 mode change 100644 => 100755 qa/common/gen_tag_sigdef.py
 delete mode 100755 qa/common/gen_xavier_trt_models
 create mode 100755 qa/common/infer_test.py
 mode change 100644 => 100755 qa/common/infer_util.py
 mode change 100644 => 100755 qa/common/inferentia_perf_analyzer_input_data_json/simple_model.py
 mode change 100644 => 100755 qa/common/libtorch_infer_client.py
 mode change 100644 => 100755 qa/common/nightly_email_helper.py
 create mode 100644 qa/common/perf_analyzer_input_data_json/int_data_optional.json
 create mode 100644 qa/common/perf_analyzer_input_data_json/repeat_int32_data.json
 mode change 100644 => 100755 qa/common/sequence_util.py
 mode change 100644 => 100755 qa/common/shm_util.py
 mode change 100644 => 100755 qa/common/test_util.py
 mode change 100755 => 100644 qa/custom_models/custom_zero_1_float32/config.pbtxt
 create mode 100644 qa/openvino_models/README.md
 create mode 100644 qa/openvino_models/dynamic_batch/1/model.bin
 create mode 100644 qa/openvino_models/dynamic_batch/1/model.mapping
 create mode 100644 qa/openvino_models/dynamic_batch/1/model.xml
 create mode 100644 qa/openvino_models/fixed_batch/1/model.bin
 create mode 100644 qa/openvino_models/fixed_batch/1/model.mapping
 create mode 100644 qa/openvino_models/fixed_batch/1/model.xml
 create mode 100644 qa/python_models/bls_finalize_error/config.pbtxt
 create mode 100644 qa/python_models/bls_finalize_error/model.py
 create mode 100644 qa/python_models/bls_init_error/config.pbtxt
 create mode 100644 qa/python_models/bls_init_error/model.py
 create mode 100644 qa/python_models/bls_model_loading/config.pbtxt
 create mode 100644 qa/python_models/bls_model_loading/model.py
 create mode 100644 qa/python_models/bls_onnx_warmup/config.pbtxt
 create mode 100644 qa/python_models/bls_onnx_warmup/model.py
 create mode 100644 qa/python_models/bls_parameters/config.pbtxt
 create mode 100644 qa/python_models/bls_parameters/model.py
 create mode 100644 qa/python_models/bls_request_rescheduling/config.pbtxt
 create mode 100644 qa/python_models/bls_request_rescheduling/model.py
 create mode 100644 qa/python_models/bls_simple/bls_simple.py
 create mode 100644 qa/python_models/bls_undefined/config.pbtxt
 create mode 100644 qa/python_models/bls_undefined/model.py
 create mode 100644 qa/python_models/cuda_memory_consumer/1/model.py
 create mode 100644 qa/python_models/cuda_memory_consumer/config.pbtxt
 create mode 100644 qa/python_models/custom_metrics/config.pbtxt
 create mode 100644 qa/python_models/custom_metrics/model.py
 create mode 100644 qa/python_models/dlpack_empty_output/config.pbtxt
 create mode 100644 qa/python_models/dlpack_empty_output/model.py
 create mode 100644 qa/python_models/dlpack_square/config.pbtxt
 create mode 100644 qa/python_models/dlpack_square/model.py
 create mode 100644 qa/python_models/error_code/config.pbtxt
 create mode 100644 qa/python_models/error_code/model.py
 create mode 100644 qa/python_models/execute_cancel/config.pbtxt
 create mode 100644 qa/python_models/execute_cancel/model.py
 create mode 100644 qa/python_models/ground_truth/config.pbtxt
 create mode 100644 qa/python_models/ground_truth/model.py
 create mode 100644 qa/python_models/identity_fp32_logging/config.pbtxt
 create mode 100644 qa/python_models/identity_fp32_logging/model.py
 create mode 100644 qa/python_models/identity_fp32_timeout/config.pbtxt
 create mode 100644 qa/python_models/identity_fp32_timeout/model.py
 create mode 100644 qa/python_models/init_exit/config.pbtxt
 create mode 100644 qa/python_models/init_exit/model.py
 create mode 100644 qa/python_models/iterative_sequence/config.pbtxt
 create mode 100644 qa/python_models/iterative_sequence/model.py
 create mode 100644 qa/python_models/model_init_del/config.pbtxt
 create mode 100644 qa/python_models/model_init_del/model.py
 create mode 100755 qa/python_models/model_init_del/util.py
 mode change 100644 => 100755 qa/python_models/multi_file/file1.py
 mode change 100644 => 100755 qa/python_models/multi_file/file2.py
 create mode 100644 qa/python_models/python_based_backends/add_sub_backend/model.py
 create mode 100644 qa/python_models/request_rescheduling_addsub/config.pbtxt
 create mode 100644 qa/python_models/request_rescheduling_addsub/model.py
 create mode 100644 qa/python_models/sequence_int32/config.pbtxt
 create mode 100644 qa/python_models/sequence_int32/model.py
 rename deploy/gke-marketplace-app/server-deployer/chart/triton/templates/istio-vs.yaml => qa/python_models/sequence_py/config.pbtxt (76%)
 create mode 100644 qa/python_models/sequence_py/model.py
 create mode 100644 qa/python_models/torchvision/resnet50/config.pbtxt
 create mode 100644 qa/python_models/torchvision/resnet50/model.py
 create mode 100644 qa/python_models/variable_gpu_output/config.pbtxt
 create mode 100644 qa/python_models/variable_gpu_output/model.py
 create mode 100644 qa/python_models/wrong_return_type/config.pbtxt
 create mode 100644 qa/python_models/wrong_return_type/model.py
 create mode 100644 src/command_line_parser.cc
 create mode 100644 src/command_line_parser.h
 create mode 100644 src/grpc/CMakeLists.txt
 create mode 100644 src/grpc/grpc_handler.h
 create mode 100644 src/grpc/grpc_server.cc
 create mode 100644 src/grpc/grpc_server.h
 create mode 100644 src/grpc/grpc_utils.cc
 create mode 100644 src/grpc/grpc_utils.h
 create mode 100644 src/grpc/infer_handler.cc
 create mode 100644 src/grpc/infer_handler.h
 create mode 100644 src/grpc/stream_infer_handler.cc
 create mode 100644 src/grpc/stream_infer_handler.h
 delete mode 100644 src/grpc_server.cc
 delete mode 100644 src/grpc_server.h
 create mode 100644 src/restricted_features.h
 create mode 100644 src/test/iterative_sequence/CMakeLists.txt
 create mode 100644 src/test/iterative_sequence/cmake/TritonIterativeSequenceBackendConfig.cmake.in
 create mode 100644 src/test/iterative_sequence/src/iterative_sequence.cc
 rename deploy/gke-marketplace-app/server-deployer/chart/triton/templates/istio-gateway.yaml => src/test/iterative_sequence/src/libtriton_iterative_sequence.ldscript (81%)

diff --git a/.clang-format b/.clang-format
index 98c649734c..1defc175de 100644
--- a/.clang-format
+++ b/.clang-format
@@ -2,6 +2,7 @@
 BasedOnStyle: Google
 
 IndentWidth: 2
+ColumnLimit: 80
 ContinuationIndentWidth: 4
 UseTab: Never
 MaxEmptyLinesToKeep: 2
@@ -34,4 +35,5 @@ BinPackArguments: true
 BinPackParameters: true
 ConstructorInitializerAllOnOneLineOrOnePerLine: false
 
-IndentCaseLabels: true
\ No newline at end of file
+IndentCaseLabels: true
+
diff --git a/.github/workflows/codeql.yml b/.github/workflows/codeql.yml
new file mode 100644
index 0000000000..745a33730b
--- /dev/null
+++ b/.github/workflows/codeql.yml
@@ -0,0 +1,84 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "CodeQL"
+
+on:
+  pull_request:
+
+jobs:
+  analyze:
+    name: Analyze
+    runs-on: ubuntu-latest
+    permissions:
+      actions: read
+      contents: read
+      security-events: write
+
+    strategy:
+      fail-fast: false
+      matrix:
+        language: [ 'python' ]
+        # CodeQL supports [ 'cpp', 'csharp', 'go', 'java', 'javascript', 'python', 'ruby' ]
+        # Learn more about CodeQL language support at https://aka.ms/codeql-docs/language-support
+
+    steps:
+    - name: Checkout repository
+      uses: actions/checkout@v3
+
+    # Initializes the CodeQL tools for scanning.
+    - name: Initialize CodeQL
+      uses: github/codeql-action/init@v2
+      with:
+        languages: ${{ matrix.language }}
+        # If you wish to specify custom queries, you can do so here or in a config file.
+        # By default, queries listed here will override any specified in a config file.
+        # Prefix the list here with "+" to use these queries and those in the config file.
+
+        # Details on CodeQL's query packs refer to:
+        # https://docs.github.com/en/code-security/code-scanning/automatically-scanning-your-code-for-vulnerabilities-and-errors/configuring-code-scanning#using-queries-in-ql-packs
+        queries: +security-and-quality
+
+
+    # Autobuild attempts to build any compiled languages  (C/C++, C#, Go, or Java).
+    # If this step fails, then you should remove it and run the build manually (see below)
+    - name: Autobuild
+      uses: github/codeql-action/autobuild@v2
+
+    # Command-line programs to run using the OS shell.
+    # See https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepsrun
+
+    #   If the Autobuild fails above, remove it and uncomment the following three lines.
+    #   modify them (or add more) to build your code if your project, please refer to the EXAMPLE below for guidance.
+
+    # - run: |
+    #   echo "Run, Build Application using script"
+    #   ./location_of_script_within_repo/buildscript.sh
+
+    - name: Perform CodeQL Analysis
+      uses: github/codeql-action/analyze@v2
+      with:
+        category: "/language:${{matrix.language}}"
diff --git a/.github/workflows/pre-commit.yaml b/.github/workflows/pre-commit.yaml
new file mode 100644
index 0000000000..531cc2911b
--- /dev/null
+++ b/.github/workflows/pre-commit.yaml
@@ -0,0 +1,39 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: pre-commit
+
+on:
+  pull_request:
+
+jobs:
+  pre-commit:
+    runs-on: ubuntu-22.04
+    steps:
+    - uses: actions/checkout@v3
+    - uses: actions/setup-python@v3
+    - uses: pre-commit/action@v3.0.0
+
diff --git a/.gitignore b/.gitignore
index 523a31748f..f1b69cb25e 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,3 +1,8 @@
+/build
 /builddir
 /.vscode
 *.so
+__pycache__
+tmp
+*.log
+test_results.txt
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
new file mode 100644
index 0000000000..f44f815351
--- /dev/null
+++ b/.pre-commit-config.yaml
@@ -0,0 +1,74 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+repos:
+- repo: https://github.com/timothycrosley/isort
+  rev: 5.12.0
+  hooks:
+  - id: isort
+    additional_dependencies: [toml]
+- repo: https://github.com/psf/black
+  rev: 23.1.0
+  hooks:
+  - id: black
+    types_or: [python, cython]
+- repo: https://github.com/PyCQA/flake8
+  rev: 5.0.4
+  hooks:
+  - id: flake8
+    args: [--max-line-length=88, --select=C,E,F,W,B,B950, --extend-ignore = E203,E501]
+    types_or: [python, cython]
+- repo: https://github.com/pre-commit/mirrors-clang-format
+  rev: v16.0.5
+  hooks:
+  - id: clang-format
+    types_or: [c, c++, cuda, proto, textproto, java]
+    args: ["-fallback-style=none", "-style=file", "-i"]
+- repo: https://github.com/codespell-project/codespell
+  rev: v2.2.4
+  hooks:
+  - id: codespell
+    additional_dependencies: [tomli]
+    args: ["--toml", "pyproject.toml"]
+    exclude: (?x)^(.*stemmer.*|.*stop_words.*|^CHANGELOG.md$)
+# More details about these pre-commit hooks here:
+# https://pre-commit.com/hooks.html
+- repo: https://github.com/pre-commit/pre-commit-hooks
+  rev: v4.4.0
+  hooks:
+  - id: check-case-conflict
+  - id: check-executables-have-shebangs
+  - id: check-merge-conflict
+  - id: check-json
+  - id: check-toml
+  - id: check-yaml
+    exclude: ^deploy(\/[^\/]+)*\/templates\/.*$
+  - id: check-shebang-scripts-are-executable
+  - id: end-of-file-fixer
+    types_or: [c, c++, cuda, proto, textproto, java, python]
+  - id: mixed-line-ending
+  - id: requirements-txt-fixer
+  - id: trailing-whitespace
diff --git a/CITATION.cff b/CITATION.cff
new file mode 100644
index 0000000000..f8fb8d09fb
--- /dev/null
+++ b/CITATION.cff
@@ -0,0 +1,7 @@
+cff-version: 1.2.0
+message: "If you use this software, please cite it as below."
+title: "Triton Inference Server: An Optimized Cloud and Edge Inferencing Solution."
+url: https://github.com/triton-inference-server
+repository-code: https://github.com/triton-inference-server/server
+authors:
+  - name: "NVIDIA Corporation"
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 6d4ec543df..13dc0c4e9b 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -1,4 +1,4 @@
-# Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -38,6 +38,7 @@ option(TRITON_ENABLE_TRACING "Include tracing support in server" OFF)
 option(TRITON_ENABLE_NVTX "Include NVTX support in server" OFF)
 option(TRITON_ENABLE_GPU "Enable GPU support in server" ON)
 option(TRITON_ENABLE_MALI_GPU "Enable Arm Mali GPU support in server" OFF)
+option(TRITON_IGPU_BUILD "Enable options for iGPU compilation in sever" OFF)
 set(TRITON_MIN_COMPUTE_CAPABILITY "6.0" CACHE STRING
     "The minimum CUDA compute capability supported by Triton" )
 set(TRITON_EXTRA_LIB_PATHS "" CACHE PATH "Extra library paths for Triton Server build")
@@ -54,6 +55,7 @@ option(TRITON_ENABLE_VERTEX_AI "Include Vertex AI API in server" OFF)
 # Metrics
 option(TRITON_ENABLE_METRICS "Include metrics support in server" ON)
 option(TRITON_ENABLE_METRICS_GPU "Include GPU metrics support in server" ON)
+option(TRITON_ENABLE_METRICS_CPU "Include CPU metrics support in server" ON)
 
 # Cloud storage
 option(TRITON_ENABLE_GCS "Include GCS Filesystem support in server" OFF)
@@ -85,6 +87,10 @@ if(TRITON_ENABLE_TRACING AND NOT TRITON_ENABLE_STATS)
   message(FATAL_ERROR "TRITON_ENABLE_TRACING=ON requires TRITON_ENABLE_STATS=ON")
 endif()
 
+if (TRITON_ENABLE_METRICS_CPU AND NOT TRITON_ENABLE_METRICS)
+  message(FATAL_ERROR "TRITON_ENABLE_METRICS_CPU=ON requires TRITON_ENABLE_METRICS=ON")
+endif()
+
 if (TRITON_ENABLE_METRICS_GPU AND NOT TRITON_ENABLE_METRICS)
   message(FATAL_ERROR "TRITON_ENABLE_METRICS_GPU=ON requires TRITON_ENABLE_METRICS=ON")
 endif()
@@ -113,6 +119,19 @@ FetchContent_Declare(
   GIT_TAG ${TRITON_THIRD_PARTY_REPO_TAG}
 )
 
+# Some libs are installed to ${TRITON_THIRD_PARTY_INSTALL_PREFIX}/{LIB}/lib64 instead
+# of ${TRITON_THIRD_PARTY_INSTALL_PREFIX}/{LIB}/lib on Centos
+set (LIB_DIR "lib")
+# /etc/os-release does not exist on Windows
+if(EXISTS "/etc/os-release")
+  file(STRINGS /etc/os-release DISTRO REGEX "^NAME=")
+  string(REGEX REPLACE "NAME=\"(.*)\"" "\\1" DISTRO "${DISTRO}")
+  message(STATUS "Distro Name: ${DISTRO}")
+  if(DISTRO MATCHES "CentOS.*")
+    set (LIB_DIR "lib64")
+  endif()
+endif()
+
 set(TRITON_CORE_HEADERS_ONLY OFF)
 
 FetchContent_MakeAvailable(repo-third-party repo-core)
@@ -152,7 +171,16 @@ endif()
 if (WIN32)
   set(_FINDPACKAGE_PROTOBUF_CONFIG_DIR "${TRITON_THIRD_PARTY_INSTALL_PREFIX}/protobuf/cmake")
 else()
-  set(_FINDPACKAGE_PROTOBUF_CONFIG_DIR "${TRITON_THIRD_PARTY_INSTALL_PREFIX}/protobuf/lib/cmake/protobuf")
+  set(_FINDPACKAGE_PROTOBUF_CONFIG_DIR "${TRITON_THIRD_PARTY_INSTALL_PREFIX}/protobuf/${LIB_DIR}/cmake/protobuf")
+endif()
+
+# Triton with Opentelemetry is not supported on Windows
+# FIXME: add location for Windows, when support is added
+# JIRA DLIS-4786
+if (WIN32)
+  set(_FINDPACKAGE_OPENTELEMETRY_CONFIG_DIR "")
+else()
+  set(_FINDPACKAGE_OPENTELEMETRY_CONFIG_DIR "${TRITON_THIRD_PARTY_INSTALL_PREFIX}/opentelemetry-cpp/${LIB_DIR}/cmake/opentelemetry-cpp")
 endif()
 
 if (CMAKE_INSTALL_PREFIX_INITIALIZED_TO_DEFAULT)
@@ -168,15 +196,15 @@ endif() # TRITON_ENABLE_GCS
 if(${TRITON_ENABLE_S3})
   set(TRITON_DEPENDS ${TRITON_DEPENDS} aws-sdk-cpp)
 endif() # TRITON_ENABLE_S3
-if(${TRITON_ENABLE_AZURE_STORAGE})
-  set(TRITON_DEPENDS ${TRITON_DEPENDS} azure-storage-cpplite)
-endif() # TRITON_ENABLE_AZURE_STORAGE
 if(${TRITON_ENABLE_HTTP} OR ${TRITON_ENABLE_METRICS} OR ${TRITON_ENABLE_SAGEMAKER} OR ${TRITON_ENABLE_VERTEX_AI})
   set(TRITON_DEPENDS ${TRITON_DEPENDS} libevent libevhtp)
 endif() # TRITON_ENABLE_HTTP || TRITON_ENABLE_METRICS || TRITON_ENABLE_SAGEMAKER || TRITON_ENABLE_VERTEX_AI
 if(${TRITON_ENABLE_GRPC})
   set(TRITON_DEPENDS ${TRITON_DEPENDS} grpc)
 endif() # TRITON_ENABLE_GRPC
+if(NOT WIN32 AND ${TRITON_ENABLE_TRACING})
+  set(TRITON_DEPENDS ${TRITON_DEPENDS} opentelemetry-cpp)
+endif() # TRITON_ENABLE_TRACING
 
 ExternalProject_Add(triton-server
   PREFIX triton-server
@@ -189,21 +217,23 @@ ExternalProject_Add(triton-server
     ${_CMAKE_ARGS_VCPKG_TARGET_TRIPLET}
     -DGTEST_ROOT:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/googletest
     -DgRPC_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/grpc/lib/cmake/grpc
-    -Dc-ares_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/c-ares/lib/cmake/c-ares
-    -Dabsl_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/absl/lib/cmake/absl
-    -Dnlohmann_json_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/nlohmann_json/lib/cmake/nlohmann_json
+    -Dc-ares_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/c-ares/${LIB_DIR}/cmake/c-ares
+    -Dabsl_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/absl/${LIB_DIR}/cmake/absl
+    -DCURL_DIR:STRING=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/curl/${LIB_DIR}/cmake/CURL
+    -Dnlohmann_json_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/nlohmann_json/${LIB_DIR}/cmake/nlohmann_json
     -DLibevent_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/libevent/lib/cmake/libevent
     -Dlibevhtp_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/libevhtp/lib/cmake/libevhtp
-    -Dstorage_client_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/google-cloud-cpp/lib/cmake/storage_client
-    -Dazure-storage-cpplite_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/azure-storage-cpplite
-    -Dgoogle_cloud_cpp_common_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/google-cloud-cpp/lib/cmake/google_cloud_cpp_common
-    -DCrc32c_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/crc32c/lib/cmake/Crc32c
-    -DAWSSDK_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/aws-sdk-cpp/lib/cmake/AWSSDK
-    -Daws-cpp-sdk-core_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/aws-sdk-cpp/lib/cmake/aws-cpp-sdk-core
-    -Daws-cpp-sdk-s3_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/aws-sdk-cpp/lib/cmake/aws-cpp-sdk-s3
-    -Daws-c-event-stream_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/aws-sdk-cpp/lib/aws-c-event-stream/cmake
-    -Daws-c-common_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/aws-sdk-cpp/lib/aws-c-common/cmake
-    -Daws-checksums_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/aws-sdk-cpp/lib/aws-checksums/cmake
+    -Dstorage_client_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/google-cloud-cpp/${LIB_DIR}/cmake/storage_client
+    -Dgoogle_cloud_cpp_common_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/google-cloud-cpp/${LIB_DIR}/cmake/google_cloud_cpp_common
+    -DCrc32c_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/crc32c/${LIB_DIR}/cmake/Crc32c
+    -DAWSSDK_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/aws-sdk-cpp/${LIB_DIR}/cmake/AWSSDK
+    -Daws-cpp-sdk-core_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/aws-sdk-cpp/${LIB_DIR}/cmake/aws-cpp-sdk-core
+    -Daws-cpp-sdk-s3_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/aws-sdk-cpp/${LIB_DIR}/cmake/aws-cpp-sdk-s3
+    -Daws-c-event-stream_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/aws-sdk-cpp/${LIB_DIR}/aws-c-event-stream/cmake
+    -Daws-c-common_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/aws-sdk-cpp/${LIB_DIR}/aws-c-common/cmake
+    -Daws-checksums_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/aws-sdk-cpp/${LIB_DIR}/aws-checksums/cmake
+    -Dopentelemetry-cpp_DIR:PATH=${_FINDPACKAGE_OPENTELEMETRY_CONFIG_DIR}
+    -DTRITON_IGPU_BUILD:BOOL=${TRITON_IGPU_BUILD}
     -DTRITON_THIRD_PARTY_REPO_TAG:STRING=${TRITON_THIRD_PARTY_REPO_TAG}
     -DTRITON_COMMON_REPO_TAG:STRING=${TRITON_COMMON_REPO_TAG}
     -DTRITON_CORE_REPO_TAG:STRING=${TRITON_CORE_REPO_TAG}
@@ -223,6 +253,7 @@ ExternalProject_Add(triton-server
     -DTRITON_MIN_COMPUTE_CAPABILITY:STRING=${TRITON_MIN_COMPUTE_CAPABILITY}
     -DTRITON_ENABLE_METRICS:BOOL=${TRITON_ENABLE_METRICS}
     -DTRITON_ENABLE_METRICS_GPU:BOOL=${TRITON_ENABLE_METRICS_GPU}
+    -DTRITON_ENABLE_METRICS_CPU:BOOL=${TRITON_ENABLE_METRICS_CPU}
     -DTRITON_ENABLE_GCS:BOOL=${TRITON_ENABLE_GCS}
     -DTRITON_ENABLE_AZURE_STORAGE:BOOL=${TRITON_ENABLE_AZURE_STORAGE}
     -DTRITON_ENABLE_S3:BOOL=${TRITON_ENABLE_S3}
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index dbc3f9bdb4..59e0ace975 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2018-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -47,7 +47,7 @@ proposed change so that the Triton team can provide feedback.
   will provide guidance about how and where your enhancement should be
   implemented.
 
-- [Testing](docs/test.md) is a critical part of any Triton
+- [Testing](docs/customization_guide/test.md) is a critical part of any Triton
   enhancement. You should plan on spending significant time on
   creating tests for your change. The Triton team will help you to
   design your testing so that it is compatible with existing testing
@@ -84,7 +84,7 @@ proposed change so that the Triton team can provide feedback.
 - Make sure all `L0_*` tests pass:
 
   - In the `qa/` directory, there are basic sanity tests scripted in
-    directories named `L0_...`.  See the [Test](docs/test.md)
+    directories named `L0_...`.  See the [Test](docs/customization_guide/test.md)
     documentation for instructions on running these tests.
 
 - Triton Inference Server's default build assumes recent versions of
@@ -103,21 +103,19 @@ proposed change so that the Triton team can provide feedback.
 
 # Coding Convention
 
-Use clang-format to format all source files (\*.h, \*.cc, \*.proto,
-*.py) to a consistent format. You should run clang-format on all
-source files before submitting a pull request:
+All pull requests are checked against the
+[pre-commit hooks](https://github.com/pre-commit/pre-commit-hooks)
+located [in the repository's top-level .pre-commit-config.yaml](https://github.com/NVIDIA/triton-inference-server/blob/master/pre-commit-config.yaml).
+The hooks do some sanity checking like linting and formatting.
+These checks must pass to merge a change.
 
-```
-$ apt-get install clang-format clang-format-6.0
-```
-
-For convenience there is a format.py script in the
-triton-inference-server/common repo in the "tools" directory that can
-be used to clang-format all files within the repo:
-
-```
-$ python3 ../common/tools/format.py *
-```
+To run these locally, you can
+[install pre-commit,](https://pre-commit.com/#install)
+then run `pre-commit install` inside the cloned repo. When you
+commit a change, the pre-commit hooks will run automatically.
+If a fix is implemented by a pre-commit hook, adding the file again
+and running `git commit` a second time will pass and successfully
+commit.
 
 # Contributor License Agreement (CLA)
 
@@ -125,3 +123,5 @@ Triton requires that all contributors (or their corporate entity) send
 a signed copy of the [Contributor License
 Agreement](https://github.com/NVIDIA/triton-inference-server/blob/master/Triton-CCLA-v1.pdf)
 to triton-cla@nvidia.com.
+*NOTE*: Contributors with no company affiliation can fill `N/A` in the
+`Corporation Name` and `Corporation Address` fields.
diff --git a/Dockerfile.QA b/Dockerfile.QA
index c8d114b16c..509206fc70 100644
--- a/Dockerfile.QA
+++ b/Dockerfile.QA
@@ -1,4 +1,4 @@
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -32,6 +32,7 @@ ARG TRITON_CORE_REPO_TAG=main
 ARG TRITON_THIRD_PARTY_REPO_TAG=main
 ARG TRITON_BACKEND_REPO_TAG=main
 ARG TRITONTMP_DIR=/tmp
+ARG IGPU_BUILD=0
 
 ############################################################################
 ## Test artifacts built as part of the tritonserver build are
@@ -44,6 +45,7 @@ ARG TRITON_COMMON_REPO_TAG
 ARG TRITON_CORE_REPO_TAG
 ARG TRITON_THIRD_PARTY_REPO_TAG
 ARG TRITON_BACKEND_REPO_TAG
+ARG IGPU_BUILD
 
 # Ensure apt-get won't prompt for selecting options
 ENV DEBIAN_FRONTEND=noninteractive
@@ -62,13 +64,13 @@ RUN apt-get update && \
 RUN pip3 install --upgrade pip && \
     pip3 install --upgrade wheel setuptools
 
-RUN wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | \
-      gpg --dearmor - |  \
-      tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null && \
-    apt-add-repository 'deb https://apt.kitware.com/ubuntu/ focal main' && \
-    apt-get update && \
-    apt-get install -y --no-install-recommends \
-      cmake-data=3.21.1-0kitware1ubuntu20.04.1 cmake=3.21.1-0kitware1ubuntu20.04.1
+RUN apt update -q=2 \
+    && apt install -y gpg wget \
+    && wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - |  tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null \
+    && . /etc/os-release \
+    && echo "deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ $UBUNTU_CODENAME main" | tee /etc/apt/sources.list.d/kitware.list >/dev/null \
+    && apt-get update -q=2 \
+    && apt-get install -y --no-install-recommends cmake=3.27.7* cmake-data=3.27.7*
 
 # Add inception_graphdef model to example repo
 WORKDIR /workspace/docs/examples/model_repository
@@ -107,6 +109,8 @@ RUN mkdir -p qa/common && \
     cp -r docs/examples/model_repository/simple_sequence qa/L0_grpc/models && \
     cp -r docs/examples/model_repository/simple_string qa/L0_grpc/models && \
     cp -r docs/examples/model_repository/inception_graphdef qa/L0_grpc/models && \
+    mkdir qa/L0_grpc_state_cleanup/models && \
+    cp -r /workspace/src/test/models/repeat_int32 qa/L0_grpc_state_cleanup/models/ && \
     mkdir qa/L0_http/models && \
     cp -r docs/examples/model_repository/simple qa/L0_http/models && \
     cp -r docs/examples/model_repository/simple_dyna_sequence qa/L0_http/models && \
@@ -128,6 +132,8 @@ RUN mkdir -p qa/common && \
     mkdir qa/L0_query/models/query/1 && \
     cp tritonbuild/tritonserver/backends/query/libtriton_query.so qa/L0_query/models/query/1/. && \
     cp bin/query_test qa/L0_query/. && \
+    mkdir qa/L0_iterative_sequence/models/iterative_sequence/1 && \
+    cp tritonbuild/tritonserver/backends/iterative_sequence/libtriton_iterative_sequence.so qa/L0_iterative_sequence/models/iterative_sequence/1/. && \
     cp bin/register_api_test qa/L0_register/. && \
     cp bin/async_work_queue_test qa/L0_async_work_queue/. && \
     cp tritonbuild/tritonserver/backends/implicit_state/libtriton_implicit_state.so \
@@ -137,8 +143,15 @@ RUN mkdir -p qa/common && \
     cp bin/data_compressor_test qa/L0_data_compression/. && \
     cp bin/metrics_api_test qa/L0_metrics/. && \
     cp bin/response_cache_test qa/L0_response_cache/. && \
+    cp bin/request_cancellation_test qa/L0_request_cancellation/. && \
+    cp bin/triton_json_test qa/L0_json/. && \
+    cp bin/backend_output_detail_test qa/L0_backend_output_detail/. && \
     cp -r deploy/mlflow-triton-plugin qa/L0_mlflow/.
 
+RUN mkdir -p qa/pkgs && \
+    cp python/triton*.whl qa/pkgs/. && \
+    cp python/test_binding.py qa/L0_python_api/.
+
 # caffe2plan will not exist if the build was done without TensorRT enabled
 RUN if [ -f bin/caffe2plan ]; then \
        cp bin/caffe2plan qa/common/.; \
@@ -197,12 +210,6 @@ RUN cp tritonbuild/identity/install/backends/identity/libtriton_identity.so \
     cp -r qa/L0_infer/. qa/L0_infer_cudashm && \
     mkdir -p qa/L0_infer_valgrind && \
     cp -r qa/L0_infer/. qa/L0_infer_valgrind && \
-    mkdir -p qa/L0_infer_tf2 && \
-    cp -r qa/L0_infer/. qa/L0_infer_tf2 && \
-    if [ -d qa/L0_tf_image_models ]; then \
-        mkdir -p qa/L0_tf_image_models_tf2 && \
-            cp -r qa/L0_tf_image_models/. qa/L0_tf_image_models_tf2; \
-    fi; \
     mkdir -p qa/L0_trt_shape_tensors_shm && \
     cp -r qa/L0_trt_shape_tensors/. qa/L0_trt_shape_tensors_shm && \
     mkdir -p qa/L0_trt_shape_tensors_cudashm && \
@@ -232,20 +239,22 @@ RUN if [ -d qa/L0_model_control_stress ]; then \
             cp -r qa/L0_model_control_stress/. qa/L0_model_control_stress_valgrind_massif; \
     fi
 
-RUN cp backends/repeat/libtriton_repeat.so qa/L0_model_config/.
-
 RUN mkdir -p qa/L0_decoupled/models/repeat_int32/1 && \
-    cp backends/repeat/libtriton_repeat.so \
-        qa/L0_decoupled/models/repeat_int32/1/.
-RUN mkdir -p qa/L0_decoupled/models/square_int32/1 && \
-    cp backends/square/libtriton_square.so \
-        qa/L0_decoupled/models/square_int32/1/.
-RUN mkdir -p qa/L0_decoupled/models/identity_int32/1
-RUN mkdir -p qa/L0_decoupled/models/simple_repeat/1 && \
+    mkdir -p qa/L0_decoupled/models/square_int32/1 && \
+    mkdir -p qa/L0_decoupled/models/identity_int32/1 && \
+    mkdir -p qa/L0_decoupled/models/simple_repeat/1 && \
     mkdir -p qa/L0_decoupled/models/fan_repeat/1 && \
     mkdir -p qa/L0_decoupled/models/sequence_repeat/1 && \
     mkdir -p qa/L0_decoupled/models/repeat_square/1 && \
-    mkdir -p qa/L0_decoupled/models/nested_square/1
+    mkdir -p qa/L0_decoupled/models/nested_square/1 && \
+    mkdir -p qa/L0_grpc_state_cleanup/models/repeat_int32/1
+
+RUN if [ "$IGPU_BUILD" == "0" ]; then \
+        cp backends/repeat/libtriton_repeat.so qa/L0_model_config && \
+        cp backends/repeat/libtriton_repeat.so qa/L0_decoupled/models/repeat_int32/1 && \
+        cp backends/repeat/libtriton_repeat.so qa/L0_grpc_state_cleanup/models/repeat_int32/1/. && \
+        cp backends/square/libtriton_square.so qa/L0_decoupled/models/square_int32/1; \
+    fi
 
 RUN cp -r qa/L0_decoupled/models qa/L0_decoupled/python_models/ && \
     cp /workspace/tritonbuild/python/examples/decoupled/repeat_model.py \
@@ -299,12 +308,16 @@ RUN if [ $(cat /etc/os-release | grep 'VERSION_ID="20.04"' | wc -l) -ne 0 ]; the
         apt-get update && \
         apt-get install -y --no-install-recommends \
                 libpng-dev; \
+    elif [ $(cat /etc/os-release | grep 'VERSION_ID="22.04"' | wc -l) -ne 0 ]; then \
+        apt-get update && \
+        apt-get install -y --no-install-recommends \
+                libpng-dev; \
     elif [ $(cat /etc/os-release | grep 'VERSION_ID="18.04"' | wc -l) -ne 0 ]; then \
         apt-get update && \
         apt-get install -y --no-install-recommends \
                 libpng-dev; \
     else \
-        echo "Ubuntu version must be either 18.04 or 20.04" && \
+        echo "Ubuntu version must be either 18.04, 20.04 or 22.04" && \
         exit 1; \
     fi
 
@@ -312,6 +325,7 @@ RUN if [ $(cat /etc/os-release | grep 'VERSION_ID="20.04"' | wc -l) -ne 0 ]; the
 # libarchive-dev is required by Python backend
 RUN apt-get update && apt-get install -y --no-install-recommends \
                               curl \
+                              gdb \
                               libopencv-dev \
                               libarchive-dev \
                               libopencv-core-dev \
@@ -334,22 +348,19 @@ RUN rm -f /usr/bin/python && \
     ln -s /usr/bin/python3 /usr/bin/python
 
 RUN pip3 install --upgrade wheel setuptools && \
-    pip3 install --upgrade numpy pillow attrdict future grpcio requests gsutil awscli six grpcio-channelz
-
-# L0_http_fuzz is hitting similar issue with boofuzz with latest version (0.4.0):
-# https://github.com/jtpereyda/boofuzz/issues/529
-# Hence, fixing the boofuzz version to 0.3.0
-RUN pip3 install 'boofuzz==0.3.0'
+    pip3 install --upgrade numpy pillow attrdict future grpcio requests gsutil \
+                           awscli six grpcio-channelz prettytable virtualenv \
+                           check-jsonschema
 
 # go needed for example go client test.
 RUN if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
-      wget https://golang.org/dl/go1.16.3.linux-arm64.tar.gz && \
-      rm -rf /usr/local/go && tar -C /usr/local -xzf go1.16.3.linux-arm64.tar.gz && \
-      rm -f go1.16.3.linux-arm64.tar.gz; \
+      wget https://golang.org/dl/go1.19.1.linux-arm64.tar.gz && \
+      rm -rf /usr/local/go && tar -C /usr/local -xzf go1.19.1.linux-arm64.tar.gz && \
+      rm -f go1.19.1.linux-arm64.tar.gz; \
     else \
-      wget https://golang.org/dl/go1.16.3.linux-amd64.tar.gz && \
-      rm -rf /usr/local/go && tar -C /usr/local -xzf go1.16.3.linux-amd64.tar.gz && \
-      rm -f go1.16.3.linux-amd64.tar.gz; \
+      wget https://golang.org/dl/go1.19.1.linux-amd64.tar.gz && \
+      rm -rf /usr/local/go && tar -C /usr/local -xzf go1.19.1.linux-amd64.tar.gz && \
+      rm -f go1.19.1.linux-amd64.tar.gz; \
     fi
 ENV GOPATH /root/go
 ENV PATH $PATH:/usr/local/go/bin:$GOPATH/bin
@@ -369,7 +380,14 @@ RUN rm -fr qa/L0_copyrights qa/L0_build_variants && \
     "tritonclient-*linux*.whl" | xargs printf -- '%s[all]' | \
     xargs pip3 install --upgrade
 
+# Install Triton Python API
+RUN find qa/pkgs/ -maxdepth 1 -type f -name \
+    "tritonserver-*.whl" | xargs pip3 install --upgrade
+
 ENV LD_LIBRARY_PATH /opt/tritonserver/qa/clients:${LD_LIBRARY_PATH}
 
 # DLIS-3631: Needed to run Perf Analyzer CI tests correctly
 ENV LD_LIBRARY_PATH /opt/hpcx/ompi/lib:${LD_LIBRARY_PATH}
+
+# Required for PyTorch to pickup the correct HPCX libraries
+ENV LD_LIBRARY_PATH /opt/hpcx/ucc/lib/:/opt/hpcx/ucx/lib/:${LD_LIBRARY_PATH}
diff --git a/Dockerfile.sdk b/Dockerfile.sdk
index eb382c6f67..299630cf87 100644
--- a/Dockerfile.sdk
+++ b/Dockerfile.sdk
@@ -1,4 +1,4 @@
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -29,20 +29,19 @@
 #
 
 # Base image on the minimum Triton container
-ARG BASE_IMAGE=nvcr.io/nvidia/tritonserver:22.05-py3-min
+ARG BASE_IMAGE=nvcr.io/nvidia/tritonserver:23.11-py3-min
 
 ARG TRITON_CLIENT_REPO_SUBDIR=clientrepo
 ARG TRITON_COMMON_REPO_TAG=main
 ARG TRITON_CORE_REPO_TAG=main
-ARG TRITON_BACKEND_REPO_TAG=main
 ARG TRITON_THIRD_PARTY_REPO_TAG=main
 ARG TRITON_MODEL_ANALYZER_REPO_TAG=main
-ARG CMAKE_UBUNTU_VERSION=20.04
 ARG TRITON_ENABLE_GPU=ON
 ARG JAVA_BINDINGS_MAVEN_VERSION=3.8.4
+ARG JAVA_BINDINGS_JAVACPP_PRESETS_TAG=1.5.8
 
 # DCGM version to install for Model Analyzer
-ARG DCGM_VERSION=2.2.9
+ARG DCGM_VERSION=3.2.6
 
 ARG NVIDIA_TRITON_SERVER_SDK_VERSION=unknown
 ARG NVIDIA_BUILD_ID=unknown
@@ -65,7 +64,9 @@ RUN apt-get update && \
             build-essential \
             curl \
             git \
+            gperf \
             libb64-dev \
+            libgoogle-perftools-dev \
             libopencv-dev \
             libopencv-core-dev \
             libssl-dev \
@@ -84,25 +85,16 @@ RUN apt-get update && \
     pip3 install --upgrade grpcio-tools && \
     pip3 install --upgrade pip
 
-ARG CMAKE_UBUNTU_VERSION
 # Client build requires recent version of CMake (FetchContent required)
-RUN wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | \
-      gpg --dearmor - |  \
-      tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null && \
-    if [ "$CMAKE_UBUNTU_VERSION" = "20.04" ]; then \
-      apt-add-repository 'deb https://apt.kitware.com/ubuntu/ focal main' && \
-      apt-get update && \
-      apt-get install -y --no-install-recommends \
-        cmake-data=3.21.1-0kitware1ubuntu20.04.1 cmake=3.21.1-0kitware1ubuntu20.04.1; \
-    elif [ "$CMAKE_UBUNTU_VERSION" = "18.04" ]; then \
-      apt-add-repository 'deb https://apt.kitware.com/ubuntu/ bionic main' && \
-      apt-get update && \
-      apt-get install -y --no-install-recommends \
-        cmake-data=3.18.4-0kitware1 cmake=3.18.4-0kitware1; \
-    else \
-      echo "ERROR: Only support CMAKE_UBUNTU_VERSION to be 18.04 or 20.04" && false; \
-    fi && \
-    cmake --version
+# Using CMAKE installation instruction from:: https://apt.kitware.com/
+RUN apt update -q=2 \
+    && apt install -y gpg wget \
+    && wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - |  tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null \
+    && . /etc/os-release \
+    && echo "deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ $UBUNTU_CODENAME main" | tee /etc/apt/sources.list.d/kitware.list >/dev/null \
+    && apt-get update -q=2 \
+    && apt-get install -y --no-install-recommends cmake=3.27.7* cmake-data=3.27.7* \
+    && cmake --version
 
 # Build expects "python" executable (not python3).
 RUN rm -f /usr/bin/python && \
@@ -112,10 +104,10 @@ RUN rm -f /usr/bin/python && \
 ARG TRITON_CLIENT_REPO_SUBDIR
 ARG TRITON_COMMON_REPO_TAG
 ARG TRITON_CORE_REPO_TAG
-ARG TRITON_BACKEND_REPO_TAG
 ARG TRITON_THIRD_PARTY_REPO_TAG
 ARG TRITON_ENABLE_GPU
 ARG JAVA_BINDINGS_MAVEN_VERSION
+ARG JAVA_BINDINGS_JAVACPP_PRESETS_TAG
 ARG TARGETPLATFORM
 
 WORKDIR /workspace
@@ -127,7 +119,6 @@ RUN cmake -DCMAKE_INSTALL_PREFIX=/workspace/install \
           -DTRITON_VERSION=`cat /workspace/TRITON_VERSION` \
           -DTRITON_COMMON_REPO_TAG=${TRITON_COMMON_REPO_TAG} \
           -DTRITON_CORE_REPO_TAG=${TRITON_CORE_REPO_TAG} \
-          -DTRITON_BACKEND_REPO_TAG=${TRITON_BACKEND_REPO_TAG} \
           -DTRITON_THIRD_PARTY_REPO_TAG=${TRITON_THIRD_PARTY_REPO_TAG} \
           -DTRITON_ENABLE_CC_HTTP=ON -DTRITON_ENABLE_CC_GRPC=ON \
           -DTRITON_ENABLE_PYTHON_HTTP=ON -DTRITON_ENABLE_PYTHON_GRPC=ON \
@@ -146,6 +137,7 @@ RUN if [ "$TARGETPLATFORM" = "linux/amd64" ]; then \
         source /workspace/client/src/java-api-bindings/scripts/install_dependencies_and_build.sh \
         --maven-version ${JAVA_BINDINGS_MAVEN_VERSION} \
         --core-tag ${TRITON_CORE_REPO_TAG} \
+        --javacpp-tag ${JAVA_BINDINGS_JAVACPP_PRESETS_TAG} \
         --jar-install-path /workspace/install/java-api-bindings; \
     fi
 
@@ -167,7 +159,9 @@ RUN apt-get update && \
             software-properties-common \
             curl \
             git \
+            gperf \
             libb64-dev \
+            libgoogle-perftools-dev \
             libopencv-dev \
             libopencv-core-dev \
             libssl-dev \
@@ -198,6 +192,10 @@ RUN mkdir qa
 COPY qa/L0_sdk qa/L0_sdk
 COPY qa/L0_client_build_variants qa/L0_client_build_variants
 
+# Create a directory for all the python client tests to enable unit testing
+RUN mkdir -p qa/python_client_unit_tests/
+COPY --from=sdk_build /workspace/client/src/python/library/tests/* qa/python_client_unit_tests/
+
 # Install an image needed by the quickstart and other documentation.
 COPY qa/images/mug.jpg images/mug.jpg
 
@@ -213,7 +211,7 @@ RUN pip3 install --upgrade numpy pillow attrdict && \
 RUN if [ "$TRITON_ENABLE_GPU" = "ON" ]; then \
         [ "$(uname -m)" != "x86_64" ] && arch="sbsa" || arch="x86_64" && \
         curl -o /tmp/cuda-keyring.deb \
-        https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/$arch/cuda-keyring_1.0-1_all.deb \
+        https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/$arch/cuda-keyring_1.0-1_all.deb \
         && apt install /tmp/cuda-keyring.deb && rm /tmp/cuda-keyring.deb && \
         apt-get update && apt-get install -y datacenter-gpu-manager=1:${DCGM_VERSION}; \
     fi
@@ -244,3 +242,6 @@ ENV LD_LIBRARY_PATH /workspace/install/lib:${LD_LIBRARY_PATH}
 
 # DLIS-3631: Needed to run Perf Analyzer CI tests correctly
 ENV LD_LIBRARY_PATH /opt/hpcx/ompi/lib:${LD_LIBRARY_PATH}
+
+# Set TCMALLOC_RELEASE_RATE for users setting LD_PRELOAD with tcmalloc
+ENV TCMALLOC_RELEASE_RATE 200
diff --git a/Dockerfile.win10.min b/Dockerfile.win10.min
index eaa4606b7c..8d2d43e3dd 100644
--- a/Dockerfile.win10.min
+++ b/Dockerfile.win10.min
@@ -26,7 +26,7 @@
 
 # Windows min container for Triton build
 
-ARG BASE_IMAGE=mcr.microsoft.com/dotnet/framework/sdk:4.8
+ARG BASE_IMAGE=mcr.microsoft.com/windows:10.0.19042.1889
 
 FROM ${BASE_IMAGE}
 
@@ -35,49 +35,96 @@ SHELL ["cmd", "/S", "/C"]
 RUN mkdir c:\tmp
 WORKDIR /tmp
 
-RUN powershell.exe [Net.ServicePointManager]::Expect100Continue = $true; [Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls, [Net.SecurityProtocolType]::Tls11, [Net.SecurityProtocolType]::Tls12, [Net.SecurityProtocolType]::Ssl3; Invoke-Expression (New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1')
+RUN powershell.exe Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope LocalMachine
+RUN powershell.exe [Net.ServicePointManager]::Expect100Continue=$true;[Net.ServicePointManager]::SecurityProtocol=[Net.SecurityProtocolType]::Tls,[Net.SecurityProtocolType]::Tls11,[Net.SecurityProtocolType]::Tls12,[Net.SecurityProtocolType]::Ssl3;Invoke-Expression( New-Object System.Net.WebClient ).DownloadString('https://chocolatey.org/install.ps1')
 RUN choco install git docker unzip -y
 
+#
+# Installing CMake
+#
+ARG CMAKE_VERSION=3.27.1
+ARG CMAKE_FILE=cmake-${CMAKE_VERSION}-windows-x86_64
+ARG CMAKE_SOURCE=https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/${CMAKE_FILE}.zip
+
+ADD ${CMAKE_SOURCE} ${CMAKE_FILE}.zip
+RUN unzip %CMAKE_FILE%.zip
+RUN move %CMAKE_FILE% "c:\CMake"
+RUN setx PATH "c:\CMake\bin;%PATH%"
+
+ENV CMAKE_TOOLCHAIN_FILE /vcpkg/scripts/buildsystems/vcpkg.cmake
+ENV VCPKG_TARGET_TRIPLET x64-windows
+
+LABEL CMAKE_VERSION=${CMAKE_VERSION}
+
 # Be aware that pip can interact badly with VS cmd shell so need to pip install before
 # vsdevcmd.bat (see https://bugs.python.org/issue38989)
-ADD https://www.python.org/ftp/python/3.8.10/python-3.8.10-amd64.exe python-3.8.10-amd64.exe
-RUN python-3.8.10-amd64.exe /quiet InstallAllUsers=1 PrependPath=1 Include_doc=0
-RUN mklink "c:\Program Files\Python38\python3.exe" "c:\Program Files\Python38\python.exe"
+ARG PYTHON_VERSION=3.8.10
+ARG PYTHON_SOURCE=https://www.python.org/ftp/python/${PYTHON_VERSION}/python-${PYTHON_VERSION}-amd64.exe
+ADD ${PYTHON_SOURCE} python-${PYTHON_VERSION}-amd64.exe
+RUN python-%PYTHON_VERSION%-amd64.exe /quiet InstallAllUsers=1 PrependPath=1 Include_doc=0 TargetDir="C:\python%PYTHON_VERSION%"
+RUN mklink "C:\python%PYTHON_VERSION%\python3.exe" "C:\python%PYTHON_VERSION%\python.exe"
 RUN pip install --upgrade wheel setuptools docker
 RUN pip install grpcio-tools
 
-# Download and install Build Tools for Visual Studio. The use of
-# powershell for the install seems to be required to make the command
-# wait for the install to complete before continuing. To avoid failures
-# caused by VS regressions we want to stick with a working
-# compiler. Currently this is 16.8.5. This page contains download
-# links for buildtools.
-# https://docs.microsoft.com/en-us/visualstudio/releases/2019/history#release-dates-and-build-numbers
-ADD https://download.visualstudio.microsoft.com/download/pr/20130c62-1bc8-43d6-b4f0-c20bb7c79113/145a319d79a83376915d8f855605e152ef5f6fa2b2f1d2dca411fb03722eea72/vs_BuildTools.exe vs_buildtools.exe
+LABEL PYTHON_VERSION=${PYTHON_VERSION}
+
+#
+# Installing Visual Studio BuildTools: VS17 2022
+#
+ARG BUILDTOOLS_VERSION
+# Download collect.exe in case of an install failure.
+ADD https://aka.ms/vscollect.exe "C:\tmp\collect.exe"
+
+# Use the latest release channel. For more control, specify the location of an internal layout.
+ARG CHANNEL_URL=https://aka.ms/vs/17/release/channel
+ADD ${CHANNEL_URL} "C:\tmp\VisualStudio.chman"
+# Download the Build Tools bootstrapper.
+ARG BUILD_TOOLS_SOURCE=https://aka.ms/vs/17/release/vs_buildtools.exe
+ADD ${BUILD_TOOLS_SOURCE} vs_buildtools.exe
+# Install Build Tools with the Microsoft.VisualStudio.Workload.VCTools workload, including recommended.
 ARG VS_INSTALL_PATH_WP="C:\BuildTools"
-RUN powershell.exe Start-Process -FilePath vs_buildtools.exe -ArgumentList "--wait","--quiet","--norestart","--nocache","--installPath","%VS_INSTALL_PATH_WP%","--channelUri","C:\tmp\doesnotexist.chman","--add","Microsoft.VisualStudio.Workload.VCTools`;includeRecommended","--add","Microsoft.Component.MSBuild" -Wait -PassThru
+RUN vs_buildtools.exe --quiet --wait --norestart --nocache install \
+      --installPath %VS_INSTALL_PATH_WP% \
+      --channelUri "C:\tmp\VisualStudio.chman" \
+      --installChannelUri "C:\tmp\VisualStudio.chman" \
+      --add Microsoft.VisualStudio.Workload.VCTools \
+      --includeRecommended \
+      --locale "En-us"
+
+LABEL BUILDTOOLS_VERSION=${BUILDTOOLS_VERSION}
 
 WORKDIR /
-RUN git clone --single-branch --depth=1 -b 2022.04.12 https://github.com/microsoft/vcpkg.git
+
+#
+# Installing Vcpkg
+#
+ARG VCPGK_VERSION=2022.11.14
+RUN git clone --single-branch --depth=1 -b %VCPGK_VERSION% https://github.com/microsoft/vcpkg.git
 WORKDIR /vcpkg
 RUN bootstrap-vcpkg.bat
 RUN vcpkg.exe update
-RUN vcpkg.exe install openssl:x64-windows openssl-windows:x64-windows rapidjson:x64-windows re2:x64-windows boost-interprocess:x64-windows boost-stacktrace:x64-windows zlib:x64-windows pthread:x64-windows b64:x64-windows
+RUN vcpkg.exe install \
+      b64:x64-windows \
+      boost-filesystem:x64-windows \
+      boost-interprocess:x64-windows \
+      boost-stacktrace:x64-windows \
+      openssl-windows:x64-windows \
+      openssl:x64-windows \
+      pthread:x64-windows \
+      rapidjson:x64-windows \
+      re2:x64-windows \
+      zlib:x64-windows
 RUN vcpkg.exe integrate install
 
-# Install a recent version of cmake. The version installed with VS is
-# older and specifically we need more recent version to avoid
-# https://gitlab.kitware.com/cmake/cmake/-/issues/21492.
-ARG CMAKE_VERSION=3.21.2
-ARG CMAKE_FILE=cmake-${CMAKE_VERSION}-windows-x86_64
+LABEL VCPGK_VERSION=${VCPGK_VERSION}
+
 WORKDIR /
-ADD https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/${CMAKE_FILE}.zip ${CMAKE_FILE}.zip
-RUN unzip %CMAKE_FILE%.zip
-RUN move %CMAKE_FILE% CMake
-RUN setx PATH "c:\CMake\bin;%PATH%"
 
-ARG CUDA_MAJOR=11
-ARG CUDA_MINOR=5
+#
+# Installing CUDA
+#
+ARG CUDA_MAJOR=12
+ARG CUDA_MINOR=3
 ARG CUDA_PATCH=0
 ARG CUDA_VERSION=${CUDA_MAJOR}.${CUDA_MINOR}.${CUDA_PATCH}
 ARG CUDA_PACKAGES="nvcc_${CUDA_MAJOR}.${CUDA_MINOR} \
@@ -88,47 +135,61 @@ ARG CUDA_PACKAGES="nvcc_${CUDA_MAJOR}.${CUDA_MINOR} \
                    curand_${CUDA_MAJOR}.${CUDA_MINOR} curand_dev_${CUDA_MAJOR}.${CUDA_MINOR} \
                    cusolver_${CUDA_MAJOR}.${CUDA_MINOR} cusolver_dev_${CUDA_MAJOR}.${CUDA_MINOR} \
                    cusparse_${CUDA_MAJOR}.${CUDA_MINOR} cusparse_dev_${CUDA_MAJOR}.${CUDA_MINOR} \
+                   cupti_${CUDA_MAJOR}.${CUDA_MINOR} \
                    thrust_${CUDA_MAJOR}.${CUDA_MINOR} \
                    visual_studio_integration_${CUDA_MAJOR}.${CUDA_MINOR}"
 ARG CUDA_INSTALL_ROOT_WP="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v${CUDA_MAJOR}.${CUDA_MINOR}"
 
-ARG TENSORRT_VERSION=8.2.2.1
-ARG TENSORRT_ZIP=TensorRT-${TENSORRT_VERSION}.Windows10.x86_64.cuda-11.4.cudnn8.2.zip
+ARG CUDA_SOURCE=https://developer.download.nvidia.com/compute/cuda/${CUDA_VERSION}/network_installers/cuda_${CUDA_VERSION}_windows_network.exe
+ADD ${CUDA_SOURCE} cuda_${CUDA_VERSION}_windows_network.exe
+
+RUN cuda_%CUDA_VERSION%_windows_network.exe -s %CUDA_PACKAGES%
+# Copy the CUDA visualstudio integration from where it was installed
+# into the appropriate place in BuildTools
+RUN copy "%CUDA_INSTALL_ROOT_WP%\extras\visual_studio_integration\MSBuildExtensions\*" "%VS_INSTALL_PATH_WP%\MSBuild\Microsoft\VC\v170\BuildCustomizations"
 
-ARG CUDNN_VERSION=8.3.2.44
-ARG CUDNN_DIR=cudnn-windows-x86_64-${CUDNN_VERSION}_cuda11.5-archive
-ARG CUDNN_ZIP=${CUDNN_DIR}.zip
+RUN setx PATH "%CUDA_INSTALL_ROOT_WP%\bin;%PATH%"
 
-WORKDIR /tmp
-ADD https://developer.download.nvidia.com/compute/cuda/${CUDA_VERSION}/network_installers/cuda_${CUDA_VERSION}_win10_network.exe cuda_${CUDA_VERSION}_win10_network.exe
-COPY ${CUDNN_ZIP} .
-COPY ${TENSORRT_ZIP} .
+LABEL CUDA_VERSION="${CUDA_VERSION}"
+
+#
+# Installing TensorRT
+#
+ARG TENSORRT_VERSION=8.6.1.6
+ARG TENSORRT_ZIP="TensorRT-${TENSORRT_VERSION}.Windows10.x86_64.cuda-12.0.zip"
+ARG TENSORRT_SOURCE=${TENSORRT_ZIP}
+# COPY ${TENSORRT_ZIP} /tmp/${TENSORRT_ZIP}
+ADD ${TENSORRT_SOURCE} /tmp/${TENSORRT_ZIP}
 
-WORKDIR /
 RUN unzip /tmp/%TENSORRT_ZIP%
-RUN move TensorRT-%TENSORRT_VERSION% TensorRT
+RUN move TensorRT-* TensorRT
 ENV TRT_VERSION ${TENSORRT_VERSION}
 
-WORKDIR /tmp
-RUN cuda_%CUDA_VERSION%_win10_network.exe -s %CUDA_PACKAGES%
+RUN setx PATH "c:\TensorRT\lib;%PATH%"
+
+LABEL TENSORRT_VERSION="${TENSORRT_VERSION}"
+
+
+#
+# Installing cuDNN
+#
+ARG CUDNN_VERSION=8.9.6.50
+ARG CUDNN_ZIP=cudnn-windows-x86_64-${CUDNN_VERSION}_cuda12-archive.zip
+ARG CUDNN_SOURCE=${CUDNN_ZIP}
+
+ADD ${CUDNN_SOURCE} /tmp/${CUDNN_ZIP}
 
-WORKDIR /
 RUN unzip /tmp/%CUDNN_ZIP%
-RUN move %CUDNN_DIR% cudnn
+RUN move cudnn-* cudnn
 RUN copy cudnn\bin\cudnn*.dll "%CUDA_INSTALL_ROOT_WP%\bin\."
-RUN copy cudnn\lib\cudnn*.lib "%CUDA_INSTALL_ROOT_WP%\lib\x64\."
+RUN copy cudnn\lib\x64\cudnn*.lib "%CUDA_INSTALL_ROOT_WP%\lib\x64\."
 RUN copy cudnn\include\cudnn*.h "%CUDA_INSTALL_ROOT_WP%\include\."
 
-# Copy the CUDA visualstudio integration from where it was installed
-# into the appropriate place in BuildTools
-RUN copy "%CUDA_INSTALL_ROOT_WP%\extras\visual_studio_integration\MSBuildExtensions\*" "%VS_INSTALL_PATH_WP%\MSBuild\Microsoft\VC\v160\BuildCustomizations"
-
-RUN setx PATH "%PATH%;c:\TensorRT\lib;%CUDA_INSTALL_ROOT_WP%\bin"
+ENV CUDNN_VERSION ${CUDNN_VERSION}
 
+LABEL CUDNN_VERSION="${CUDNN_VERSION}"
 # It is important that the entrypoint initialize VisualStudio
 # environment otherwise the build will fail. Also set
 # CMAKE_TOOLCHAIN_FILE and VCPKG_TARGET_TRIPLET so
 # that cmake can find the packages installed by vcpkg.
-ENV CMAKE_TOOLCHAIN_FILE /vcpkg/scripts/buildsystems/vcpkg.cmake
-ENV VCPKG_TARGET_TRIPLET x64-windows
 ENTRYPOINT C:\BuildTools\VC\Auxiliary\Build\vcvars64.bat &&
diff --git a/LICENSE b/LICENSE
index e8584b917f..5529809efc 100644
--- a/LICENSE
+++ b/LICENSE
@@ -1,4 +1,4 @@
-Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 
 Redistribution and use in source and binary forms, with or without
 modification, are permitted provided that the following conditions
diff --git a/README.md b/README.md
index 255cd5ca13..b9cf911424 100644
--- a/README.md
+++ b/README.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -31,20 +31,23 @@
 [![License](https://img.shields.io/badge/License-BSD3-lightgrey.svg)](https://opensource.org/licenses/BSD-3-Clause)
 
 **LATEST RELEASE: You are currently on the main branch which tracks
-under-development progress towards the next release. The current release is 
-version [2.22.0](https://github.com/triton-inference-server/server/tree/r22.05)
-and corresponds to the 22.05 container release on 
+under-development progress towards the next release. The current release is
+version [2.40.0](https://github.com/triton-inference-server/server/tree/r23.11)
+and corresponds to the 23.11 container release on
 [NVIDIA GPU Cloud (NGC)](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver).**
 
 ----
-Triton Inference Server is an open source inference serving software that 
-streamlines AI inferencing. Triton enables teams to deploy any AI model from 
-multiple deep learning and machine learning frameworks, including TensorRT, 
-TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton 
-supports inference across cloud, data center,edge and embedded devices on NVIDIA 
-GPUs, x86 and ARM CPU, or AWS Inferentia. Triton delivers optimized performance 
-for many query types, including real time, batched, ensembles and audio/video 
-streaming.
+Triton Inference Server is an open source inference serving software that
+streamlines AI inferencing. Triton enables teams to deploy any AI model from
+multiple deep learning and machine learning frameworks, including TensorRT,
+TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton
+Inference Server supports inference across cloud, data center, edge and embedded
+devices on NVIDIA GPUs, x86 and ARM CPU, or AWS Inferentia. Triton Inference
+Server delivers optimized performance for many query types, including real time,
+batched, ensembles and audio/video streaming. Triton inference Server is part of
+[NVIDIA AI Enterprise](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/),
+a software platform that accelerates the data science pipeline and streamlines
+the development and deployment of production AI.
 
 Major features include:
 
@@ -53,47 +56,53 @@ Major features include:
 - [Supports multiple machine learning
   frameworks](https://github.com/triton-inference-server/fil_backend)
 - [Concurrent model
-  execution](docs/architecture.md#concurrent-model-execution)
-- [Dynamic batching](docs/model_configuration.md#dynamic-batcher)
-- [Sequence batching](docs/model_configuration.md#sequence-batcher) and 
-  [implicit state management](docs/architecture.md#implicit-state-management)
+  execution](docs/user_guide/architecture.md#concurrent-model-execution)
+- [Dynamic batching](docs/user_guide/model_configuration.md#dynamic-batcher)
+- [Sequence batching](docs/user_guide/model_configuration.md#sequence-batcher) and
+  [implicit state management](docs/user_guide/architecture.md#implicit-state-management)
   for stateful models
 - Provides [Backend API](https://github.com/triton-inference-server/backend) that
   allows adding custom backends and pre/post processing operations
+- Supports writing custom backends in python, a.k.a.
+  [Python-based backends.](https://github.com/triton-inference-server/backend/blob/main/docs/python_based_backends.md#python-based-backends)
 - Model pipelines using
-  [Ensembling](docs/architecture.md#ensemble-models) or [Business
+  [Ensembling](docs/user_guide/architecture.md#ensemble-models) or [Business
   Logic Scripting
   (BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting)
 - [HTTP/REST and GRPC inference
-  protocols](docs/inference_protocols.md) based on the community
+  protocols](docs/customization_guide/inference_protocols.md) based on the community
   developed [KServe
   protocol](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2)
-- A [C API](docs/inference_protocols.md#in-process-triton-server-api) and
-  [Java API](docs/inference_protocols.md#java-bindings-for-in-process-triton-server-api)
+- A [C API](docs/customization_guide/inference_protocols.md#in-process-triton-server-api) and
+  [Java API](docs/customization_guide/inference_protocols.md#java-bindings-for-in-process-triton-server-api)
   allow Triton to link directly into your application for edge and other in-process use cases
-- [Metrics](docs/metrics.md) indicating GPU utilization, server
+- [Metrics](docs/user_guide/metrics.md) indicating GPU utilization, server
   throughput, server latency, and more
 
-Need enterprise support?  NVIDIA global support is available for Triton 
-Inference Server with the 
-[NVIDIA AI Enterprise software suite](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/). 
+**New to Triton Inference Server?** Make use of
+[these tutorials](https://github.com/triton-inference-server/tutorials)
+to begin your Triton journey!
+
+Join the [Triton and TensorRT community](https://www.nvidia.com/en-us/deep-learning-ai/triton-tensorrt-newsletter/) and
+stay current on the latest product updates, bug fixes, content, best practices,
+and more.  Need enterprise support?  NVIDIA global support is available for Triton
+Inference Server with the
+[NVIDIA AI Enterprise software suite](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/).
 
 ## Serve a Model in 3 Easy Steps
 
 ```bash
-# Step 1: Create the example model repository 
-git clone -b r22.05 https://github.com/triton-inference-server/server.git
-
+# Step 1: Create the example model repository
+git clone -b r23.11 https://github.com/triton-inference-server/server.git
 cd server/docs/examples
-
 ./fetch_models.sh
 
 # Step 2: Launch triton from the NGC Triton container
-docker run --gpus=1 --rm --net=host -v /full/path/to/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:22.05-py3 tritonserver --model-repository=/models
-
-# Step 3: In a separate console, launch the image_client example from the NGC Triton SDK container
-docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:22.05-py3-sdk
+docker run --gpus=1 --rm --net=host -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:23.11-py3 tritonserver --model-repository=/models
 
+# Step 3: Sending an Inference Request
+# In a separate console, launch the image_client example from the NGC Triton SDK container
+docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:23.11-py3-sdk
 /workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg
 
 # Inference should return the following
@@ -102,8 +111,8 @@ Image '/workspace/images/mug.jpg':
     13.224326 (968) = CUP
     10.422965 (505) = COFFEEPOT
 ```
-Please read the [QuickStart](docs/quickstart.md) guide for additional information
-regarding this example. The quickstart guide also contains an example of how to launch Triton on [CPU-only systems](docs/quickstart.md#run-on-cpu-only-system).
+Please read the [QuickStart](docs/getting_started/quickstart.md) guide for additional information
+regarding this example. The quickstart guide also contains an example of how to launch Triton on [CPU-only systems](docs/getting_started/quickstart.md#run-on-cpu-only-system). New to Triton and wondering where to get started? Watch the [Getting Started video](https://youtu.be/NQDtfSi5QF4).
 
 ## Examples and Tutorials
 
@@ -111,13 +120,13 @@ Check out [NVIDIA LaunchPad](https://www.nvidia.com/en-us/data-center/products/a
 for free access to a set of hands-on labs with Triton Inference Server hosted on
 NVIDIA infrastructure.
 
-Specific end-to-end examples for popular models, such as ResNet, BERT, and DLRM 
-are located in the 
+Specific end-to-end examples for popular models, such as ResNet, BERT, and DLRM
+are located in the
 [NVIDIA Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples)
-page on GitHub. The 
-[NVIDIA Developer Zone](https://developer.nvidia.com/nvidia-triton-inference-server) 
+page on GitHub. The
+[NVIDIA Developer Zone](https://developer.nvidia.com/nvidia-triton-inference-server)
 contains additional documentation, presentations, and examples.
- 
+
 ## Documentation
 
 ### Build and Deploy
@@ -125,57 +134,63 @@ contains additional documentation, presentations, and examples.
 The recommended way to build and use Triton Inference Server is with Docker
 images.
 
-- [Install Triton Inference Server with Docker containers](docs/build.md#building-triton-with-docker) (*Recommended*)
-- [Install Triton Inference Server without Docker containers](docs/build.md#building-triton-without-docker)
-- [Build a custom Triton Inference Server Docker container](docs/compose.md)
-- [Build Triton Inference Server from source](docs/build.md#building-on-unsupported-platforms)
-- [Build Triton Inference Server for Windows 10](docs/build.md#building-for-windows-10)
-- Examples for deploying Triton Inference Server with Kubernetes and Helm on [GCP](deploy/gcp/README.md), 
+- [Install Triton Inference Server with Docker containers](docs/customization_guide/build.md#building-with-docker) (*Recommended*)
+- [Install Triton Inference Server without Docker containers](docs/customization_guide/build.md#building-without-docker)
+- [Build a custom Triton Inference Server Docker container](docs/customization_guide/compose.md)
+- [Build Triton Inference Server from source](docs/customization_guide/build.md#building-on-unsupported-platforms)
+- [Build Triton Inference Server for Windows 10](docs/customization_guide/build.md#building-for-windows-10)
+- Examples for deploying Triton Inference Server with Kubernetes and Helm on [GCP](deploy/gcp/README.md),
   [AWS](deploy/aws/README.md), and [NVIDIA FleetCommand](deploy/fleetcommand/README.md)
+- [Secure Deployment Considerations](docs/customization_guide/deploy.md)
 
 ### Using Triton
 
 #### Preparing Models for Triton Inference Server
 
 The first step in using Triton to serve your models is to place one or
-more models into a [model repository](docs/model_repository.md). Depending on 
+more models into a [model repository](docs/user_guide/model_repository.md). Depending on
 the type of the model and on what Triton capabilities you want to enable for
 the model, you may need to create a [model
-configuration](docs/model_configuration.md) for the model.  
+configuration](docs/user_guide/model_configuration.md) for the model.
 
-- [Add custom operations to Triton if needed by your model](docs/custom_operations.md)
-- Enable model pipelining with [Model Ensemble](docs/architecture.md#ensemble-models)
+- [Add custom operations to Triton if needed by your model](docs/user_guide/custom_operations.md)
+- Enable model pipelining with [Model Ensemble](docs/user_guide/architecture.md#ensemble-models)
   and [Business Logic Scripting (BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting)
-- Optimize your models setting [scheduling and batching](docs/architecture.md#models-and-schedulers)
-  parameters and [model instances](docs/model_configuration.md#instance-groups).
+- Optimize your models setting [scheduling and batching](docs/user_guide/architecture.md#models-and-schedulers)
+  parameters and [model instances](docs/user_guide/model_configuration.md#instance-groups).
 - Use the [Model Analyzer tool](https://github.com/triton-inference-server/model_analyzer)
   to help optimize your model configuration with profiling
-- Learn how to [explicitly manage what models are available by loading and 
-  unloading models](docs/model_management.md)
+- Learn how to [explicitly manage what models are available by loading and
+  unloading models](docs/user_guide/model_management.md)
 
 #### Configure and Use Triton Inference Server
 
-- Read the [Quick Start Guide](docs/quickstart.md) to run Triton Inference 
+- Read the [Quick Start Guide](docs/getting_started/quickstart.md) to run Triton Inference
   Server on both GPU and CPU
-- Triton supports multiple execution engines, called 
-  [backends](https://github.com/triton-inference-server/backend#where-can-i-find-all-the-backends-that-are-available-for-triton), including 
-  [TensorRT](https://github.com/triton-inference-server/tensorrt_backend), 
-  [TensorFlow](https://github.com/triton-inference-server/tensorflow_backend), 
-  [PyTorch](https://github.com/triton-inference-server/pytorch_backend), 
-  [ONNX](https://github.com/triton-inference-server/onnxruntime_backend), 
-  [OpenVINO](https://github.com/triton-inference-server/openvino_backend), 
+- Triton supports multiple execution engines, called
+  [backends](https://github.com/triton-inference-server/backend#where-can-i-find-all-the-backends-that-are-available-for-triton), including
+  [TensorRT](https://github.com/triton-inference-server/tensorrt_backend),
+  [TensorFlow](https://github.com/triton-inference-server/tensorflow_backend),
+  [PyTorch](https://github.com/triton-inference-server/pytorch_backend),
+  [ONNX](https://github.com/triton-inference-server/onnxruntime_backend),
+  [OpenVINO](https://github.com/triton-inference-server/openvino_backend),
   [Python](https://github.com/triton-inference-server/python_backend), and more
-- Learn how to [optimize performance](docs/optimization.md) using the 
-  [Performance Analyzer](docs/perf_analyzer.md) and 
+- Not all the above backends are supported on every platform supported by Triton.
+  Look at the
+  [Backend-Platform Support Matrix](https://github.com/triton-inference-server/backend/blob/main/docs/backend_platform_support_matrix.md)
+  to learn which backends are supported on your target platform.
+- Learn how to [optimize performance](docs/user_guide/optimization.md) using the
+  [Performance Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md)
+  and
   [Model Analyzer](https://github.com/triton-inference-server/model_analyzer)
-- Learn how to [manage loading and unloading models](docs/model_management.md) in 
+- Learn how to [manage loading and unloading models](docs/user_guide/model_management.md) in
   Triton
 - Send requests directly to Triton with the [HTTP/REST JSON-based
-  or gRPC protocols](docs/inference_protocols.md#httprest-and-grpc-protocols)
+  or gRPC protocols](docs/customization_guide/inference_protocols.md#httprest-and-grpc-protocols)
 
 #### Client Support and Examples
 
-A Triton *client* application sends inference and other requests to Triton. The 
+A Triton *client* application sends inference and other requests to Triton. The
 [Python and C++ client libraries](https://github.com/triton-inference-server/client)
 provide APIs to simplify this communication.
 
@@ -185,32 +200,32 @@ provide APIs to simplify this communication.
 - Configure [HTTP](https://github.com/triton-inference-server/client#http-options)
   and [gRPC](https://github.com/triton-inference-server/client#grpc-options)
   client options
-- Send input data (e.g. a jpeg image) directly to Triton in the [body of an HTTP 
+- Send input data (e.g. a jpeg image) directly to Triton in the [body of an HTTP
   request without any additional metadata](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_binary_data.md#raw-binary-request)
 
 ### Extend Triton
 
-[Triton Inference Server's architecture](docs/architecture.md) is specifically 
+[Triton Inference Server's architecture](docs/user_guide/architecture.md) is specifically
 designed for modularity and flexibility
 
-- [Customize Triton Inference Server container](docs/compose.md) for your use case
+- [Customize Triton Inference Server container](docs/customization_guide/compose.md) for your use case
 - [Create custom backends](https://github.com/triton-inference-server/backend)
   in either [C/C++](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api)
   or [Python](https://github.com/triton-inference-server/python_backend)
-- Create [decouple backends and models](docs/decoupled_models.md) that can send 
+- Create [decoupled backends and models](docs/user_guide/decoupled_models.md) that can send
   multiple responses for a request or not send any responses for a request
-- Use a [Triton repository agent](docs/repository_agents.md) to add functionality
-  that operates when a model is loaded and unloaded, such as authentication, 
+- Use a [Triton repository agent](docs/customization_guide/repository_agents.md) to add functionality
+  that operates when a model is loaded and unloaded, such as authentication,
   decryption, or conversion
-- Deploy Triton on [Jetson and JetPack](docs/jetson.md)
-- [Use Triton on AWS 
+- Deploy Triton on [Jetson and JetPack](docs/user_guide/jetson.md)
+- [Use Triton on AWS
    Inferentia](https://github.com/triton-inference-server/python_backend/tree/main/inferentia)
 
 ### Additional Documentation
 
-- [FAQ](docs/faq.md)
-- [User Guide](docs#user-guide)
-- [Developer Guide](docs#developer-guide)
+- [FAQ](docs/user_guide/faq.md)
+- [User Guide](docs/README.md#user-guide)
+- [Customization Guide](docs/README.md#customization-guide)
 - [Release Notes](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/index.html)
 - [GPU, Driver, and CUDA Support
 Matrix](https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html)
@@ -218,7 +233,7 @@ Matrix](https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html)
 ## Contributing
 
 Contributions to Triton Inference Server are more than welcome. To
-contribute please review the [contribution 
+contribute please review the [contribution
 guidelines](CONTRIBUTING.md). If you have a backend, client,
 example or similar contribution that is not modifying the core of
 Triton, then you should file a PR in the [contrib
@@ -226,7 +241,7 @@ repo](https://github.com/triton-inference-server/contrib).
 
 ## Reporting problems, asking questions
 
-We appreciate any feedback, questions or bug reporting regarding this project. 
+We appreciate any feedback, questions or bug reporting regarding this project.
 When posting [issues in GitHub](https://github.com/triton-inference-server/server/issues),
 follow the process outlined in the [Stack Overflow document](https://stackoverflow.com/help/mcve).
 Ensure posted examples are:
@@ -240,6 +255,11 @@ Ensure posted examples are:
   reproduces the problem. Remove all other problems that are not
   related to your request/question.
 
+For issues, please use the provided bug report and feature request templates.
+
+For questions, we recommend posting in our community
+[GitHub Discussions.](https://github.com/triton-inference-server/server/discussions)
+
 ## For more information
 
 Please refer to the [NVIDIA Developer Triton page](https://developer.nvidia.com/nvidia-triton-inference-server)
diff --git a/SECURITY.md b/SECURITY.md
new file mode 100644
index 0000000000..7aa39f4e5d
--- /dev/null
+++ b/SECURITY.md
@@ -0,0 +1,44 @@
+<!--
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Report a Security Vulnerability
+
+To report a potential security vulnerability in any NVIDIA product, please use either:
+* This web form: [Security Vulnerability Submission Form](https://www.nvidia.com/object/submit-security-vulnerability.html), or
+* Send email to: [NVIDIA PSIRT](mailto:psirt@nvidia.com)
+
+**OEM Partners should contact their NVIDIA Customer Program Manager**
+
+If reporting a potential vulnerability via email, please encrypt it using NVIDIA’s public PGP key ([see PGP Key page](https://www.nvidia.com/en-us/security/pgp-key/)) and include the following information:
+1. Product/Driver name and version/branch that contains the vulnerability
+2. Type of vulnerability (code execution, denial of service, buffer overflow, etc.)
+3. Instructions to reproduce the vulnerability
+4. Proof-of-concept or exploit code
+5. Potential impact of the vulnerability, including how an attacker could exploit the vulnerability
+
+See https://www.nvidia.com/en-us/security/ for past NVIDIA Security Bulletins and Notices.
diff --git a/TRITON_VERSION b/TRITON_VERSION
index 7609fc9e9e..25aa01454a 100644
--- a/TRITON_VERSION
+++ b/TRITON_VERSION
@@ -1 +1 @@
-2.24.0dev
+2.42.0dev
diff --git a/build.py b/build.py
index 0e808b591f..9b61dd5182 100755
--- a/build.py
+++ b/build.py
@@ -1,5 +1,5 @@
 #!/usr/bin/env python3
-# Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,26 +26,26 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import argparse
-import logging
+import importlib.util
+import multiprocessing
 import os
 import os.path
-import multiprocessing
 import pathlib
 import platform
-import shutil
 import stat
 import subprocess
 import sys
-import traceback
 from inspect import getsourcefile
 
+import requests
+
 #
 # Build Triton Inference Server.
 #
 
 # By default build.py builds the Triton Docker image, but can also be
 # used to build without Docker.  See docs/build.md and --help for more
-# infomation.
+# information.
 #
 # The TRITON_VERSION file indicates the Triton version and
 # TRITON_VERSION_MAP is used to determine the corresponding container
@@ -69,41 +69,20 @@
 # different versions are used then one backend or the other will
 # incorrectly load the other version of the openvino libraries.
 #
-# The standalone openVINO describes multiple versions where each version
-# is a pair of openVINO version and openVINO package version. When openVINO
-# package version is specified, then backend will be built with pre-built
-# openVINO release from Intel. If the package version is specified as None,
-# then openVINO for the backend is built from source with openMP support.
-# By default, only the first version is built. To build the all the versions
-# in list use --build-multiple-openvino. Triton will use the first version
-# for inference by default. In order to use different version, Triton should
-# be invoked with appropriate backend configuration:
-# (--backend-config=openvino,version=<version_str>)
-# The version string can be obtained as follows:
-# <major_version>_<minor_version>[_pre]
-# Append '_pre' only if the openVINO backend was built with prebuilt openVINO
-# library. In other words, when the second element of the pair is not None.
-# To use ('2021.4', None) version_str should be `2021_4'.
-# To use ('2021.4', '2021.4.582') version_str should be `2021_4_pre'.
-# User can also build openvino backend from specific commit sha of openVINO
-# repository. The pair should be (`SPECIFIC`, <commit_sha_of_ov_repo>).
-# Note: Not all sha ids would successfuly compile and work.
-# Note: When updating the conda version, make sure to update the shasum of
-# the packages used for different platforms in install_miniconda function.
-#
 TRITON_VERSION_MAP = {
-    '2.24.0dev': (
-        '22.07dev',  # triton container
-        '22.05',  # upstream container
-        '1.11.1',  # ORT
-        '2021.4.582',  # ORT OpenVINO
-        (('2021.4', None), ('2021.4', '2021.4.582'),
-         ('SPECIFIC', 'f2f281e6')),  # Standalone OpenVINO
-        '2.2.9',  # DCGM version
-        'py38_4.12.0')  # Conda version.
+    "2.42.0dev": (
+        "24.01dev",  # triton container
+        "23.11",  # upstream container
+        "1.16.3",  # ORT
+        "2023.0.0",  # ORT OpenVINO
+        "2023.0.0",  # Standalone OpenVINO
+        "3.2.6",  # DCGM version
+        "py310_23.1.0-1",  # Conda version
+        "0.2.2",  # vLLM version
+    )
 }
 
-CORE_BACKENDS = ['ensemble']
+CORE_BACKENDS = ["ensemble"]
 
 FLAGS = None
 EXTRA_CORE_CMAKE_FLAGS = {}
@@ -119,7 +98,7 @@ def log(msg, force=False):
         try:
             print(msg, file=sys.stderr)
         except Exception:
-            print('<failed to log>', file=sys.stderr)
+            print("<failed to log>", file=sys.stderr)
 
 
 def log_verbose(msg):
@@ -133,7 +112,7 @@ def fail(msg):
 
 def fail_if(p, msg):
     if p:
-        print('error: {}'.format(msg), file=sys.stderr)
+        print("error: {}".format(msg), file=sys.stderr)
         sys.exit(1)
 
 
@@ -149,26 +128,14 @@ def target_machine():
     return platform.machine().lower()
 
 
-def tagged_backend(be, version):
-    tagged_be = be
-    if be == 'openvino':
-        if version[0] == 'SPECIFIC':
-            tagged_be += "_" + version[1]
-        else:
-            tagged_be += "_" + version[0].replace('.', '_')
-            if version[1] and target_platform() != 'windows':
-                tagged_be += "_pre"
-    return tagged_be
-
-
 def container_versions(version, container_version, upstream_container_version):
     if container_version is None:
         if version not in TRITON_VERSION_MAP:
-            fail('container version not known for {}'.format(version))
+            fail("container version not known for {}".format(version))
         container_version = TRITON_VERSION_MAP[version][0]
     if upstream_container_version is None:
         if version not in TRITON_VERSION_MAP:
-            fail('upstream container version not known for {}'.format(version))
+            fail("upstream container version not known for {}".format(version))
         upstream_container_version = TRITON_VERSION_MAP[version][1]
     return container_version, upstream_container_version
 
@@ -193,13 +160,13 @@ def __del__(self):
 
     def close(self):
         if self._file is not None:
-            if target_platform() == 'windows':
+            if target_platform() == "windows":
                 self.blankln()
-                self._file.write('}\n')
-                self._file.write('catch {\n')
-                self._file.write('    $_;\n')
-                self._file.write('    ExitWithCode 1;\n')
-                self._file.write('}\n')
+                self._file.write("}\n")
+                self._file.write("catch {\n")
+                self._file.write("    $_;\n")
+                self._file.write("    ExitWithCode 1;\n")
+                self._file.write("}\n")
             """Close the file"""
             self._file.close()
             self._file = None
@@ -207,28 +174,28 @@ def close(self):
             os.chmod(self._filepath, st.st_mode | stat.S_IEXEC)
 
     def blankln(self):
-        self._file.write('\n')
+        self._file.write("\n")
 
     def commentln(self, cnt):
-        self._file.write('#' * cnt + '\n')
+        self._file.write("#" * cnt + "\n")
 
-    def comment(self, msg=''):
+    def comment(self, msg=""):
         if not isinstance(msg, str):
             try:
                 for m in msg:
-                    self._file.write(f'# {msg}\n')
+                    self._file.write(f"# {msg}\n")
                 return
             except TypeError:
                 pass
-        self._file.write(f'# {msg}\n')
+        self._file.write(f"# {msg}\n")
 
-    def comment_verbose(self, msg=''):
+    def comment_verbose(self, msg=""):
         if self._verbose:
             self.comment(msg)
 
     def header(self, desc=None):
-        if target_platform() != 'windows':
-            self._file.write('#!/usr/bin/env bash\n\n')
+        if target_platform() != "windows":
+            self._file.write("#!/usr/bin/env bash\n\n")
 
         if desc is not None:
             self.comment()
@@ -236,132 +203,134 @@ def header(self, desc=None):
             self.comment()
             self.blankln()
 
-        self.comment('Exit script immediately if any command fails')
-        if target_platform() == 'windows':
-            self._file.write('function ExitWithCode($exitcode) {\n')
-            self._file.write('    $host.SetShouldExit($exitcode)\n')
-            self._file.write('    exit $exitcode\n')
-            self._file.write('}\n')
+        self.comment("Exit script immediately if any command fails")
+        if target_platform() == "windows":
+            self._file.write("function ExitWithCode($exitcode) {\n")
+            self._file.write("    $host.SetShouldExit($exitcode)\n")
+            self._file.write("    exit $exitcode\n")
+            self._file.write("}\n")
             self.blankln()
             if self._verbose:
-                self._file.write('Set-PSDebug -Trace 1\n')
+                self._file.write("Set-PSDebug -Trace 1\n")
             self.blankln()
-            self._file.write('try {\n')
+            self._file.write("try {\n")
         else:
-            self._file.write('set -e\n')
+            self._file.write("set -e\n")
             if self._verbose:
-                self._file.write('set -x\n')
+                self._file.write("set -x\n")
         self.blankln()
 
     def envvar_ref(self, v):
-        if target_platform() == 'windows':
-            return f'${{env:{v}}}'
-        return f'${{{v}}}'
+        if target_platform() == "windows":
+            return f"${{env:{v}}}"
+        return f"${{{v}}}"
 
     def cmd(self, clist, check_exitcode=False):
         if isinstance(clist, str):
-            self._file.write(f'{clist}\n')
+            self._file.write(f"{clist}\n")
         else:
             for c in clist:
-                self._file.write(f'{c} ')
+                self._file.write(f"{c} ")
             self.blankln()
 
         if check_exitcode:
-            if target_platform() == 'windows':
-                self._file.write('if ($LASTEXITCODE -ne 0) {\n')
+            if target_platform() == "windows":
+                self._file.write("if ($LASTEXITCODE -ne 0) {\n")
                 self._file.write(
-                    '  Write-Output "exited with status code $LASTEXITCODE";\n')
-                self._file.write('  ExitWithCode 1;\n')
-                self._file.write('}\n')
+                    '  Write-Output "exited with status code $LASTEXITCODE";\n'
+                )
+                self._file.write("  ExitWithCode 1;\n")
+                self._file.write("}\n")
 
     def cwd(self, path):
-        if target_platform() == 'windows':
-            self.cmd(f'Set-Location -EV Err -EA Stop {path}')
+        if target_platform() == "windows":
+            self.cmd(f"Set-Location -EV Err -EA Stop {path}")
         else:
-            self.cmd(f'cd {path}')
+            self.cmd(f"cd {path}")
 
     def cp(self, src, dest):
-        if target_platform() == 'windows':
-            self.cmd(f'Copy-Item -EV Err -EA Stop {src} -Destination {dest}')
+        if target_platform() == "windows":
+            self.cmd(f"Copy-Item -EV Err -EA Stop {src} -Destination {dest}")
         else:
-            self.cmd(f'cp {src} {dest}')
+            self.cmd(f"cp {src} {dest}")
 
     def mkdir(self, path):
-        if target_platform() == 'windows':
+        if target_platform() == "windows":
             self.cmd(
-                f'New-Item -EV Err -EA Stop -ItemType Directory -Force -Path {path}'
+                f"New-Item -EV Err -EA Stop -ItemType Directory -Force -Path {path}"
             )
         else:
-            self.cmd(f'mkdir -p {pathlib.Path(path)}')
+            self.cmd(f"mkdir -p {pathlib.Path(path)}")
 
     def rmdir(self, path):
-        if target_platform() == 'windows':
-            self.cmd(f'if (Test-Path -Path {path}) {{')
-            self.cmd(f'  Remove-Item -EV Err -EA Stop -Recurse -Force {path}')
-            self.cmd('}')
+        if target_platform() == "windows":
+            self.cmd(f"if (Test-Path -Path {path}) {{")
+            self.cmd(f"  Remove-Item -EV Err -EA Stop -Recurse -Force {path}")
+            self.cmd("}")
         else:
-            self.cmd(f'rm -fr {pathlib.Path(path)}')
+            self.cmd(f"rm -fr {pathlib.Path(path)}")
 
     def cpdir(self, src, dest):
-        if target_platform() == 'windows':
-            self.cmd(
-                f'Copy-Item -EV Err -EA Stop -Recurse {src} -Destination {dest}'
-            )
+        if target_platform() == "windows":
+            self.cmd(f"Copy-Item -EV Err -EA Stop -Recurse {src} -Destination {dest}")
         else:
-            self.cmd(f'cp -r {src} {dest}')
+            self.cmd(f"cp -r {src} {dest}")
 
     def tar(self, subdir, tar_filename):
-        if target_platform() == 'windows':
-            fail('unsupported operation: tar')
+        if target_platform() == "windows":
+            fail("unsupported operation: tar")
         else:
-            self.cmd(f'tar zcf {tar_filename} {subdir}')
+            self.cmd(f"tar zcf {tar_filename} {subdir}")
 
     def cmake(self, args):
         # Pass some additional envvars into cmake...
         env_args = []
-        for k in ('TRT_VERSION', 'DALI_VERSION', 'CMAKE_TOOLCHAIN_FILE',
-                  'VCPKG_TARGET_TRIPLET'):
+        for k in ("TRT_VERSION", "CMAKE_TOOLCHAIN_FILE", "VCPKG_TARGET_TRIPLET"):
             env_args += [f'"-D{k}={self.envvar_ref(k)}"']
-        self.cmd(f'cmake {" ".join(env_args)} {" ".join(args)}',
-                 check_exitcode=True)
+        self.cmd(f'cmake {" ".join(env_args)} {" ".join(args)}', check_exitcode=True)
 
-    def makeinstall(self, target='install'):
-        if target_platform() == 'windows':
-            verbose_flag = '' if self._verbose else '-clp:ErrorsOnly'
+    def makeinstall(self, target="install"):
+        if target_platform() == "windows":
+            verbose_flag = "" if self._verbose else "-clp:ErrorsOnly"
             self.cmd(
-                f'msbuild.exe -m:{FLAGS.build_parallel} {verbose_flag} -p:Configuration={FLAGS.build_type} {target}.vcxproj',
-                check_exitcode=True)
+                f"msbuild.exe -m:{FLAGS.build_parallel} {verbose_flag} -p:Configuration={FLAGS.build_type} {target}.vcxproj",
+                check_exitcode=True,
+            )
         else:
-            verbose_flag = 'VERBOSE=1' if self._verbose else 'VERBOSE=0'
-            self.cmd(f'make -j{FLAGS.build_parallel} {verbose_flag} {target}')
+            verbose_flag = "VERBOSE=1" if self._verbose else "VERBOSE=0"
+            self.cmd(f"make -j{FLAGS.build_parallel} {verbose_flag} {target}")
 
     def gitclone(self, repo, tag, subdir, org):
         clone_dir = subdir
         if not FLAGS.no_force_clone:
             self.rmdir(clone_dir)
 
-        if target_platform() == 'windows':
-            self.cmd(f'if (-Not (Test-Path -Path {clone_dir})) {{')
+        if target_platform() == "windows":
+            self.cmd(f"if (-Not (Test-Path -Path {clone_dir})) {{")
         else:
-            self.cmd(f'if [[ ! -e {clone_dir} ]]; then')
+            self.cmd(f"if [[ ! -e {clone_dir} ]]; then")
 
+        # FIXME [DLIS-4045 - Currently the tag starting with "pull/" is not
+        # working with "--repo-tag" as the option is not forwarded to the
+        # individual repo build correctly.]
         # If 'tag' starts with "pull/" then it must be of form
         # "pull/<pr>/head". We just clone at "main" and then fetch the
         # reference onto a new branch we name "tritonbuildref".
         if tag.startswith("pull/"):
             self.cmd(
-                f'  git clone --recursive --depth=1 {org}/{repo}.git {subdir};',
-                check_exitcode=True)
-            self.cmd('}' if target_platform() == 'windows' else 'fi')
+                f"  git clone --recursive --depth=1 {org}/{repo}.git {subdir};",
+                check_exitcode=True,
+            )
+            self.cmd("}" if target_platform() == "windows" else "fi")
             self.cwd(subdir)
-            self.cmd(f'git fetch origin {tag}:tritonbuildref',
-                     check_exitcode=True)
-            self.cmd(f'git checkout tritonbuildref', check_exitcode=True)
+            self.cmd(f"git fetch origin {tag}:tritonbuildref", check_exitcode=True)
+            self.cmd(f"git checkout tritonbuildref", check_exitcode=True)
         else:
             self.cmd(
-                f'  git clone --recursive --single-branch --depth=1 -b {tag} {org}/{repo}.git {subdir};',
-                check_exitcode=True)
-            self.cmd('}' if target_platform() == 'windows' else 'fi')
+                f"  git clone --recursive --single-branch --depth=1 -b {tag} {org}/{repo}.git {subdir};",
+                check_exitcode=True,
+            )
+            self.cmd("}" if target_platform() == "windows" else "fi")
 
 
 def cmake_core_arg(name, type, value):
@@ -370,9 +339,9 @@ def cmake_core_arg(name, type, value):
     if name in OVERRIDE_CORE_CMAKE_FLAGS:
         value = OVERRIDE_CORE_CMAKE_FLAGS[name]
     if type is None:
-        type = ''
+        type = ""
     else:
-        type = ':{}'.format(type)
+        type = ":{}".format(type)
     return '"-D{}{}={}"'.format(name, type, value)
 
 
@@ -383,7 +352,7 @@ def cmake_core_enable(name, flag):
     if name in OVERRIDE_CORE_CMAKE_FLAGS:
         value = OVERRIDE_CORE_CMAKE_FLAGS[name]
     else:
-        value = 'ON' if flag else 'OFF'
+        value = "ON" if flag else "OFF"
     return '"-D{}:BOOL={}"'.format(name, value)
 
 
@@ -401,9 +370,9 @@ def cmake_backend_arg(backend, name, type, value):
         if name in OVERRIDE_BACKEND_CMAKE_FLAGS[backend]:
             value = OVERRIDE_BACKEND_CMAKE_FLAGS[backend][name]
     if type is None:
-        type = ''
+        type = ""
     else:
-        type = ':{}'.format(type)
+        type = ":{}".format(type)
     return '"-D{}{}={}"'.format(name, type, value)
 
 
@@ -416,7 +385,7 @@ def cmake_backend_enable(backend, name, flag):
         if name in OVERRIDE_BACKEND_CMAKE_FLAGS[backend]:
             value = OVERRIDE_BACKEND_CMAKE_FLAGS[backend][name]
     if value is None:
-        value = 'ON' if flag else 'OFF'
+        value = "ON" if flag else "OFF"
     return '"-D{}:BOOL={}"'.format(name, value)
 
 
@@ -431,15 +400,15 @@ def cmake_backend_extra_args(backend):
 def cmake_repoagent_arg(name, type, value):
     # For now there is no override for repo-agents
     if type is None:
-        type = ''
+        type = ""
     else:
-        type = ':{}'.format(type)
+        type = ":{}".format(type)
     return '"-D{}{}={}"'.format(name, type, value)
 
 
 def cmake_repoagent_enable(name, flag):
     # For now there is no override for repo-agents
-    value = 'ON' if flag else 'OFF'
+    value = "ON" if flag else "OFF"
     return '"-D{}:BOOL={}"'.format(name, value)
 
 
@@ -449,63 +418,80 @@ def cmake_repoagent_extra_args():
     return args
 
 
+def cmake_cache_arg(name, type, value):
+    # For now there is no override for caches
+    if type is None:
+        type = ""
+    else:
+        type = ":{}".format(type)
+    return '"-D{}{}={}"'.format(name, type, value)
+
+
+def cmake_cache_enable(name, flag):
+    # For now there is no override for caches
+    value = "ON" if flag else "OFF"
+    return '"-D{}:BOOL={}"'.format(name, value)
+
+
+def cmake_cache_extra_args():
+    # For now there is no extra args for caches
+    args = []
+    return args
+
+
 def core_cmake_args(components, backends, cmake_dir, install_dir):
     cargs = [
-        cmake_core_arg('CMAKE_BUILD_TYPE', None, FLAGS.build_type),
-        cmake_core_arg('CMAKE_INSTALL_PREFIX', 'PATH', install_dir),
-        cmake_core_arg('TRITON_VERSION', 'STRING', FLAGS.version),
-        cmake_core_arg('TRITON_COMMON_REPO_TAG', 'STRING',
-                       components['common']),
-        cmake_core_arg('TRITON_CORE_REPO_TAG', 'STRING', components['core']),
-        cmake_core_arg('TRITON_BACKEND_REPO_TAG', 'STRING',
-                       components['backend']),
-        cmake_core_arg('TRITON_THIRD_PARTY_REPO_TAG', 'STRING',
-                       components['thirdparty'])
+        cmake_core_arg("CMAKE_BUILD_TYPE", None, FLAGS.build_type),
+        cmake_core_arg("CMAKE_INSTALL_PREFIX", "PATH", install_dir),
+        cmake_core_arg("TRITON_VERSION", "STRING", FLAGS.version),
+        cmake_core_arg("TRITON_COMMON_REPO_TAG", "STRING", components["common"]),
+        cmake_core_arg("TRITON_CORE_REPO_TAG", "STRING", components["core"]),
+        cmake_core_arg("TRITON_BACKEND_REPO_TAG", "STRING", components["backend"]),
+        cmake_core_arg(
+            "TRITON_THIRD_PARTY_REPO_TAG", "STRING", components["thirdparty"]
+        ),
     ]
 
+    cargs.append(cmake_core_enable("TRITON_ENABLE_LOGGING", FLAGS.enable_logging))
+    cargs.append(cmake_core_enable("TRITON_ENABLE_STATS", FLAGS.enable_stats))
+    cargs.append(cmake_core_enable("TRITON_ENABLE_METRICS", FLAGS.enable_metrics))
     cargs.append(
-        cmake_core_enable('TRITON_ENABLE_LOGGING', FLAGS.enable_logging))
-    cargs.append(cmake_core_enable('TRITON_ENABLE_STATS', FLAGS.enable_stats))
-    cargs.append(
-        cmake_core_enable('TRITON_ENABLE_METRICS', FLAGS.enable_metrics))
-    cargs.append(
-        cmake_core_enable('TRITON_ENABLE_METRICS_GPU',
-                          FLAGS.enable_gpu_metrics))
+        cmake_core_enable("TRITON_ENABLE_METRICS_GPU", FLAGS.enable_gpu_metrics)
+    )
     cargs.append(
-        cmake_core_enable('TRITON_ENABLE_TRACING', FLAGS.enable_tracing))
-    cargs.append(cmake_core_enable('TRITON_ENABLE_NVTX', FLAGS.enable_nvtx))
+        cmake_core_enable("TRITON_ENABLE_METRICS_CPU", FLAGS.enable_cpu_metrics)
+    )
+    cargs.append(cmake_core_enable("TRITON_ENABLE_TRACING", FLAGS.enable_tracing))
+    cargs.append(cmake_core_enable("TRITON_ENABLE_NVTX", FLAGS.enable_nvtx))
 
-    cargs.append(cmake_core_enable('TRITON_ENABLE_GPU', FLAGS.enable_gpu))
+    cargs.append(cmake_core_enable("TRITON_ENABLE_GPU", FLAGS.enable_gpu))
     cargs.append(
-        cmake_core_arg('TRITON_MIN_COMPUTE_CAPABILITY', None,
-                       FLAGS.min_compute_capability))
+        cmake_core_arg(
+            "TRITON_MIN_COMPUTE_CAPABILITY", None, FLAGS.min_compute_capability
+        )
+    )
 
-    cargs.append(
-        cmake_core_enable('TRITON_ENABLE_MALI_GPU', FLAGS.enable_mali_gpu))
+    cargs.append(cmake_core_enable("TRITON_ENABLE_MALI_GPU", FLAGS.enable_mali_gpu))
 
+    cargs.append(cmake_core_enable("TRITON_ENABLE_GRPC", "grpc" in FLAGS.endpoint))
+    cargs.append(cmake_core_enable("TRITON_ENABLE_HTTP", "http" in FLAGS.endpoint))
     cargs.append(
-        cmake_core_enable('TRITON_ENABLE_GRPC', 'grpc' in FLAGS.endpoint))
-    cargs.append(
-        cmake_core_enable('TRITON_ENABLE_HTTP', 'http' in FLAGS.endpoint))
-    cargs.append(
-        cmake_core_enable('TRITON_ENABLE_SAGEMAKER', 'sagemaker'
-                          in FLAGS.endpoint))
+        cmake_core_enable("TRITON_ENABLE_SAGEMAKER", "sagemaker" in FLAGS.endpoint)
+    )
     cargs.append(
-        cmake_core_enable('TRITON_ENABLE_VERTEX_AI', 'vertex-ai'
-                          in FLAGS.endpoint))
+        cmake_core_enable("TRITON_ENABLE_VERTEX_AI", "vertex-ai" in FLAGS.endpoint)
+    )
 
+    cargs.append(cmake_core_enable("TRITON_ENABLE_GCS", "gcs" in FLAGS.filesystem))
+    cargs.append(cmake_core_enable("TRITON_ENABLE_S3", "s3" in FLAGS.filesystem))
     cargs.append(
-        cmake_core_enable('TRITON_ENABLE_GCS', 'gcs' in FLAGS.filesystem))
-    cargs.append(cmake_core_enable('TRITON_ENABLE_S3', 's3'
-                                   in FLAGS.filesystem))
-    cargs.append(
-        cmake_core_enable('TRITON_ENABLE_AZURE_STORAGE', 'azure_storage'
-                          in FLAGS.filesystem))
+        cmake_core_enable(
+            "TRITON_ENABLE_AZURE_STORAGE", "azure_storage" in FLAGS.filesystem
+        )
+    )
 
-    cargs.append(
-        cmake_core_enable('TRITON_ENABLE_ENSEMBLE', 'ensemble' in backends))
-    cargs.append(
-        cmake_core_enable('TRITON_ENABLE_TENSORRT', 'tensorrt' in backends))
+    cargs.append(cmake_core_enable("TRITON_ENABLE_ENSEMBLE", "ensemble" in backends))
+    cargs.append(cmake_core_enable("TRITON_ENABLE_TENSORRT", "tensorrt" in backends))
 
     cargs += cmake_core_extra_args()
     cargs.append(cmake_dir)
@@ -513,346 +499,391 @@ def core_cmake_args(components, backends, cmake_dir, install_dir):
 
 
 def repoagent_repo(ra):
-    return '{}_repository_agent'.format(ra)
+    return "{}_repository_agent".format(ra)
 
 
 def repoagent_cmake_args(images, components, ra, install_dir):
     args = []
 
     cargs = args + [
-        cmake_repoagent_arg('CMAKE_BUILD_TYPE', None, FLAGS.build_type),
-        cmake_repoagent_arg('CMAKE_INSTALL_PREFIX', 'PATH', install_dir),
-        cmake_repoagent_arg('TRITON_COMMON_REPO_TAG', 'STRING',
-                            components['common']),
-        cmake_repoagent_arg('TRITON_CORE_REPO_TAG', 'STRING',
-                            components['core'])
+        cmake_repoagent_arg("CMAKE_BUILD_TYPE", None, FLAGS.build_type),
+        cmake_repoagent_arg("CMAKE_INSTALL_PREFIX", "PATH", install_dir),
+        cmake_repoagent_arg("TRITON_COMMON_REPO_TAG", "STRING", components["common"]),
+        cmake_repoagent_arg("TRITON_CORE_REPO_TAG", "STRING", components["core"]),
     ]
 
-    cargs.append(cmake_repoagent_enable('TRITON_ENABLE_GPU', FLAGS.enable_gpu))
+    cargs.append(cmake_repoagent_enable("TRITON_ENABLE_GPU", FLAGS.enable_gpu))
     cargs += cmake_repoagent_extra_args()
-    cargs.append('..')
+    cargs.append("..")
+    return cargs
+
+
+def cache_repo(cache):
+    # example: "local", or "redis"
+    return "{}_cache".format(cache)
+
+
+def cache_cmake_args(images, components, cache, install_dir):
+    args = []
+
+    cargs = args + [
+        cmake_cache_arg("CMAKE_BUILD_TYPE", None, FLAGS.build_type),
+        cmake_cache_arg("CMAKE_INSTALL_PREFIX", "PATH", install_dir),
+        cmake_cache_arg("TRITON_COMMON_REPO_TAG", "STRING", components["common"]),
+        cmake_cache_arg("TRITON_CORE_REPO_TAG", "STRING", components["core"]),
+    ]
+
+    cargs.append(cmake_cache_enable("TRITON_ENABLE_GPU", FLAGS.enable_gpu))
+    cargs += cmake_cache_extra_args()
+    cargs.append("..")
     return cargs
 
 
 def backend_repo(be):
-    if (be == 'tensorflow1') or (be == 'tensorflow2'):
-        return 'tensorflow_backend'
-    if be.startswith("openvino"):
-        return 'openvino_backend'
-    return '{}_backend'.format(be)
+    return "{}_backend".format(be)
+
 
+def backend_cmake_args(images, components, be, install_dir, library_paths):
+    cmake_build_type = FLAGS.build_type
 
-def backend_cmake_args(images, components, be, install_dir, library_paths,
-                       variant_index):
-    if be == 'onnxruntime':
+    if be == "onnxruntime":
         args = onnxruntime_cmake_args(images, library_paths)
-    elif be.startswith('openvino'):
-        args = openvino_cmake_args(be, variant_index)
-    elif be == 'tensorflow1':
-        args = tensorflow_cmake_args(1, images, library_paths)
-    elif be == 'tensorflow2':
-        args = tensorflow_cmake_args(2, images, library_paths)
-    elif be == 'python':
+    elif be == "openvino":
+        args = openvino_cmake_args()
+    elif be == "tensorflow":
+        args = tensorflow_cmake_args(images, library_paths)
+    elif be == "python":
         args = []
-    elif be == 'dali':
+    elif be == "dali":
         args = dali_cmake_args()
-    elif be == 'pytorch':
+    elif be == "pytorch":
         args = pytorch_cmake_args(images)
-    elif be == 'armnn_tflite':
+    elif be == "armnn_tflite":
         args = armnn_tflite_cmake_args()
-    elif be == 'fil':
+    elif be == "fil":
         args = fil_cmake_args(images)
-    elif be == 'fastertransformer':
-        args = []
-    elif be == 'tensorrt':
+        # DLIS-4618: FIL backend fails debug build, so override it for now.
+        cmake_build_type = "Release"
+    elif be == "fastertransformer":
+        args = fastertransformer_cmake_args()
+    elif be == "tensorrt":
         args = tensorrt_cmake_args()
+    elif be == "tensorrtllm":
+        args = tensorrtllm_cmake_args(images)
     else:
         args = []
 
     cargs = args + [
-        cmake_backend_arg(be, 'CMAKE_BUILD_TYPE', None, FLAGS.build_type),
-        cmake_backend_arg(be, 'CMAKE_INSTALL_PREFIX', 'PATH', install_dir),
-        cmake_backend_arg(be, 'TRITON_COMMON_REPO_TAG', 'STRING',
-                          components['common']),
-        cmake_backend_arg(be, 'TRITON_CORE_REPO_TAG', 'STRING',
-                          components['core']),
-        cmake_backend_arg(be, 'TRITON_BACKEND_REPO_TAG', 'STRING',
-                          components['backend'])
+        cmake_backend_arg(be, "CMAKE_BUILD_TYPE", None, cmake_build_type),
+        cmake_backend_arg(be, "CMAKE_INSTALL_PREFIX", "PATH", install_dir),
+        cmake_backend_arg(be, "TRITON_COMMON_REPO_TAG", "STRING", components["common"]),
+        cmake_backend_arg(be, "TRITON_CORE_REPO_TAG", "STRING", components["core"]),
+        cmake_backend_arg(
+            be, "TRITON_BACKEND_REPO_TAG", "STRING", components["backend"]
+        ),
     ]
 
-    cargs.append(cmake_backend_enable(be, 'TRITON_ENABLE_GPU',
-                                      FLAGS.enable_gpu))
+    cargs.append(cmake_backend_enable(be, "TRITON_ENABLE_GPU", FLAGS.enable_gpu))
     cargs.append(
-        cmake_backend_enable(be, 'TRITON_ENABLE_MALI_GPU',
-                             FLAGS.enable_mali_gpu))
-    cargs.append(
-        cmake_backend_enable(be, 'TRITON_ENABLE_STATS', FLAGS.enable_stats))
+        cmake_backend_enable(be, "TRITON_ENABLE_MALI_GPU", FLAGS.enable_mali_gpu)
+    )
+    cargs.append(cmake_backend_enable(be, "TRITON_ENABLE_STATS", FLAGS.enable_stats))
     cargs.append(
-        cmake_backend_enable(be, 'TRITON_ENABLE_METRICS', FLAGS.enable_metrics))
+        cmake_backend_enable(be, "TRITON_ENABLE_METRICS", FLAGS.enable_metrics)
+    )
+
+    # [DLIS-4950] always enable below once Windows image is updated with CUPTI
+    # cargs.append(cmake_backend_enable(be, 'TRITON_ENABLE_MEMORY_TRACKER', True))
+    if (target_platform() == "windows") and (not FLAGS.no_container_build):
+        print(
+            "Warning: Detected docker build is used for Windows, backend utility 'device memory tracker' will be disabled due to missing library in CUDA Windows docker image."
+        )
+        cargs.append(cmake_backend_enable(be, "TRITON_ENABLE_MEMORY_TRACKER", False))
+    elif target_platform() == "igpu":
+        print(
+            "Warning: Detected iGPU build, backend utility 'device memory tracker' will be disabled as iGPU doesn't contain required version of the library."
+        )
+        cargs.append(cmake_backend_enable(be, "TRITON_ENABLE_MEMORY_TRACKER", False))
+    elif FLAGS.enable_gpu:
+        cargs.append(cmake_backend_enable(be, "TRITON_ENABLE_MEMORY_TRACKER", True))
 
     cargs += cmake_backend_extra_args(be)
-    cargs.append('..')
+    cargs.append("..")
     return cargs
 
 
 def pytorch_cmake_args(images):
-
-    # If platform is jetpack do not use docker based build
-    if target_platform() == 'jetpack':
-        if 'pytorch' not in library_paths:
-            raise Exception(
-                "Must specify library path for pytorch using --library-paths=pytorch:<path_to_pytorch>"
-            )
-        pt_lib_path = library_paths['pytorch'] + "/lib"
-        pt_include_paths = ""
-        for suffix in [
-                'include/torch', 'include/torch/torch/csrc/api/include',
-                'include/torchvision'
-        ]:
-            pt_include_paths += library_paths['pytorch'] + '/' + suffix + ';'
-        cargs = [
-            cmake_backend_arg('pytorch', 'TRITON_PYTORCH_INCLUDE_PATHS', None,
-                              pt_include_paths),
-            cmake_backend_arg('pytorch', 'TRITON_PYTORCH_LIB_PATHS', None,
-                              pt_lib_path),
-        ]
+    if "pytorch" in images:
+        image = images["pytorch"]
     else:
-        if "pytorch" in images:
-            image = images["pytorch"]
-        else:
-            image = 'nvcr.io/nvidia/pytorch:{}-py3'.format(
-                FLAGS.upstream_container_version)
-        cargs = [
-            cmake_backend_arg('pytorch', 'TRITON_PYTORCH_DOCKER_IMAGE', None,
-                              image),
-        ]
+        image = "nvcr.io/nvidia/pytorch:{}-py3".format(FLAGS.upstream_container_version)
+    cargs = [
+        cmake_backend_arg("pytorch", "TRITON_PYTORCH_DOCKER_IMAGE", None, image),
+    ]
 
-        if FLAGS.enable_gpu:
-            cargs.append(
-                cmake_backend_enable('pytorch',
-                                     'TRITON_PYTORCH_ENABLE_TORCHTRT', True))
+    if FLAGS.enable_gpu:
         cargs.append(
-            cmake_backend_enable('pytorch', 'TRITON_ENABLE_NVTX',
-                                 FLAGS.enable_nvtx))
+            cmake_backend_enable("pytorch", "TRITON_PYTORCH_ENABLE_TORCHTRT", True)
+        )
+    cargs.append(
+        cmake_backend_enable("pytorch", "TRITON_ENABLE_NVTX", FLAGS.enable_nvtx)
+    )
     return cargs
 
 
 def onnxruntime_cmake_args(images, library_paths):
     cargs = [
-        cmake_backend_arg('onnxruntime', 'TRITON_BUILD_ONNXRUNTIME_VERSION',
-                          None, TRITON_VERSION_MAP[FLAGS.version][2])
+        cmake_backend_arg(
+            "onnxruntime",
+            "TRITON_BUILD_ONNXRUNTIME_VERSION",
+            None,
+            TRITON_VERSION_MAP[FLAGS.version][2],
+        )
     ]
 
     # TRITON_ENABLE_GPU is already set for all backends in backend_cmake_args()
     if FLAGS.enable_gpu:
         cargs.append(
-            cmake_backend_enable('onnxruntime',
-                                 'TRITON_ENABLE_ONNXRUNTIME_TENSORRT', True))
-
-    # If platform is jetpack do not use docker based build
-    if target_platform() == 'jetpack':
-        if 'onnxruntime' not in library_paths:
-            raise Exception(
-                "Must specify library path for onnxruntime using --library-paths=onnxruntime:<path_to_onnxruntime>"
+            cmake_backend_enable(
+                "onnxruntime", "TRITON_ENABLE_ONNXRUNTIME_TENSORRT", True
+            )
+        )
+
+    if target_platform() == "windows":
+        if "base" in images:
+            cargs.append(
+                cmake_backend_arg(
+                    "onnxruntime", "TRITON_BUILD_CONTAINER", None, images["base"]
+                )
             )
-        ort_lib_path = library_paths['onnxruntime'] + "/lib"
-        ort_include_path = library_paths['onnxruntime'] + "/include"
-        cargs += [
-            cmake_backend_arg('onnxruntime', 'TRITON_ONNXRUNTIME_INCLUDE_PATHS',
-                              None, ort_include_path),
-            cmake_backend_arg('onnxruntime', 'TRITON_ONNXRUNTIME_LIB_PATHS',
-                              None, ort_lib_path),
-            cmake_backend_enable('onnxruntime',
-                                 'TRITON_ENABLE_ONNXRUNTIME_OPENVINO', False)
-        ]
     else:
-        if target_platform() == 'windows':
-            if 'base' in images:
-                cargs.append(
-                    cmake_backend_arg('onnxruntime', 'TRITON_BUILD_CONTAINER',
-                                      None, images['base']))
+        if "base" in images:
+            cargs.append(
+                cmake_backend_arg(
+                    "onnxruntime", "TRITON_BUILD_CONTAINER", None, images["base"]
+                )
+            )
         else:
-            if 'base' in images:
-                cargs.append(
-                    cmake_backend_arg('onnxruntime', 'TRITON_BUILD_CONTAINER',
-                                      None, images['base']))
-            else:
-                cargs.append(
-                    cmake_backend_arg('onnxruntime',
-                                      'TRITON_BUILD_CONTAINER_VERSION', None,
-                                      TRITON_VERSION_MAP[FLAGS.version][1]))
-
-            if ((target_machine() != 'aarch64') and
-                (TRITON_VERSION_MAP[FLAGS.version][3] is not None)):
-                cargs.append(
-                    cmake_backend_enable('onnxruntime',
-                                         'TRITON_ENABLE_ONNXRUNTIME_OPENVINO',
-                                         True))
-                cargs.append(
-                    cmake_backend_arg(
-                        'onnxruntime',
-                        'TRITON_BUILD_ONNXRUNTIME_OPENVINO_VERSION', None,
-                        TRITON_VERSION_MAP[FLAGS.version][3]))
+            cargs.append(
+                cmake_backend_arg(
+                    "onnxruntime",
+                    "TRITON_BUILD_CONTAINER_VERSION",
+                    None,
+                    TRITON_VERSION_MAP[FLAGS.version][1],
+                )
+            )
 
-    return cargs
+        if (target_machine() != "aarch64") and (
+            TRITON_VERSION_MAP[FLAGS.version][3] is not None
+        ):
+            cargs.append(
+                cmake_backend_enable(
+                    "onnxruntime", "TRITON_ENABLE_ONNXRUNTIME_OPENVINO", True
+                )
+            )
+            cargs.append(
+                cmake_backend_arg(
+                    "onnxruntime",
+                    "TRITON_BUILD_ONNXRUNTIME_OPENVINO_VERSION",
+                    None,
+                    TRITON_VERSION_MAP[FLAGS.version][3],
+                )
+            )
 
+        if target_platform() == "igpu":
+            cargs.append(
+                cmake_backend_arg(
+                    "onnxruntime",
+                    "TRITON_BUILD_TARGET_PLATFORM",
+                    None,
+                    target_platform(),
+                )
+            )
 
-def openvino_cmake_args(be, variant_index):
-    using_specific_commit_sha = False
-    if TRITON_VERSION_MAP[FLAGS.version][4][variant_index][0] == 'SPECIFIC':
-        using_specific_commit_sha = True
+    return cargs
 
-    ov_version = TRITON_VERSION_MAP[FLAGS.version][4][variant_index][1]
-    if ov_version:
-        if using_specific_commit_sha:
-            use_prebuilt_ov = False
-        else:
-            use_prebuilt_ov = True
-    else:
-        # If the OV package version is None, then we are not using prebuilt package
-        ov_version = TRITON_VERSION_MAP[FLAGS.version][4][variant_index][0]
-        use_prebuilt_ov = False
-    if using_specific_commit_sha:
-        cargs = [
-            cmake_backend_arg(be, 'TRITON_BUILD_OPENVINO_COMMIT_SHA', None,
-                              ov_version),
-        ]
-    else:
-        cargs = [
-            cmake_backend_arg(be, 'TRITON_BUILD_OPENVINO_VERSION', None,
-                              ov_version),
-        ]
-    cargs.append(
-        cmake_backend_arg(be, 'TRITON_OPENVINO_BACKEND_INSTALLDIR', None, be))
-    if target_platform() == 'windows':
-        if 'base' in images:
+
+def openvino_cmake_args():
+    cargs = [
+        cmake_backend_arg(
+            "openvino",
+            "TRITON_BUILD_OPENVINO_VERSION",
+            None,
+            TRITON_VERSION_MAP[FLAGS.version][4],
+        )
+    ]
+    if target_platform() == "windows":
+        if "base" in images:
             cargs.append(
-                cmake_backend_arg(be, 'TRITON_BUILD_CONTAINER', None,
-                                  images['base']))
+                cmake_backend_arg(
+                    "openvino", "TRITON_BUILD_CONTAINER", None, images["base"]
+                )
+            )
     else:
-        if 'base' in images:
+        if "base" in images:
             cargs.append(
-                cmake_backend_arg(be, 'TRITON_BUILD_CONTAINER', None,
-                                  images['base']))
+                cmake_backend_arg(
+                    "openvino", "TRITON_BUILD_CONTAINER", None, images["base"]
+                )
+            )
         else:
             cargs.append(
-                cmake_backend_arg(be, 'TRITON_BUILD_CONTAINER_VERSION', None,
-                                  TRITON_VERSION_MAP[FLAGS.version][1]))
-        cargs.append(
-            cmake_backend_enable(be, 'TRITON_BUILD_USE_PREBUILT_OPENVINO',
-                                 use_prebuilt_ov))
+                cmake_backend_arg(
+                    "openvino",
+                    "TRITON_BUILD_CONTAINER_VERSION",
+                    None,
+                    TRITON_VERSION_MAP[FLAGS.version][1],
+                )
+            )
     return cargs
 
 
 def tensorrt_cmake_args():
     cargs = [
-        cmake_backend_enable('tensorrt', 'TRITON_ENABLE_NVTX',
-                             FLAGS.enable_nvtx),
+        cmake_backend_enable("tensorrt", "TRITON_ENABLE_NVTX", FLAGS.enable_nvtx),
     ]
-    if target_platform() == 'windows':
+    if target_platform() == "windows":
         cargs.append(
-            cmake_backend_arg('tensorrt', 'TRITON_TENSORRT_INCLUDE_PATHS', None,
-                              'c:/TensorRT/include'))
+            cmake_backend_arg(
+                "tensorrt", "TRITON_TENSORRT_INCLUDE_PATHS", None, "c:/TensorRT/include"
+            )
+        )
 
     return cargs
 
 
-def tensorflow_cmake_args(ver, images, library_paths):
-    backend_name = "tensorflow{}".format(ver)
-
-    # If platform is jetpack do not use docker images
+def tensorflow_cmake_args(images, library_paths):
+    backend_name = "tensorflow"
     extra_args = []
-    if target_platform() == 'jetpack':
-        if backend_name in library_paths:
-            extra_args = [
-                cmake_backend_arg(backend_name, 'TRITON_TENSORFLOW_LIB_PATHS',
-                                  None, library_paths[backend_name])
-            ]
-        else:
-            raise Exception(
-                f"Must specify library path for {backend_name} using --library-paths={backend_name}:<path_to_{backend_name}>"
-            )
+
+    # If a specific TF image is specified use it, otherwise pull from NGC.
+    if backend_name in images:
+        image = images[backend_name]
     else:
-        # If a specific TF image is specified use it, otherwise pull from NGC.
-        if backend_name in images:
-            image = images[backend_name]
-        else:
-            image = 'nvcr.io/nvidia/tensorflow:{}-tf{}-py3'.format(
-                FLAGS.upstream_container_version, ver)
-        extra_args = [
-            cmake_backend_arg(backend_name, 'TRITON_TENSORFLOW_DOCKER_IMAGE',
-                              None, image)
-        ]
-    return [
-        cmake_backend_arg(backend_name, 'TRITON_TENSORFLOW_VERSION', None, ver)
-    ] + extra_args
+        image = "nvcr.io/nvidia/tensorflow:{}-tf2-py3".format(
+            FLAGS.upstream_container_version
+        )
+    extra_args = [
+        cmake_backend_arg(backend_name, "TRITON_TENSORFLOW_DOCKER_IMAGE", None, image)
+    ]
+    return extra_args
 
 
 def dali_cmake_args():
     return [
-        cmake_backend_enable('dali', 'TRITON_DALI_SKIP_DOWNLOAD', False),
+        cmake_backend_enable("dali", "TRITON_DALI_SKIP_DOWNLOAD", False),
     ]
 
 
 def fil_cmake_args(images):
-    cargs = [cmake_backend_enable('fil', 'TRITON_FIL_DOCKER_BUILD', True)]
-    if 'base' in images:
+    cargs = [cmake_backend_enable("fil", "TRITON_FIL_DOCKER_BUILD", True)]
+    if "base" in images:
         cargs.append(
-            cmake_backend_arg('fil', 'TRITON_BUILD_CONTAINER', None,
-                              images['base']))
+            cmake_backend_arg("fil", "TRITON_BUILD_CONTAINER", None, images["base"])
+        )
     else:
         cargs.append(
-            cmake_backend_arg('fil', 'TRITON_BUILD_CONTAINER_VERSION', None,
-                              TRITON_VERSION_MAP[FLAGS.version][1]))
+            cmake_backend_arg(
+                "fil",
+                "TRITON_BUILD_CONTAINER_VERSION",
+                None,
+                TRITON_VERSION_MAP[FLAGS.version][1],
+            )
+        )
 
     return cargs
 
 
 def armnn_tflite_cmake_args():
     return [
-        cmake_backend_arg('armnn_tflite', 'JOBS', None,
-                          multiprocessing.cpu_count()),
+        cmake_backend_arg("armnn_tflite", "JOBS", None, multiprocessing.cpu_count()),
+    ]
+
+
+def fastertransformer_cmake_args():
+    print("Warning: FasterTransformer backend is not officially supported.")
+    cargs = [
+        cmake_backend_arg(
+            "fastertransformer", "CMAKE_EXPORT_COMPILE_COMMANDS", None, 1
+        ),
+        cmake_backend_arg("fastertransformer", "ENABLE_FP8", None, "OFF"),
     ]
+    return cargs
+
+
+def tensorrtllm_cmake_args(images):
+    cargs = [
+        cmake_backend_arg(
+            "tensorrtllm",
+            "TRT_LIB_DIR",
+            None,
+            "${TRT_ROOT}/targets/${ARCH}-linux-gnu/lib",
+        ),
+        cmake_backend_arg(
+            "tensorrtllm", "TRT_INCLUDE_DIR", None, "${TRT_ROOT}/include"
+        ),
+        cmake_backend_arg(
+            "tensorrtllm",
+            "TRTLLM_BUILD_CONTAINER",
+            None,
+            images["base"],
+        ),
+    ]
+    cargs.append(cmake_backend_enable("tensorrtllm", "TRITON_BUILD", True))
+    return cargs
 
 
 def install_dcgm_libraries(dcgm_version, target_machine):
-    if dcgm_version == '':
+    if dcgm_version == "":
         fail(
-            'unable to determine default repo-tag, DCGM version not known for {}'
-            .format(FLAGS.version))
-        return ''
+            "unable to determine default repo-tag, DCGM version not known for {}".format(
+                FLAGS.version
+            )
+        )
+        return ""
     else:
-        if target_machine == 'aarch64':
-            return '''
+        if target_machine == "aarch64":
+            return """
 ENV DCGM_VERSION {}
 # Install DCGM. Steps from https://developer.nvidia.com/dcgm#Downloads
 RUN curl -o /tmp/cuda-keyring.deb \
-    https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/sbsa/cuda-keyring_1.0-1_all.deb \
+    https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/sbsa/cuda-keyring_1.0-1_all.deb \
     && apt install /tmp/cuda-keyring.deb && rm /tmp/cuda-keyring.deb && \
     apt-get update && apt-get install -y datacenter-gpu-manager=1:{}
-'''.format(dcgm_version, dcgm_version)
+""".format(
+                dcgm_version, dcgm_version
+            )
         else:
-            return '''
+            return """
 ENV DCGM_VERSION {}
 # Install DCGM. Steps from https://developer.nvidia.com/dcgm#Downloads
 RUN curl -o /tmp/cuda-keyring.deb \
-    https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb \
+    https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb \
     && apt install /tmp/cuda-keyring.deb && rm /tmp/cuda-keyring.deb && \
     apt-get update && apt-get install -y datacenter-gpu-manager=1:{}
-'''.format(dcgm_version, dcgm_version)
+""".format(
+                dcgm_version, dcgm_version
+            )
 
 
 def install_miniconda(conda_version, target_machine):
-    if conda_version == '':
+    if target_machine == "arm64":
+        # This branch used for the case when linux container builds on MacOS with ARM chip
+        # macos arm arch names "arm64" when in linux it's names "aarch64".
+        # So we just replace the architecture to able find right conda version for Linux
+        target_machine = "aarch64"
+    if conda_version == "":
         fail(
-            'unable to determine default repo-tag, CONDA version not known for {}'
-            .format(FLAGS.version))
+            "unable to determine default repo-tag, CONDA version not known for {}".format(
+                FLAGS.version
+            )
+        )
     miniconda_url = f"https://repo.anaconda.com/miniconda/Miniconda3-{conda_version}-Linux-{target_machine}.sh"
-    if target_machine == 'x86_64':
-        sha_sum = "3190da6626f86eee8abf1b2fd7a5af492994eb2667357ee4243975cdbb175d7a"
+    if target_machine == "x86_64":
+        sha_sum = "32d73e1bc33fda089d7cd9ef4c1be542616bd8e437d1f77afeeaf7afdb019787"
     else:
-        sha_sum = "0c20f121dc4c8010032d64f8e9b27d79e52d28355eb8d7972eafc90652387777"
-    return f'''
+        sha_sum = "80d6c306b015e1e3b01ea59dc66c676a81fa30279bc2da1f180a7ef7b2191d6e"
+    return f"""
 RUN mkdir -p /opt/
 RUN wget "{miniconda_url}" -O miniconda.sh -q && \
     echo "{sha_sum}" "miniconda.sh" > shasum && \
@@ -863,52 +894,68 @@ def install_miniconda(conda_version, target_machine):
     find /opt/conda/ -follow -type f -name '*.js.map' -delete && \
     /opt/conda/bin/conda clean -afy
 ENV PATH /opt/conda/bin:${{PATH}}
-'''
+"""
 
 
 def create_dockerfile_buildbase(ddir, dockerfile_name, argmap):
-    df = '''
+    df = """
 ARG TRITON_VERSION={}
 ARG TRITON_CONTAINER_VERSION={}
 ARG BASE_IMAGE={}
-'''.format(argmap['TRITON_VERSION'], argmap['TRITON_CONTAINER_VERSION'],
-           argmap['BASE_IMAGE'])
+""".format(
+        argmap["TRITON_VERSION"],
+        argmap["TRITON_CONTAINER_VERSION"],
+        argmap["BASE_IMAGE"],
+    )
 
-    df += '''
+    df += """
 FROM ${BASE_IMAGE}
 
 ARG TRITON_VERSION
 ARG TRITON_CONTAINER_VERSION
-'''
+"""
     # Install the windows- or linux-specific buildbase dependencies
-    if target_platform() == 'windows':
-        df += '''
+    if target_platform() == "windows":
+        df += """
 SHELL ["cmd", "/S", "/C"]
-'''
+"""
     else:
-        df += '''
+        df += """
 # Ensure apt-get won't prompt for selecting options
 ENV DEBIAN_FRONTEND=noninteractive
 
+# Install docker docker buildx
+RUN apt-get update \
+        && apt-get install -y ca-certificates curl gnupg \
+        && install -m 0755 -d /etc/apt/keyrings \
+        && curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /etc/apt/keyrings/docker.gpg \
+        && chmod a+r /etc/apt/keyrings/docker.gpg \
+        && echo \
+            "deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
+            "$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
+            tee /etc/apt/sources.list.d/docker.list > /dev/null \
+        && apt-get update \
+        && apt-get install -y docker.io docker-buildx-plugin
+
 # libcurl4-openSSL-dev is needed for GCS
 # python3-dev is needed by Torchvision
 # python3-pip and libarchive-dev is needed by python backend
-# uuid-dev and pkg-config is needed for Azure Storage
+# libxml2-dev is needed for Azure Storage
 # scons is needed for armnn_tflite backend build dep
-RUN apt-get update && \
-    apt-get install -y --no-install-recommends \
+RUN apt-get update \
+    && apt-get install -y --no-install-recommends \
             ca-certificates \
             autoconf \
             automake \
             build-essential \
-            docker.io \
             git \
+            gperf \
             libre2-dev \
             libssl-dev \
             libtool \
-            libboost-dev \
             libcurl4-openssl-dev \
             libb64-dev \
+            libgoogle-perftools-dev \
             patchelf \
             python3-dev \
             python3-pip \
@@ -916,70 +963,81 @@ def create_dockerfile_buildbase(ddir, dockerfile_name, argmap):
             rapidjson-dev \
             scons \
             software-properties-common \
+            pkg-config \
             unzip \
             wget \
             zlib1g-dev \
             libarchive-dev \
-            pkg-config \
-            uuid-dev \
-            libnuma-dev && \
-    rm -rf /var/lib/apt/lists/*
+            libxml2-dev \
+            libnuma-dev \
+            wget \
+    && rm -rf /var/lib/apt/lists/*
 
 RUN pip3 install --upgrade pip && \
     pip3 install --upgrade wheel setuptools docker
 
+# Install boost version >= 1.78 for boost::span
+# Current libboost-dev apt packages are < 1.78, so install from tar.gz
+RUN wget -O /tmp/boost.tar.gz \
+        https://boostorg.jfrog.io/artifactory/main/release/1.80.0/source/boost_1_80_0.tar.gz && \
+    (cd /tmp && tar xzf boost.tar.gz) && \
+    cd /tmp/boost_1_80_0 && ./bootstrap.sh --prefix=/usr && ./b2 install && \
+    mv /tmp/boost_1_80_0/boost /usr/include/boost
+
 # Server build requires recent version of CMake (FetchContent required)
-RUN wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | \
-      gpg --dearmor - |  \
-      tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null && \
-    apt-add-repository 'deb https://apt.kitware.com/ubuntu/ focal main' && \
-    apt-get update && \
-    apt-get install -y --no-install-recommends \
-      cmake-data=3.21.1-0kitware1ubuntu20.04.1 cmake=3.21.1-0kitware1ubuntu20.04.1
-'''
+RUN apt update -q=2 \\
+    && apt install -y gpg wget \\
+    && wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - |  tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null \\
+    && . /etc/os-release \\
+    && echo "deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ $UBUNTU_CODENAME main" | tee /etc/apt/sources.list.d/kitware.list >/dev/null \\
+    && apt-get update -q=2 \\
+    && apt-get install -y --no-install-recommends cmake=3.27.7* cmake-data=3.27.7*
+"""
 
         if FLAGS.enable_gpu:
-            df += install_dcgm_libraries(argmap['DCGM_VERSION'],
-                                         target_machine())
+            df += install_dcgm_libraries(argmap["DCGM_VERSION"], target_machine())
 
-    df += '''
+    df += """
 ENV TRITON_SERVER_VERSION ${TRITON_VERSION}
 ENV NVIDIA_TRITON_SERVER_VERSION ${TRITON_CONTAINER_VERSION}
-'''
+"""
 
     # Copy in the triton source. We remove existing contents first in
     # case the FROM container has something there already.
-    if target_platform() == 'windows':
-        df += '''
+    if target_platform() == "windows":
+        df += """
 WORKDIR /workspace
 RUN rmdir /S/Q * || exit 0
 COPY . .
-'''
+"""
     else:
-        df += '''
+        df += """
 WORKDIR /workspace
 RUN rm -fr *
 COPY . .
 ENTRYPOINT []
-'''
+"""
 
     # Install miniconda required for the DALI backend.
-    if target_platform() != 'windows':
-        df += install_miniconda(argmap['CONDA_VERSION'], target_machine())
+    if target_platform() != "windows":
+        df += install_miniconda(argmap["CONDA_VERSION"], target_machine())
 
     with open(os.path.join(ddir, dockerfile_name), "w") as dfile:
         dfile.write(df)
 
 
 def create_dockerfile_cibase(ddir, dockerfile_name, argmap):
-    df = '''
+    df = """
 ARG TRITON_VERSION={}
 ARG TRITON_CONTAINER_VERSION={}
 ARG BASE_IMAGE={}
-'''.format(argmap['TRITON_VERSION'], argmap['TRITON_CONTAINER_VERSION'],
-           argmap['BASE_IMAGE'])
+""".format(
+        argmap["TRITON_VERSION"],
+        argmap["TRITON_CONTAINER_VERSION"],
+        argmap["BASE_IMAGE"],
+    )
 
-    df += '''
+    df += """
 FROM ${BASE_IMAGE}
 
 ARG TRITON_VERSION
@@ -991,80 +1049,84 @@ def create_dockerfile_cibase(ddir, dockerfile_name, argmap):
 
 ENV TRITON_SERVER_VERSION ${TRITON_VERSION}
 ENV NVIDIA_TRITON_SERVER_VERSION ${TRITON_CONTAINER_VERSION}
-'''
+"""
 
     with open(os.path.join(ddir, dockerfile_name), "w") as dfile:
         dfile.write(df)
 
 
-def create_dockerfile_linux(ddir, dockerfile_name, argmap, backends, repoagents,
-                            endpoints):
-    df = '''
+def create_dockerfile_linux(
+    ddir, dockerfile_name, argmap, backends, repoagents, caches, endpoints
+):
+    df = """
 ARG TRITON_VERSION={}
 ARG TRITON_CONTAINER_VERSION={}
 ARG BASE_IMAGE={}
 
-'''.format(argmap['TRITON_VERSION'], argmap['TRITON_CONTAINER_VERSION'],
-           argmap['BASE_IMAGE'])
+""".format(
+        argmap["TRITON_VERSION"],
+        argmap["TRITON_CONTAINER_VERSION"],
+        argmap["BASE_IMAGE"],
+    )
 
-    # PyTorch, TensorFlow 1 and TensorFlow 2 backends need extra CUDA and other
+    # PyTorch and TensorFlow backends need extra CUDA and other
     # dependencies during runtime that are missing in the CPU-only base container.
     # These dependencies must be copied from the Triton Min image.
-    if not FLAGS.enable_gpu and (('pytorch' in backends) or
-                                 ('tensorflow1' in backends) or
-                                 ('tensorflow2' in backends)):
-        df += '''
+    if not FLAGS.enable_gpu and (("pytorch" in backends) or ("tensorflow" in backends)):
+        df += """
 ############################################################################
 ##  Triton Min image
 ############################################################################
 FROM {} AS min_container
 
-'''.format(argmap['GPU_BASE_IMAGE'])
+""".format(
+            argmap["GPU_BASE_IMAGE"]
+        )
 
-    df += '''
+    df += """
 ############################################################################
 ##  Production stage: Create container with just inference server executable
 ############################################################################
 FROM ${BASE_IMAGE}
-'''
+"""
 
-    df += dockerfile_prepare_container_linux(argmap, backends, FLAGS.enable_gpu,
-                                             target_machine())
+    df += dockerfile_prepare_container_linux(
+        argmap, backends, FLAGS.enable_gpu, target_machine()
+    )
 
-    df += '''
+    df += """
 WORKDIR /opt
 COPY --chown=1000:1000 build/install tritonserver
 
 WORKDIR /opt/tritonserver
 COPY --chown=1000:1000 NVIDIA_Deep_Learning_Container_License.pdf .
 
-'''
+"""
     if not FLAGS.no_core_build:
         # Add feature labels for SageMaker endpoint
-        if 'sagemaker' in endpoints:
-            df += '''
+        if "sagemaker" in endpoints:
+            df += """
 LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true
 LABEL com.amazonaws.sagemaker.capabilities.multi-models=true
 COPY --chown=1000:1000 docker/sagemaker/serve /usr/bin/.
-'''
+"""
 
     # This is required since libcublasLt.so is not present during the build
     # stage of the PyTorch backend
-    if not FLAGS.enable_gpu and ('pytorch' in backends):
-        df += '''
-RUN patchelf --add-needed /usr/local/cuda/lib64/stubs/libcublasLt.so.11 backends/pytorch/libtorch_cuda.so
-'''
+    if not FLAGS.enable_gpu and ("pytorch" in backends):
+        df += """
+RUN patchelf --add-needed /usr/local/cuda/lib64/stubs/libcublasLt.so.12 backends/pytorch/libtorch_cuda.so
+"""
 
     with open(os.path.join(ddir, dockerfile_name), "w") as dfile:
         dfile.write(df)
 
 
-def dockerfile_prepare_container_linux(argmap, backends, enable_gpu,
-                                       target_machine):
+def dockerfile_prepare_container_linux(argmap, backends, enable_gpu, target_machine):
     gpu_enabled = 1 if enable_gpu else 0
     # Common steps to produce docker images shared by build.py and compose.py.
-    # Sets enviroment variables, installs dependencies and adds entrypoint
-    df = '''
+    # Sets environment variables, installs dependencies and adds entrypoint
+    df = """
 ARG TRITON_VERSION
 ARG TRITON_CONTAINER_VERSION
 
@@ -1073,24 +1135,36 @@ def dockerfile_prepare_container_linux(argmap, backends, enable_gpu,
 LABEL com.nvidia.tritonserver.version="${TRITON_SERVER_VERSION}"
 
 ENV PATH /opt/tritonserver/bin:${PATH}
-'''
+# Remove once https://github.com/openucx/ucx/pull/9148 is available
+# in the min container.
+ENV UCX_MEM_EVENTS no
+"""
 
     # TODO Remove once the ORT-OpenVINO "Exception while Reading network" is fixed
-    if 'onnxruntime' in backends:
-        df += '''
+    if "onnxruntime" in backends:
+        df += """
 ENV LD_LIBRARY_PATH /opt/tritonserver/backends/onnxruntime:${LD_LIBRARY_PATH}
-'''
+"""
+
+    # Necessary for libtorch.so to find correct HPCX libraries
+    if "pytorch" in backends:
+        df += """
+ENV LD_LIBRARY_PATH /opt/hpcx/ucc/lib/:/opt/hpcx/ucx/lib/:${LD_LIBRARY_PATH}
+"""
 
     backend_dependencies = ""
     # libgomp1 is needed by both onnxruntime and pytorch backends
-    if ('onnxruntime' in backends) or ('pytorch' in backends):
+    if ("onnxruntime" in backends) or ("pytorch" in backends):
         backend_dependencies = "libgomp1"
 
     # libgfortran5 is needed by pytorch backend on ARM
-    if ('pytorch' in backends) and (target_machine == 'aarch64'):
+    if ("pytorch" in backends) and (target_machine == "aarch64"):
         backend_dependencies += " libgfortran5"
+    # openssh-server is needed for fastertransformer
+    if "fastertransformer" in backends:
+        backend_dependencies += " openssh-server"
 
-    df += '''
+    df += """
 ENV TF_ADJUST_HUE_FUSED         1
 ENV TF_ADJUST_SATURATION_FUSED  1
 ENV TF_ENABLE_WINOGRAD_NONFUSED 1
@@ -1113,76 +1187,68 @@ def dockerfile_prepare_container_linux(argmap, backends, enable_gpu,
 
 # Common dependencies. FIXME (can any of these be conditional? For
 # example libcurl only needed for GCS?)
-RUN apt-get update && \
-    apt-get install -y --no-install-recommends \
-            software-properties-common \
+RUN apt-get update \
+    && apt-get install -y --no-install-recommends \
+            clang \
+            curl \
+            dirmngr \
+            git \
+            gperf \
             libb64-0d \
             libcurl4-openssl-dev \
-            libre2-5 \
-            git \
-            dirmngr \
+            libgoogle-perftools-dev \
+            libjemalloc-dev \
             libnuma-dev \
-            curl \
-            {backend_dependencies} && \
-    rm -rf /var/lib/apt/lists/*
-'''.format(gpu_enabled=gpu_enabled, backend_dependencies=backend_dependencies)
+            libre2-9 \
+            software-properties-common \
+            wget \
+            {backend_dependencies} \
+    && rm -rf /var/lib/apt/lists/*
+
+# Install boost version >= 1.78 for boost::span
+# Current libboost-dev apt packages are < 1.78, so install from tar.gz
+RUN wget -O /tmp/boost.tar.gz \
+        https://boostorg.jfrog.io/artifactory/main/release/1.80.0/source/boost_1_80_0.tar.gz \
+      && (cd /tmp && tar xzf boost.tar.gz) \
+      && cd /tmp/boost_1_80_0 \
+      && ./bootstrap.sh --prefix=/usr \
+      && ./b2 install \
+      && rm -rf /tmp/boost*
+
+# Set TCMALLOC_RELEASE_RATE for users setting LD_PRELOAD with tcmalloc
+ENV TCMALLOC_RELEASE_RATE 200
+""".format(
+        gpu_enabled=gpu_enabled, backend_dependencies=backend_dependencies
+    )
+
+    if "fastertransformer" in backends:
+        be = "fastertransformer"
+        url = "https://raw.githubusercontent.com/triton-inference-server/fastertransformer_backend/{}/docker/create_dockerfile_and_build.py".format(
+            backends[be]
+        )
+        response = requests.get(url)
+        spec = importlib.util.spec_from_loader(
+            "fastertransformer_buildscript", loader=None, origin=url
+        )
+        fastertransformer_buildscript = importlib.util.module_from_spec(spec)
+        exec(response.content, fastertransformer_buildscript.__dict__)
+        df += fastertransformer_buildscript.create_postbuild(is_multistage_build=False)
 
     if enable_gpu:
-        df += install_dcgm_libraries(argmap['DCGM_VERSION'], target_machine)
-        df += '''
+        df += install_dcgm_libraries(argmap["DCGM_VERSION"], target_machine)
+        df += """
 # Extra defensive wiring for CUDA Compat lib
 RUN ln -sf ${_CUDA_COMPAT_PATH}/lib.real ${_CUDA_COMPAT_PATH}/lib \
  && echo ${_CUDA_COMPAT_PATH}/lib > /etc/ld.so.conf.d/00-cuda-compat.conf \
  && ldconfig \
  && rm -f ${_CUDA_COMPAT_PATH}/lib
-'''
-
+"""
     else:
-        libs_arch = 'aarch64' if target_machine == 'aarch64' else 'x86_64'
-        if ('pytorch' in backends) or ('tensorflow1' in backends):
-            # Add extra dependencies for tensorflow1/pytorch backend.
-            # Note: Even though the build is CPU-only, the version of tensorflow1/
-            # pytorch we are using depend upon libraries like cuda and cudnn. Since
-            # these dependencies are not present in the ubuntu base image,
-            # we must copy these from the Triton min container ourselves.
-            cuda_arch = 'sbsa' if target_machine == 'aarch64' else 'x86_64'
-            df += '''
-RUN mkdir -p /usr/local/cuda/lib64/stubs
-COPY --from=min_container /usr/local/cuda/lib64/stubs/libcusparse.so /usr/local/cuda/lib64/stubs/libcusparse.so.11
-COPY --from=min_container /usr/local/cuda/lib64/stubs/libcusolver.so /usr/local/cuda/lib64/stubs/libcusolver.so.11
-COPY --from=min_container /usr/local/cuda/lib64/stubs/libcurand.so /usr/local/cuda/lib64/stubs/libcurand.so.10
-COPY --from=min_container /usr/local/cuda/lib64/stubs/libcufft.so /usr/local/cuda/lib64/stubs/libcufft.so.10
-COPY --from=min_container /usr/local/cuda/lib64/stubs/libcublas.so /usr/local/cuda/lib64/stubs/libcublas.so.11
-COPY --from=min_container /usr/local/cuda/lib64/stubs/libcublasLt.so /usr/local/cuda/lib64/stubs/libcublasLt.so.11
-
-RUN mkdir -p /usr/local/cuda/targets/{cuda_arch}-linux/lib
-COPY --from=min_container /usr/local/cuda-11.7/targets/{cuda_arch}-linux/lib/libcudart.so.11.0 /usr/local/cuda/targets/{cuda_arch}-linux/lib/.
-COPY --from=min_container /usr/local/cuda-11.7/targets/{cuda_arch}-linux/lib/libcupti.so.11.7 /usr/local/cuda/targets/{cuda_arch}-linux/lib/.
-COPY --from=min_container /usr/local/cuda-11.7/targets/{cuda_arch}-linux/lib/libnvToolsExt.so.1 /usr/local/cuda/targets/{cuda_arch}-linux/lib/.
-
-COPY --from=min_container /usr/lib/{libs_arch}-linux-gnu/libcudnn.so.8 /usr/lib/{libs_arch}-linux-gnu/libcudnn.so.8
-
-# patchelf is needed to add deps of libcublasLt.so.11 to libtorch_cuda.so
-RUN apt-get update && \
-        apt-get install -y --no-install-recommends openmpi-bin patchelf
-
-ENV LD_LIBRARY_PATH /usr/local/cuda/targets/{cuda_arch}-linux/lib:/usr/local/cuda/lib64/stubs:${{LD_LIBRARY_PATH}}
-'''.format(cuda_arch=cuda_arch, libs_arch=libs_arch)
-
-        if ('pytorch' in backends) or ('tensorflow1' in backends) \
-                or ('tensorflow2' in backends):
-            # Add NCCL dependency for tensorflow1/tensorflow2/pytorch backend.
-            # Note: Even though the build is CPU-only, the version of tensorflow1/
-            # tensorflow2/pytorch we are using depends upon the NCCL library. Since
-            # this dependency is not present in the ubuntu base image, we must
-            # copy it from the Triton min container ourselves.
-            df += '''
-COPY --from=min_container /usr/lib/{libs_arch}-linux-gnu/libnccl.so.2 /usr/lib/{libs_arch}-linux-gnu/libnccl.so.2
-'''.format(libs_arch=libs_arch)
+        df += add_cpu_libs_to_linux_dockerfile(backends, target_machine)
 
     # Add dependencies needed for python backend
-    if 'python' in backends:
-        df += '''
+    if "python" in backends:
+        df += """
 # python3, python3-pip and some pip installs required for the python backend
 RUN apt-get update && \
     apt-get install -y --no-install-recommends \
@@ -1193,37 +1259,122 @@ def dockerfile_prepare_container_linux(argmap, backends, enable_gpu,
     pip3 install --upgrade wheel setuptools && \
     pip3 install --upgrade numpy && \
     rm -rf /var/lib/apt/lists/*
-'''
+"""
+    # Add dependencies needed for tensorrtllm backend
+    if "tensorrtllm" in backends:
+        be = "tensorrtllm"
+        url = "https://raw.githubusercontent.com/triton-inference-server/tensorrtllm_backend/{}/tools/gen_trtllm_dockerfile.py".format(
+            backends[be]
+        )
+
+        response = requests.get(url)
+        spec = importlib.util.spec_from_loader(
+            "trtllm_buildscript", loader=None, origin=url
+        )
+        trtllm_buildscript = importlib.util.module_from_spec(spec)
+        exec(response.content, trtllm_buildscript.__dict__)
+        df += trtllm_buildscript.create_postbuild(backends[be])
+
+    if "vllm" in backends:
+        # [DLIS-5606] Build Conda environment for vLLM backend
+        # Remove Pip install once vLLM backend moves to Conda environment.
+        df += """
+# vLLM needed for vLLM backend
+RUN pip3 install vllm=={}
+""".format(
+            TRITON_VERSION_MAP[FLAGS.version][7]
+        )
 
-    df += '''
+    df += """
 WORKDIR /opt/tritonserver
 RUN rm -fr /opt/tritonserver/*
 ENV NVIDIA_PRODUCT_NAME="Triton Server"
 COPY docker/entrypoint.d/ /opt/nvidia/entrypoint.d/
-'''
+"""
 
     # The CPU-only build uses ubuntu as the base image, and so the
     # entrypoint files are not available in /opt/nvidia in the base
     # image, so we must provide them ourselves.
     if not enable_gpu:
-        df += '''
+        df += """
 COPY docker/cpu_only/ /opt/nvidia/
 ENTRYPOINT ["/opt/nvidia/nvidia_entrypoint.sh"]
-'''
+"""
 
-    df += '''
+    df += """
 ENV NVIDIA_BUILD_ID {}
 LABEL com.nvidia.build.id={}
 LABEL com.nvidia.build.ref={}
-'''.format(argmap['NVIDIA_BUILD_ID'], argmap['NVIDIA_BUILD_ID'],
-           argmap['NVIDIA_BUILD_REF'])
+""".format(
+        argmap["NVIDIA_BUILD_ID"], argmap["NVIDIA_BUILD_ID"], argmap["NVIDIA_BUILD_REF"]
+    )
+
+    return df
+
+
+def add_cpu_libs_to_linux_dockerfile(backends, target_machine):
+    df = ""
+    libs_arch = "aarch64" if target_machine == "aarch64" else "x86_64"
+    if "pytorch" in backends:
+        # Add extra dependencies for pytorch backend.
+        # Note: Even though the build is CPU-only, the version of pytorch
+        # we are using depend upon libraries like cuda and cudnn. Since
+        # these dependencies are not present in the ubuntu base image,
+        # we must copy these from the Triton min container ourselves.
+        cuda_arch = "sbsa" if target_machine == "aarch64" else "x86_64"
+        df += """
+RUN mkdir -p /usr/local/cuda/lib64/stubs
+COPY --from=min_container /usr/local/cuda/lib64/stubs/libcusparse.so /usr/local/cuda/lib64/stubs/libcusparse.so.12
+COPY --from=min_container /usr/local/cuda/lib64/stubs/libcusolver.so /usr/local/cuda/lib64/stubs/libcusolver.so.11
+COPY --from=min_container /usr/local/cuda/lib64/stubs/libcurand.so /usr/local/cuda/lib64/stubs/libcurand.so.10
+COPY --from=min_container /usr/local/cuda/lib64/stubs/libcufft.so /usr/local/cuda/lib64/stubs/libcufft.so.11
+COPY --from=min_container /usr/local/cuda/lib64/stubs/libcublas.so /usr/local/cuda/lib64/stubs/libcublas.so.12
+COPY --from=min_container /usr/local/cuda/lib64/stubs/libcublasLt.so /usr/local/cuda/lib64/stubs/libcublasLt.so.12
+COPY --from=min_container /usr/local/cuda/lib64/stubs/libcublasLt.so /usr/local/cuda/lib64/stubs/libcublasLt.so.11
+
+RUN mkdir -p /usr/local/cuda/targets/{cuda_arch}-linux/lib
+COPY --from=min_container /usr/local/cuda/lib64/libcudart.so.12 /usr/local/cuda/targets/{cuda_arch}-linux/lib/.
+COPY --from=min_container /usr/local/cuda/lib64/libcupti.so.12 /usr/local/cuda/targets/{cuda_arch}-linux/lib/.
+COPY --from=min_container /usr/local/cuda/lib64/libnvToolsExt.so.1 /usr/local/cuda/targets/{cuda_arch}-linux/lib/.
+COPY --from=min_container /usr/local/cuda/lib64/libnvJitLink.so.12 /usr/local/cuda/targets/{cuda_arch}-linux/lib/.
+
+RUN mkdir -p /opt/hpcx/ucc/lib/ /opt/hpcx/ucx/lib/
+COPY --from=min_container /opt/hpcx/ucc/lib/libucc.so.1 /opt/hpcx/ucc/lib/libucc.so.1
+COPY --from=min_container /opt/hpcx/ucx/lib/libucm.so.0 /opt/hpcx/ucx/lib/libucm.so.0
+COPY --from=min_container /opt/hpcx/ucx/lib/libucp.so.0 /opt/hpcx/ucx/lib/libucp.so.0
+COPY --from=min_container /opt/hpcx/ucx/lib/libucs.so.0 /opt/hpcx/ucx/lib/libucs.so.0
+COPY --from=min_container /opt/hpcx/ucx/lib/libuct.so.0 /opt/hpcx/ucx/lib/libuct.so.0
+
+COPY --from=min_container /usr/lib/{libs_arch}-linux-gnu/libcudnn.so.8 /usr/lib/{libs_arch}-linux-gnu/libcudnn.so.8
+
+# patchelf is needed to add deps of libcublasLt.so.12 to libtorch_cuda.so
+RUN apt-get update && \
+        apt-get install -y --no-install-recommends openmpi-bin patchelf
+
+ENV LD_LIBRARY_PATH /usr/local/cuda/targets/{cuda_arch}-linux/lib:/usr/local/cuda/lib64/stubs:${{LD_LIBRARY_PATH}}
+""".format(
+            cuda_arch=cuda_arch, libs_arch=libs_arch
+        )
+
+    if ("pytorch" in backends) or ("tensorflow" in backends):
+        # Add NCCL dependency for tensorflow/pytorch backend.
+        # Note: Even though the build is CPU-only, the version of
+        # tensorflow/pytorch we are using depends upon the NCCL library.
+        # Since this dependency is not present in the ubuntu base image,
+        # we must copy it from the Triton min container ourselves.
+        df += """
+COPY --from=min_container /usr/lib/{libs_arch}-linux-gnu/libnccl.so.2 /usr/lib/{libs_arch}-linux-gnu/libnccl.so.2
+""".format(
+            libs_arch=libs_arch
+        )
 
     return df
 
 
-def create_dockerfile_windows(ddir, dockerfile_name, argmap, backends,
-                              repoagents):
-    df = '''
+def create_dockerfile_windows(
+    ddir, dockerfile_name, argmap, backends, repoagents, caches
+):
+    df = """
 ARG TRITON_VERSION={}
 ARG TRITON_CONTAINER_VERSION={}
 ARG BASE_IMAGE={}
@@ -1242,9 +1393,12 @@ def create_dockerfile_windows(ddir, dockerfile_name, argmap, backends,
 
 RUN setx path "%path%;C:\opt\tritonserver\bin"
 
-'''.format(argmap['TRITON_VERSION'], argmap['TRITON_CONTAINER_VERSION'],
-           argmap['BASE_IMAGE'])
-    df += '''
+""".format(
+        argmap["TRITON_VERSION"],
+        argmap["TRITON_CONTAINER_VERSION"],
+        argmap["BASE_IMAGE"],
+    )
+    df += """
 WORKDIR /opt
 RUN rmdir /S/Q tritonserver || exit 0
 COPY --chown=1000:1000 build/install tritonserver
@@ -1252,118 +1406,136 @@ def create_dockerfile_windows(ddir, dockerfile_name, argmap, backends,
 WORKDIR /opt/tritonserver
 COPY --chown=1000:1000 NVIDIA_Deep_Learning_Container_License.pdf .
 
-'''
-    df += '''
+"""
+    df += """
 ENTRYPOINT []
 ENV NVIDIA_BUILD_ID {}
 LABEL com.nvidia.build.id={}
 LABEL com.nvidia.build.ref={}
-'''.format(argmap['NVIDIA_BUILD_ID'], argmap['NVIDIA_BUILD_ID'],
-           argmap['NVIDIA_BUILD_REF'])
+""".format(
+        argmap["NVIDIA_BUILD_ID"], argmap["NVIDIA_BUILD_ID"], argmap["NVIDIA_BUILD_REF"]
+    )
 
     with open(os.path.join(ddir, dockerfile_name), "w") as dfile:
         dfile.write(df)
 
 
-def create_build_dockerfiles(container_build_dir, images, backends, repoagents,
-                             endpoints):
-    if 'base' in images:
-        base_image = images['base']
-    elif target_platform() == 'windows':
-        base_image = 'mcr.microsoft.com/dotnet/framework/sdk:4.8'
+def create_build_dockerfiles(
+    container_build_dir, images, backends, repoagents, caches, endpoints
+):
+    if "base" in images:
+        base_image = images["base"]
+    elif target_platform() == "windows":
+        base_image = "mcr.microsoft.com/dotnet/framework/sdk:4.8"
     elif FLAGS.enable_gpu:
-        base_image = 'nvcr.io/nvidia/tritonserver:{}-py3-min'.format(
-            FLAGS.upstream_container_version)
+        base_image = "nvcr.io/nvidia/tritonserver:{}-py3-min".format(
+            FLAGS.upstream_container_version
+        )
     else:
-        base_image = 'ubuntu:20.04'
+        base_image = "ubuntu:22.04"
 
     dockerfileargmap = {
-        'NVIDIA_BUILD_REF':
-            '' if FLAGS.build_sha is None else FLAGS.build_sha,
-        'NVIDIA_BUILD_ID':
-            '<unknown>' if FLAGS.build_id is None else FLAGS.build_id,
-        'TRITON_VERSION':
-            FLAGS.version,
-        'TRITON_CONTAINER_VERSION':
-            FLAGS.container_version,
-        'BASE_IMAGE':
-            base_image,
-        'DCGM_VERSION':
-            '' if FLAGS.version is None or FLAGS.version
-            not in TRITON_VERSION_MAP else TRITON_VERSION_MAP[FLAGS.version][5],
-        'CONDA_VERSION':
-            '' if FLAGS.version is None or FLAGS.version
-            not in TRITON_VERSION_MAP else TRITON_VERSION_MAP[FLAGS.version][6]
+        "NVIDIA_BUILD_REF": "" if FLAGS.build_sha is None else FLAGS.build_sha,
+        "NVIDIA_BUILD_ID": "<unknown>" if FLAGS.build_id is None else FLAGS.build_id,
+        "TRITON_VERSION": FLAGS.version,
+        "TRITON_CONTAINER_VERSION": FLAGS.container_version,
+        "BASE_IMAGE": base_image,
+        "DCGM_VERSION": ""
+        if FLAGS.version is None or FLAGS.version not in TRITON_VERSION_MAP
+        else TRITON_VERSION_MAP[FLAGS.version][5],
+        "CONDA_VERSION": ""
+        if FLAGS.version is None or FLAGS.version not in TRITON_VERSION_MAP
+        else TRITON_VERSION_MAP[FLAGS.version][6],
     }
 
     # For CPU-only image we need to copy some cuda libraries and dependencies
-    # since we are using PyTorch, TensorFlow 1, TensorFlow 2 containers that
+    # since we are using PyTorch and TensorFlow containers that
     # are not CPU-only.
-    if not FLAGS.enable_gpu and (
-        ('pytorch' in backends) or ('tensorflow1' in backends) or
-        ('tensorflow2' in backends)) and (target_platform() != 'windows'):
-        if 'gpu-base' in images:
-            gpu_base_image = images['gpu-base']
+    if (
+        not FLAGS.enable_gpu
+        and (("pytorch" in backends) or ("tensorflow" in backends))
+        and (target_platform() != "windows")
+    ):
+        if "gpu-base" in images:
+            gpu_base_image = images["gpu-base"]
         else:
-            gpu_base_image = 'nvcr.io/nvidia/tritonserver:{}-py3-min'.format(
-                FLAGS.upstream_container_version)
-        dockerfileargmap['GPU_BASE_IMAGE'] = gpu_base_image
+            gpu_base_image = "nvcr.io/nvidia/tritonserver:{}-py3-min".format(
+                FLAGS.upstream_container_version
+            )
+        dockerfileargmap["GPU_BASE_IMAGE"] = gpu_base_image
 
-    create_dockerfile_buildbase(FLAGS.build_dir, 'Dockerfile.buildbase',
-                                dockerfileargmap)
+    create_dockerfile_buildbase(
+        FLAGS.build_dir, "Dockerfile.buildbase", dockerfileargmap
+    )
 
-    if target_platform() == 'windows':
-        create_dockerfile_windows(FLAGS.build_dir, 'Dockerfile',
-                                  dockerfileargmap, backends, repoagents)
+    if target_platform() == "windows":
+        create_dockerfile_windows(
+            FLAGS.build_dir,
+            "Dockerfile",
+            dockerfileargmap,
+            backends,
+            repoagents,
+            caches,
+        )
     else:
-        create_dockerfile_linux(FLAGS.build_dir, 'Dockerfile', dockerfileargmap,
-                                backends, repoagents, endpoints)
+        create_dockerfile_linux(
+            FLAGS.build_dir,
+            "Dockerfile",
+            dockerfileargmap,
+            backends,
+            repoagents,
+            caches,
+            endpoints,
+        )
 
     # Dockerfile used for the creating the CI base image.
-    create_dockerfile_cibase(FLAGS.build_dir, 'Dockerfile.cibase',
-                             dockerfileargmap)
+    create_dockerfile_cibase(FLAGS.build_dir, "Dockerfile.cibase", dockerfileargmap)
 
 
-def create_docker_build_script(script_name, container_install_dir,
-                               container_ci_dir):
+def create_docker_build_script(script_name, container_install_dir, container_ci_dir):
     with BuildScript(
-            os.path.join(FLAGS.build_dir, script_name),
-            verbose=FLAGS.verbose,
-            desc=('Docker-based build script for Triton Inference Server'
-                 )) as docker_script:
-
+        os.path.join(FLAGS.build_dir, script_name),
+        verbose=FLAGS.verbose,
+        desc=("Docker-based build script for Triton Inference Server"),
+    ) as docker_script:
         #
         # Build base image... tritonserver_buildbase
         #
         docker_script.commentln(8)
-        docker_script.comment('Create Triton base build image')
+        docker_script.comment("Create Triton base build image")
         docker_script.comment(
-            'This image contains all dependencies necessary to build Triton')
+            "This image contains all dependencies necessary to build Triton"
+        )
         docker_script.comment()
 
         cachefrommap = [
-            'tritonserver_buildbase', 'tritonserver_buildbase_cache0',
-            'tritonserver_buildbase_cache1'
+            "tritonserver_buildbase",
+            "tritonserver_buildbase_cache0",
+            "tritonserver_buildbase_cache1",
         ]
 
         baseargs = [
-            'docker', 'build', '-t', 'tritonserver_buildbase', '-f',
-            os.path.join(FLAGS.build_dir, 'Dockerfile.buildbase')
+            "docker",
+            "build",
+            "-t",
+            "tritonserver_buildbase",
+            "-f",
+            os.path.join(FLAGS.build_dir, "Dockerfile.buildbase"),
         ]
 
         if not FLAGS.no_container_pull:
             baseargs += [
-                '--pull',
+                "--pull",
             ]
 
         # Windows docker runs in a VM and memory needs to be specified
         # explicitly (at least for some configurations of docker).
-        if target_platform() == 'windows':
+        if target_platform() == "windows":
             if FLAGS.container_memory:
-                baseargs += ['--memory', FLAGS.container_memory]
+                baseargs += ["--memory", FLAGS.container_memory]
 
-        baseargs += ['--cache-from={}'.format(k) for k in cachefrommap]
-        baseargs += ['.']
+        baseargs += ["--cache-from={}".format(k) for k in cachefrommap]
+        baseargs += ["."]
 
         docker_script.cwd(THIS_SCRIPT_DIR)
         docker_script.cmd(baseargs, check_exitcode=True)
@@ -1373,10 +1545,9 @@ def create_docker_build_script(script_name, container_install_dir,
         #
         docker_script.blankln()
         docker_script.commentln(8)
-        docker_script.comment('Run build in tritonserver_buildbase container')
-        docker_script.comment(
-            'Mount a directory into the container where the install')
-        docker_script.comment('artifacts will be placed.')
+        docker_script.comment("Run build in tritonserver_buildbase container")
+        docker_script.comment("Mount a directory into the container where the install")
+        docker_script.comment("artifacts will be placed.")
         docker_script.comment()
 
         # Don't use '-v' to communicate the built artifacts out of the
@@ -1384,63 +1555,76 @@ def create_docker_build_script(script_name, container_install_dir,
         # Docker (i.e. docker-in-docker) and not just if run directly
         # from host.
         runargs = [
-            'docker', 'run', '-w', '/workspace/build', '--name',
-            'tritonserver_builder'
+            "docker",
+            "run",
+            "-w",
+            "/workspace/build",
+            "--name",
+            "tritonserver_builder",
         ]
 
         if not FLAGS.no_container_interactive:
-            runargs += ['-it']
+            runargs += ["-it"]
 
-        if target_platform() == 'windows':
+        if target_platform() == "windows":
             if FLAGS.container_memory:
-                runargs += ['--memory', FLAGS.container_memory]
-            runargs += [
-                '-v', '\\\\.\pipe\docker_engine:\\\\.\pipe\docker_engine'
-            ]
+                runargs += ["--memory", FLAGS.container_memory]
+            runargs += ["-v", "\\\\.\pipe\docker_engine:\\\\.\pipe\docker_engine"]
         else:
-            runargs += ['-v', '/var/run/docker.sock:/var/run/docker.sock']
+            runargs += ["-v", "/var/run/docker.sock:/var/run/docker.sock"]
 
-        runargs += ['tritonserver_buildbase']
+        runargs += ["tritonserver_buildbase"]
 
-        if target_platform() == 'windows':
-            runargs += [
-                'powershell.exe', '-noexit', '-File', './cmake_build.ps1'
-            ]
+        if target_platform() == "windows":
+            runargs += ["powershell.exe", "-noexit", "-File", "./cmake_build.ps1"]
         else:
-            runargs += ['./cmake_build']
+            runargs += ["./cmake_build"]
 
         # Remove existing tritonserver_builder container...
-        if target_platform() == 'windows':
-            docker_script.cmd(['docker', 'rm', 'tritonserver_builder'])
+        if target_platform() == "windows":
+            docker_script.cmd(["docker", "rm", "tritonserver_builder"])
         else:
             docker_script._file.write(
-                'if [ "$(docker ps -a | grep tritonserver_builder)" ]; then  docker rm tritonserver_builder; fi\n'
+                'if [ "$(docker ps -a | grep tritonserver_builder)" ]; then  docker rm -f tritonserver_builder; fi\n'
             )
 
         docker_script.cmd(runargs, check_exitcode=True)
 
-        docker_script.cmd([
-            'docker', 'cp', 'tritonserver_builder:/tmp/tritonbuild/install',
-            FLAGS.build_dir
-        ],
-                          check_exitcode=True)
-        docker_script.cmd([
-            'docker', 'cp', 'tritonserver_builder:/tmp/tritonbuild/ci',
-            FLAGS.build_dir
-        ],
-                          check_exitcode=True)
+        docker_script.cmd(
+            [
+                "docker",
+                "cp",
+                "tritonserver_builder:/tmp/tritonbuild/install",
+                FLAGS.build_dir,
+            ],
+            check_exitcode=True,
+        )
+        docker_script.cmd(
+            [
+                "docker",
+                "cp",
+                "tritonserver_builder:/tmp/tritonbuild/ci",
+                FLAGS.build_dir,
+            ],
+            check_exitcode=True,
+        )
 
         #
         # Final image... tritonserver
         #
         docker_script.blankln()
         docker_script.commentln(8)
-        docker_script.comment('Create final tritonserver image')
+        docker_script.comment("Create final tritonserver image")
         docker_script.comment()
 
         finalargs = [
-            'docker', 'build', '-t', 'tritonserver', '-f',
-            os.path.join(FLAGS.build_dir, 'Dockerfile'), '.'
+            "docker",
+            "build",
+            "-t",
+            "tritonserver",
+            "-f",
+            os.path.join(FLAGS.build_dir, "Dockerfile"),
+            ".",
         ]
 
         docker_script.cwd(THIS_SCRIPT_DIR)
@@ -1451,266 +1635,413 @@ def create_docker_build_script(script_name, container_install_dir,
         #
         docker_script.blankln()
         docker_script.commentln(8)
-        docker_script.comment('Create CI base image')
+        docker_script.comment("Create CI base image")
         docker_script.comment()
 
         cibaseargs = [
-            'docker', 'build', '-t', 'tritonserver_cibase', '-f',
-            os.path.join(FLAGS.build_dir, 'Dockerfile.cibase'), '.'
+            "docker",
+            "build",
+            "-t",
+            "tritonserver_cibase",
+            "-f",
+            os.path.join(FLAGS.build_dir, "Dockerfile.cibase"),
+            ".",
         ]
 
         docker_script.cwd(THIS_SCRIPT_DIR)
         docker_script.cmd(cibaseargs, check_exitcode=True)
 
 
-def core_build(cmake_script, repo_dir, cmake_dir, build_dir, install_dir,
-               components, backends):
-    repo_build_dir = os.path.join(build_dir, 'tritonserver', 'build')
-    repo_install_dir = os.path.join(build_dir, 'tritonserver', 'install')
+def core_build(
+    cmake_script, repo_dir, cmake_dir, build_dir, install_dir, components, backends
+):
+    repo_build_dir = os.path.join(build_dir, "tritonserver", "build")
+    repo_install_dir = os.path.join(build_dir, "tritonserver", "install")
 
     cmake_script.commentln(8)
-    cmake_script.comment('Triton core library and tritonserver executable')
+    cmake_script.comment("Triton core library and tritonserver executable")
     cmake_script.comment()
     cmake_script.mkdir(repo_build_dir)
     cmake_script.cwd(repo_build_dir)
     cmake_script.cmake(
-        core_cmake_args(components, backends, cmake_dir, repo_install_dir))
+        core_cmake_args(components, backends, cmake_dir, repo_install_dir)
+    )
     cmake_script.makeinstall()
 
-    if target_platform() == 'windows':
-        cmake_script.mkdir(os.path.join(install_dir, 'bin'))
+    if target_platform() == "windows":
+        cmake_script.mkdir(os.path.join(install_dir, "bin"))
         cmake_script.cp(
-            os.path.join(repo_install_dir, 'bin', 'tritonserver.exe'),
-            os.path.join(install_dir, 'bin'))
+            os.path.join(repo_install_dir, "bin", "tritonserver.exe"),
+            os.path.join(install_dir, "bin"),
+        )
         cmake_script.cp(
-            os.path.join(repo_install_dir, 'bin', 'tritonserver.dll'),
-            os.path.join(install_dir, 'bin'))
+            os.path.join(repo_install_dir, "bin", "tritonserver.dll"),
+            os.path.join(install_dir, "bin"),
+        )
     else:
-        cmake_script.mkdir(os.path.join(install_dir, 'bin'))
-        cmake_script.cp(os.path.join(repo_install_dir, 'bin', 'tritonserver'),
-                        os.path.join(install_dir, 'bin'))
-        cmake_script.mkdir(os.path.join(install_dir, 'lib'))
+        cmake_script.mkdir(os.path.join(install_dir, "bin"))
+        cmake_script.cp(
+            os.path.join(repo_install_dir, "bin", "tritonserver"),
+            os.path.join(install_dir, "bin"),
+        )
+        cmake_script.mkdir(os.path.join(install_dir, "lib"))
         cmake_script.cp(
-            os.path.join(repo_install_dir, 'lib', 'libtritonserver.so'),
-            os.path.join(install_dir, 'lib'))
+            os.path.join(repo_install_dir, "lib", "libtritonserver.so"),
+            os.path.join(install_dir, "lib"),
+        )
+    # [FIXME] Placing the Triton server wheel file in 'python' for now, should
+    # have been upload to pip registry and be able to install directly
+    cmake_script.mkdir(os.path.join(install_dir, "python"))
+    cmake_script.cp(
+        os.path.join(repo_install_dir, "python", "tritonserver*.whl"),
+        os.path.join(install_dir, "python"),
+    )
 
-    cmake_script.mkdir(os.path.join(install_dir, 'include', 'triton'))
+    cmake_script.mkdir(os.path.join(install_dir, "include", "triton"))
     cmake_script.cpdir(
-        os.path.join(repo_install_dir, 'include', 'triton', 'core'),
-        os.path.join(install_dir, 'include', 'triton', 'core'))
+        os.path.join(repo_install_dir, "include", "triton", "core"),
+        os.path.join(install_dir, "include", "triton", "core"),
+    )
 
-    cmake_script.cp(os.path.join(repo_dir, 'LICENSE'), install_dir)
-    cmake_script.cp(os.path.join(repo_dir, 'TRITON_VERSION'), install_dir)
+    cmake_script.cp(os.path.join(repo_dir, "LICENSE"), install_dir)
+    cmake_script.cp(os.path.join(repo_dir, "TRITON_VERSION"), install_dir)
 
     # If requested, package the source code for all OSS used to build
     # For windows, Triton is not delivered as a container so skip for
     # windows platform.
-    if target_platform() != 'windows':
-        if (not FLAGS.no_container_build) and (not FLAGS.no_core_build) and (
-                not FLAGS.no_container_source):
-            cmake_script.mkdir(os.path.join(install_dir, 'third-party-src'))
+    if target_platform() != "windows":
+        if (
+            (not FLAGS.no_container_build)
+            and (not FLAGS.no_core_build)
+            and (not FLAGS.no_container_source)
+        ):
+            cmake_script.mkdir(os.path.join(install_dir, "third-party-src"))
             cmake_script.cwd(repo_build_dir)
             cmake_script.tar(
-                'third-party-src',
-                os.path.join(install_dir, 'third-party-src', 'src.tar.gz'))
+                "third-party-src",
+                os.path.join(install_dir, "third-party-src", "src.tar.gz"),
+            )
             cmake_script.cp(
-                os.path.join(repo_dir, 'docker', 'README.third-party-src'),
-                os.path.join(install_dir, 'third-party-src', 'README'))
+                os.path.join(repo_dir, "docker", "README.third-party-src"),
+                os.path.join(install_dir, "third-party-src", "README"),
+            )
 
     cmake_script.comment()
-    cmake_script.comment('end Triton core library and tritonserver executable')
+    cmake_script.comment("end Triton core library and tritonserver executable")
     cmake_script.commentln(8)
     cmake_script.blankln()
 
 
-def backend_build(be,
-                  cmake_script,
-                  tag,
-                  build_dir,
-                  install_dir,
-                  github_organization,
-                  images,
-                  components,
-                  library_paths,
-                  variant_index=0):
-    repo_build_dir = os.path.join(build_dir, be, 'build')
-    repo_install_dir = os.path.join(build_dir, be, 'install')
+def tensorrtllm_prebuild(cmake_script):
+    # Export the TRT_ROOT environment variable
+    cmake_script.cmd("export TRT_ROOT=/usr/local/tensorrt")
+    cmake_script.cmd("export ARCH=$(uname -m)")
+
+    # FIXME: Update the file structure to the one Triton expects. This is a temporary fix
+    # to get the build working for r23.10.
+    cmake_script.cmd("mv tensorrtllm/inflight_batcher_llm/src tensorrtllm")
+    cmake_script.cmd("mv tensorrtllm/inflight_batcher_llm/cmake tensorrtllm")
+    cmake_script.cmd("mv tensorrtllm/inflight_batcher_llm/CMakeLists.txt tensorrtllm")
+
+
+def backend_build(
+    be,
+    cmake_script,
+    tag,
+    build_dir,
+    install_dir,
+    github_organization,
+    images,
+    components,
+    library_paths,
+):
+    repo_build_dir = os.path.join(build_dir, be, "build")
+    repo_install_dir = os.path.join(build_dir, be, "install")
 
     cmake_script.commentln(8)
-    cmake_script.comment(f'\'{be}\' backend')
-    cmake_script.comment('Delete this section to remove backend from build')
+    cmake_script.comment(f"'{be}' backend")
+    cmake_script.comment("Delete this section to remove backend from build")
     cmake_script.comment()
     cmake_script.mkdir(build_dir)
     cmake_script.cwd(build_dir)
     cmake_script.gitclone(backend_repo(be), tag, be, github_organization)
 
+    if be == "tensorrtllm":
+        tensorrtllm_prebuild(cmake_script)
+
     cmake_script.mkdir(repo_build_dir)
     cmake_script.cwd(repo_build_dir)
     cmake_script.cmake(
-        backend_cmake_args(images, components, be, repo_install_dir,
-                           library_paths, variant_index))
+        backend_cmake_args(images, components, be, repo_install_dir, library_paths)
+    )
     cmake_script.makeinstall()
 
-    cmake_script.mkdir(os.path.join(install_dir, 'backends'))
-    cmake_script.rmdir(os.path.join(install_dir, 'backends', be))
-    cmake_script.cpdir(os.path.join(repo_install_dir, 'backends', be),
-                       os.path.join(install_dir, 'backends'))
+    cmake_script.mkdir(os.path.join(install_dir, "backends"))
+    cmake_script.rmdir(os.path.join(install_dir, "backends", be))
+
+    cmake_script.cpdir(
+        os.path.join(repo_install_dir, "backends", be),
+        os.path.join(install_dir, "backends"),
+    )
 
     cmake_script.comment()
-    cmake_script.comment(f'end \'{be}\' backend')
+    cmake_script.comment(f"end '{be}' backend")
     cmake_script.commentln(8)
     cmake_script.blankln()
 
 
-def repo_agent_build(ra, cmake_script, build_dir, install_dir, repoagent_repo,
-                     repoagents):
-    repo_build_dir = os.path.join(build_dir, ra, 'build')
-    repo_install_dir = os.path.join(build_dir, ra, 'install')
+def backend_clone(
+    be,
+    clone_script,
+    tag,
+    build_dir,
+    install_dir,
+    github_organization,
+):
+    clone_script.commentln(8)
+    clone_script.comment(f"'{be}' backend")
+    clone_script.comment("Delete this section to remove backend from build")
+    clone_script.comment()
+    clone_script.mkdir(build_dir)
+    clone_script.cwd(build_dir)
+    clone_script.gitclone(backend_repo(be), tag, be, github_organization)
+
+    repo_target_dir = os.path.join(install_dir, "backends")
+    clone_script.mkdir(repo_target_dir)
+    backend_dir = os.path.join(repo_target_dir, be)
+    clone_script.rmdir(backend_dir)
+    clone_script.mkdir(backend_dir)
+
+    clone_script.cp(
+        os.path.join(build_dir, be, "src", "model.py"),
+        backend_dir,
+    )
+
+    clone_script.comment()
+    clone_script.comment(f"end '{be}' backend")
+    clone_script.commentln(8)
+    clone_script.blankln()
+
+
+def repo_agent_build(
+    ra, cmake_script, build_dir, install_dir, repoagent_repo, repoagents
+):
+    repo_build_dir = os.path.join(build_dir, ra, "build")
+    repo_install_dir = os.path.join(build_dir, ra, "install")
 
     cmake_script.commentln(8)
-    cmake_script.comment(f'\'{ra}\' repository agent')
-    cmake_script.comment(
-        'Delete this section to remove repository agent from build')
+    cmake_script.comment(f"'{ra}' repository agent")
+    cmake_script.comment("Delete this section to remove repository agent from build")
     cmake_script.comment()
     cmake_script.mkdir(build_dir)
     cmake_script.cwd(build_dir)
-    cmake_script.gitclone(repoagent_repo(ra), repoagents[ra], ra,
-                          FLAGS.github_organization)
+    cmake_script.gitclone(
+        repoagent_repo(ra), repoagents[ra], ra, FLAGS.github_organization
+    )
 
     cmake_script.mkdir(repo_build_dir)
     cmake_script.cwd(repo_build_dir)
-    cmake_script.cmake(
-        repoagent_cmake_args(images, components, ra, repo_install_dir))
+    cmake_script.cmake(repoagent_cmake_args(images, components, ra, repo_install_dir))
     cmake_script.makeinstall()
 
-    cmake_script.mkdir(os.path.join(install_dir, 'repoagents'))
-    cmake_script.rmdir(os.path.join(install_dir, 'repoagents', ra))
-    cmake_script.cpdir(os.path.join(repo_install_dir, 'repoagents', ra),
-                       os.path.join(install_dir, 'repoagents'))
+    cmake_script.mkdir(os.path.join(install_dir, "repoagents"))
+    cmake_script.rmdir(os.path.join(install_dir, "repoagents", ra))
+    cmake_script.cpdir(
+        os.path.join(repo_install_dir, "repoagents", ra),
+        os.path.join(install_dir, "repoagents"),
+    )
     cmake_script.comment()
-    cmake_script.comment(f'end \'{ra}\' repository agent')
+    cmake_script.comment(f"end '{ra}' repository agent")
     cmake_script.commentln(8)
     cmake_script.blankln()
 
 
-def cibase_build(cmake_script, repo_dir, cmake_dir, build_dir, install_dir,
-                 ci_dir, backends):
-    repo_build_dir = os.path.join(build_dir, 'tritonserver', 'build')
-    repo_install_dir = os.path.join(build_dir, 'tritonserver', 'install')
+def cache_build(cache, cmake_script, build_dir, install_dir, cache_repo, caches):
+    repo_build_dir = os.path.join(build_dir, cache, "build")
+    repo_install_dir = os.path.join(build_dir, cache, "install")
 
     cmake_script.commentln(8)
-    cmake_script.comment('Collect Triton CI artifacts')
+    cmake_script.comment(f"'{cache}' cache")
+    cmake_script.comment("Delete this section to remove cache from build")
+    cmake_script.comment()
+    cmake_script.mkdir(build_dir)
+    cmake_script.cwd(build_dir)
+    cmake_script.gitclone(
+        cache_repo(cache), caches[cache], cache, FLAGS.github_organization
+    )
+
+    cmake_script.mkdir(repo_build_dir)
+    cmake_script.cwd(repo_build_dir)
+    cmake_script.cmake(cache_cmake_args(images, components, cache, repo_install_dir))
+    cmake_script.makeinstall()
+
+    cmake_script.mkdir(os.path.join(install_dir, "caches"))
+    cmake_script.rmdir(os.path.join(install_dir, "caches", cache))
+    cmake_script.cpdir(
+        os.path.join(repo_install_dir, "caches", cache),
+        os.path.join(install_dir, "caches"),
+    )
+    cmake_script.comment()
+    cmake_script.comment(f"end '{cache}' cache")
+    cmake_script.commentln(8)
+    cmake_script.blankln()
+
+
+def cibase_build(
+    cmake_script, repo_dir, cmake_dir, build_dir, install_dir, ci_dir, backends
+):
+    repo_install_dir = os.path.join(build_dir, "tritonserver", "install")
+
+    cmake_script.commentln(8)
+    cmake_script.comment("Collect Triton CI artifacts")
     cmake_script.comment()
 
     cmake_script.mkdir(ci_dir)
 
     # On windows we are not yet using a CI/QA docker image for
     # testing, so don't do anything...
-    if target_platform() == 'windows':
+    if target_platform() == "windows":
         return
 
     # The core build produces some artifacts that are needed for CI
     # testing, so include those in the install.
-    cmake_script.cpdir(os.path.join(repo_dir, 'qa'), ci_dir)
-    cmake_script.cpdir(os.path.join(repo_dir, 'deploy'), ci_dir)
-    cmake_script.mkdir(os.path.join(ci_dir, 'docs'))
-    cmake_script.cpdir(os.path.join(repo_dir, 'docs', 'examples'),
-                       os.path.join(ci_dir, 'docs'))
-    cmake_script.mkdir(os.path.join(ci_dir, 'src', 'test'))
-    cmake_script.cpdir(os.path.join(repo_dir, 'src', 'test', 'models'),
-                       os.path.join(ci_dir, 'src', 'test'))
-    cmake_script.cpdir(os.path.join(repo_install_dir, 'bin'), ci_dir)
-    cmake_script.mkdir(os.path.join(ci_dir, 'lib'))
-    cmake_script.cp(
-        os.path.join(repo_install_dir, 'lib',
-                     'libtritonrepoagent_relocation.so'),
-        os.path.join(ci_dir, 'lib'))
+    cmake_script.cpdir(os.path.join(repo_dir, "qa"), ci_dir)
+    cmake_script.cpdir(os.path.join(repo_dir, "deploy"), ci_dir)
+    cmake_script.mkdir(os.path.join(ci_dir, "docs"))
+    cmake_script.cpdir(
+        os.path.join(repo_dir, "docs", "examples"), os.path.join(ci_dir, "docs")
+    )
+    cmake_script.mkdir(os.path.join(ci_dir, "src", "test"))
+    cmake_script.cpdir(
+        os.path.join(repo_dir, "src", "test", "models"),
+        os.path.join(ci_dir, "src", "test"),
+    )
+    # Skip copying the artifacts in the bin, lib, and python as those directories will
+    # be missing when the core build is not enabled.
+    if not FLAGS.no_core_build:
+        cmake_script.cpdir(os.path.join(repo_install_dir, "bin"), ci_dir)
+        cmake_script.mkdir(os.path.join(ci_dir, "lib"))
+        cmake_script.cp(
+            os.path.join(repo_install_dir, "lib", "libtritonrepoagent_relocation.so"),
+            os.path.join(ci_dir, "lib"),
+        )
+        cmake_script.cpdir(os.path.join(repo_install_dir, "python"), ci_dir)
 
     # Some of the backends are needed for CI testing
-    cmake_script.mkdir(os.path.join(ci_dir, 'backends'))
-    for be in ('identity', 'repeat', 'square'):
-        be_install_dir = os.path.join(build_dir, be, 'install', 'backends', be)
-        if target_platform() == 'windows':
-            cmake_script.cmd(f'if (Test-Path -Path {be_install_dir}) {{')
+    cmake_script.mkdir(os.path.join(ci_dir, "backends"))
+    for be in ("identity", "repeat", "square"):
+        be_install_dir = os.path.join(build_dir, be, "install", "backends", be)
+        if target_platform() == "windows":
+            cmake_script.cmd(f"if (Test-Path -Path {be_install_dir}) {{")
         else:
-            cmake_script.cmd(f'if [[ -e {be_install_dir} ]]; then')
-        cmake_script.cpdir(be_install_dir, os.path.join(ci_dir, 'backends'))
-        cmake_script.cmd('}' if target_platform() == 'windows' else 'fi')
+            cmake_script.cmd(f"if [[ -e {be_install_dir} ]]; then")
+        cmake_script.cpdir(be_install_dir, os.path.join(ci_dir, "backends"))
+        cmake_script.cmd("}" if target_platform() == "windows" else "fi")
 
     # Some of the unit-test built backends are needed for CI testing
-    cmake_script.mkdir(
-        os.path.join(ci_dir, 'tritonbuild', 'tritonserver', 'backends'))
-    for be in ('query', 'implicit_state', 'sequence', 'dyna_sequence',
-               'distributed_addsub'):
-        be_install_dir = os.path.join(repo_install_dir, 'backends', be)
-        if target_platform() == 'windows':
-            cmake_script.cmd(f'if (Test-Path -Path {be_install_dir}) {{')
+    cmake_script.mkdir(os.path.join(ci_dir, "tritonbuild", "tritonserver", "backends"))
+    for be in (
+        "query",
+        "implicit_state",
+        "sequence",
+        "dyna_sequence",
+        "distributed_addsub",
+        "iterative_sequence",
+    ):
+        be_install_dir = os.path.join(repo_install_dir, "backends", be)
+        if target_platform() == "windows":
+            cmake_script.cmd(f"if (Test-Path -Path {be_install_dir}) {{")
         else:
-            cmake_script.cmd(f'if [[ -e {be_install_dir} ]]; then')
+            cmake_script.cmd(f"if [[ -e {be_install_dir} ]]; then")
         cmake_script.cpdir(
             be_install_dir,
-            os.path.join(ci_dir, 'tritonbuild', 'tritonserver', 'backends'))
-        cmake_script.cmd('}' if target_platform() == 'windows' else 'fi')
+            os.path.join(ci_dir, "tritonbuild", "tritonserver", "backends"),
+        )
+        cmake_script.cmd("}" if target_platform() == "windows" else "fi")
 
     # The onnxruntime_backend build produces some artifacts that
     # are needed for CI testing.
-    if 'onnxruntime' in backends:
-        ort_install_dir = os.path.join(build_dir, 'onnxruntime', 'install')
-        cmake_script.mkdir(os.path.join(ci_dir, 'qa', 'L0_custom_ops'))
-        cmake_script.cp(
-            os.path.join(ort_install_dir, 'test', 'libcustom_op_library.so'),
-            os.path.join(ci_dir, 'qa', 'L0_custom_ops'))
-        cmake_script.cp(
-            os.path.join(ort_install_dir, 'test', 'custom_op_test.onnx'),
-            os.path.join(ci_dir, 'qa', 'L0_custom_ops'))
+    if "onnxruntime" in backends:
+        ort_install_dir = os.path.join(build_dir, "onnxruntime", "install")
+        cmake_script.mkdir(os.path.join(ci_dir, "qa", "L0_custom_ops"))
+        if target_platform() != "igpu":
+            cmake_script.cp(
+                os.path.join(ort_install_dir, "test", "libcustom_op_library.so"),
+                os.path.join(ci_dir, "qa", "L0_custom_ops"),
+            )
+            cmake_script.cp(
+                os.path.join(ort_install_dir, "test", "custom_op_test.onnx"),
+                os.path.join(ci_dir, "qa", "L0_custom_ops"),
+            )
+        # [WIP] other way than wildcard?
+        backend_tests = os.path.join(build_dir, "onnxruntime", "test", "*")
+        cmake_script.cpdir(backend_tests, os.path.join(ci_dir, "qa"))
 
     # Need the build area for some backends so that they can be
     # rebuilt with specific options.
-    cmake_script.mkdir(os.path.join(ci_dir, 'tritonbuild'))
-    for be in ('identity', 'python'):
+    cmake_script.mkdir(os.path.join(ci_dir, "tritonbuild"))
+    for be in ("identity", "python"):
         if be in backends:
-            cmake_script.rmdir(os.path.join(build_dir, be, 'build'))
-            cmake_script.rmdir(os.path.join(build_dir, be, 'install'))
-            cmake_script.cpdir(os.path.join(build_dir, be),
-                               os.path.join(ci_dir, 'tritonbuild'))
+            cmake_script.rmdir(os.path.join(build_dir, be, "build"))
+            cmake_script.rmdir(os.path.join(build_dir, be, "install"))
+            cmake_script.cpdir(
+                os.path.join(build_dir, be), os.path.join(ci_dir, "tritonbuild")
+            )
 
     cmake_script.comment()
-    cmake_script.comment('end Triton CI artifacts')
+    cmake_script.comment("end Triton CI artifacts")
     cmake_script.commentln(8)
     cmake_script.blankln()
 
 
 def finalize_build(cmake_script, install_dir, ci_dir):
-    cmake_script.cmd(f'chmod -R a+rw {install_dir}')
-    cmake_script.cmd(f'chmod -R a+rw {ci_dir}')
+    cmake_script.cmd(f"chmod -R a+rw {install_dir}")
+    cmake_script.cmd(f"chmod -R a+rw {ci_dir}")
 
 
 def enable_all():
-    if target_platform() != 'windows':
+    if target_platform() != "windows":
         all_backends = [
-            'ensemble', 'identity', 'square', 'repeat', 'tensorflow1',
-            'tensorflow2', 'onnxruntime', 'python', 'dali', 'pytorch',
-            'openvino', 'fil', 'tensorrt'
+            "ensemble",
+            "identity",
+            "square",
+            "repeat",
+            "tensorflow",
+            "onnxruntime",
+            "python",
+            "dali",
+            "pytorch",
+            "openvino",
+            "fil",
+            "tensorrt",
         ]
-        all_repoagents = ['checksum']
-        all_filesystems = ['gcs', 's3', 'azure_storage']
-        all_endpoints = ['http', 'grpc', 'sagemaker', 'vertex-ai']
+        all_repoagents = ["checksum"]
+        all_caches = ["local", "redis"]
+        all_filesystems = ["gcs", "s3", "azure_storage"]
+        all_endpoints = ["http", "grpc", "sagemaker", "vertex-ai"]
 
         FLAGS.enable_logging = True
         FLAGS.enable_stats = True
         FLAGS.enable_metrics = True
         FLAGS.enable_gpu_metrics = True
+        FLAGS.enable_cpu_metrics = True
         FLAGS.enable_tracing = True
         FLAGS.enable_nvtx = True
         FLAGS.enable_gpu = True
     else:
         all_backends = [
-            'ensemble', 'identity', 'square', 'repeat', 'onnxruntime',
-            'openvino', 'tensorrt'
+            "ensemble",
+            "identity",
+            "square",
+            "repeat",
+            "onnxruntime",
+            "openvino",
+            "tensorrt",
         ]
-        all_repoagents = ['checksum']
+        all_repoagents = ["checksum"]
+        all_caches = ["local", "redis"]
         all_filesystems = []
-        all_endpoints = ['http', 'grpc']
+        all_endpoints = ["http", "grpc"]
 
         FLAGS.enable_logging = True
         FLAGS.enable_stats = True
@@ -1719,7 +2050,7 @@ def enable_all():
 
     requested_backends = []
     for be in FLAGS.backend:
-        parts = be.split(':')
+        parts = be.split(":")
         requested_backends += [parts[0]]
     for be in all_backends:
         if be not in requested_backends:
@@ -1727,12 +2058,20 @@ def enable_all():
 
     requested_repoagents = []
     for ra in FLAGS.repoagent:
-        parts = ra.split(':')
+        parts = ra.split(":")
         requested_repoagents += [parts[0]]
     for ra in all_repoagents:
         if ra not in requested_repoagents:
             FLAGS.repoagent += [ra]
 
+    requested_caches = []
+    for cache in FLAGS.cache:
+        parts = cache.split(":")
+        requested_caches += [parts[0]]
+    for cache in all_caches:
+        if cache not in requested_caches:
+            FLAGS.cache += [cache]
+
     for fs in all_filesystems:
         if fs not in FLAGS.filesystem:
             FLAGS.filesystem += [fs]
@@ -1742,294 +2081,296 @@ def enable_all():
             FLAGS.endpoint += [ep]
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
 
     group_qv = parser.add_mutually_exclusive_group()
-    group_qv.add_argument('-q',
-                          '--quiet',
-                          action="store_true",
-                          required=False,
-                          help='Disable console output.')
-    group_qv.add_argument('-v',
-                          '--verbose',
-                          action="store_true",
-                          required=False,
-                          help='Enable verbose output.')
+    group_qv.add_argument(
+        "-q",
+        "--quiet",
+        action="store_true",
+        required=False,
+        help="Disable console output.",
+    )
+    group_qv.add_argument(
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        help="Enable verbose output.",
+    )
 
     parser.add_argument(
-        '--dryrun',
+        "--dryrun",
+        action="store_true",
+        required=False,
+        help="Output the build scripts, but do not perform build.",
+    )
+    parser.add_argument(
+        "--no-container-build",
         action="store_true",
         required=False,
-        help='Output the build scripts, but do not perform build.')
-    parser.add_argument('--no-container-build',
-                        action="store_true",
-                        required=False,
-                        help='Do not use Docker container for build.')
+        help="Do not use Docker container for build.",
+    )
     parser.add_argument(
-        '--no-container-interactive',
+        "--no-container-interactive",
         action="store_true",
         required=False,
-        help=
-        'Do not use -it argument to "docker run" when performing container build.'
+        help='Do not use -it argument to "docker run" when performing container build.',
     )
     parser.add_argument(
-        '--no-container-pull',
+        "--no-container-pull",
         action="store_true",
         required=False,
-        help='Do not use Docker --pull argument when building container.')
+        help="Do not use Docker --pull argument when building container.",
+    )
     parser.add_argument(
-        '--container-memory',
+        "--container-memory",
         default=None,
         required=False,
-        help='Value for Docker --memory argument. Used only for windows builds.'
+        help="Value for Docker --memory argument. Used only for windows builds.",
     )
     parser.add_argument(
-        '--target-platform',
+        "--target-platform",
         required=False,
         default=None,
-        help=
-        'Target platform for build, can be "linux", "windows" or "jetpack". If not specified, build targets the current platform.'
+        help='Target platform for build, can be "linux", "windows" or "igpu". If not specified, build targets the current platform.',
     )
     parser.add_argument(
-        '--target-machine',
+        "--target-machine",
         required=False,
         default=None,
-        help=
-        'Target machine/architecture for build. If not specified, build targets the current machine/architecture.'
-    )
-
-    parser.add_argument('--build-id',
-                        type=str,
-                        required=False,
-                        help='Build ID associated with the build.')
-    parser.add_argument('--build-sha',
-                        type=str,
-                        required=False,
-                        help='SHA associated with the build.')
+        help="Target machine/architecture for build. If not specified, build targets the current machine/architecture.",
+    )
+
+    parser.add_argument(
+        "--build-id",
+        type=str,
+        required=False,
+        help="Build ID associated with the build.",
+    )
+    parser.add_argument(
+        "--build-sha", type=str, required=False, help="SHA associated with the build."
+    )
     parser.add_argument(
-        '--build-dir',
+        "--build-dir",
         type=str,
         required=False,
-        help=
-        'Build directory. All repo clones and builds will be performed in this directory.'
+        help="Build directory. All repo clones and builds will be performed in this directory.",
     )
     parser.add_argument(
-        '--install-dir',
+        "--install-dir",
         type=str,
         required=False,
         default=None,
-        help='Install directory, default is <builddir>/opt/tritonserver.')
+        help="Install directory, default is <builddir>/opt/tritonserver.",
+    )
     parser.add_argument(
-        '--cmake-dir',
+        "--cmake-dir",
         type=str,
         required=False,
-        help='Directory containing the CMakeLists.txt file for Triton server.')
+        help="Directory containing the CMakeLists.txt file for Triton server.",
+    )
     parser.add_argument(
-        '--tmp-dir',
+        "--tmp-dir",
         type=str,
         required=False,
-        default='/tmp',
-        help=
-        'Temporary directory used for building inside docker. Default is /tmp.')
+        default="/tmp",
+        help="Temporary directory used for building inside docker. Default is /tmp.",
+    )
     parser.add_argument(
-        '--library-paths',
-        action='append',
+        "--library-paths",
+        action="append",
         required=False,
         default=None,
-        help=
-        'Specify library paths for respective backends in build as <backend-name>[:<library_path>].'
+        help="Specify library paths for respective backends in build as <backend-name>[:<library_path>].",
     )
     parser.add_argument(
-        '--build-type',
+        "--build-type",
         required=False,
-        default='Release',
-        help=
-        'Build type, one of "Release", "Debug", "RelWithDebInfo" or "MinSizeRel". Default is "Release".'
+        default="Release",
+        help='Build type, one of "Release", "Debug", "RelWithDebInfo" or "MinSizeRel". Default is "Release".',
     )
     parser.add_argument(
-        '-j',
-        '--build-parallel',
+        "-j",
+        "--build-parallel",
         type=int,
         required=False,
         default=None,
-        help='Build parallelism. Defaults to 2 * number-of-cores.')
+        help="Build parallelism. Defaults to 2 * number-of-cores.",
+    )
 
     parser.add_argument(
-        '--github-organization',
+        "--github-organization",
         type=str,
         required=False,
-        default='https://github.com/triton-inference-server',
-        help=
-        'The GitHub organization containing the repos used for the build. Defaults to "https://github.com/triton-inference-server".'
+        default="https://github.com/triton-inference-server",
+        help='The GitHub organization containing the repos used for the build. Defaults to "https://github.com/triton-inference-server".',
     )
     parser.add_argument(
-        '--version',
+        "--version",
         type=str,
         required=False,
-        help=
-        'The Triton version. If not specified defaults to the value in the TRITON_VERSION file.'
+        help="The Triton version. If not specified defaults to the value in the TRITON_VERSION file.",
     )
     parser.add_argument(
-        '--container-version',
+        "--container-version",
         type=str,
         required=False,
-        help=
-        'The Triton container version to build. If not specified the container version will be chosen automatically based on --version value.'
+        help="The Triton container version to build. If not specified the container version will be chosen automatically based on --version value.",
     )
     parser.add_argument(
-        '--upstream-container-version',
+        "--upstream-container-version",
         type=str,
         required=False,
-        help=
-        'The upstream container version to use for the build. If not specified the upstream container version will be chosen automatically based on --version value.'
+        help="The upstream container version to use for the build. If not specified the upstream container version will be chosen automatically based on --version value.",
     )
     parser.add_argument(
-        '--container-prebuild-command',
+        "--container-prebuild-command",
         type=str,
         required=False,
-        help=
-        'When performing a container build, this command will be executed within the container just before the build it performed.'
+        help="When performing a container build, this command will be executed within the container just before the build it performed.",
     )
     parser.add_argument(
-        '--no-container-source',
+        "--no-container-source",
         action="store_true",
         required=False,
-        help='Do not include OSS source code in Docker container.')
+        help="Do not include OSS source code in Docker container.",
+    )
     parser.add_argument(
-        '--image',
-        action='append',
+        "--image",
+        action="append",
         required=False,
-        help=
-        'Use specified Docker image in build as <image-name>,<full-image-name>. <image-name> can be "base", "gpu-base", "tensorflow1", "tensorflow2", or "pytorch".'
+        help='Use specified Docker image in build as <image-name>,<full-image-name>. <image-name> can be "base", "gpu-base", "tensorflow", or "pytorch".',
     )
 
     parser.add_argument(
-        '--enable-all',
+        "--enable-all",
+        action="store_true",
+        required=False,
+        help="Enable all standard released Triton features, backends, repository agents, caches, endpoints and file systems.",
+    )
+    parser.add_argument(
+        "--enable-logging", action="store_true", required=False, help="Enable logging."
+    )
+    parser.add_argument(
+        "--enable-stats",
         action="store_true",
         required=False,
-        help=
-        'Enable all standard released Triton features, backends, repository agents, endpoints and file systems.'
-    )
-    parser.add_argument('--enable-logging',
-                        action="store_true",
-                        required=False,
-                        help='Enable logging.')
-    parser.add_argument('--enable-stats',
-                        action="store_true",
-                        required=False,
-                        help='Enable statistics collection.')
-    parser.add_argument('--enable-metrics',
-                        action="store_true",
-                        required=False,
-                        help='Enable metrics reporting.')
-    parser.add_argument('--enable-gpu-metrics',
-                        action="store_true",
-                        required=False,
-                        help='Include GPU metrics in reported metrics.')
-    parser.add_argument('--enable-tracing',
-                        action="store_true",
-                        required=False,
-                        help='Enable tracing.')
-    parser.add_argument('--enable-nvtx',
-                        action="store_true",
-                        required=False,
-                        help='Enable NVTX.')
-    parser.add_argument('--enable-gpu',
-                        action="store_true",
-                        required=False,
-                        help='Enable GPU support.')
-    parser.add_argument('--enable-mali-gpu',
-                        action="store_true",
-                        required=False,
-                        help='Enable ARM MALI GPU support.')
+        help="Enable statistics collection.",
+    )
     parser.add_argument(
-        '--min-compute-capability',
+        "--enable-metrics",
+        action="store_true",
+        required=False,
+        help="Enable metrics reporting.",
+    )
+    parser.add_argument(
+        "--enable-gpu-metrics",
+        action="store_true",
+        required=False,
+        help="Include GPU metrics in reported metrics.",
+    )
+    parser.add_argument(
+        "--enable-cpu-metrics",
+        action="store_true",
+        required=False,
+        help="Include CPU metrics in reported metrics.",
+    )
+    parser.add_argument(
+        "--enable-tracing", action="store_true", required=False, help="Enable tracing."
+    )
+    parser.add_argument(
+        "--enable-nvtx", action="store_true", required=False, help="Enable NVTX."
+    )
+    parser.add_argument(
+        "--enable-gpu", action="store_true", required=False, help="Enable GPU support."
+    )
+    parser.add_argument(
+        "--enable-mali-gpu",
+        action="store_true",
+        required=False,
+        help="Enable ARM MALI GPU support.",
+    )
+    parser.add_argument(
+        "--min-compute-capability",
         type=str,
         required=False,
-        default='6.0',
-        help='Minimum CUDA compute capability supported by server.')
+        default="6.0",
+        help="Minimum CUDA compute capability supported by server.",
+    )
 
     parser.add_argument(
-        '--endpoint',
-        action='append',
+        "--endpoint",
+        action="append",
         required=False,
-        help=
-        'Include specified endpoint in build. Allowed values are "grpc", "http", "vertex-ai" and "sagemaker".'
+        help='Include specified endpoint in build. Allowed values are "grpc", "http", "vertex-ai" and "sagemaker".',
     )
     parser.add_argument(
-        '--filesystem',
-        action='append',
+        "--filesystem",
+        action="append",
         required=False,
-        help=
-        'Include specified filesystem in build. Allowed values are "gcs", "azure_storage" and "s3".'
+        help='Include specified filesystem in build. Allowed values are "gcs", "azure_storage" and "s3".',
     )
     parser.add_argument(
-        '--no-core-build',
+        "--no-core-build",
         action="store_true",
         required=False,
-        help='Do not build Triton core sharead library or executable.')
+        help="Do not build Triton core shared library or executable.",
+    )
     parser.add_argument(
-        '--backend',
-        action='append',
+        "--backend",
+        action="append",
         required=False,
-        help=
-        'Include specified backend in build as <backend-name>[:<repo-tag>]. If <repo-tag> starts with "pull/" then it refers to a pull-request reference, otherwise <repo-tag> indicates the git tag/branch to use for the build. If the version is non-development then the default <repo-tag> is the release branch matching the container version (e.g. version 22.05 -> branch r22.05); otherwise the default <repo-tag> is "main" (e.g. version 22.05dev -> branch main).'
+        help='Include specified backend in build as <backend-name>[:<repo-tag>]. If <repo-tag> starts with "pull/" then it refers to a pull-request reference, otherwise <repo-tag> indicates the git tag/branch to use for the build. If the version is non-development then the default <repo-tag> is the release branch matching the container version (e.g. version YY.MM -> branch rYY.MM); otherwise the default <repo-tag> is "main" (e.g. version YY.MMdev -> branch main).',
     )
     parser.add_argument(
-        '--build-multiple-openvino',
-        action="store_true",
-        default=False,
-        help=
-        'Build multiple openVINO versions as specified in TRITON_VERSION_MAP. Be aware that loading backends with different openvino versions simultaneously in triton can cause conflicts'
+        "--repo-tag",
+        action="append",
+        required=False,
+        help='The version of a component to use in the build as <component-name>:<repo-tag>. <component-name> can be "common", "core", "backend" or "thirdparty". <repo-tag> indicates the git tag/branch to use for the build. Currently <repo-tag> does not support pull-request reference. If the version is non-development then the default <repo-tag> is the release branch matching the container version (e.g. version YY.MM -> branch rYY.MM); otherwise the default <repo-tag> is "main" (e.g. version YY.MMdev -> branch main).',
     )
     parser.add_argument(
-        '--repo-tag',
-        action='append',
+        "--repoagent",
+        action="append",
         required=False,
-        help=
-        'The version of a component to use in the build as <component-name>:<repo-tag>. <component-name> can be "common", "core", "backend" or "thirdparty". If <repo-tag> starts with "pull/" then it refers to a pull-request reference, otherwise <repo-tag> indicates the git tag/branch. If the version is non-development then the default <repo-tag> is the release branch matching the container version (e.g. version 22.05 -> branch r22.05); otherwise the default <repo-tag> is "main" (e.g. version 22.05dev -> branch main).'
+        help='Include specified repo agent in build as <repoagent-name>[:<repo-tag>]. If <repo-tag> starts with "pull/" then it refers to a pull-request reference, otherwise <repo-tag> indicates the git tag/branch to use for the build. If the version is non-development then the default <repo-tag> is the release branch matching the container version (e.g. version YY.MM -> branch rYY.MM); otherwise the default <repo-tag> is "main" (e.g. version YY.MMdev -> branch main).',
     )
     parser.add_argument(
-        '--repoagent',
-        action='append',
+        "--cache",
+        action="append",
         required=False,
-        help=
-        'Include specified repo agent in build as <repoagent-name>[:<repo-tag>]. If <repo-tag> starts with "pull/" then it refers to a pull-request reference, otherwise <repo-tag> indicates the git tag/branch to use for the build. If the version is non-development then the default <repo-tag> is the release branch matching the container version (e.g. version 22.05 -> branch r22.05); otherwise the default <repo-tag> is "main" (e.g. version 22.05dev -> branch main).'
+        help='Include specified cache in build as <cache-name>[:<repo-tag>]. If <repo-tag> starts with "pull/" then it refers to a pull-request reference, otherwise <repo-tag> indicates the git tag/branch to use for the build. If the version is non-development then the default <repo-tag> is the release branch matching the container version (e.g. version YY.MM -> branch rYY.MM); otherwise the default <repo-tag> is "main" (e.g. version YY.MMdev -> branch main).',
     )
     parser.add_argument(
-        '--no-force-clone',
+        "--no-force-clone",
         action="store_true",
         default=False,
-        help='Do not create fresh clones of repos that have already been cloned.'
+        help="Do not create fresh clones of repos that have already been cloned.",
     )
     parser.add_argument(
-        '--extra-core-cmake-arg',
-        action='append',
+        "--extra-core-cmake-arg",
+        action="append",
         required=False,
-        help=
-        'Extra CMake argument as <name>=<value>. The argument is passed to CMake as -D<name>=<value> and is included after all CMake arguments added by build.py for the core builds.'
+        help="Extra CMake argument as <name>=<value>. The argument is passed to CMake as -D<name>=<value> and is included after all CMake arguments added by build.py for the core builds.",
     )
     parser.add_argument(
-        '--override-core-cmake-arg',
-        action='append',
+        "--override-core-cmake-arg",
+        action="append",
         required=False,
-        help=
-        'Override specified CMake argument in the build as <name>=<value>. The argument is passed to CMake as -D<name>=<value>. This flag only impacts CMake arguments that are used by build.py. To unconditionally add a CMake argument to the core build use --extra-core-cmake-arg.'
+        help="Override specified CMake argument in the build as <name>=<value>. The argument is passed to CMake as -D<name>=<value>. This flag only impacts CMake arguments that are used by build.py. To unconditionally add a CMake argument to the core build use --extra-core-cmake-arg.",
     )
     parser.add_argument(
-        '--extra-backend-cmake-arg',
-        action='append',
+        "--extra-backend-cmake-arg",
+        action="append",
         required=False,
-        help=
-        'Extra CMake argument for a backend build as <backend>:<name>=<value>. The argument is passed to CMake as -D<name>=<value> and is included after all CMake arguments added by build.py for the backend.'
+        help="Extra CMake argument for a backend build as <backend>:<name>=<value>. The argument is passed to CMake as -D<name>=<value> and is included after all CMake arguments added by build.py for the backend.",
     )
     parser.add_argument(
-        '--override-backend-cmake-arg',
-        action='append',
+        "--override-backend-cmake-arg",
+        action="append",
         required=False,
-        help=
-        'Override specified backend CMake argument in the build as <backend>:<name>=<value>. The argument is passed to CMake as -D<name>=<value>. This flag only impacts CMake arguments that are used by build.py. To unconditionally add a CMake argument to the backend build use --extra-backend-cmake-arg.'
+        help="Override specified backend CMake argument in the build as <backend>:<name>=<value>. The argument is passed to CMake as -D<name>=<value>. This flag only impacts CMake arguments that are used by build.py. To unconditionally add a CMake argument to the backend build use --extra-backend-cmake-arg.",
     )
 
     FLAGS = parser.parse_args()
@@ -2046,6 +2387,8 @@ def enable_all():
         FLAGS.filesystem = []
     if FLAGS.repoagent is None:
         FLAGS.repoagent = []
+    if FLAGS.cache is None:
+        FLAGS.cache = []
     if FLAGS.library_paths is None:
         FLAGS.library_paths = []
     if FLAGS.extra_core_cmake_arg is None:
@@ -2058,7 +2401,7 @@ def enable_all():
         FLAGS.extra_backend_cmake_arg = []
 
     # if --enable-all is specified, then update FLAGS to enable all
-    # settings, backends, repo-agents, file systems, endpoints, etc.
+    # settings, backends, repo-agents, caches, file systems, endpoints, etc.
     if FLAGS.enable_all:
         enable_all()
 
@@ -2069,64 +2412,63 @@ def enable_all():
     # set.
     if FLAGS.no_container_build:
         if FLAGS.build_dir is None:
-            fail('--no-container-build requires --build-dir')
+            fail("--no-container-build requires --build-dir")
         if FLAGS.install_dir is None:
-            FLAGS.install_dir = os.path.join(FLAGS.build_dir, "opt",
-                                             "tritonserver")
+            FLAGS.install_dir = os.path.join(FLAGS.build_dir, "opt", "tritonserver")
         if FLAGS.cmake_dir is None:
             FLAGS.cmake_dir = THIS_SCRIPT_DIR
     else:
         if FLAGS.build_dir is not None:
-            fail('--build-dir must not be set for container-based build')
+            fail("--build-dir must not be set for container-based build")
         if FLAGS.install_dir is not None:
-            fail('--install-dir must not be set for container-based build')
+            fail("--install-dir must not be set for container-based build")
         if FLAGS.cmake_dir is not None:
-            fail('--cmake-dir must not be set for container-based build')
-        FLAGS.build_dir = os.path.join(THIS_SCRIPT_DIR, 'build')
+            fail("--cmake-dir must not be set for container-based build")
+        FLAGS.build_dir = os.path.join(THIS_SCRIPT_DIR, "build")
 
     # Determine the versions. Start with Triton version, if --version
     # is not explicitly specified read from TRITON_VERSION file.
     if FLAGS.version is None:
-        with open(os.path.join(THIS_SCRIPT_DIR, 'TRITON_VERSION'),
-                  "r") as vfile:
+        with open(os.path.join(THIS_SCRIPT_DIR, "TRITON_VERSION"), "r") as vfile:
             FLAGS.version = vfile.readline().strip()
 
     if FLAGS.build_parallel is None:
         FLAGS.build_parallel = multiprocessing.cpu_count() * 2
 
-    log('Building Triton Inference Server')
-    log('platform {}'.format(target_platform()))
-    log('machine {}'.format(target_machine()))
-    log('version {}'.format(FLAGS.version))
-    log('build dir {}'.format(FLAGS.build_dir))
-    log('install dir {}'.format(FLAGS.install_dir))
-    log('cmake dir {}'.format(FLAGS.cmake_dir))
+    log("Building Triton Inference Server")
+    log("platform {}".format(target_platform()))
+    log("machine {}".format(target_machine()))
+    log("version {}".format(FLAGS.version))
+    log("build dir {}".format(FLAGS.build_dir))
+    log("install dir {}".format(FLAGS.install_dir))
+    log("cmake dir {}".format(FLAGS.cmake_dir))
 
     # Determine the default repo-tag that should be used for images,
-    # backends and repo-agents if a repo-tag is not given
+    # backends, repo-agents, and caches if a repo-tag is not given
     # explicitly. For release branches we use the release branch as
     # the default, otherwise we use 'main'.
-    default_repo_tag = 'main'
+    default_repo_tag = "main"
     cver = FLAGS.container_version
     if cver is None:
         if FLAGS.version not in TRITON_VERSION_MAP:
             fail(
-                'unable to determine default repo-tag, container version not known for {}'
-                .format(FLAGS.version))
+                "unable to determine default repo-tag, container version not known for {}".format(
+                    FLAGS.version
+                )
+            )
         cver = TRITON_VERSION_MAP[FLAGS.version][0]
-    if not cver.endswith('dev'):
-        default_repo_tag = 'r' + cver
-    log('default repo-tag: {}'.format(default_repo_tag))
+    if not cver.endswith("dev"):
+        default_repo_tag = "r" + cver
+    log("default repo-tag: {}".format(default_repo_tag))
 
     # For other versions use the TRITON_VERSION_MAP unless explicitly
     # given.
     FLAGS.container_version, FLAGS.upstream_container_version = container_versions(
-        FLAGS.version, FLAGS.container_version,
-        FLAGS.upstream_container_version)
+        FLAGS.version, FLAGS.container_version, FLAGS.upstream_container_version
+    )
 
-    log('container version {}'.format(FLAGS.container_version))
-    log('upstream container version {}'.format(
-        FLAGS.upstream_container_version))
+    log("container version {}".format(FLAGS.container_version))
+    log("upstream container version {}".format(FLAGS.upstream_container_version))
 
     for ep in FLAGS.endpoint:
         log(f'endpoint "{ep}"')
@@ -2136,116 +2478,146 @@ def enable_all():
     # Initialize map of backends to build and repo-tag for each.
     backends = {}
     for be in FLAGS.backend:
-        parts = be.split(':')
+        parts = be.split(":")
         if len(parts) == 1:
             parts.append(default_repo_tag)
+        if parts[0] == "tensorflow1":
+            fail(
+                "Starting from Triton version 23.04, support for TensorFlow 1 has been discontinued. Please switch to Tensorflow 2."
+            )
+        if parts[0] == "tensorflow2":
+            parts[0] = "tensorflow"
         log('backend "{}" at tag/branch "{}"'.format(parts[0], parts[1]))
         backends[parts[0]] = parts[1]
 
+    if "vllm" in backends:
+        if "python" not in backends:
+            log(
+                "vLLM backend requires Python backend, adding Python backend with tag {}".format(
+                    backends["vllm"]
+                )
+            )
+            backends["python"] = backends["vllm"]
+
     # Initialize map of repo agents to build and repo-tag for each.
     repoagents = {}
     for be in FLAGS.repoagent:
-        parts = be.split(':')
+        parts = be.split(":")
         if len(parts) == 1:
             parts.append(default_repo_tag)
         log('repoagent "{}" at tag/branch "{}"'.format(parts[0], parts[1]))
         repoagents[parts[0]] = parts[1]
 
+    # Initialize map of caches to build and repo-tag for each.
+    caches = {}
+    for be in FLAGS.cache:
+        parts = be.split(":")
+        if len(parts) == 1:
+            parts.append(default_repo_tag)
+        log('cache "{}" at tag/branch "{}"'.format(parts[0], parts[1]))
+        caches[parts[0]] = parts[1]
+
     # Initialize map of docker images.
     images = {}
     for img in FLAGS.image:
-        parts = img.split(',')
+        parts = img.split(",")
         fail_if(
-            len(parts) != 2,
-            '--image must specify <image-name>,<full-image-registry>')
+            len(parts) != 2, "--image must specify <image-name>,<full-image-registry>"
+        )
         fail_if(
-            parts[0] not in [
-                'base', 'gpu-base', 'pytorch', 'tensorflow1', 'tensorflow2'
-            ], 'unsupported value for --image')
+            parts[0]
+            not in ["base", "gpu-base", "pytorch", "tensorflow", "tensorflow2"],
+            "unsupported value for --image",
+        )
         log('image "{}": "{}"'.format(parts[0], parts[1]))
+        if parts[0] == "tensorflow2":
+            parts[0] = "tensorflow"
         images[parts[0]] = parts[1]
 
     # Initialize map of library paths for each backend.
     library_paths = {}
     for lpath in FLAGS.library_paths:
-        parts = lpath.split(':')
+        parts = lpath.split(":")
         if len(parts) == 2:
             log('backend "{}" library path "{}"'.format(parts[0], parts[1]))
+            if parts[0] == "tensorflow2":
+                parts[0] = "tensorflow"
             library_paths[parts[0]] = parts[1]
 
     # Parse any explicitly specified cmake arguments
     for cf in FLAGS.extra_core_cmake_arg:
-        parts = cf.split('=')
-        fail_if(
-            len(parts) != 2,
-            '--extra-core-cmake-arg must specify <name>=<value>')
+        parts = cf.split("=")
+        fail_if(len(parts) != 2, "--extra-core-cmake-arg must specify <name>=<value>")
         log('CMake core extra "-D{}={}"'.format(parts[0], parts[1]))
         EXTRA_CORE_CMAKE_FLAGS[parts[0]] = parts[1]
 
     for cf in FLAGS.override_core_cmake_arg:
-        parts = cf.split('=')
+        parts = cf.split("=")
         fail_if(
-            len(parts) != 2,
-            '--override-core-cmake-arg must specify <name>=<value>')
+            len(parts) != 2, "--override-core-cmake-arg must specify <name>=<value>"
+        )
         log('CMake core override "-D{}={}"'.format(parts[0], parts[1]))
         OVERRIDE_CORE_CMAKE_FLAGS[parts[0]] = parts[1]
 
     for cf in FLAGS.extra_backend_cmake_arg:
-        parts = cf.split(':', 1)
+        parts = cf.split(":", 1)
         fail_if(
             len(parts) != 2,
-            '--extra-backend-cmake-arg must specify <backend>:<name>=<value>')
+            "--extra-backend-cmake-arg must specify <backend>:<name>=<value>",
+        )
         be = parts[0]
-        parts = parts[1].split('=', 1)
+        parts = parts[1].split("=", 1)
         fail_if(
             len(parts) != 2,
-            '--extra-backend-cmake-arg must specify <backend>:<name>=<value>')
+            "--extra-backend-cmake-arg must specify <backend>:<name>=<value>",
+        )
         fail_if(
             be not in backends,
-            '--extra-backend-cmake-arg specifies backend "{}" which is not included in build'
-            .format(be))
+            '--extra-backend-cmake-arg specifies backend "{}" which is not included in build'.format(
+                be
+            ),
+        )
         log('backend "{}" CMake extra "-D{}={}"'.format(be, parts[0], parts[1]))
         if be not in EXTRA_BACKEND_CMAKE_FLAGS:
             EXTRA_BACKEND_CMAKE_FLAGS[be] = {}
         EXTRA_BACKEND_CMAKE_FLAGS[be][parts[0]] = parts[1]
 
     for cf in FLAGS.override_backend_cmake_arg:
-        parts = cf.split(':', 1)
+        parts = cf.split(":", 1)
         fail_if(
             len(parts) != 2,
-            '--override-backend-cmake-arg must specify <backend>:<name>=<value>'
+            "--override-backend-cmake-arg must specify <backend>:<name>=<value>",
         )
         be = parts[0]
-        parts = parts[1].split('=', 1)
+        parts = parts[1].split("=", 1)
         fail_if(
             len(parts) != 2,
-            '--override-backend-cmake-arg must specify <backend>:<name>=<value>'
+            "--override-backend-cmake-arg must specify <backend>:<name>=<value>",
         )
         fail_if(
             be not in backends,
-            '--override-backend-cmake-arg specifies backend "{}" which is not included in build'
-            .format(be))
-        log('backend "{}" CMake override "-D{}={}"'.format(
-            be, parts[0], parts[1]))
+            '--override-backend-cmake-arg specifies backend "{}" which is not included in build'.format(
+                be
+            ),
+        )
+        log('backend "{}" CMake override "-D{}={}"'.format(be, parts[0], parts[1]))
         if be not in OVERRIDE_BACKEND_CMAKE_FLAGS:
             OVERRIDE_BACKEND_CMAKE_FLAGS[be] = {}
         OVERRIDE_BACKEND_CMAKE_FLAGS[be][parts[0]] = parts[1]
 
     # Initialize map of common components and repo-tag for each.
     components = {
-        'common': default_repo_tag,
-        'core': default_repo_tag,
-        'backend': default_repo_tag,
-        'thirdparty': default_repo_tag
+        "common": default_repo_tag,
+        "core": default_repo_tag,
+        "backend": default_repo_tag,
+        "thirdparty": default_repo_tag,
     }
     for be in FLAGS.repo_tag:
-        parts = be.split(':')
-        fail_if(
-            len(parts) != 2,
-            '--repo-tag must specify <component-name>:<repo-tag>')
+        parts = be.split(":")
+        fail_if(len(parts) != 2, "--repo-tag must specify <component-name>:<repo-tag>")
         fail_if(
             parts[0] not in components,
-            '--repo-tag <component-name> must be "common", "core", "backend", or "thirdparty"'
+            '--repo-tag <component-name> must be "common", "core", "backend", or "thirdparty"',
         )
         components[parts[0]] = parts[1]
     for c in components:
@@ -2264,94 +2636,119 @@ def enable_all():
         # FLAGS.tmp_dir may be specified with "\" on Windows, adjust
         # to "/" for docker usage.
         script_build_dir = os.path.normpath(
-            os.path.join(FLAGS.tmp_dir, 'tritonbuild').replace("\\", "/"))
-        script_install_dir = os.path.normpath(
-            os.path.join(script_build_dir, 'install'))
-        script_ci_dir = os.path.normpath(os.path.join(script_build_dir, 'ci'))
-        if target_platform() == 'windows':
-            script_repo_dir = script_cmake_dir = os.path.normpath(
-                'c:/workspace')
+            os.path.join(FLAGS.tmp_dir, "tritonbuild").replace("\\", "/")
+        )
+        script_install_dir = os.path.normpath(os.path.join(script_build_dir, "install"))
+        script_ci_dir = os.path.normpath(os.path.join(script_build_dir, "ci"))
+        if target_platform() == "windows":
+            script_repo_dir = script_cmake_dir = os.path.normpath("c:/workspace")
         else:
-            script_repo_dir = script_cmake_dir = '/workspace'
+            script_repo_dir = script_cmake_dir = "/workspace"
 
-    script_name = 'cmake_build'
-    if target_platform() == 'windows':
-        script_name += '.ps1'
+    script_name = "cmake_build"
+    if target_platform() == "windows":
+        script_name += ".ps1"
 
-    # Write the build script that invokes cmake for the core, backends, and repo-agents.
+    # Write the build script that invokes cmake for the core, backends, repo-agents, and caches.
     pathlib.Path(FLAGS.build_dir).mkdir(parents=True, exist_ok=True)
     with BuildScript(
-            os.path.join(FLAGS.build_dir, script_name),
-            verbose=FLAGS.verbose,
-            desc=('Build script for Triton Inference Server')) as cmake_script:
-
+        os.path.join(FLAGS.build_dir, script_name),
+        verbose=FLAGS.verbose,
+        desc=("Build script for Triton Inference Server"),
+    ) as cmake_script:
         # Run the container pre-build command if the cmake build is
         # being done within the build container.
         if not FLAGS.no_container_build and FLAGS.container_prebuild_command:
-            cmake_script.cmd(FLAGS.container_prebuild_command,
-                             check_exitcode=True)
+            cmake_script.cmd(FLAGS.container_prebuild_command, check_exitcode=True)
             cmake_script.blankln()
 
         # Commands to build the core shared library and the server executable.
         if not FLAGS.no_core_build:
-            core_build(cmake_script, script_repo_dir, script_cmake_dir,
-                       script_build_dir, script_install_dir, components,
-                       backends)
+            core_build(
+                cmake_script,
+                script_repo_dir,
+                script_cmake_dir,
+                script_build_dir,
+                script_install_dir,
+                components,
+                backends,
+            )
 
         # Commands to build each backend...
         for be in backends:
             # Core backends are not built separately from core so skip...
-            if (be in CORE_BACKENDS):
+            if be in CORE_BACKENDS:
                 continue
 
-            tagged_be_list = []
-            if (be == 'openvino'):
-                tagged_be_list.append(
-                    tagged_backend(be, TRITON_VERSION_MAP[FLAGS.version][4][0]))
-                if (FLAGS.build_multiple_openvino):
-                    skip = True
-                    for ver in TRITON_VERSION_MAP[FLAGS.version][4]:
-                        if not skip:
-                            tagged_be_list.append(tagged_backend(be, ver))
-                        skip = False
-
             # If armnn_tflite backend, source from external repo for git clone
-            if be == 'armnn_tflite':
-                github_organization = 'https://gitlab.com/arm-research/smarter/'
+            if be == "armnn_tflite":
+                github_organization = "https://gitlab.com/arm-research/smarter/"
             else:
                 github_organization = FLAGS.github_organization
 
-            if not tagged_be_list:
-                backend_build(be, cmake_script, backends[be], script_build_dir,
-                              script_install_dir, github_organization, images,
-                              components, library_paths)
+            if be == "vllm":
+                backend_clone(
+                    be,
+                    cmake_script,
+                    backends[be],
+                    script_build_dir,
+                    script_install_dir,
+                    github_organization,
+                )
             else:
-                variant_index = 0
-                for tagged_be in tagged_be_list:
-                    backend_build(tagged_be, cmake_script, backends[be],
-                                  script_build_dir, script_install_dir,
-                                  github_organization, images, components,
-                                  library_paths, variant_index)
-                    variant_index += 1
+                backend_build(
+                    be,
+                    cmake_script,
+                    backends[be],
+                    script_build_dir,
+                    script_install_dir,
+                    github_organization,
+                    images,
+                    components,
+                    library_paths,
+                )
 
         # Commands to build each repo agent...
         for ra in repoagents:
-            repo_agent_build(ra, cmake_script, script_build_dir,
-                             script_install_dir, repoagent_repo, repoagents)
+            repo_agent_build(
+                ra,
+                cmake_script,
+                script_build_dir,
+                script_install_dir,
+                repoagent_repo,
+                repoagents,
+            )
+
+        # Commands to build each cache...
+        for cache in caches:
+            cache_build(
+                cache,
+                cmake_script,
+                script_build_dir,
+                script_install_dir,
+                cache_repo,
+                caches,
+            )
 
         # Commands needed only when building with Docker...
         if not FLAGS.no_container_build:
             # Commands to collect all the build artifacts needed for CI
             # testing.
-            cibase_build(cmake_script, script_repo_dir, script_cmake_dir,
-                         script_build_dir, script_install_dir, script_ci_dir,
-                         backends)
+            cibase_build(
+                cmake_script,
+                script_repo_dir,
+                script_cmake_dir,
+                script_build_dir,
+                script_install_dir,
+                script_ci_dir,
+                backends,
+            )
 
             # When building with Docker the install and ci artifacts
             # written to the build-dir while running the docker container
             # may have root ownership, so give them permissions to be
             # managed by all users on the host system.
-            if target_platform() != 'windows':
+            if target_platform() != "windows":
                 finalize_build(cmake_script, script_install_dir, script_ci_dir)
 
     # If --no-container-build is not specified then we perform the
@@ -2360,24 +2757,25 @@ def enable_all():
     # generate a few Dockerfiles and a top-level script that drives
     # the build process.
     if not FLAGS.no_container_build:
-        script_name = 'docker_build'
-        if target_platform() == 'windows':
-            script_name += '.ps1'
+        script_name = "docker_build"
+        if target_platform() == "windows":
+            script_name += ".ps1"
 
-        create_build_dockerfiles(script_build_dir, images, backends, repoagents,
-                                 FLAGS.endpoint)
-        create_docker_build_script(script_name, script_install_dir,
-                                   script_ci_dir)
+        create_build_dockerfiles(
+            script_build_dir, images, backends, repoagents, caches, FLAGS.endpoint
+        )
+        create_docker_build_script(script_name, script_install_dir, script_ci_dir)
 
     # In not dry-run, execute the script to perform the build...  If a
     # container-based build is requested use 'docker_build' script,
     # otherwise build directly on this system using cmake script.
     if not FLAGS.dryrun:
-        if target_platform() == 'windows':
+        if target_platform() == "windows":
             p = subprocess.Popen(
-                ['powershell.exe', '-noexit', '-File', f'./{script_name}'],
-                cwd=FLAGS.build_dir)
+                ["powershell.exe", "-noexit", "-File", f"./{script_name}"],
+                cwd=FLAGS.build_dir,
+            )
         else:
-            p = subprocess.Popen([f'./{script_name}'], cwd=FLAGS.build_dir)
+            p = subprocess.Popen([f"./{script_name}"], cwd=FLAGS.build_dir)
         p.wait()
-        fail_if(p.returncode != 0, 'build failed')
+        fail_if(p.returncode != 0, "build failed")
diff --git a/compose.py b/compose.py
old mode 100644
new mode 100755
index 095cac9174..9f948c14fd
--- a/compose.py
+++ b/compose.py
@@ -1,5 +1,5 @@
 #!/usr/bin/env python3
-# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -39,7 +39,7 @@ def log(msg, force=False):
         try:
             print(msg, file=sys.stderr)
         except Exception:
-            print('<failed to log>', file=sys.stderr)
+            print("<failed to log>", file=sys.stderr)
 
 
 def log_verbose(msg):
@@ -48,7 +48,7 @@ def log_verbose(msg):
 
 
 def fail(msg):
-    print('error: {}'.format(msg), file=sys.stderr)
+    print("error: {}".format(msg), file=sys.stderr)
     sys.exit(1)
 
 
@@ -58,8 +58,8 @@ def fail_if(p, msg):
 
 
 def start_dockerfile(ddir, images, argmap, dockerfile_name, backends):
-    # Set enviroment variables, set default user and install dependencies
-    df = '''
+    # Set environment variables, set default user and install dependencies
+    df = """
 #
 # Multistage build.
 #
@@ -67,30 +67,38 @@ def start_dockerfile(ddir, images, argmap, dockerfile_name, backends):
 ARG TRITON_CONTAINER_VERSION={}
 
 FROM {} AS full
-'''.format(argmap['TRITON_VERSION'], argmap['TRITON_CONTAINER_VERSION'],
-           images["full"])
+""".format(
+        argmap["TRITON_VERSION"], argmap["TRITON_CONTAINER_VERSION"], images["full"]
+    )
 
     # PyTorch, TensorFlow 1 and TensorFlow 2 backends need extra CUDA and other
     # dependencies during runtime that are missing in the CPU-only base container.
     # These dependencies must be copied from the Triton Min image.
-    if not FLAGS.enable_gpu and (('pytorch' in backends) or
-                                 ('tensorflow1' in backends) or
-                                 ('tensorflow2' in backends)):
-        df += '''
+    if not FLAGS.enable_gpu and (
+        ("pytorch" in backends)
+        or ("tensorflow1" in backends)
+        or ("tensorflow2" in backends)
+    ):
+        df += """
 FROM {} AS min_container
 
-'''.format(images["gpu-min"])
+""".format(
+            images["gpu-min"]
+        )
 
-    df += '''
+    df += """
 FROM {}
-'''.format(images["min"])
+""".format(
+        images["min"]
+    )
 
     import build
-    df += build.dockerfile_prepare_container_linux(argmap, backends,
-                                                   FLAGS.enable_gpu,
-                                                   platform.machine().lower())
+
+    df += build.dockerfile_prepare_container_linux(
+        argmap, backends, FLAGS.enable_gpu, platform.machine().lower()
+    )
     # Copy over files
-    df += '''
+    df += """
 WORKDIR /opt/tritonserver
 COPY --chown=1000:1000 --from=full /opt/tritonserver/LICENSE .
 COPY --chown=1000:1000 --from=full /opt/tritonserver/TRITON_VERSION .
@@ -98,7 +106,7 @@ def start_dockerfile(ddir, images, argmap, dockerfile_name, backends):
 COPY --chown=1000:1000 --from=full /opt/tritonserver/bin bin/
 COPY --chown=1000:1000 --from=full /opt/tritonserver/lib lib/
 COPY --chown=1000:1000 --from=full /opt/tritonserver/include include/
-'''
+"""
     with open(os.path.join(ddir, dockerfile_name), "w") as dfile:
         dfile.write(df)
 
@@ -106,17 +114,15 @@ def start_dockerfile(ddir, images, argmap, dockerfile_name, backends):
 def add_requested_backends(ddir, dockerfile_name, backends):
     df = "# Copying over backends \n"
     for backend in backends:
-        if backend == 'openvino':
-            import build
-            ver = next(iter(build.TRITON_VERSION_MAP.values()))
-            backend = build.tagged_backend(backend, ver[4][0])
-        df += '''COPY --chown=1000:1000 --from=full /opt/tritonserver/backends/{} /opt/tritonserver/backends/{}
-'''.format(backend, backend)
+        df += """COPY --chown=1000:1000 --from=full /opt/tritonserver/backends/{} /opt/tritonserver/backends/{}
+""".format(
+            backend, backend
+        )
     if len(backends) > 0:
-        df += '''
+        df += """
 # Top-level /opt/tritonserver/backends not copied so need to explicitly set permissions here
 RUN chown triton-server:triton-server /opt/tritonserver/backends
-'''
+"""
     with open(os.path.join(ddir, dockerfile_name), "a") as dfile:
         dfile.write(df)
 
@@ -124,13 +130,31 @@ def add_requested_backends(ddir, dockerfile_name, backends):
 def add_requested_repoagents(ddir, dockerfile_name, repoagents):
     df = "#  Copying over repoagents \n"
     for ra in repoagents:
-        df += '''COPY --chown=1000:1000 --from=full /opt/tritonserver/repoagents/{} /opt/tritonserver/repoagents/{}
-'''.format(ra, ra)
+        df += """COPY --chown=1000:1000 --from=full /opt/tritonserver/repoagents/{} /opt/tritonserver/repoagents/{}
+""".format(
+            ra, ra
+        )
     if len(repoagents) > 0:
-        df += '''
+        df += """
 # Top-level /opt/tritonserver/repoagents not copied so need to explicitly set permissions here
 RUN chown triton-server:triton-server /opt/tritonserver/repoagents
-'''
+"""
+    with open(os.path.join(ddir, dockerfile_name), "a") as dfile:
+        dfile.write(df)
+
+
+def add_requested_caches(ddir, dockerfile_name, caches):
+    df = "#  Copying over caches \n"
+    for cache in caches:
+        df += """COPY --chown=1000:1000 --from=full /opt/tritonserver/caches/{} /opt/tritonserver/caches/{}
+""".format(
+            cache, cache
+        )
+    if len(caches) > 0:
+        df += """
+# Top-level /opt/tritonserver/caches not copied so need to explicitly set permissions here
+RUN chown triton-server:triton-server /opt/tritonserver/caches
+"""
     with open(os.path.join(ddir, dockerfile_name), "a") as dfile:
         dfile.write(df)
 
@@ -138,226 +162,292 @@ def add_requested_repoagents(ddir, dockerfile_name, repoagents):
 def end_dockerfile(ddir, dockerfile_name, argmap):
     # Install additional dependencies
     df = ""
-    if argmap['SAGEMAKER_ENDPOINT']:
-        df += '''
+    if argmap["SAGEMAKER_ENDPOINT"]:
+        df += """
 LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true
 COPY --chown=1000:1000 --from=full /usr/bin/serve /usr/bin/.
-'''
+"""
     with open(os.path.join(ddir, dockerfile_name), "a") as dfile:
         dfile.write(df)
 
 
 def build_docker_image(ddir, dockerfile_name, container_name):
     # Create container with docker build
-    p = subprocess.Popen(['docker', 'build', '-t', container_name, '-f', \
-        os.path.join(ddir, dockerfile_name), '.'])
+    p = subprocess.Popen(
+        [
+            "docker",
+            "build",
+            "-t",
+            container_name,
+            "-f",
+            os.path.join(ddir, dockerfile_name),
+            ".",
+        ]
+    )
     p.wait()
-    fail_if(p.returncode != 0, 'docker build {} failed'.format(container_name))
+    fail_if(p.returncode != 0, "docker build {} failed".format(container_name))
 
 
 def get_container_version_if_not_specified():
     if FLAGS.container_version is None:
         # Read from TRITON_VERSION file in server repo to determine version
-        with open('TRITON_VERSION', "r") as vfile:
+        with open("TRITON_VERSION", "r") as vfile:
             version = vfile.readline().strip()
         import build
+
         _, FLAGS.container_version = build.container_versions(
-            version, None, FLAGS.container_version)
-        log('version {}'.format(version))
-    log('using container version {}'.format(FLAGS.container_version))
+            version, None, FLAGS.container_version
+        )
+        log("version {}".format(version))
+    log("using container version {}".format(FLAGS.container_version))
 
 
-def create_argmap(images):
+def create_argmap(images, skip_pull):
     # Extract information from upstream build and create map other functions can
     # use
     full_docker_image = images["full"]
     min_docker_image = images["min"]
     enable_gpu = FLAGS.enable_gpu
-    # Docker inspect enviroment variables
-    base_run_args = ['docker', 'inspect', '-f']
-    import re  # parse all PATH enviroment variables
+    # Docker inspect environment variables
+    base_run_args = ["docker", "inspect", "-f"]
+    import re  # parse all PATH environment variables
 
     # first pull docker images
-    log("pulling container:{}".format(full_docker_image))
-    p = subprocess.run(['docker', 'pull', full_docker_image])
-    fail_if(
-        p.returncode != 0,
-        'docker pull container {} failed, {}'.format(full_docker_image,
-                                                     p.stderr))
-    if enable_gpu:
-        pm = subprocess.run(['docker', 'pull', min_docker_image])
+    if not skip_pull:
+        log("pulling container:{}".format(full_docker_image))
+        p = subprocess.run(["docker", "pull", full_docker_image])
         fail_if(
-            pm.returncode != 0, 'docker pull container {} failed, {}'.format(
-                min_docker_image, pm.stderr))
-        pm_path = subprocess.run(base_run_args + [
-            '{{range $index, $value := .Config.Env}}{{$value}} {{end}}',
-            min_docker_image
-        ],
-                                 capture_output=True,
-                                 text=True)
+            p.returncode != 0,
+            "docker pull container {} failed, {}".format(full_docker_image, p.stderr),
+        )
+    if enable_gpu:
+        if not skip_pull:
+            pm = subprocess.run(["docker", "pull", min_docker_image])
+            fail_if(
+                pm.returncode != 0 and not skip_pull,
+                "docker pull container {} failed, {}".format(
+                    min_docker_image, pm.stderr
+                ),
+            )
+        pm_path = subprocess.run(
+            base_run_args
+            + [
+                "{{range $index, $value := .Config.Env}}{{$value}} {{end}}",
+                min_docker_image,
+            ],
+            capture_output=True,
+            text=True,
+        )
         fail_if(
             pm_path.returncode != 0,
-            'docker inspect to find triton enviroment variables for min container failed, {}'
-            .format(pm_path.stderr))
-        # min container needs to be GPU support  enabled if the build is GPU build
+            "docker inspect to find triton environment variables for min container failed, {}".format(
+                pm_path.stderr
+            ),
+        )
+        # min container needs to be GPU-support-enabled if the build is GPU build
         vars = pm_path.stdout
         e = re.search("CUDA_VERSION", vars)
         gpu_enabled = False if e is None else True
         fail_if(
             not gpu_enabled,
-            'Composing container with gpu support enabled but min container provided does not have CUDA installed'
+            "Composing container with gpu support enabled but min container provided does not have CUDA installed",
         )
 
-    # Check full container enviroment variables
-    p_path = subprocess.run(base_run_args + [
-        '{{range $index, $value := .Config.Env}}{{$value}} {{end}}',
-        full_docker_image
-    ],
-                            capture_output=True,
-                            text=True)
+    # Check full container environment variables
+    p_path = subprocess.run(
+        base_run_args
+        + [
+            "{{range $index, $value := .Config.Env}}{{$value}} {{end}}",
+            full_docker_image,
+        ],
+        capture_output=True,
+        text=True,
+    )
     fail_if(
         p_path.returncode != 0,
-        'docker inspect to find enviroment variables for full container failed, {}'
-        .format(p_path.stderr))
+        "docker inspect to find environment variables for full container failed, {}".format(
+            p_path.stderr
+        ),
+    )
     vars = p_path.stdout
     log_verbose("inspect args: {}".format(vars))
 
     e0 = re.search("TRITON_SERVER_GPU_ENABLED=([\S]{1,}) ", vars)
     e1 = re.search("CUDA_VERSION", vars)
     gpu_enabled = False
-    if (e0 != None):
+    if e0 != None:
         gpu_enabled = e0.group(1) == "1"
-    elif (e1 != None):
+    elif e1 != None:
         gpu_enabled = True
     fail_if(
         gpu_enabled != enable_gpu,
-        'Error: full container provided was build with \'TRITON_SERVER_GPU_ENABLED\' as {} and you are composing container with \'TRITON_SERVER_GPU_ENABLED\' as {}'
-        .format(gpu_enabled, enable_gpu))
+        "Error: full container provided was build with "
+        "'TRITON_SERVER_GPU_ENABLED' as {} and you are composing container"
+        "with 'TRITON_SERVER_GPU_ENABLED' as {}".format(gpu_enabled, enable_gpu),
+    )
     e = re.search("TRITON_SERVER_VERSION=([\S]{6,}) ", vars)
     version = "" if e is None else e.group(1)
     fail_if(
         len(version) == 0,
-        'docker inspect to find triton server version failed, {}'.format(
-            p_path.stderr))
+        "docker inspect to find triton server version failed, {}".format(p_path.stderr),
+    )
     e = re.search("NVIDIA_TRITON_SERVER_VERSION=([\S]{5,}) ", vars)
     container_version = "" if e is None else e.group(1)
     fail_if(
         len(container_version) == 0,
-        'docker inspect to find triton container version failed, {}'.format(
-            vars))
+        "docker inspect to find triton container version failed, {}".format(vars),
+    )
     dcgm_ver = re.search("DCGM_VERSION=([\S]{4,}) ", vars)
     dcgm_version = ""
     if dcgm_ver is None:
         dcgm_version = "2.2.3"
-        log("WARNING: DCGM version not found from image, installing the earlierst version {}"
-            .format(dcgm_version))
+        log(
+            "WARNING: DCGM version not found from image, installing the earlierst version {}".format(
+                dcgm_version
+            )
+        )
     else:
         dcgm_version = dcgm_ver.group(1)
     fail_if(
         len(dcgm_version) == 0,
-        'docker inspect to find DCGM version failed, {}'.format(vars))
+        "docker inspect to find DCGM version failed, {}".format(vars),
+    )
 
     p_sha = subprocess.run(
-        base_run_args +
-        ['{{ index .Config.Labels "com.nvidia.build.ref"}}', full_docker_image],
+        base_run_args
+        + ['{{ index .Config.Labels "com.nvidia.build.ref"}}', full_docker_image],
         capture_output=True,
-        text=True)
+        text=True,
+    )
     fail_if(
         p_sha.returncode != 0,
-        'docker inspect of upstream docker image build sha failed, {}'.format(
-            p_sha.stderr))
+        "docker inspect of upstream docker image build sha failed, {}".format(
+            p_sha.stderr
+        ),
+    )
     p_build = subprocess.run(
-        base_run_args +
-        ['{{ index .Config.Labels "com.nvidia.build.id"}}', full_docker_image],
+        base_run_args
+        + ['{{ index .Config.Labels "com.nvidia.build.id"}}', full_docker_image],
         capture_output=True,
-        text=True)
+        text=True,
+    )
     fail_if(
         p_build.returncode != 0,
-        'docker inspect of upstream docker image build sha failed, {}'.format(
-            p_build.stderr))
+        "docker inspect of upstream docker image build sha failed, {}".format(
+            p_build.stderr
+        ),
+    )
 
     p_find = subprocess.run(
-        ['docker', 'run', full_docker_image, 'bash', '-c', 'ls /usr/bin/'],
+        ["docker", "run", full_docker_image, "bash", "-c", "ls /usr/bin/"],
         capture_output=True,
-        text=True)
+        text=True,
+    )
     f = re.search("serve", p_find.stdout)
-    fail_if(p_find.returncode != 0,
-            "Cannot search for 'serve' in /usr/bin, {}".format(p_find.stderr))
+    fail_if(
+        p_find.returncode != 0,
+        "Cannot search for 'serve' in /usr/bin, {}".format(p_find.stderr),
+    )
     argmap = {
-        'NVIDIA_BUILD_REF': p_sha.stdout.rstrip(),
-        'NVIDIA_BUILD_ID': p_build.stdout.rstrip(),
-        'TRITON_VERSION': version,
-        'TRITON_CONTAINER_VERSION': container_version,
-        'DCGM_VERSION': dcgm_version,
-        'SAGEMAKER_ENDPOINT': f is not None,
+        "NVIDIA_BUILD_REF": p_sha.stdout.rstrip(),
+        "NVIDIA_BUILD_ID": p_build.stdout.rstrip(),
+        "TRITON_VERSION": version,
+        "TRITON_CONTAINER_VERSION": container_version,
+        "DCGM_VERSION": dcgm_version,
+        "SAGEMAKER_ENDPOINT": f is not None,
     }
     return argmap
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
     group_qv = parser.add_mutually_exclusive_group()
-    group_qv.add_argument('-q',
-                          '--quiet',
-                          action="store_true",
-                          required=False,
-                          help='Disable console output.')
-    group_qv.add_argument('-v',
-                          '--verbose',
-                          action="store_true",
-                          required=False,
-                          help='Enable verbose output.')
+    group_qv.add_argument(
+        "-q",
+        "--quiet",
+        action="store_true",
+        required=False,
+        help="Disable console output.",
+    )
+    group_qv.add_argument(
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        help="Enable verbose output.",
+    )
     parser.add_argument(
-        '--output-name',
+        "--output-name",
         type=str,
         required=False,
-        help='Name for the generated Docker image. Default is "tritonserver".')
+        help='Name for the generated Docker image. Default is "tritonserver".',
+    )
     parser.add_argument(
-        '--work-dir',
+        "--work-dir",
         type=str,
         required=False,
-        help=
-        'Generated dockerfiles are placed here. Default to current directory.')
+        help="Generated dockerfiles are placed here. Default to current directory.",
+    )
     parser.add_argument(
-        '--container-version',
+        "--container-version",
         type=str,
         required=False,
-        help=
-        'The version to use for the generated Docker image. If not specified the container version will be chosen automatically based on the repository branch.'
+        help="The version to use for the generated Docker image. If not specified "
+        "the container version will be chosen automatically based on the "
+        "repository branch.",
+    )
+    parser.add_argument(
+        "--image",
+        action="append",
+        required=False,
+        help="Use specified Docker image to generate Docker image. Specified as "
+        '<image-name>,<full-image-name>. <image-name> can be "min", "gpu-min" '
+        'or "full". Both "min" and "full" need to be specified at the same time.'
+        'This will override "--container-version". "gpu-min" is needed for '
+        "CPU-only container to copy TensorFlow and PyTorch deps.",
+    )
+    parser.add_argument(
+        "--enable-gpu",
+        nargs="?",
+        type=lambda x: (str(x).lower() == "true"),
+        const=True,
+        default=True,
+        required=False,
+        help=argparse.SUPPRESS,
     )
     parser.add_argument(
-        '--image',
-        action='append',
+        "--backend",
+        action="append",
         required=False,
-        help=
-        'Use specified Docker image to generate Docker image. Specified as <image-name>,<full-image-name>. <image-name> can be "min", "gpu-min" or "full". Both "min" and "full" need to be specified at the same time. This will override "--container-version". "gpu-min" is needed for CPU-only container to copy TensorFlow and PyTorch deps.'
+        help="Include <backend-name> in the generated Docker image. The flag may be "
+        "specified multiple times.",
     )
-    parser.add_argument('--enable-gpu',
-                        nargs='?',
-                        type=lambda x: (str(x).lower() == 'true'),
-                        const=True,
-                        default=True,
-                        required=False,
-                        help=argparse.SUPPRESS)
     parser.add_argument(
-        '--backend',
-        action='append',
+        "--repoagent",
+        action="append",
         required=False,
-        help=
-        'Include <backend-name> in the generated Docker image. The flag may be specified multiple times.'
+        help="Include <repoagent-name> in the generated Docker image. The flag may "
+        "be specified multiple times.",
     )
     parser.add_argument(
-        '--repoagent',
-        action='append',
+        "--cache",
+        action="append",
         required=False,
-        help=
-        'Include <repoagent-name> in the generated Docker image. The flag may be specified multiple times.'
+        help="Include <cache-name> in the generated Docker image. The flag may "
+        "be specified multiple times.",
     )
     parser.add_argument(
-        '--dry-run',
+        "--skip-pull",
         action="store_true",
         required=False,
-        help='Only creates Dockerfile.compose, does not build the Docker image.'
+        help="Do not pull the required docker images. The user is responsible "
+        "for pulling the upstream images needed to compose the image.",
+    )
+    parser.add_argument(
+        "--dry-run",
+        action="store_true",
+        required=False,
+        help="Only creates Dockerfile.compose, does not build the Docker image.",
     )
 
     FLAGS = parser.parse_args()
@@ -367,64 +457,69 @@ def create_argmap(images):
     if FLAGS.output_name is None:
         FLAGS.output_name = "tritonserver"
 
-    dockerfile_name = 'Dockerfile.compose'
+    dockerfile_name = "Dockerfile.compose"
 
     if FLAGS.backend is None:
         FLAGS.backend = []
     if FLAGS.repoagent is None:
         FLAGS.repoagent = []
+    if FLAGS.cache is None:
+        FLAGS.cache = []
 
     # Initialize map of docker images.
     images = {}
     if FLAGS.image:
         for img in FLAGS.image:
-            parts = img.split(',')
+            parts = img.split(",")
             fail_if(
                 len(parts) != 2,
-                '--image must specific <image-name>,<full-image-registry>')
+                "--image must specific <image-name>,<full-image-registry>",
+            )
             fail_if(
-                parts[0] not in ['min', 'full', 'gpu-min'],
-                'unsupported image-name \'{}\' for --image'.format(parts[0]))
+                parts[0] not in ["min", "full", "gpu-min"],
+                "unsupported image-name '{}' for --image".format(parts[0]),
+            )
             log('image "{}": "{}"'.format(parts[0], parts[1]))
             images[parts[0]] = parts[1]
     else:
         get_container_version_if_not_specified()
         if FLAGS.enable_gpu:
             images = {
-                "full":
-                    "nvcr.io/nvidia/tritonserver:{}-py3".format(
-                        FLAGS.container_version),
-                "min":
-                    "nvcr.io/nvidia/tritonserver:{}-py3-min".format(
-                        FLAGS.container_version)
+                "full": "nvcr.io/nvidia/tritonserver:{}-py3".format(
+                    FLAGS.container_version
+                ),
+                "min": "nvcr.io/nvidia/tritonserver:{}-py3-min".format(
+                    FLAGS.container_version
+                ),
             }
         else:
             images = {
-                "full":
-                    "nvcr.io/nvidia/tritonserver:{}-cpu-only-py3".format(
-                        FLAGS.container_version),
-                "min":
-                    "ubuntu:20.04"
+                "full": "nvcr.io/nvidia/tritonserver:{}-cpu-only-py3".format(
+                    FLAGS.container_version
+                ),
+                "min": "ubuntu:22.04",
             }
-    fail_if(
-        len(images) < 2,
-        "Need to specify both 'full' and 'min' images if at all")
+    fail_if(len(images) < 2, "Need to specify both 'full' and 'min' images if at all")
 
     # For CPU-only image we need to copy some cuda libraries and dependencies
     # since we are using PyTorch, TensorFlow 1, TensorFlow 2 containers that
     # are not CPU-only.
-    if (('pytorch' in FLAGS.backend) or ('tensorflow1' in FLAGS.backend) or
-        ('tensorflow2' in FLAGS.backend)) and ('gpu-min' not in images):
+    if (
+        ("pytorch" in FLAGS.backend)
+        or ("tensorflow1" in FLAGS.backend)
+        or ("tensorflow2" in FLAGS.backend)
+    ) and ("gpu-min" not in images):
         images["gpu-min"] = "nvcr.io/nvidia/tritonserver:{}-py3-min".format(
-            FLAGS.container_version)
+            FLAGS.container_version
+        )
 
-    argmap = create_argmap(images)
+    argmap = create_argmap(images, FLAGS.skip_pull)
 
-    start_dockerfile(FLAGS.work_dir, images, argmap, dockerfile_name,
-                     FLAGS.backend)
+    start_dockerfile(FLAGS.work_dir, images, argmap, dockerfile_name, FLAGS.backend)
     add_requested_backends(FLAGS.work_dir, dockerfile_name, FLAGS.backend)
     add_requested_repoagents(FLAGS.work_dir, dockerfile_name, FLAGS.repoagent)
+    add_requested_caches(FLAGS.work_dir, dockerfile_name, FLAGS.cache)
     end_dockerfile(FLAGS.work_dir, dockerfile_name, argmap)
 
-    if (not FLAGS.dry_run):
+    if not FLAGS.dry_run:
         build_docker_image(FLAGS.work_dir, dockerfile_name, FLAGS.output_name)
diff --git a/deploy/alibaba-cloud/README.md b/deploy/alibaba-cloud/README.md
index 1dea4ede11..98f914a693 100644
--- a/deploy/alibaba-cloud/README.md
+++ b/deploy/alibaba-cloud/README.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,7 +26,7 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 -->
 
-# Deploy Triton Inference Server on PAI-EAS 
+# Deploy Triton Inference Server on PAI-EAS
 * Table Of Contents
    - [Description](https://yuque.alibaba-inc.com/pai/blade/mtptqc#Description)
    - [Prerequisites](https://yuque.alibaba-inc.com/pai/blade/mtptqc#Prerequisites)
@@ -57,11 +57,11 @@ Download the tensorflow inception model via [fetch_model.sh](https://github.com/
 The following is the json we use when creating a Triton Server on EAS.
 ```
 {
-  "name": "<your triton service name>",                          
+  "name": "<your triton service name>",
   "processor": "triton",
   "processor_params": [
-    "--model-repository=oss://triton-model-repo/models", 
-    "--allow-grpc=true", 
+    "--model-repository=oss://triton-model-repo/models",
+    "--allow-grpc=true",
     "--allow-http=true"
   ],
   "metadata": {
diff --git a/deploy/aws/README.md b/deploy/aws/README.md
index 8e99d45c63..4e60fdd65b 100644
--- a/deploy/aws/README.md
+++ b/deploy/aws/README.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2018-2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2018-2023, NVIDIA CORPORATION. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -39,10 +39,10 @@ This guide assumes you already have a functional Kubernetes cluster
 and helm installed (see below for instructions on installing
 helm). Note the following requirements:
 
-* The helm chart deploys Prometheus and Grafana to collect and display Triton metrics. To use this helm chart you must install Prpmetheus and Grafana in your cluster as described below and your cluster must contain sufficient CPU resourses to support these services. 
+* The helm chart deploys Prometheus and Grafana to collect and display Triton metrics. To use this helm chart you must install Prpmetheus and Grafana in your cluster as described below and your cluster must contain sufficient CPU resources to support these services.
 
 * If you want Triton Server to use GPUs for inferencing, your cluster
-must be configured to contain the desired number of GPU nodes (EC2 G4 instances recommended) 
+must be configured to contain the desired number of GPU nodes (EC2 G4 instances recommended)
 with support for the NVIDIA driver and CUDA version required by the version
 of the inference server you are using.
 
@@ -67,7 +67,7 @@ please see the [official migration guide](https://helm.sh/docs/topics/v2_v3_migr
 
 > **NOTE**: Moving forward this chart will only be tested and maintained for Helm v3.
 
-Below are example instructions for installing Helm v2. 
+Below are example instructions for installing Helm v2.
 
 ```
 $ curl https://raw.githubusercontent.com/helm/helm/master/scripts/get | bash
@@ -98,7 +98,7 @@ in an AWS S3 Storage bucket.
 $ aws s3 mb s3://triton-inference-server-repository
 ```
 
-Following the [QuickStart](../../docs/quickstart.md) download the
+Following the [QuickStart](../../docs/getting_started/quickstart.md) download the
 example model repository to your system and copy it into the AWS S3
 bucket.
 
@@ -126,13 +126,13 @@ by Grafana. The inference server helm chart assumes that Prometheus
 and Grafana are available so this step must be followed even if you
 don't want to use Grafana.
 
-Use the prometheus-operator to install these components. The
+Use the [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack) to install these components. The
 *serviceMonitorSelectorNilUsesHelmValues* flag is needed so that
 Prometheus can find the inference server metrics in the *example*
 release deployed below.
 
 ```
-$ helm install --name example-metrics --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false stable/prometheus-operator
+$ helm install example-metrics --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false prometheus-community/kube-prometheus-stack
 ```
 
 Then port-forward to the Grafana service so you can access it from
@@ -156,7 +156,7 @@ following commands.
 
 ```
 $ cd <directory containing Chart.yaml>
-$ helm install --name example .
+$ helm install example .
 ```
 
 Use kubectl to see status and wait until the inference server pods are
@@ -178,7 +178,7 @@ deploy a cluster of four inference servers use *--set* to set the
 replicaCount parameter.
 
 ```
-$ helm install --name example --set replicaCount=4 .
+$ helm install example --set replicaCount=4 .
 ```
 
 You can also write your own "config.yaml" file with the values you
@@ -191,7 +191,7 @@ image:
   imageName: nvcr.io/nvidia/tritonserver:custom-tag
   modelRepositoryPath: gs://my_model_repository
 EOF
-$ helm install --name example -f config.yaml .
+$ helm install example -f config.yaml .
 ```
 
 ## Using Triton Inference Server
@@ -218,7 +218,7 @@ from the HTTP endpoint.
 $ curl 34.83.9.133:8000/v2
 ```
 
-Follow the [QuickStart](../../docs/quickstart.md) to get the example
+Follow the [QuickStart](../../docs/getting_started/quickstart.md) to get the example
 image classification client that can be used to perform inferencing
 using image classification models being served by the inference
 server. For example,
@@ -243,15 +243,15 @@ NAME            REVISION  UPDATED                   STATUS    CHART
 example         1         Wed Feb 27 22:16:55 2019  DEPLOYED  triton-inference-server-1.0.0  1.0           default
 example-metrics	1       	Tue Jan 21 12:24:07 2020	DEPLOYED	prometheus-operator-6.18.0   	 0.32.0     	 default
 
-$ helm delete --purge example
-$ helm delete --purge example-metrics
+$ helm uninstall example
+$ helm uninstall example-metrics
 ```
 
-For the Prometheus and Grafana services you should [explicitly delete
-CRDs](https://github.com/helm/charts/tree/master/stable/prometheus-operator#uninstalling-the-chart):
+For the Prometheus and Grafana services, you should [explicitly delete
+CRDs](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack#uninstall-helm-chart):
 
 ```
-$ kubectl delete crd alertmanagers.monitoring.coreos.com servicemonitors.monitoring.coreos.com podmonitors.monitoring.coreos.com prometheuses.monitoring.coreos.com prometheusrules.monitoring.coreos.com
+$ kubectl delete crd alertmanagerconfigs.monitoring.coreos.com alertmanagers.monitoring.coreos.com podmonitors.monitoring.coreos.com probes.monitoring.coreos.com prometheuses.monitoring.coreos.com prometheusrules.monitoring.coreos.com servicemonitors.monitoring.coreos.com thanosrulers.monitoring.coreos.com
 ```
 
 You may also want to delete the AWS bucket you created to hold the
diff --git a/deploy/aws/templates/deployment.yaml b/deploy/aws/templates/deployment.yaml
index 24f3f65380..d90e51b113 100644
--- a/deploy/aws/templates/deployment.yaml
+++ b/deploy/aws/templates/deployment.yaml
@@ -1,4 +1,4 @@
-# Copyright (c) 2019-2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2019-2023, NVIDIA CORPORATION. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -56,7 +56,7 @@ spec:
             limits:
               nvidia.com/gpu: {{ .Values.image.numGpus }}
 
-          args: ["tritonserver", "--model-store={{ .Values.image.modelRepositoryPath }}", 
+          args: ["tritonserver", "--model-store={{ .Values.image.modelRepositoryPath }}",
                  "--model-control-mode=poll",
                  "--repository-poll-secs=5"]
 
@@ -94,7 +94,7 @@ spec:
             httpGet:
               path: /v2/health/ready
               port: http
-              
+
       securityContext:
         runAsUser: 1000
         fsGroup: 1000
diff --git a/deploy/aws/values.yaml b/deploy/aws/values.yaml
index f44424ba6f..9e511742bc 100644
--- a/deploy/aws/values.yaml
+++ b/deploy/aws/values.yaml
@@ -1,4 +1,4 @@
-# Copyright (c) 2019-2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2019-2023, NVIDIA CORPORATION. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -27,7 +27,7 @@
 replicaCount: 1
 
 image:
-  imageName: nvcr.io/nvidia/tritonserver:22.05-py3
+  imageName: nvcr.io/nvidia/tritonserver:23.11-py3
   pullPolicy: IfNotPresent
   modelRepositoryPath: s3://triton-inference-server-repository/model_repository
   numGpus: 1
diff --git a/deploy/fleetcommand/Chart.yaml b/deploy/fleetcommand/Chart.yaml
index a7c97d0af8..959d69794a 100644
--- a/deploy/fleetcommand/Chart.yaml
+++ b/deploy/fleetcommand/Chart.yaml
@@ -1,4 +1,4 @@
-# Copyright (c) 2019-2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2019-2023, NVIDIA CORPORATION. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,8 +26,13 @@
 
 apiVersion: v1
 # appVersion is the Triton version; update when changing release
-appVersion: "2.22.0"
+appVersion: "2.40.0"
 description: Triton Inference Server (Fleet Command)
 name: triton-inference-server
-# version is the Chart version; update when changing anything in the chart (semver)
-version: 1.3.0
+# version is the Chart version; update when changing anything in the chart
+# This follows semantic versioning, i.e.:
+#   Given version X.Y.Z
+#   When making fixes to the chart, increment Z
+#   When making functional changes to the chart (including updating the Triton version, above), increment Y and reset Z to 0
+#   When making breaking changes to the chart (e.g. user must take action before deploying), increment X and reset Y and Z to 0
+version: 1.4.0
diff --git a/deploy/fleetcommand/README.md b/deploy/fleetcommand/README.md
index cda8457ce5..217162279c 100644
--- a/deploy/fleetcommand/README.md
+++ b/deploy/fleetcommand/README.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2018-2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -62,7 +62,7 @@ For this example you will place the model repository in an S3 Storage bucket
 $ aws s3 mb s3://triton-inference-server-repository
 ```
 
-Following the [QuickStart](../../docs/quickstart.md) download the example model
+Following the [QuickStart](../../docs/getting_started/quickstart.md) download the example model
 repository to your system and copy it into the AWS S3 bucket.
 
 ```
@@ -87,7 +87,7 @@ echo -n 'AWS_SESSION_TOKEN' | base64
 
 Deploy the Triton Inference Server to your Location in Fleet Command by creating
 a Deployment.  You can specify configuration parameters to override the default
-[values.yaml](values.yaml) in the Application Configuration section.  
+[values.yaml](values.yaml) in the Application Configuration section.
 
 *Note:* You _must_ provide a `--model-repository` parameter with a path to your
 prepared model repository in your S3 bucket.  Otherwise, the Triton will not
@@ -114,7 +114,7 @@ for more info.
 If you have `prometheus-operator` deployed, you can enable the ServiceMonitor
 for the Triton Inference Server by setting `serviceMonitor.enabled: true` in
 Application Configuration.  This will also deploy a Grafana dashboard for Triton
-as a ConfigMap.  
+as a ConfigMap.
 
 Otherwise, metrics can be scraped by pointing an external Prometheus
 instance at the `metricsNodePort` in the values.
@@ -136,7 +136,7 @@ location has the IP `34.83.9.133`:
 $ curl 34.83.9.133:30343/v2
 ```
 
-Follow the [QuickStart](../../docs/quickstart.md) to get the example image
+Follow the [QuickStart](../../docs/getting_started/quickstart.md) to get the example image
 classification client that can be used to perform inferencing using image
 classification models being served by the Triton. For example,
 
diff --git a/deploy/fleetcommand/templates/deployment.yaml b/deploy/fleetcommand/templates/deployment.yaml
index f447cb8b30..5d7af7023d 100644
--- a/deploy/fleetcommand/templates/deployment.yaml
+++ b/deploy/fleetcommand/templates/deployment.yaml
@@ -80,11 +80,13 @@ spec:
               secretKeyRef:
                 name: aws-credentials
                 key: AWS_SECRET_ACCESS_KEY
+{{- if .Values.secret.token }}
           - name: AWS_SESSION_TOKEN
             valueFrom:
               secretKeyRef:
                 name: aws-credentials
                 key: AWS_SESSION_TOKEN
+{{- end }}
 {{- end }}
 
           ports:
diff --git a/deploy/fleetcommand/templates/secrets.yaml b/deploy/fleetcommand/templates/secrets.yaml
index 902f4c5221..9c7dcd404d 100644
--- a/deploy/fleetcommand/templates/secrets.yaml
+++ b/deploy/fleetcommand/templates/secrets.yaml
@@ -34,5 +34,7 @@ data:
   AWS_DEFAULT_REGION: {{ .Values.secret.region }}
   AWS_ACCESS_KEY_ID: {{ .Values.secret.id }}
   AWS_SECRET_ACCESS_KEY: {{ .Values.secret.key }}
+{{- if .Values.secret.token }}
   AWS_SESSION_TOKEN: {{ .Values.secret.token }}
 {{- end }}
+{{- end }}
diff --git a/deploy/fleetcommand/values.yaml b/deploy/fleetcommand/values.yaml
index d676802bde..df0071136e 100644
--- a/deploy/fleetcommand/values.yaml
+++ b/deploy/fleetcommand/values.yaml
@@ -1,4 +1,4 @@
-# Copyright (c) 2019-2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2019-2023, NVIDIA CORPORATION. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -27,7 +27,7 @@
 replicaCount: 1
 
 image:
-  imageName: nvcr.io/nvidia/tritonserver:22.05-py3
+  imageName: nvcr.io/nvidia/tritonserver:23.11-py3
   pullPolicy: IfNotPresent
   numGpus: 1
   serverCommand: tritonserver
@@ -46,13 +46,13 @@ image:
     # Model Control Mode (Optional, default: none)
     #
     # To set model control mode, uncomment and configure below
-    # See https://github.com/triton-inference-server/server/blob/r22.05/docs/model_management.md
+    # See https://github.com/triton-inference-server/server/blob/r23.11/docs/model_management.md
     #  for more details
     #- --model-control-mode=explicit|poll|none
     #
     # Additional server args
     #
-    # see https://github.com/triton-inference-server/server/blob/r22.05/README.md
+    # see https://github.com/triton-inference-server/server/blob/r23.11/README.md
     #  for more details
 
 service:
diff --git a/deploy/gcp/README.md b/deploy/gcp/README.md
index 51aed5d237..dc80cc77de 100644
--- a/deploy/gcp/README.md
+++ b/deploy/gcp/README.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2018-2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2018-2023, NVIDIA CORPORATION. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -72,7 +72,7 @@ please see the [official migration guide](https://helm.sh/docs/topics/v2_v3_migr
 
 > **NOTE**: Moving forward this chart will only be tested and maintained for Helm v3.
 
-Below are example instructions for installing Helm v2. 
+Below are example instructions for installing Helm v2.
 
 ```
 $ curl https://raw.githubusercontent.com/helm/helm/master/scripts/get | bash
@@ -103,7 +103,7 @@ in a Google Cloud Storage bucket.
 $ gsutil mb gs://triton-inference-server-repository
 ```
 
-Following the [QuickStart](../../docs/quickstart.md) download the
+Following the [QuickStart](../../docs/getting_started/quickstart.md) download the
 example model repository to your system and copy it into the GCS
 bucket.
 
@@ -164,13 +164,13 @@ by Grafana. The inference server helm chart assumes that Prometheus
 and Grafana are available so this step must be followed even if you
 don't want to use Grafana.
 
-Use the prometheus-operator to install these components. The
+Use the [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack) to install these components. The
 *serviceMonitorSelectorNilUsesHelmValues* flag is needed so that
 Prometheus can find the inference server metrics in the *example*
 release deployed below.
 
 ```
-$ helm install --name example-metrics --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false stable/prometheus-operator
+$ helm install example-metrics --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false prometheus-community/kube-prometheus-stack
 ```
 
 Then port-forward to the Grafana service so you can access it from
@@ -194,7 +194,7 @@ following commands.
 
 ```
 $ cd <directory containing Chart.yaml>
-$ helm install --name example .
+$ helm install example .
 ```
 
 Use kubectl to see status and wait until the inference server pods are
@@ -216,7 +216,7 @@ deploy a cluster of four inference servers use *--set* to set the
 replicaCount parameter.
 
 ```
-$ helm install --name example --set replicaCount=4 .
+$ helm install example --set replicaCount=4 .
 ```
 
 You can also write your own "config.yaml" file with the values you
@@ -229,7 +229,7 @@ image:
   imageName: nvcr.io/nvidia/tritonserver:custom-tag
   modelRepositoryPath: gs://my_model_repository
 EOF
-$ helm install --name example -f config.yaml .
+$ helm install example -f config.yaml .
 ```
 
 ## Using Triton Inference Server
@@ -256,7 +256,7 @@ from the HTTP endpoint.
 $ curl 34.83.9.133:8000/v2
 ```
 
-Follow the [QuickStart](../../docs/quickstart.md) to get the example
+Follow the [QuickStart](../../docs/getting_started/quickstart.md) to get the example
 image classification client that can be used to perform inferencing
 using image classification models being served by the inference
 server. For example,
@@ -281,15 +281,15 @@ NAME            REVISION  UPDATED                   STATUS    CHART
 example         1         Wed Feb 27 22:16:55 2019  DEPLOYED  triton-inference-server-1.0.0  1.0           default
 example-metrics	1       	Tue Jan 21 12:24:07 2020	DEPLOYED	prometheus-operator-6.18.0   	 0.32.0     	 default
 
-$ helm delete --purge example
-$ helm delete --purge example-metrics
+$ helm uninstall example
+$ helm uninstall example-metrics
 ```
 
-For the Prometheus and Grafana services you should [explicitly delete
-CRDs](https://github.com/helm/charts/tree/master/stable/prometheus-operator#uninstalling-the-chart):
+For the Prometheus and Grafana services, you should [explicitly delete
+CRDs](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack#uninstall-helm-chart):
 
 ```
-$ kubectl delete crd alertmanagers.monitoring.coreos.com servicemonitors.monitoring.coreos.com podmonitors.monitoring.coreos.com prometheuses.monitoring.coreos.com prometheusrules.monitoring.coreos.com
+$ kubectl delete crd alertmanagerconfigs.monitoring.coreos.com alertmanagers.monitoring.coreos.com podmonitors.monitoring.coreos.com probes.monitoring.coreos.com prometheuses.monitoring.coreos.com prometheusrules.monitoring.coreos.com servicemonitors.monitoring.coreos.com thanosrulers.monitoring.coreos.com
 ```
 
 You may also want to delete the GCS bucket you created to hold the
diff --git a/deploy/gcp/values.yaml b/deploy/gcp/values.yaml
index 7c60d5d651..ed01d80d52 100644
--- a/deploy/gcp/values.yaml
+++ b/deploy/gcp/values.yaml
@@ -1,4 +1,4 @@
-# Copyright (c) 2019-2022, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2019-2023, NVIDIA CORPORATION. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -27,7 +27,7 @@
 replicaCount: 1
 
 image:
-  imageName: nvcr.io/nvidia/tritonserver:22.05-py3
+  imageName: nvcr.io/nvidia/tritonserver:23.11-py3
   pullPolicy: IfNotPresent
   modelRepositoryPath: gs://triton-inference-server-repository/model_repository
   numGpus: 1
diff --git a/deploy/gke-marketplace-app/README.md b/deploy/gke-marketplace-app/README.md
index 1dd9302f79..e99b9efbae 100644
--- a/deploy/gke-marketplace-app/README.md
+++ b/deploy/gke-marketplace-app/README.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -29,31 +29,44 @@
 # NVIDIA Triton Inference Server GKE Marketplace Application
 
 **Table Of Contents**
-- [Description](#description)
-- [Prerequisites](#prerequisites)
-- [Demo Instruction](#demo-instruction)
-- [Additional Resources](#additional-resources)
-- [Known Issues](#known-issues)
+- [NVIDIA Triton Inference Server GKE Marketplace Application](#nvidia-triton-inference-server-gke-marketplace-application)
+  - [Description](#description)
+  - [Prerequisites](#prerequisites)
+  - [Demo Instruction](#demo-instruction)
+  - [Additional Resources](#additional-resources)
+  - [Known Issues](#known-issues)
 
 ## Description
 
-This repository contains Google Kubernetes Engine(GKE) Marketplace Application for NVIDIA Triton Inference Server deployer. 
+This repository contains Google Kubernetes Engine(GKE) Marketplace Application for NVIDIA Triton Inference Server deployer.
 
  - Triton GKE deployer is a helm chart deployer recommended by GKE Marketplace
- - Triton GKE deployer leverage [Istio on GCP](https://cloud.google.com/istio/docs/istio-on-gke/overview) for traffic ingress and load balancing, user can further config Istio for advanced use cases in service mesh and Anthos
+ - Triton GKE deployer deploys a GKE ingress which accepts public inference requests
  - Triton GKE deployer includes a horizontal pod autoscaler(HPA) which relies on [stack driver custom metrics adaptor](https://github.com/GoogleCloudPlatform/k8s-stackdriver/tree/master/custom-metrics-stackdriver-adapter) to monitor GPU duty cycle, and auto scale GPU nodes.
- - This repo also contains a sample to generate BERT model with TensorRT and use Locust to experiment with GPU node autoscaling and monitor client latency/throughput. 
+ - This repo also contains a sample to generate BERT model with TensorRT and use Locust to experiment with GPU node autoscaling and monitor client latency/throughput.
 
 ![Cloud Architecture Diagram](diagram.png)
 
 ## Prerequisites
 
- - [Install Google Cloud SDK on your laptop/client workstation](https://cloud.google.com/sdk/docs/install), so that `gcloud` SDK cli interface could be run on the client and sign in with your GCP credentials. 
+ - [Install Google Cloud SDK on your laptop/client workstation](https://cloud.google.com/sdk/docs/install), so that `gcloud` SDK cli interface could be run on the client and sign in with your GCP credentials.
  - In addition, user could leverage [Google Cloud shell](https://cloud.google.com/shell/docs/launching-cloud-shell).
 
 ## Demo Instruction
 
-First, install this Triton GKE app to an existing GKE cluster with GPU node pool, Google Cloud Marketplace currently doesn't support auto creation of GPU clusters. User has to run following command to create a compatible cluster (gke version >=1.18.7) with GPU node pools, we recommend user to select T4 or A100(MIG) instances type and choose CPU ratio based on profiling of actual inference workflow. 
+First, install this Triton GKE app to an existing GKE cluster with GPU node pool, Google Cloud Marketplace currently doesn't support auto creation of GPU clusters. User has to run following command to create a compatible cluster (gke version >=1.18.7) with GPU node pools, we recommend user to select T4 or A100(MIG) instances type and choose CPU ratio based on profiling of actual inference workflow.
+
+Users need to follow these [instructions](https://cloud.google.com/kubernetes-engine/docs/how-to/kubernetes-service-accounts#creating_a_kubernetes_service_account) to create a kubernetes service account. In this example, we use `gke-test@k80-exploration.iam.gserviceaccount.com`. Make sure it has access to artifact registry and monitoring viewer. For example, to grant access to custom metrics which is required for HPA to work:
+```
+gcloud iam service-accounts add-iam-policy-binding --role \
+  roles/iam.workloadIdentityUser --member \
+  "serviceAccount:<project-id>.svc.id.goog[custom-metrics/custom-metrics-stackdriver-adapter]" \
+  <google-service-account>@<project-id>.iam.gserviceaccount.com
+
+kubectl annotate serviceaccount --namespace custom-metrics \
+  custom-metrics-stackdriver-adapter \
+  iam.gke.io/gcp-service-account=<google-service-account>@<project-id>.iam.gserviceaccount.com
+```
 
 Currently, GKE >= 1.18.7 only supported in GKE rapid channel, to find the latest version, please visit [GKE release notes](https://cloud.google.com/kubernetes-engine/docs/release-notes).
 ```
@@ -61,11 +74,15 @@ export PROJECT_ID=<your GCP project ID>
 export ZONE=<GCP zone of your choice>
 export REGION=<GCP region of your choice>
 export DEPLOYMENT_NAME=<GKE cluster name, triton-gke for example>
+# example: export SERVICE_ACCOUNT="gke-test@k80-exploration.iam.gserviceaccount.com"
+export SERVICE_ACCOUNT=<Your GKE service account>
 
 gcloud beta container clusters create ${DEPLOYMENT_NAME} \
---addons=HorizontalPodAutoscaling,HttpLoadBalancing,Istio \
+--addons=HorizontalPodAutoscaling,HttpLoadBalancing \
+--service-account=${SERVICE_ACCOUNT} \
 --machine-type=n1-standard-8 \
 --node-locations=${ZONE} \
+--monitoring=SYSTEM \
 --zone=${ZONE} \
 --subnetwork=default \
 --scopes cloud-platform \
@@ -77,6 +94,7 @@ gcloud container node-pools create accel \
   --project ${PROJECT_ID} \
   --zone ${ZONE} \
   --cluster ${DEPLOYMENT_NAME} \
+  --service-account=${SERVICE_ACCOUNT} \
   --num-nodes 2 \
   --accelerator type=nvidia-tesla-t4,count=1 \
   --enable-autoscaling --min-nodes 2 --max-nodes 3 \
@@ -86,18 +104,22 @@ gcloud container node-pools create accel \
   --verbosity error
 
 # so that you can run kubectl locally to the cluster
-gcloud container clusters get-credentials ${DEPLOYMENT_NAME} --project ${PROJECT_ID} --zone ${ZONE}  
+gcloud container clusters get-credentials ${DEPLOYMENT_NAME} --project ${PROJECT_ID} --zone ${ZONE}
 
 # deploy NVIDIA device plugin for GKE to prepare GPU nodes for driver install
-kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
+kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml
 
 # make sure you can run kubectl locally to access the cluster
 kubectl create clusterrolebinding cluster-admin-binding --clusterrole cluster-admin --user "$(gcloud config get-value account)"
 
 # enable stackdriver custom metrics adaptor
-kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter.yaml
+kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml
+
+# create an ip for ingress traffic
+gcloud compute addresses create ingress-triton --global
 ```
-Creating a cluster and adding GPU nodes could take up-to 10 minutes. Please be patient after executing this command. GPU resources in GCP could be fully utilized, so please try a different zone in case compute resource cannot be allocated. After GKE cluster is running, run `kubectl get pods --all-namespaces` to make sure the client can access the cluster correctly: 
+
+Creating a cluster and adding GPU nodes could take up-to 10 minutes. Please be patient after executing this command. GPU resources in GCP could be fully utilized, so please try a different zone in case compute resource cannot be allocated. After GKE cluster is running, run `kubectl get pods --all-namespaces` to make sure the client can access the cluster correctly:
 
 If user would like to experiment with A100 MIG partitioned GPU in GKE, please create node pool with following command:
 ```
@@ -105,6 +127,7 @@ gcloud beta container node-pools create accel \
   --project ${PROJECT_ID} \
   --zone ${ZONE} \
   --cluster ${DEPLOYMENT_NAME} \
+  --service-account=${SERVICE_ACCOUNT} \
   --num-nodes 1 \
   --accelerator type=nvidia-tesla-a100,count=1,gpu-partition-size=1g.5gb  \
   --enable-autoscaling --min-nodes 1 --max-nodes 2 \
@@ -112,15 +135,17 @@ gcloud beta container node-pools create accel \
   --disk-size=100 \
   --scopes cloud-platform \
   --verbosity error
-
-# deploy a newer NVIDIA device plugin for GKE to prepare GPU nodes for driver install, additional line to install MIG
-kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/cmd/nvidia_gpu/device-plugin.yaml
-kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-nvidia-mig.yaml
 ```
 
-Please note that A100 MIG in GKE does not support GPU metrics yet, also Triton GPU Metrics is not compatiable with A100 MIG. Hence, please disable GPU metrics by unselect allowGPUMetrics while deploy Triton GKE app. Also for the same reason, this deployer doesn't support inference workfload auto-scaling on A100 MIG as well.  
+Please note that A100 MIG in GKE does not support GPU metrics yet, also Triton GPU Metrics is not compatible with A100 MIG. Hence, please disable GPU metrics by unselect allowGPUMetrics while deploy Triton GKE app. Also for the same reason, this deployer doesn't support inference workfload auto-scaling on A100 MIG as well.
+
+Second, go to this [GKE Marketplace link](https://console.cloud.google.com/marketplace/details/nvidia-ngc-public/triton-inference-server) to deploy Triton application.
 
-Second, go to [GKE Marketplace link](https://console.cloud.google.com/marketplace/details/nvidia-ngc-public/triton-inference-server) to deploy Triton application. User could leave everything as default, if user has model that has been validated with Triton, they can provide GCS path point to that model in Triton format. By default, we provide a BERT large model optimized by TensorRT in public GCS bucket that is compatible with xx.yy release of Triton Server, in `gs://triton_sample_models/xx_yy`, please note this TensorRT engine only work with Tesla T4. If experiment with A100 MIG 5gb partition, user could use `gs://triton_sample_models/xx_yy_mig5g`. Also please note that this bucket locates in us-central1 hence loading model into Triton in other region might be effected. Also the first deployment of Triton Application will be slower than consecutive runs as image needs to be pulled into the GKE cluster. 
+Users can leave everything as default if their models have already been tested/validated with Triton. They can provide a GCS path pointing to the model repository containing their models. By default, we provide a BERT large model optimized by TensorRT in a public demo GCS bucket that is compatible with the `xx.yy` release of Triton Server in `gs://triton_sample_models/xx_yy`. However, please take note of the following about this demo bucket:
+- The TensorRT engine provided in the demo bucket is only compatible with Tesla T4 GPUs.
+- This bucket is located in `us-central1`, so loading from this bucket into Triton in other regions may be affected.
+- The first deployment of this Triton GKE application will be slower than consecutive runs because the image needs to be pulled into the GKE cluster.
+- You can find an example of how this model is generated and uploaded [here](trt-engine/README.md).
 
 Where <xx.yy> is the version of NGC Triton container needed.
 
@@ -128,24 +153,31 @@ Where <xx.yy> is the version of NGC Triton container needed.
 
 We want to discuss HPA autoscaling metrics users can leverage. GPU Power(Percentage of Power) tends to be a reliable metric, especially for larger GPU like V100 and A100. GKE currently natively support GPU duty cycle which is GPU utilization in `nvidia-smi`. We ask users always profile their model to determine the autoscaling target and metrics. When attempting to select the right metrics for autoscaling, the goal should be to pick metrics based on the following: 1, meet SLA rrequirement. 2, give consideration to transient request load, 3, keep GPU as fully utilized as possible. Profiling comes in 2 aspects: If user decided to use Duty Cycle or other GPU metric, it is recommend establish baseline to link SLA requirement such as latency with GPU metrics, for example, for model A, latency will be below 10ms 99% of time when Duty Cycle is below 80% utilized. Additionally, profiling also provide insight to model optimization for inference, with tools like [Nsight](https://developer.nvidia.com/nsight-systems).
 
-As the application deployed successfully, get Istio Ingress host and port
+Once the application is deployed successfully, get the public ip from ingress:
 ```
-export INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
-export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].port}')
+> kubectl get ingress
+NAME              CLASS    HOSTS   ADDRESS          PORTS   AGE
+triton-external   <none>   *       35.186.215.182   80      107s
 ```
 
 Third, we will try sending request to server with provide client example.
 
-If User selected deploy Triton to accept HTTP request, please launch [Locust](https://docs.locust.io/en/stable/installation.html) with Ingress host and port to query Triton Inference Server. In this [example script](https://github.com/triton-inference-server/server/tree/master/deploy/gke-marketplace-app/client-sample/locustfile_bert_large.py), we send request to Triton server which has loaded a BERT large TensorRT Engine with Sequence length of 128 into GCP bucket. We simulate 1000 concurrent user as target and spawn user at rate of 50 users per second.
+If User selected deploy Triton to accept HTTP request, please launch [Locust](https://docs.locust.io/en/stable/installation.html) with Ingress host and port to query Triton Inference Server. In this [example script](https://github.com/triton-inference-server/server/tree/master/deploy/gke-marketplace-app/client-sample/locustfile_bert.py), we send request to Triton server which has loaded a BERT large TensorRT Engine with Sequence length of 128 into GCP bucket. We simulate 1000 concurrent user as target and spawn user at rate of 50 users per second.
 ```
-locust -f locustfile_bert_large.py -H http://${INGRESS_HOST}:${INGRESS_PORT}
+locust -f locustfile_bert.py -H http://${INGRESS_HOST}:${INGRESS_PORT}
 ```
 
-The client example push about ~650 QPS(Query per second) to Triton Server, and will trigger a auto scale of T4 GPU nodes (We recommend to use T4 and A100[MIG] for inference). From locust UI, we will observer a drop of latency mean and variance for the requests. At the end, after autoscaling, we see the latency stablized at ~200 ms, end to end from US client to europe server, which is excellent for a model that has 345 million parameters. Since for each node, we use 1T4 + n1-standard-4 instance, and it can handle ~450 QPS, with on-demand price, it is ($0.35+$0.19)=$0.54/hr, that translate to 3 million inference per dollar for BERT large model at batch size 1. Further more, with 3 year commitment price, hr rate is ($0.16+$0.08)=$0.24/hr, that translate to 6.75 million inference per dollar. 
+The client example push about ~650 QPS(Query per second) to Triton Server, and will trigger a auto scale of T4 GPU nodes (We recommend to use T4 and A100[MIG] for inference). From locust UI, we will observer a drop of latency mean and variance for the requests. At the end, after autoscaling, we see the latency stablized at ~200 ms, end to end from US client to europe server, which is excellent for a model that has 345 million parameters. Since for each node, we use 1T4 + n1-standard-4 instance, and it can handle ~450 QPS, with on-demand price, it is ($0.35+$0.19)=$0.54/hr, that translate to 3 million inference per dollar for BERT large model at batch size 1. Further more, with 3 year commitment price, hr rate is ($0.16+$0.08)=$0.24/hr, that translate to 6.75 million inference per dollar.
 
 ![Locust Client Chart](client.png)
 
-Alternatively, user can opt to use [Perf Analyzer](https://github.com/triton-inference-server/server/blob/master/docs/perf_analyzer.md) to profile and study the performance of Triton Inference Server. Here we also provide a [client script](https://github.com/triton-inference-server/server/tree/master/deploy/gke-marketplace-app/client-sample/perf_analyzer_grpc.sh) to use Perf Analyzer to send gRPC to Triton Server GKE deployment. Perf Analyzer client requires user to use NGC Triton Client Container.
+Alternatively, user can opt to use
+[Perf Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md)
+to profile and study the performance of Triton Inference Server. Here we also
+provide a
+[client script](https://github.com/triton-inference-server/server/tree/master/deploy/gke-marketplace-app/client-sample/perf_analyzer_grpc.sh)
+to use Perf Analyzer to send gRPC to Triton Server GKE deployment. Perf Analyzer
+client requires user to use NGC Triton Client Container.
 
 ```
 bash perf_analyzer_grpc.sh ${INGRESS_HOST}:${INGRESS_PORT}
@@ -165,6 +197,5 @@ See the following resources to learn more about NVIDIA Triton Inference Server a
 
 ## Known Issues
 
-- When EXTERNAL-IP stuck in pending state, user can do `kubectl describe svc istio-ingressgateway -n istio-system` to understand and cause, then proceed to increase quota for `Forwarding Rules` as `compute.googleapis.com/forwarding_rules`, `TARGET_POOLS` as `compute.googleapis.com/target_pools` or `HEALTH_CHECKS` as `compute.googleapis.com/health_checks`
-- GKE one click cluster creation doesn't support GPU node pools at the moment, users have to mannually create a compatible (>=1.18.7) cluster and attach node pool (T4 and A100 MIG recommended)
+- GKE one click cluster creation doesn't support GPU node pools at the moment, users have to manually create a compatible (>=1.18.7) cluster and attach node pool (T4 and A100 MIG recommended)
 - When Horizontal Pod Autoscaler(HPA) expand and all GPU node pool already utilized, GKE will request new GPU node and it can take between 4-7 minutes, it could be a long wait plus GPU driver install and image pulling. We recommend user to leverage multi-tier model serving and Triton's priority feature to create cushion for latency critical models, and allocate active standby GPU node for spike of requests.
diff --git a/deploy/gke-marketplace-app/benchmark/README.md b/deploy/gke-marketplace-app/benchmark/README.md
index c350b931dc..5138148035 100644
--- a/deploy/gke-marketplace-app/benchmark/README.md
+++ b/deploy/gke-marketplace-app/benchmark/README.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -49,30 +49,30 @@ We the place the model into a GCS with following structure, `config.pbtxt` was p
     ├── bert_base_trt_gpu_seqlen128
     │   ├── 1
     │   │   └── model.plan
-    │   └── config.pbtxt    
+    │   └── config.pbtxt
     ├── bert_base_tf_gpu
     │   ├── 1
     │   │   └── model.savedmodel
-    │   └── config.pbtxt      
+    │   └── config.pbtxt
     ├── bert_base_tf_cpu
     │   ├── 1
     │   │   └── model.savedmodel
     │   └── config.pbtxt
-    ├── bert_distill_tf_gpu 
+    ├── bert_distill_tf_gpu
     │   ├── 1
     │   │   └── model.savedmodel
     │   └── config.pbtxt
     └── bert_distill_tf_cpu
         ├── 1
         │   └── model.savedmodel
-        └── config.pbtxt 
+        └── config.pbtxt
 ```
 
-When deploy Triton GKE application, point the model repository to directory contains the structure above with actual models. 
+When deploy Triton GKE application, point the model repository to directory contains the structure above with actual models.
 
 ## Performance
 
-We use perf analyzer of Triton to benchmark the performance of each model, the perf analyzer reside in another pod of the GKE cluster. 
+We use perf analyzer of Triton to benchmark the performance of each model, the perf analyzer reside in another pod of the GKE cluster.
 ```bash
 export INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
 export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].port}')
@@ -91,6 +91,5 @@ GPU TensorRT BERT BASE: latency: 50ms, throughput: 465 qps
 
 With n1-standard-96 priced at $4.56/hr and n1-standard-4 at $0.19/hr and T4 at $0.35/hr totaling $0.54/hr. While achieving a much lower latency, the TCO of BERT inference with TensorRT on T4 is over 163 times that of Distill BERT inference on n1-standard-96.
 
-  
 
- 
\ No newline at end of file
+
diff --git a/deploy/gke-marketplace-app/benchmark/model-store/bert_base_tf_gpu/config.pbtxt b/deploy/gke-marketplace-app/benchmark/model-store/bert_base_tf_gpu/config.pbtxt
index b46aa21f5e..b6ca32f9a2 100644
--- a/deploy/gke-marketplace-app/benchmark/model-store/bert_base_tf_gpu/config.pbtxt
+++ b/deploy/gke-marketplace-app/benchmark/model-store/bert_base_tf_gpu/config.pbtxt
@@ -1,4 +1,4 @@
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,7 +24,7 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-max_batch_size: 4 
+max_batch_size: 4
 dynamic_batching {
    preferred_batch_size: 4
    max_queue_delay_microseconds: 200000
diff --git a/deploy/gke-marketplace-app/benchmark/model-store/bert_base_trt_gpu/config.pbtxt b/deploy/gke-marketplace-app/benchmark/model-store/bert_base_trt_gpu/config.pbtxt
index 9cc4dd4551..acbd124bf2 100644
--- a/deploy/gke-marketplace-app/benchmark/model-store/bert_base_trt_gpu/config.pbtxt
+++ b/deploy/gke-marketplace-app/benchmark/model-store/bert_base_trt_gpu/config.pbtxt
@@ -1,4 +1,4 @@
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,7 +25,7 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 platform: "tensorrt_plan"
-max_batch_size: 4 
+max_batch_size: 4
 dynamic_batching {
    preferred_batch_size: 4
    max_queue_delay_microseconds: 200000
diff --git a/deploy/gke-marketplace-app/benchmark/model-store/bert_distill_tf_cpu/config.pbtxt b/deploy/gke-marketplace-app/benchmark/model-store/bert_distill_tf_cpu/config.pbtxt
index 9b236c9092..c8e8074309 100644
--- a/deploy/gke-marketplace-app/benchmark/model-store/bert_distill_tf_cpu/config.pbtxt
+++ b/deploy/gke-marketplace-app/benchmark/model-store/bert_distill_tf_cpu/config.pbtxt
@@ -1,4 +1,4 @@
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,7 +24,7 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-max_batch_size: 4 
+max_batch_size: 4
 dynamic_batching {
    preferred_batch_size: 1
    max_queue_delay_microseconds: 2000000
diff --git a/deploy/gke-marketplace-app/benchmark/model-store/bert_distill_tf_gpu/config.pbtxt b/deploy/gke-marketplace-app/benchmark/model-store/bert_distill_tf_gpu/config.pbtxt
index b46aa21f5e..b6ca32f9a2 100644
--- a/deploy/gke-marketplace-app/benchmark/model-store/bert_distill_tf_gpu/config.pbtxt
+++ b/deploy/gke-marketplace-app/benchmark/model-store/bert_distill_tf_gpu/config.pbtxt
@@ -1,4 +1,4 @@
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,7 +24,7 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-max_batch_size: 4 
+max_batch_size: 4
 dynamic_batching {
    preferred_batch_size: 4
    max_queue_delay_microseconds: 200000
diff --git a/deploy/gke-marketplace-app/benchmark/perf-analyzer-script/perf_query.sh b/deploy/gke-marketplace-app/benchmark/perf-analyzer-script/perf_query.sh
old mode 100644
new mode 100755
diff --git a/deploy/gke-marketplace-app/benchmark/perf-analyzer-script/triton_client.yaml b/deploy/gke-marketplace-app/benchmark/perf-analyzer-script/triton_client.yaml
index b1e4d64010..f15abbbbc5 100644
--- a/deploy/gke-marketplace-app/benchmark/perf-analyzer-script/triton_client.yaml
+++ b/deploy/gke-marketplace-app/benchmark/perf-analyzer-script/triton_client.yaml
@@ -1,4 +1,4 @@
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -33,7 +33,7 @@ metadata:
   namespace: default
 spec:
   containers:
-  - image: nvcr.io/nvidia/tritonserver:22.05-py3-sdk
+  - image: nvcr.io/nvidia/tritonserver:23.11-py3-sdk
     imagePullPolicy: Always
     name: nv-triton-client
     securityContext:
diff --git a/deploy/gke-marketplace-app/client-sample/bert_request.json b/deploy/gke-marketplace-app/client-sample/bert_request.json
index b918815147..ce4b956db6 100644
--- a/deploy/gke-marketplace-app/client-sample/bert_request.json
+++ b/deploy/gke-marketplace-app/client-sample/bert_request.json
@@ -4,19 +4,19 @@
     "shape": [1, 128],
     "datatype": "INT32",
     "parameters": {},
-    "data": [101, 2054, 2003, 23435, 5339, 1029, 102, 23435, 5339, 2003, 1037, 2152, 2836, 2784, 4083, 28937, 4132, 2008, 18058, 2659, 2397, 9407, 1998, 2152, 2083, 18780, 2005, 18726, 2107, 2004, 16755, 2545, 1010, 4613, 1998, 3746, 1013, 2678, 2006, 1050, 17258, 2401, 14246, 2271, 1012, 2009, 2950, 11968, 8043, 2015, 2000, 12324, 4275, 1010, 1998, 13354, 7076, 2000, 2490, 3117, 23092, 1998, 9014, 2077, 11243, 20600, 2015, 2005, 28937, 1012, 2651, 1050, 17258, 2401, 2003, 2330, 1011, 14768, 6129, 11968, 8043, 2015, 1998, 13354, 7076, 1999, 23435, 5339, 2061, 2008, 1996, 2784, 4083, 2451, 2064, 7661, 4697, 1998, 7949, 2122, 6177, 2000, 2202, 5056, 1997, 3928, 23435, 5339, 20600, 2015, 2005, 2115, 18726, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 
+    "data": [101, 2054, 2003, 23435, 5339, 1029, 102, 23435, 5339, 2003, 1037, 2152, 2836, 2784, 4083, 28937, 4132, 2008, 18058, 2659, 2397, 9407, 1998, 2152, 2083, 18780, 2005, 18726, 2107, 2004, 16755, 2545, 1010, 4613, 1998, 3746, 1013, 2678, 2006, 1050, 17258, 2401, 14246, 2271, 1012, 2009, 2950, 11968, 8043, 2015, 2000, 12324, 4275, 1010, 1998, 13354, 7076, 2000, 2490, 3117, 23092, 1998, 9014, 2077, 11243, 20600, 2015, 2005, 28937, 1012, 2651, 1050, 17258, 2401, 2003, 2330, 1011, 14768, 6129, 11968, 8043, 2015, 1998, 13354, 7076, 1999, 23435, 5339, 2061, 2008, 1996, 2784, 4083, 2451, 2064, 7661, 4697, 1998, 7949, 2122, 6177, 2000, 2202, 5056, 1997, 3928, 23435, 5339, 20600, 2015, 2005, 2115, 18726, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
   }, {
     "name": "input_mask",
     "shape": [1, 128],
     "datatype": "INT32",
     "parameters": {},
-    "data": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 
+    "data": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
   }, {
     "name": "segment_ids",
     "shape": [1, 128],
     "datatype": "INT32",
     "parameters": {},
-    "data": [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 
+    "data": [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
   }],
   "outputs": [{
     "name": "cls_squad_logits",
diff --git a/deploy/gke-marketplace-app/client-sample/locustfile_bert_large.py b/deploy/gke-marketplace-app/client-sample/locustfile_bert.py
old mode 100644
new mode 100755
similarity index 87%
rename from deploy/gke-marketplace-app/client-sample/locustfile_bert_large.py
rename to deploy/gke-marketplace-app/client-sample/locustfile_bert.py
index 74d240facc..aae8c69f43
--- a/deploy/gke-marketplace-app/client-sample/locustfile_bert_large.py
+++ b/deploy/gke-marketplace-app/client-sample/locustfile_bert.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,18 +26,18 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-from locust import HttpUser, task, between
-from locust import LoadTestShape
 import json
 
+from locust import HttpUser, LoadTestShape, between, task
+
 
 class ProfileLoad(LoadTestShape):
-    '''
+    """
     This load profile starts at 0 and steps up by step_users
     increments every tick, up to target_users.  After reaching
     target_user level, load will stay at target_user level
     until time_limit is reached.
-    '''
+    """
 
     target_users = 1000
     step_users = 50  # ramp users each step
@@ -63,8 +65,7 @@ def bert(self):
         response = self.client.post(self.url1, data=json.dumps(self.data))
 
     def on_start(self):
-        with open('bert_request.json') as f:
+        with open("bert_request.json") as f:
             self.data = json.load(f)
 
-        self.url1 = '{}/v2/models/{}/infer'.format(self.environment.host,
-                                                   'bert_large')
+        self.url1 = "{}/v2/models/{}/infer".format(self.environment.host, "bert")
diff --git a/deploy/gke-marketplace-app/client-sample/perf_analyzer_grpc.sh b/deploy/gke-marketplace-app/client-sample/perf_analyzer_grpc.sh
old mode 100644
new mode 100755
index 7f65f8b215..ae5476f338
--- a/deploy/gke-marketplace-app/client-sample/perf_analyzer_grpc.sh
+++ b/deploy/gke-marketplace-app/client-sample/perf_analyzer_grpc.sh
@@ -32,7 +32,7 @@ BATCH_SIZE=${4:-1}
 MAX_LATENCY=${5:-500}
 MAX_CLIENT_THREADS=${6:-6}
 MAX_CONCURRENCY=${7:-20}
-MODEL_NAME=${8:-"bert_large"}
+MODEL_NAME=${8:-"bert"}
 SEQ_LENGTH=${9:-"128"}
 PERFCLIENT_PERCENTILE=${10:-90}
 STABILITY_PERCENTAGE=${11:-0.01}
diff --git a/deploy/gke-marketplace-app/server-deployer/build_and_push.sh b/deploy/gke-marketplace-app/server-deployer/build_and_push.sh
old mode 100644
new mode 100755
index e296e1003e..2f12104749
--- a/deploy/gke-marketplace-app/server-deployer/build_and_push.sh
+++ b/deploy/gke-marketplace-app/server-deployer/build_and_push.sh
@@ -1,4 +1,5 @@
-# Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/bin/bash
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,9 +27,9 @@
 
 export REGISTRY=gcr.io/$(gcloud config get-value project | tr ':' '/')
 export APP_NAME=tritonserver
-export MAJOR_VERSION=2.22
-export MINOR_VERSION=2.22.0
-export NGC_VERSION=22.05-py3
+export MAJOR_VERSION=2.40
+export MINOR_VERSION=2.40.0
+export NGC_VERSION=23.11-py3
 
 docker pull nvcr.io/nvidia/$APP_NAME:$NGC_VERSION
 
diff --git a/deploy/gke-marketplace-app/server-deployer/chart/triton/Chart.yaml b/deploy/gke-marketplace-app/server-deployer/chart/triton/Chart.yaml
index 436a46f495..a7c8da41b3 100644
--- a/deploy/gke-marketplace-app/server-deployer/chart/triton/Chart.yaml
+++ b/deploy/gke-marketplace-app/server-deployer/chart/triton/Chart.yaml
@@ -1,4 +1,4 @@
-# Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,7 +25,7 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 apiVersion: v1
-appVersion: "2.22"
+appVersion: "2.40"
 description: Triton Inference Server
 name: triton-inference-server
-version: 2.22.0
+version: 2.40.0
diff --git a/deploy/gke-marketplace-app/server-deployer/chart/triton/templates/application.yaml b/deploy/gke-marketplace-app/server-deployer/chart/triton/templates/application.yaml
index fee56a0cbb..28bfbf08c4 100644
--- a/deploy/gke-marketplace-app/server-deployer/chart/triton/templates/application.yaml
+++ b/deploy/gke-marketplace-app/server-deployer/chart/triton/templates/application.yaml
@@ -1,4 +1,4 @@
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -41,14 +41,14 @@ spec:
     type: Triton
     version: "{{ .Values.publishedVersion }}"
     description: |-
-      Triton Inference Server provides a cloud and edge inferencing solution 
-      optimized for both CPUs and GPUs. Triton supports an HTTP/REST and GRPC 
-      protocol that allows remote clients to request inferencing for any model 
+      Triton Inference Server provides a cloud and edge inferencing solution
+      optimized for both CPUs and GPUs. Triton supports an HTTP/REST and GRPC
+      protocol that allows remote clients to request inferencing for any model
       being managed by the server.
 
     notes: |-
 
-      Send request to Triton server by using IP address of istio-ingressgateway, 
+      Send request to Triton server by using IP address "ingress-triton",
       send to IP:80/v2/models/{}/infer
 
       Links:
@@ -63,10 +63,6 @@ spec:
     kind: Deployment
   - group: v1
     kind: Service
-  - group: autoscaling/v2beta1
+  - group: autoscaling/v2
     kind: HorizontalPodAutoscaler
-  - group: networking.istio.io/v1alpha3
-    kind: VirtualService
-  - group: networking.istio.io/v1alpha3
-    kind: Gateway
 {{  end }}
diff --git a/deploy/gke-marketplace-app/server-deployer/chart/triton/templates/deployment.yaml b/deploy/gke-marketplace-app/server-deployer/chart/triton/templates/deployment.yaml
index 6a0b77b4ea..75ac1aee81 100644
--- a/deploy/gke-marketplace-app/server-deployer/chart/triton/templates/deployment.yaml
+++ b/deploy/gke-marketplace-app/server-deployer/chart/triton/templates/deployment.yaml
@@ -1,4 +1,4 @@
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -33,7 +33,7 @@ metadata:
     app: {{ template "triton-inference-server.name" . }}
     chart: {{ template "triton-inference-server.chart" . }}
     release: {{ .Release.Name }}
-    heritage: {{ .Release.Service }}    
+    heritage: {{ .Release.Service }}
 spec:
   replicas: {{ .Values.initReplicaCount }}
   selector:
diff --git a/deploy/gke-marketplace-app/server-deployer/chart/triton/templates/hpa.yaml b/deploy/gke-marketplace-app/server-deployer/chart/triton/templates/hpa.yaml
index adb9d4f1ee..89275ea7de 100644
--- a/deploy/gke-marketplace-app/server-deployer/chart/triton/templates/hpa.yaml
+++ b/deploy/gke-marketplace-app/server-deployer/chart/triton/templates/hpa.yaml
@@ -24,7 +24,7 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-apiVersion: autoscaling/v2beta1
+apiVersion: autoscaling/v2
 kind: HorizontalPodAutoscaler
 metadata:
   name: triton-hpa
@@ -37,8 +37,12 @@ spec:
   metrics:
   - type: External
     external:
-      metricName: kubernetes.io|container|accelerator|duty_cycle
-      targetAverageValue: {{ .Values.HPATargetAverageValue }}
+      metric:
+         name: kubernetes.io|container|accelerator|duty_cycle
+      target:
+         type: AverageValue
+         averageValue: {{ .Values.HPATargetAverageValue }}
+
   scaleTargetRef:
     apiVersion: apps/v1
     kind: Deployment
diff --git a/deploy/gke-marketplace-app/server-deployer/chart/triton/templates/ingress.yaml b/deploy/gke-marketplace-app/server-deployer/chart/triton/templates/ingress.yaml
new file mode 100644
index 0000000000..2b6da5fe18
--- /dev/null
+++ b/deploy/gke-marketplace-app/server-deployer/chart/triton/templates/ingress.yaml
@@ -0,0 +1,48 @@
+# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+apiVersion: networking.k8s.io/v1
+kind: Ingress
+metadata:
+  name: triton-external
+  annotations:
+    kubernetes.io/ingress.class: "gce"
+    kubernetes.io/ingress.global-static-ip-name: "ingress-triton"
+spec:
+  rules:
+  - http:
+      paths:
+      - path: "/"
+        pathType: Prefix
+        backend:
+          service:
+            name: triton-inference-server
+            port:
+              {{ if eq .Values.tritonProtocol "gRPC" }}
+              number: 8001
+              {{ else }}
+              number: 8000
+              {{ end }}
diff --git a/deploy/gke-marketplace-app/server-deployer/chart/triton/templates/service.yaml b/deploy/gke-marketplace-app/server-deployer/chart/triton/templates/service.yaml
index b919c55f1f..93ef6f9da3 100644
--- a/deploy/gke-marketplace-app/server-deployer/chart/triton/templates/service.yaml
+++ b/deploy/gke-marketplace-app/server-deployer/chart/triton/templates/service.yaml
@@ -1,4 +1,4 @@
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -35,11 +35,11 @@ metadata:
     app: {{ template "triton-inference-server.name" . }}
     chart: {{ template "triton-inference-server.chart" . }}
     release: {{ .Release.Name }}
-    heritage: {{ .Release.Service }}    
+    heritage: {{ .Release.Service }}
 spec:
   type: {{ .Values.service.type }}
   ports:
-    - port: 8000 
+    - port: 8000
       targetPort: http
       name: http-inference-server
     - port: 8001
diff --git a/deploy/gke-marketplace-app/server-deployer/chart/triton/values.yaml b/deploy/gke-marketplace-app/server-deployer/chart/triton/values.yaml
index 9d654ccfc2..f168c13c86 100644
--- a/deploy/gke-marketplace-app/server-deployer/chart/triton/values.yaml
+++ b/deploy/gke-marketplace-app/server-deployer/chart/triton/values.yaml
@@ -1,4 +1,4 @@
-# Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -31,14 +31,14 @@ maxReplicaCount: 3
 tritonProtocol: HTTP
 # HPA GPU utilization autoscaling target
 HPATargetAverageValue: 85
-modelRepositoryPath: gs://triton_sample_models/22_04
-publishedVersion: '2.22.0'
+modelRepositoryPath: gs://triton_sample_models/23_09
+publishedVersion: '2.40.0'
 gcpMarketplace: true
 
 image:
   registry: gcr.io
   repository: nvidia-ngc-public/tritonserver
-  tag: 22.05-py3
+  tag: 23.11-py3
   pullPolicy: IfNotPresent
   # modify the model repository here to match your GCP storage bucket
   numGpus: 1
@@ -49,7 +49,7 @@ image:
   allowGPUMetrics: True
 
 service:
-  type: NodePort 
+  type: NodePort
 
 deployment:
   livenessProbe:
diff --git a/deploy/gke-marketplace-app/server-deployer/data-test/schema.yaml b/deploy/gke-marketplace-app/server-deployer/data-test/schema.yaml
index ee5cca82f4..df57ba1f30 100644
--- a/deploy/gke-marketplace-app/server-deployer/data-test/schema.yaml
+++ b/deploy/gke-marketplace-app/server-deployer/data-test/schema.yaml
@@ -1,4 +1,4 @@
-# Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -27,7 +27,7 @@
 x-google-marketplace:
   schemaVersion: v2
   applicationApiVersion: v1beta1
-  publishedVersion: '2.22.0'
+  publishedVersion: '2.40.0'
   publishedVersionMetadata:
     releaseNote: >-
       Initial release.
@@ -66,30 +66,30 @@ properties:
     type: string
     x-google-marketplace:
       type: NAMESPACE
-  initReplicaCount: 
+  initReplicaCount:
     title: Initial number of Triton pod instances to deploy.
     type: integer
     default: 1
-  minReplicaCount: 
+  minReplicaCount:
     title: Minimum number of Triton pod instances in the deployment for autoscaling.
     type: integer
     default: 1
-  maxReplicaCount: 
+  maxReplicaCount:
     title: Maximum number of Triton pod instances in the deployment for autoscaling.
     type: integer
     default: 3
-  tritonProtocol: 
+  tritonProtocol:
     title: Request protocol to send data to Triton, choose from gRPC and HTTP.
     type: string
     default: HTTP
-  HPATargetAverageValue: 
-    title: HPA autoscaling target, GKE currently support Duty Cycle which is GPU utilization, when target is reached, Triton Server service will create another pod instance. We ask user to analyze model inference to associate appropriate GPU metric target based on latency requirement. We also recommend to leave some room to mitigate transient load effect. For user interested in customizing autoscaling metrics, we recommends GPU Power (Percentage of Power), Queue time or SLA measurements such as latency. 
+  HPATargetAverageValue:
+    title: HPA autoscaling target, GKE currently support Duty Cycle which is GPU utilization, when target is reached, Triton Server service will create another pod instance. We ask user to analyze model inference to associate appropriate GPU metric target based on latency requirement. We also recommend to leave some room to mitigate transient load effect. For user interested in customizing autoscaling metrics, we recommends GPU Power (Percentage of Power), Queue time or SLA measurements such as latency.
     type: integer
-    default: 85      
+    default: 85
   modelRepositoryPath:
     type: string
     title: Bucket where models are stored. Please make sure the user/service account to create the GKE app has permission to this GCS bucket. Read Triton documentation on configs and formatting details, supporting TensorRT, TensorFlow, Pytorch, Onnx ... etc.
-    default: gs://triton_sample_models/22_04
+    default: gs://triton_sample_models/models
   image.ldPreloadPath:
     type: string
     title: Leave this empty by default. Triton allows users to create custom layers for backend such as TensorRT plugin or Tensorflow custom ops, the compiled shared library must be provided via LD_PRELOAD environment variable.
@@ -97,7 +97,7 @@ properties:
   image.logVerboseLevel:
     type: integer
     title: Set verbose logging level. Zero (0) disables verbose logging and values >= 1 enable verbose logging, this is helpful when user unsure if the model is compatible with Triton or for general debug.
-    default: 0 
+    default: 0
   image.strictModelConfig:
     type: boolean
     title: Leave this unchecked by default. When strictModelConfig is not checked(False), Triton will try to infer the config file from model file, when checked(True), user need to provide config.pbtxt in model repository.
@@ -105,14 +105,14 @@ properties:
   image.allowGPUMetrics:
     type: boolean
     title: Select by default. When use A100 MIG, unselect to disable GPU Memory metrics reported by Triton, as current GPU metrics not support on A100 MIG.
-    default: True  
+    default: True
   istioEnabled:
     type: boolean
     x-google-marketplace:
       type: ISTIO_ENABLED
     default: True
 
-  
+
 required:
 - name
 - namespace
diff --git a/deploy/gke-marketplace-app/server-deployer/schema.yaml b/deploy/gke-marketplace-app/server-deployer/schema.yaml
index 56dfacda83..25fe6515cd 100644
--- a/deploy/gke-marketplace-app/server-deployer/schema.yaml
+++ b/deploy/gke-marketplace-app/server-deployer/schema.yaml
@@ -1,4 +1,4 @@
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -27,7 +27,7 @@
 x-google-marketplace:
   schemaVersion: v2
   applicationApiVersion: v1beta1
-  publishedVersion: '2.22.0'
+  publishedVersion: '2.40.0'
   publishedVersionMetadata:
     releaseNote: >-
       Initial release.
@@ -66,30 +66,30 @@ properties:
     type: string
     x-google-marketplace:
       type: NAMESPACE
-  initReplicaCount: 
+  initReplicaCount:
     title: Initial number of Triton pod instances to deploy.
     type: integer
     default: 1
-  minReplicaCount: 
+  minReplicaCount:
     title: Minimum number of Triton pod instances in the deployment for autoscaling.
     type: integer
     default: 1
-  maxReplicaCount: 
+  maxReplicaCount:
     title: Maximum number of Triton pod instances in the deployment for autoscaling.
     type: integer
     default: 3
-  tritonProtocol: 
+  tritonProtocol:
     title: Request protocol to send data to Triton, choose from gRPC and HTTP.
     type: string
     default: HTTP
-  HPATargetAverageValue: 
-    title: HPA autoscaling target, GKE currently support Duty Cycle which is GPU utilization, when target is reached, Triton Server service will create another pod instance. We ask user to analyze model inference to associate appropriate GPU metric target based on latency requirement. We also recommend to leave some room to mitigate transient load effect. For user interested in customizing autoscaling metrics, we recommends GPU Power (Percentage of Power), Queue time or SLA measurements such as latency. 
+  HPATargetAverageValue:
+    title: HPA autoscaling target, GKE currently support Duty Cycle which is GPU utilization, when target is reached, Triton Server service will create another pod instance. We ask user to analyze model inference to associate appropriate GPU metric target based on latency requirement. We also recommend to leave some room to mitigate transient load effect. For user interested in customizing autoscaling metrics, we recommends GPU Power (Percentage of Power), Queue time or SLA measurements such as latency.
     type: integer
-    default: 85      
+    default: 85
   modelRepositoryPath:
     type: string
     title: Bucket where models are stored. Please make sure the user/service account to create the GKE app has permission to this GCS bucket. Read Triton documentation on configs and formatting details, supporting TensorRT, TensorFlow, Pytorch, Onnx ... etc.
-    default: gs://triton_sample_models/22_04
+    default: gs://triton_sample_models/23_09
   image.ldPreloadPath:
     type: string
     title: Leave this empty by default. Triton allows users to create custom layers for backend such as TensorRT plugin or Tensorflow custom ops, the compiled shared library must be provided via LD_PRELOAD environment variable.
@@ -97,7 +97,7 @@ properties:
   image.logVerboseLevel:
     type: integer
     title: Set verbose logging level. Zero (0) disables verbose logging and values >= 1 enable verbose logging, this is helpful when user unsure if the model is compatible with Triton or for general debug.
-    default: 0 
+    default: 0
   image.strictModelConfig:
     type: boolean
     title: Leave this unchecked by default. When strictModelConfig is not checked(False), Triton will try to infer the config file from model file, when checked(True), user need to provide config.pbtxt in model repository.
@@ -105,14 +105,14 @@ properties:
   image.allowGPUMetrics:
     type: boolean
     title: Select by default. When use A100 MIG, unselect to disable GPU Memory metrics reported by Triton, as current GPU metrics not support on A100 MIG.
-    default: True  
+    default: True
   istioEnabled:
     type: boolean
     x-google-marketplace:
       type: ISTIO_ENABLED
     default: True
 
-  
+
 required:
 - name
 - namespace
diff --git a/deploy/gke-marketplace-app/trt-engine/README.md b/deploy/gke-marketplace-app/trt-engine/README.md
new file mode 100644
index 0000000000..6e1d28e893
--- /dev/null
+++ b/deploy/gke-marketplace-app/trt-engine/README.md
@@ -0,0 +1,63 @@
+<!--
+# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Instruction to create BERT engine for each Triton update
+
+## Description
+
+```
+docker run --gpus all -it --network host \
+    --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
+    -v ~:/scripts nvcr.io/nvidia/tensorrt:23.11-py3
+
+pip install onnx six torch tf2onnx tensorflow
+
+git clone -b main https://github.com/NVIDIA/TensorRT.git
+cd TensorRT
+git submodule update --init --recursive
+
+export TRT_OSSPATH=/workspace/TensorRT
+export TRT_LIBPATH=/lib/x86_64-linux-gnu
+
+pushd /usr/local/bin && wget https://ngc.nvidia.com/downloads/ngccli_cat_linux.zip && unzip ngccli_cat_linux.zip && chmod u+x ngc-cli/ngc && rm ngccli_cat_linux.zip ngc-cli.md5 && ln -s ngc-cli/ngc ngc && echo "no-apikey\nascii\n" | ngc config set
+
+popd
+
+cd /workspace/TensorRT/demo/BERT
+bash ./scripts/download_squad.sh
+bash ./scripts/download_model.sh large 128
+# bash ./scripts/download_model.sh large 384
+
+mkdir -p engines
+
+python3 builder.py -m models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_128_v19.03.1/model.ckpt -o engines/bert_large_int8_bs1_s128.engine -b 1 -s 128 -c models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_128_v19.03.1/ -v models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_128_v19.03.1/vocab.txt --int8 --fp16 --strict --calib-num 1 -iln -imh
+
+gsutil cp bert_large_int8_bs1_s128.engine gs://triton_sample_models/23_09/bert/1/model.plan
+```
+
+For each Triton upgrade, container version used to generate the model, and the model path in GCS `gs://triton_sample_models/23_09/` should be updated accordingly with the correct version.
diff --git a/deploy/k8s-onprem/README.md b/deploy/k8s-onprem/README.md
index 48f6a9b911..4287b23c35 100644
--- a/deploy/k8s-onprem/README.md
+++ b/deploy/k8s-onprem/README.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2018-2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -112,10 +112,10 @@ $ git clone https://github.com/triton-inference-server/server.git
 Triton Server needs a repository of models that it will make available
 for inferencing. For this example, we are using an existing NFS server and
 placing our model files there. See the
-[Model Repository documentation](../../docs/model_repository.md) for other
+[Model Repository documentation](../../docs/user_guide/model_repository.md) for other
 supported locations.
 
-Following the [QuickStart](../../docs/quickstart.md), download the
+Following the [QuickStart](../../docs/getting_started/quickstart.md), download the
 example model repository to your system and copy it onto your NFS server.
 Then, add the url or IP address of your NFS server and the server path of your
 model repository to `values.yaml`.
@@ -217,7 +217,7 @@ deploy a cluster with a minimum of two inference servers use *--set* to
 set the autoscaler.minReplicas parameter.
 
 ```
-$ helm install --name example --set autoscaler.minReplicas=2 .
+$ helm install example --set autoscaler.minReplicas=2 .
 ```
 
 You can also write your own "config.yaml" file with the values you
@@ -231,13 +231,23 @@ image:
   imageName: nvcr.io/nvidia/tritonserver:custom-tag
   modelRepositoryPath: gs://my_model_repository
 EOF
-$ helm install --name example -f config.yaml .
+$ helm install example -f config.yaml .
 ```
 
+## Probe Configuration
+
+In `templates/deployment.yaml` is configurations for `livenessProbe`, `readinessProbe` and `startupProbe` for the Triton server container.
+By default, Triton loads all the models before starting the HTTP server to respond to the probes. The process can take several minutes, depending on the models sizes.
+If it is not completed in `startupProbe.failureThreshold * startupProbe.periodSeconds` seconds then Kubernetes considers this as a pod failure and restarts it,
+ending up with an infinite loop of restarting pods, so make sure to sufficiently set these values for your use case.
+The liveliness and readiness probes are being sent only after the first success of a startup probe.
+
+For more details, see the [Kubernetes probe documentation](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) and the [feature page of the startup probe](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/950-liveness-probe-holdoff/README.md).
+
 ## Using Triton Inference Server
 
 Now that the inference server is running you can send HTTP or GRPC
-requests to it to perform inferencing. By default, this chart deploys [Traefik](traefik.io)
+requests to it to perform inferencing. By default, this chart deploys [Traefik](https://traefik.io/)
 and uses [IngressRoutes](https://doc.traefik.io/traefik/providers/kubernetes-crd/)
 to balance requests across all available nodes.
 
@@ -267,7 +277,7 @@ from the HTTP endpoint.
 $ curl $cluster_ip:8000/v2
 ```
 
-Follow the [QuickStart](../../docs/quickstart.md) to get the example
+Follow the [QuickStart](../../docs/getting_started/quickstart.md) to get the example
 image classification client that can be used to perform inferencing
 using image classification models on the inference
 server. For example,
@@ -284,7 +294,9 @@ Image 'images/mug.jpg':
 ## Testing Load Balancing and Autoscaling
 After you have confirmed that your Triton cluster is operational and can perform inference,
 you can test the load balancing and autoscaling features by sending a heavy load of requests.
-One option for doing this is using the [perf_analyzer](../../docs/perf_analyzer.md) application.
+One option for doing this is using the
+[perf_analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md)
+application.
 
 You can apply a progressively increasing load with a command like:
 ```
@@ -305,13 +317,13 @@ NAME            REVISION  UPDATED                   STATUS    CHART
 example         1         Wed Feb 27 22:16:55 2019  DEPLOYED  triton-inference-server-1.0.0  1.0           default
 example-metrics	1       	Tue Jan 21 12:24:07 2020	DEPLOYED	prometheus-operator-6.18.0   	 0.32.0     	 default
 
-$ helm delete --purge example
-$ helm delete --purge example-metrics
+$ helm uninstall example
+$ helm uninstall example-metrics
 ```
 
 For the Prometheus and Grafana services, you should [explicitly delete
-CRDs](https://github.com/helm/charts/tree/master/stable/prometheus-operator#uninstalling-the-chart):
+CRDs](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack#uninstall-helm-chart):
 
 ```
-$ kubectl delete crd alertmanagers.monitoring.coreos.com servicemonitors.monitoring.coreos.com podmonitors.monitoring.coreos.com prometheuses.monitoring.coreos.com prometheusrules.monitoring.coreos.com
-```
\ No newline at end of file
+$ kubectl delete crd alertmanagerconfigs.monitoring.coreos.com alertmanagers.monitoring.coreos.com podmonitors.monitoring.coreos.com probes.monitoring.coreos.com prometheuses.monitoring.coreos.com prometheusrules.monitoring.coreos.com servicemonitors.monitoring.coreos.com thanosrulers.monitoring.coreos.com
+```
diff --git a/deploy/k8s-onprem/dashboard.json b/deploy/k8s-onprem/dashboard.json
index a291f6b6ab..9c99a2751c 100644
--- a/deploy/k8s-onprem/dashboard.json
+++ b/deploy/k8s-onprem/dashboard.json
@@ -9,12 +9,19 @@
       "pluginName": "Prometheus"
     }
   ],
+  "__elements": {},
   "__requires": [
+    {
+      "type": "panel",
+      "id": "gauge",
+      "name": "Gauge",
+      "version": ""
+    },
     {
       "type": "grafana",
       "id": "grafana",
       "name": "Grafana",
-      "version": "8.2.3"
+      "version": "10.0.1"
     },
     {
       "type": "datasource",
@@ -39,7 +46,10 @@
     "list": [
       {
         "builtIn": 1,
-        "datasource": "-- Grafana --",
+        "datasource": {
+          "type": "datasource",
+          "uid": "grafana"
+        },
         "enable": true,
         "hide": true,
         "iconColor": "rgba(0, 211, 255, 1)",
@@ -56,14 +66,16 @@
   },
   "editable": true,
   "fiscalYearStartMonth": 0,
-  "gnetId": null,
   "graphTooltip": 0,
   "id": null,
   "links": [],
   "liveNow": false,
   "panels": [
     {
-      "datasource": null,
+      "datasource": {
+        "type": "prometheus",
+        "uid": "${DS_PROMETHEUS}"
+      },
       "fieldConfig": {
         "defaults": {
           "color": {
@@ -108,9 +120,13 @@
         "text": {},
         "textMode": "auto"
       },
-      "pluginVersion": "8.2.3",
+      "pluginVersion": "10.0.1",
       "targets": [
         {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "${DS_PROMETHEUS}"
+          },
           "exemplar": true,
           "expr": "count(count(nv_inference_count) by (instance))",
           "interval": "",
@@ -122,13 +138,18 @@
       "type": "stat"
     },
     {
-      "datasource": null,
+      "datasource": {
+        "type": "prometheus",
+        "uid": "${DS_PROMETHEUS}"
+      },
       "fieldConfig": {
         "defaults": {
           "color": {
             "mode": "palette-classic"
           },
           "custom": {
+            "axisCenteredZero": false,
+            "axisColorMode": "text",
             "axisLabel": "",
             "axisPlacement": "auto",
             "barAlignment": 0,
@@ -215,14 +236,20 @@
         "legend": {
           "calcs": [],
           "displayMode": "list",
-          "placement": "bottom"
+          "placement": "bottom",
+          "showLegend": true
         },
         "tooltip": {
-          "mode": "single"
+          "mode": "single",
+          "sort": "none"
         }
       },
       "targets": [
         {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "${DS_PROMETHEUS}"
+          },
           "exemplar": true,
           "expr": "sum by (pod) (rate(nv_inference_count[1m])) / ignoring(pod) group_left sum (rate(nv_inference_count[1m]))",
           "instant": false,
@@ -231,12 +258,14 @@
           "refId": "A"
         }
       ],
-      "timeFrom": null,
       "title": "Proportion of Requests by Pod",
       "type": "timeseries"
     },
     {
-      "datasource": "${DS_PROMETHEUS}",
+      "datasource": {
+        "type": "prometheus",
+        "uid": "${DS_PROMETHEUS}"
+      },
       "description": "",
       "fieldConfig": {
         "defaults": {
@@ -244,6 +273,8 @@
             "mode": "palette-classic"
           },
           "custom": {
+            "axisCenteredZero": false,
+            "axisColorMode": "text",
             "axisLabel": "",
             "axisPlacement": "auto",
             "barAlignment": 0,
@@ -301,15 +332,21 @@
         "legend": {
           "calcs": [],
           "displayMode": "list",
-          "placement": "bottom"
+          "placement": "bottom",
+          "showLegend": true
         },
         "tooltip": {
-          "mode": "single"
+          "mode": "single",
+          "sort": "none"
         }
       },
       "pluginVersion": "8.2.3",
       "targets": [
         {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "${DS_PROMETHEUS}"
+          },
           "exemplar": true,
           "expr": "sum(nv_inference_request_success) by (pod)",
           "interval": "",
@@ -317,6 +354,10 @@
           "refId": "A"
         },
         {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "${DS_PROMETHEUS}"
+          },
           "exemplar": true,
           "expr": "sum(nv_inference_request_failure) by (pod)",
           "interval": "",
@@ -324,19 +365,22 @@
           "refId": "B"
         }
       ],
-      "timeFrom": null,
-      "timeShift": null,
       "title": "Cumulative Inference Requests by Pod",
       "type": "timeseries"
     },
     {
-      "datasource": "${DS_PROMETHEUS}",
+      "datasource": {
+        "type": "prometheus",
+        "uid": "${DS_PROMETHEUS}"
+      },
       "fieldConfig": {
         "defaults": {
           "color": {
             "mode": "palette-classic"
           },
           "custom": {
+            "axisCenteredZero": false,
+            "axisColorMode": "text",
             "axisLabel": "Compute Time (ms)",
             "axisPlacement": "auto",
             "barAlignment": 0,
@@ -394,15 +438,21 @@
         "legend": {
           "calcs": [],
           "displayMode": "list",
-          "placement": "bottom"
+          "placement": "bottom",
+          "showLegend": true
         },
         "tooltip": {
-          "mode": "single"
+          "mode": "single",
+          "sort": "none"
         }
       },
       "pluginVersion": "8.2.3",
       "targets": [
         {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "${DS_PROMETHEUS}"
+          },
           "exemplar": true,
           "expr": "sum(rate(nv_inference_compute_infer_duration_us[30s])) by (model) / 1000",
           "interval": "",
@@ -410,19 +460,22 @@
           "refId": "A"
         }
       ],
-      "timeFrom": null,
-      "timeShift": null,
       "title": "Compute Time by Model (milliseconds)",
       "type": "timeseries"
     },
     {
-      "datasource": "${DS_PROMETHEUS}",
+      "datasource": {
+        "type": "prometheus",
+        "uid": "${DS_PROMETHEUS}"
+      },
       "fieldConfig": {
         "defaults": {
           "color": {
             "mode": "palette-classic"
           },
           "custom": {
+            "axisCenteredZero": false,
+            "axisColorMode": "text",
             "axisLabel": "Queue Time (ms)",
             "axisPlacement": "auto",
             "barAlignment": 0,
@@ -480,15 +533,21 @@
         "legend": {
           "calcs": [],
           "displayMode": "list",
-          "placement": "bottom"
+          "placement": "bottom",
+          "showLegend": true
         },
         "tooltip": {
-          "mode": "single"
+          "mode": "single",
+          "sort": "none"
         }
       },
       "pluginVersion": "8.2.3",
       "targets": [
         {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "${DS_PROMETHEUS}"
+          },
           "exemplar": true,
           "expr": "avg(rate(nv_inference_queue_duration_us[30s])/(1+rate(nv_inference_request_success[30s]))) by (pod)",
           "interval": "",
@@ -496,14 +555,592 @@
           "refId": "A"
         }
       ],
-      "timeFrom": null,
-      "timeShift": null,
       "title": "Average Queue Time by Pod (microseconds)",
       "type": "timeseries"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "${DS_PROMETHEUS}"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisCenteredZero": false,
+            "axisColorMode": "text",
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {
+              "legend": false,
+              "tooltip": false,
+              "viz": false
+            },
+            "lineInterpolation": "linear",
+            "lineWidth": 2,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": false,
+            "stacking": {
+              "group": "A",
+              "mode": "none"
+            },
+            "thresholdsStyle": {
+              "mode": "off"
+            }
+          },
+          "links": [],
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "red",
+                "value": 80
+              }
+            ]
+          },
+          "unit": "watt"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 18,
+        "x": 0,
+        "y": 25
+      },
+      "id": 10,
+      "options": {
+        "legend": {
+          "calcs": [
+            "mean",
+            "lastNotNull",
+            "max"
+          ],
+          "displayMode": "table",
+          "placement": "right",
+          "showLegend": true
+        },
+        "tooltip": {
+          "mode": "multi",
+          "sort": "none"
+        }
+      },
+      "pluginVersion": "10.0.1",
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "${DS_PROMETHEUS}"
+          },
+          "editorMode": "code",
+          "expr": "nv_gpu_power_usage",
+          "interval": "",
+          "legendFormat": "GPU {{ gpu_uuid }}",
+          "range": true,
+          "refId": "A"
+        }
+      ],
+      "title": "GPU Power Usage",
+      "type": "timeseries"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "${DS_PROMETHEUS}"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [],
+          "max": 2400,
+          "min": 0,
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "#EAB839",
+                "value": 1800
+              },
+              {
+                "color": "red",
+                "value": 2200
+              }
+            ]
+          },
+          "unit": "watt"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 6,
+        "x": 18,
+        "y": 25
+      },
+      "id": 16,
+      "links": [],
+      "options": {
+        "orientation": "horizontal",
+        "reduceOptions": {
+          "calcs": [
+            "sum"
+          ],
+          "fields": "",
+          "values": false
+        },
+        "showThresholdLabels": false,
+        "showThresholdMarkers": true
+      },
+      "pluginVersion": "10.0.1",
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "${DS_PROMETHEUS}"
+          },
+          "editorMode": "code",
+          "expr": "sum(nv_gpu_power_usage)",
+          "interval": "",
+          "legendFormat": "",
+          "range": true,
+          "refId": "A"
+        }
+      ],
+      "title": "GPU Power Total",
+      "type": "gauge"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "${DS_PROMETHEUS}"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisCenteredZero": false,
+            "axisColorMode": "text",
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {
+              "legend": false,
+              "tooltip": false,
+              "viz": false
+            },
+            "lineInterpolation": "linear",
+            "lineWidth": 2,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": false,
+            "stacking": {
+              "group": "A",
+              "mode": "none"
+            },
+            "thresholdsStyle": {
+              "mode": "off"
+            }
+          },
+          "links": [],
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "red",
+                "value": 80
+              }
+            ]
+          },
+          "unit": "bytes"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 0,
+        "y": 33
+      },
+      "id": 18,
+      "options": {
+        "legend": {
+          "calcs": [
+            "mean",
+            "max"
+          ],
+          "displayMode": "list",
+          "placement": "right",
+          "showLegend": true
+        },
+        "tooltip": {
+          "mode": "multi",
+          "sort": "none"
+        }
+      },
+      "pluginVersion": "10.0.1",
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "${DS_PROMETHEUS}"
+          },
+          "editorMode": "code",
+          "exemplar": false,
+          "expr": "nv_gpu_memory_used_bytes",
+          "interval": "",
+          "legendFormat": "GPU {{gpu_uuid}}",
+          "range": true,
+          "refId": "A"
+        }
+      ],
+      "title": "GPU Framebuffer Mem Used",
+      "type": "timeseries"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "${DS_PROMETHEUS}"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisCenteredZero": false,
+            "axisColorMode": "text",
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {
+              "legend": false,
+              "tooltip": false,
+              "viz": false
+            },
+            "lineInterpolation": "linear",
+            "lineWidth": 2,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": false,
+            "stacking": {
+              "group": "A",
+              "mode": "none"
+            },
+            "thresholdsStyle": {
+              "mode": "off"
+            }
+          },
+          "links": [],
+          "mappings": [],
+          "max": 100,
+          "min": 0,
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "red",
+                "value": 80
+              }
+            ]
+          },
+          "unit": "percent"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 12,
+        "y": 33
+      },
+      "id": 6,
+      "options": {
+        "legend": {
+          "calcs": [
+            "mean",
+            "lastNotNull",
+            "max"
+          ],
+          "displayMode": "table",
+          "placement": "right",
+          "showLegend": true,
+          "sortBy": "Max",
+          "sortDesc": true
+        },
+        "tooltip": {
+          "mode": "multi",
+          "sort": "none"
+        }
+      },
+      "pluginVersion": "10.0.1",
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "${DS_PROMETHEUS}"
+          },
+          "editorMode": "code",
+          "expr": "nv_gpu_utilization * 100",
+          "interval": "",
+          "legendFormat": "GPU {{gpu_uuid}}",
+          "range": true,
+          "refId": "A"
+        }
+      ],
+      "title": "GPU Utilization",
+      "type": "timeseries"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "${DS_PROMETHEUS}"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisCenteredZero": false,
+            "axisColorMode": "text",
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {
+              "legend": false,
+              "tooltip": false,
+              "viz": false
+            },
+            "lineInterpolation": "linear",
+            "lineWidth": 2,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": false,
+            "stacking": {
+              "group": "A",
+              "mode": "none"
+            },
+            "thresholdsStyle": {
+              "mode": "off"
+            }
+          },
+          "links": [],
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "red",
+                "value": 80
+              }
+            ]
+          },
+          "unit": "bytes"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 0,
+        "y": 41
+      },
+      "id": 19,
+      "options": {
+        "legend": {
+          "calcs": [
+            "mean",
+            "max"
+          ],
+          "displayMode": "list",
+          "placement": "right",
+          "showLegend": true
+        },
+        "tooltip": {
+          "mode": "multi",
+          "sort": "none"
+        }
+      },
+      "pluginVersion": "10.0.1",
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "${DS_PROMETHEUS}"
+          },
+          "editorMode": "code",
+          "expr": "nv_cpu_memory_used_bytes",
+          "hide": false,
+          "instant": false,
+          "legendFormat": "Memory",
+          "range": true,
+          "refId": "A"
+        }
+      ],
+      "title": "Memory Used",
+      "type": "timeseries"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "${DS_PROMETHEUS}"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisCenteredZero": false,
+            "axisColorMode": "text",
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {
+              "legend": false,
+              "tooltip": false,
+              "viz": false
+            },
+            "lineInterpolation": "linear",
+            "lineWidth": 2,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": false,
+            "stacking": {
+              "group": "A",
+              "mode": "none"
+            },
+            "thresholdsStyle": {
+              "mode": "off"
+            }
+          },
+          "links": [],
+          "mappings": [],
+          "max": 100,
+          "min": 0,
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "red",
+                "value": 80
+              }
+            ]
+          },
+          "unit": "percent"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 12,
+        "y": 41
+      },
+      "id": 20,
+      "options": {
+        "legend": {
+          "calcs": [
+            "mean",
+            "lastNotNull",
+            "max"
+          ],
+          "displayMode": "table",
+          "placement": "right",
+          "showLegend": true,
+          "sortBy": "Max",
+          "sortDesc": true
+        },
+        "tooltip": {
+          "mode": "multi",
+          "sort": "none"
+        }
+      },
+      "pluginVersion": "10.0.1",
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "${DS_PROMETHEUS}"
+          },
+          "editorMode": "code",
+          "expr": "nv_cpu_utilization * 100",
+          "interval": "",
+          "legendFormat": "CPU",
+          "range": true,
+          "refId": "A"
+        }
+      ],
+      "title": "CPU Utilization",
+      "type": "timeseries"
     }
   ],
   "refresh": "5s",
-  "schemaVersion": 31,
+  "schemaVersion": 38,
   "style": "dark",
   "tags": [],
   "templating": {
@@ -530,5 +1167,6 @@
   "timezone": "",
   "title": "Triton Inference Server",
   "uid": "slEY4dsZk",
-  "version": 2
+  "version": 5,
+  "weekStart": ""
 }
\ No newline at end of file
diff --git a/deploy/k8s-onprem/templates/deployment.yaml b/deploy/k8s-onprem/templates/deployment.yaml
index 6945cce23a..fa521d28cb 100644
--- a/deploy/k8s-onprem/templates/deployment.yaml
+++ b/deploy/k8s-onprem/templates/deployment.yaml
@@ -79,12 +79,25 @@ spec:
             - containerPort: 8002
               name: metrics
           livenessProbe:
+            initialDelaySeconds: 15
+            failureThreshold: 3
+            periodSeconds: 10
             httpGet:
               path: /v2/health/live
               port: http
           readinessProbe:
             initialDelaySeconds: 5
             periodSeconds: 5
+            failureThreshold: 3
+            httpGet:
+              path: /v2/health/ready
+              port: http
+          startupProbe:
+            # allows Triton to load the models during 30*10 = 300 sec = 5 min
+            # starts checking the other probes only after the success of this one
+            # for details, see https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-startup-probes
+            periodSeconds: 10
+            failureThreshold: 30
             httpGet:
               path: /v2/health/ready
               port: http
diff --git a/deploy/k8s-onprem/values.yaml b/deploy/k8s-onprem/values.yaml
index 6e9e9e142f..3e3066bbf6 100644
--- a/deploy/k8s-onprem/values.yaml
+++ b/deploy/k8s-onprem/values.yaml
@@ -29,7 +29,7 @@ tags:
   loadBalancing: true
 
 image:
-  imageName: nvcr.io/nvidia/tritonserver:21.10-py3
+  imageName: nvcr.io/nvidia/tritonserver:23.11-py3
   pullPolicy: IfNotPresent
   modelRepositoryServer: < Replace with the IP Address of your file server >
   modelRepositoryPath: /srv/models
diff --git a/deploy/mlflow-triton-plugin/README.md b/deploy/mlflow-triton-plugin/README.md
index 6c8827254b..c011194299 100644
--- a/deploy/mlflow-triton-plugin/README.md
+++ b/deploy/mlflow-triton-plugin/README.md
@@ -57,7 +57,7 @@ python setup.py install
 ## Quick Start
 
 In this documentation, we will use the files in `examples` to showcase how
-the plugin interacts with Triton Infernce Server. The `onnx_float32_int32_int32`
+the plugin interacts with Triton Inference Server. The `onnx_float32_int32_int32`
 model in `examples` is a simple model that takes two float32 inputs, INPUT0 and
 INPUT1, with shape [-1, 16], and produces two int32 outputs, OUTPUT0 and
 OUTPUT1, where OUTPUT0 is the element-wise summation of INPUT0 and INPUT1 and
@@ -66,7 +66,7 @@ OUTPUT1 is the element-wise subtraction of INPUT0 and INPUT1.
 ### Start Triton Inference Server in EXPLICIT mode
 
 The MLflow Triton plugin must work with a running Triton server, see
-[documentation](https://github.com/triton-inference-server/server/blob/main/docs/quickstart.md)
+[documentation](https://github.com/triton-inference-server/server/blob/main/docs/getting_started/quickstart.md)
 of Triton Inference Server for how to start the server. Note that
 the server should be run in EXPLICIT mode (`--model-control-mode=explicit`)
 to exploit the deployment feature of the plugin.
@@ -74,7 +74,8 @@ to exploit the deployment feature of the plugin.
 Once the server has started, the following environment must be set so that the plugin
 can interact with the server properly:
 * `TRITON_URL`: The address to the Triton HTTP endpoint
-* `TRITON_MODEL_REPO`: The path to the Triton model repository
+* `TRITON_MODEL_REPO`: The path to the Triton model repository. It can be an s3 URI but keep in \
+mind that the env vars AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are needed.
 
 ### Publish models to MLflow
 
@@ -83,8 +84,8 @@ can interact with the server properly:
 The MLFlow ONNX built-in functionalities can be used to publish `onnx` flavor
 models to MLFlow directly, and the MLFlow Triton plugin will prepare the model
 to the format expected by Triton. You may also log
-[`config.pbtxt`](](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_model_configuration.md))
-as additonal artifact which Triton will be used to serve the model. Otherwise,
+[`config.pbtxt`](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_model_configuration.md)
+as additional artifact which Triton will be used to serve the model. Otherwise,
 the server should be run with auto-complete feature enabled
 (`--strict-model-config=false`) to generate the model configuration.
 
@@ -101,7 +102,7 @@ For other model frameworks that Triton supports but not yet recognized by
 the MLFlow Triton plugin, the `publish_model_to_mlflow.py` script can be used to
 publish `triton` flavor models to MLflow. A `triton` flavor model is a directory
 containing the model files following the
-[model layout](https://github.com/triton-inference-server/server/blob/main/docs/model_repository.md#repository-layout).
+[model layout](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_repository.md#repository-layout).
 Below is an example usage:
 
 ```
diff --git a/deploy/mlflow-triton-plugin/examples/onnx_float32_int32_int32/config.pbtxt b/deploy/mlflow-triton-plugin/examples/onnx_float32_int32_int32/config.pbtxt
old mode 100755
new mode 100644
diff --git a/deploy/mlflow-triton-plugin/mlflow_triton/__init__.py b/deploy/mlflow-triton-plugin/mlflow_triton/__init__.py
old mode 100644
new mode 100755
index 6eff4167d0..0b73b537d4
--- a/deploy/mlflow-triton-plugin/mlflow_triton/__init__.py
+++ b/deploy/mlflow-triton-plugin/mlflow_triton/__init__.py
@@ -1,4 +1,6 @@
-# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -22,4 +24,4 @@
 # PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
 # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
\ No newline at end of file
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff --git a/deploy/mlflow-triton-plugin/mlflow_triton/config.py b/deploy/mlflow-triton-plugin/mlflow_triton/config.py
old mode 100644
new mode 100755
index d30c17df7b..0a381fd407
--- a/deploy/mlflow-triton-plugin/mlflow_triton/config.py
+++ b/deploy/mlflow-triton-plugin/mlflow_triton/config.py
@@ -1,4 +1,6 @@
-# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,11 +26,116 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 import os
+import re
+from collections import namedtuple
 
+from mlflow.exceptions import MlflowException
 
-class Config(dict):
 
+class Config(dict):
     def __init__(self):
         super().__init__()
-        self['triton_url'] = os.environ.get('TRITON_URL')
-        self['triton_model_repo'] = os.environ.get('TRITON_MODEL_REPO')
+        self["triton_url"] = os.environ.get("TRITON_URL")
+        self["triton_model_repo"] = os.environ.get("TRITON_MODEL_REPO")
+
+        if self["triton_model_repo"].startswith("s3://"):
+            self.s3_regex = re.compile(
+                "s3://(http://|https://|)([0-9a-zA-Z\\-.]+):([0-9]+)/"
+                "([0-9a-z.\\-]+)(((/[0-9a-zA-Z.\\-_]+)*)?)"
+            )
+
+            uri = self.parse_path(self["triton_model_repo"])
+            if uri.protocol == "https://":
+                protocol = "https://"
+            else:
+                protocol = "http://"
+            endpoint_url = None
+            if uri.host_name != "" and uri.host_port != "":
+                endpoint_url = "{}{}:{}".format(protocol, uri.host_name, uri.host_port)
+
+            import boto3
+
+            # boto3 handles AWS credentials
+            self["s3"] = boto3.client("s3", endpoint_url=endpoint_url)
+            self["s3_bucket"] = uri.bucket
+            self["s3_prefix"] = uri.prefix
+            self["triton_model_repo"] = "s3://{}".format(
+                os.path.join(uri.bucket, uri.prefix)
+            )
+
+    def parse_path(self, path):
+        # Cleanup extra slashes
+        clean_path = self.clean_path(path)
+
+        # Get the bucket name and the object path. Return error if path is malformed
+        match = self.s3_regex.fullmatch(clean_path)
+        S3URI = namedtuple(
+            "S3URI", ["protocol", "host_name", "host_port", "bucket", "prefix"]
+        )
+        if match:
+            uri = S3URI(*match.group(1, 2, 3, 4, 5))
+            if uri.prefix and uri.prefix[0] == "/":
+                uri = uri._replace(prefix=uri.prefix[1:])
+        else:
+            bucket_start = clean_path.find("s3://") + len("s3://")
+            bucket_end = clean_path.find("/", bucket_start)
+
+            # If there isn't a slash, the address has only the bucket
+            if bucket_end > bucket_start:
+                bucket = clean_path[bucket_start:bucket_end]
+                prefix = clean_path[bucket_end + 1 :]
+            else:
+                bucket = clean_path[bucket_start:]
+                prefix = ""
+            uri = S3URI("", "", "", bucket, prefix)
+
+        if uri.bucket == "":
+            raise MlflowException("No bucket name found in path: " + path)
+
+        return uri
+
+    def clean_path(self, s3_path):
+        # Must handle paths with s3 prefix
+        start = s3_path.find("s3://")
+        path = ""
+        if start != -1:
+            path = s3_path[start + len("s3://") :]
+            clean_path = "s3://"
+        else:
+            path = s3_path
+            clean_path = ""
+
+        # Must handle paths with https:// or http:// prefix
+        https_start = path.find("https://")
+        if https_start != -1:
+            path = path[https_start + len("https://") :]
+            clean_path += "https://"
+        else:
+            http_start = path.find("http://")
+            if http_start != -1:
+                path = path[http_start + len("http://") :]
+                clean_path += "http://"
+
+        # Remove trailing slashes
+        rtrim_length = len(path.rstrip("/"))
+        if rtrim_length == 0:
+            raise MlflowException("Invalid bucket name: '" + path + "'")
+
+        # Remove leading slashes
+        ltrim_length = len(path) - len(path.lstrip("/"))
+        if ltrim_length == len(path):
+            raise MlflowException("Invalid bucket name: '" + path + "'")
+
+        # Remove extra internal slashes
+        true_path = path[ltrim_length : rtrim_length + 1]
+        previous_slash = False
+        for i in range(len(true_path)):
+            if true_path[i] == "/":
+                if not previous_slash:
+                    clean_path += true_path[i]
+                previous_slash = True
+            else:
+                clean_path += true_path[i]
+                previous_slash = False
+
+        return clean_path
diff --git a/deploy/mlflow-triton-plugin/mlflow_triton/deployments.py b/deploy/mlflow-triton-plugin/mlflow_triton/deployments.py
old mode 100644
new mode 100755
index 42a1d08ef0..bebe559b9e
--- a/deploy/mlflow-triton-plugin/mlflow_triton/deployments.py
+++ b/deploy/mlflow-triton-plugin/mlflow_triton/deployments.py
@@ -1,4 +1,6 @@
-# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -23,24 +25,27 @@
 # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+import ast
+import glob
+import json
+import logging
 import os
 import shutil
-import logging
 from pathlib import Path
 
-from mlflow_triton.config import Config
+import numpy as np
+import pandas as pd
 import tritonclient.http as tritonhttpclient
-from tritonclient.utils import InferenceServerException, np_to_triton_dtype, triton_to_np_dtype
-
 from mlflow.deployments import BaseDeploymentClient
-from mlflow.tracking.artifact_utils import _download_artifact_from_uri
 from mlflow.exceptions import MlflowException
 from mlflow.models import Model
-
-import glob
-import json
-import pandas as pd
-import numpy as np
+from mlflow.tracking.artifact_utils import _download_artifact_from_uri
+from mlflow_triton.config import Config
+from tritonclient.utils import (
+    InferenceServerException,
+    np_to_triton_dtype,
+    triton_to_np_dtype,
+)
 
 logger = logging.getLogger(__name__)
 
@@ -48,7 +53,6 @@
 
 
 class TritonPlugin(BaseDeploymentClient):
-
     def __init__(self, uri):
         """
         Initializes the deployment plugin, sets the triton model repo
@@ -56,16 +60,18 @@ def __init__(self, uri):
         super(TritonPlugin, self).__init__(target_uri=uri)
         self.server_config = Config()
         triton_url, self.triton_model_repo = self._get_triton_server_config()
-        self.supported_flavors = ['triton', 'onnx']  # need to add other flavors
+        # need to add other flavors
+        self.supported_flavors = ["triton", "onnx"]
         # URL cleaning for constructing Triton client
         ssl = False
         if triton_url.startswith("http://"):
-            triton_url = triton_url[len("http://"):]
+            triton_url = triton_url[len("http://") :]
         elif triton_url.startswith("https://"):
-            triton_url = triton_url[len("https://"):]
+            triton_url = triton_url[len("https://") :]
             ssl = True
         self.triton_client = tritonhttpclient.InferenceServerClient(
-            url=triton_url, ssl=ssl)
+            url=triton_url, ssl=ssl
+        )
 
     def _get_triton_server_config(self):
         triton_url = "localhost:8000"
@@ -74,8 +80,7 @@ def _get_triton_server_config(self):
         logger.info("Triton url = {}".format(triton_url))
 
         if not self.server_config["triton_model_repo"]:
-            raise Exception(
-                "Check that environment variable TRITON_MODEL_REPO is set")
+            raise Exception("Check that environment variable TRITON_MODEL_REPO is set")
         triton_model_repo = self.server_config["triton_model_repo"]
         logger.info("Triton model repo = {}".format(triton_model_repo))
 
@@ -98,7 +103,8 @@ def create_deployment(self, name, model_uri, flavor=None, config=None):
         if self._model_exists(name):
             raise Exception(
                 "Unable to create deployment for name %s because it already exists."
-                % (name))
+                % (name)
+            )
 
         # Get the path of the artifact
         path = Path(_download_artifact_from_uri(model_uri))
@@ -124,7 +130,8 @@ def delete_deployment(self, name):
         if not self._model_exists(name):
             raise Exception(
                 "Unable to delete deployment for name %s because it does not exist."
-                % (name))
+                % (name)
+            )
 
         try:
             self.triton_client.unload_model(name)
@@ -154,7 +161,8 @@ def update_deployment(self, name, model_uri=None, flavor=None, config=None):
         if not self._model_exists(name):
             raise Exception(
                 "Unable to update deployment for name %s because it does not exist."
-                % (name))
+                % (name)
+            )
 
         self.get_deployment(name)
 
@@ -181,16 +189,33 @@ def list_deployments(self):
         resp = self.triton_client.get_model_repository_index()
         actives = []
         for d in resp:
-            if 'state' in d and d['state'] == 'READY':
-                mlflow_meta_path = os.path.join(self.triton_model_repo,
-                                                d['name'],
-                                                _MLFLOW_META_FILENAME)
-                if os.path.isfile(mlflow_meta_path):
-                    meta_dict = self._get_mlflow_meta_dict(d['name'])
-                    d['triton_model_path'] = meta_dict['triton_model_path']
-                    d['mlflow_model_uri'] = meta_dict['mlflow_model_uri']
-                    d['flavor'] = meta_dict['flavor']
-                    actives.append(d)
+            if "state" in d and d["state"] == "READY":
+                mlflow_meta_path = os.path.join(
+                    self.triton_model_repo, d["name"], _MLFLOW_META_FILENAME
+                )
+                if "s3" in self.server_config:
+                    meta_dict = ast.literal_eval(
+                        self.server_config["s3"]
+                        .get_object(
+                            Bucket=self.server_config["s3_bucket"],
+                            Key=os.path.join(
+                                self.server_config["s3_prefix"],
+                                d["name"],
+                                _MLFLOW_META_FILENAME,
+                            ),
+                        )["Body"]
+                        .read()
+                        .decode("utf-8")
+                    )
+                elif os.path.isfile(mlflow_meta_path):
+                    meta_dict = self._get_mlflow_meta_dict(d["name"])
+                else:
+                    continue
+
+                d["triton_model_path"] = meta_dict["triton_model_path"]
+                d["mlflow_model_uri"] = meta_dict["mlflow_model_uri"]
+                d["flavor"] = meta_dict["flavor"]
+                actives.append(d)
 
         return actives
 
@@ -205,9 +230,9 @@ def get_deployment(self, name):
         """
         deployments = self.list_deployments()
         for d in deployments:
-            if d['name'] == name:
+            if d["name"] == name:
                 return d
-        raise ValueError(f'Unable to get deployment with name {name}')
+        raise ValueError(f"Unable to get deployment with name {name}")
 
     def predict(self, deployment_name, df):
         single_input_np = None
@@ -219,16 +244,13 @@ def predict(self, deployment_name, df):
             raise MlflowException("Unnamed input is not currently supported")
         else:
             if isinstance(df, pd.DataFrame):
-                model_metadata = self.triton_client.get_model_metadata(
-                    deployment_name)
+                model_metadata = self.triton_client.get_model_metadata(deployment_name)
                 input_dtype = {}
                 for input in model_metadata["inputs"]:
-                    input_dtype[input["name"]] = triton_to_np_dtype(
-                        input["datatype"])
+                    input_dtype[input["name"]] = triton_to_np_dtype(input["datatype"])
                 # Sanity check
                 if len(df.columns) != 1:
-                    raise MlflowException(
-                        "Expect Pandas DataFrame has only 1 column")
+                    raise MlflowException("Expect Pandas DataFrame has only 1 column")
                 col = df.columns[0]
                 for row in df.index:
                     val = df[col][row]
@@ -237,162 +259,265 @@ def predict(self, deployment_name, df):
                         val = np.array(val, dtype=input_dtype[row])
                     inputs.append(
                         tritonhttpclient.InferInput(
-                            row, val.shape, np_to_triton_dtype(val.dtype)))
+                            row, val.shape, np_to_triton_dtype(val.dtype)
+                        )
+                    )
                     inputs[-1].set_data_from_numpy(val)
             else:
                 for key, val in df.items():
                     inputs.append(
                         tritonhttpclient.InferInput(
-                            key, val.shape, np_to_triton_dtype(val.dtype)))
+                            key, val.shape, np_to_triton_dtype(val.dtype)
+                        )
+                    )
                     inputs[-1].set_data_from_numpy(val)
 
         try:
-            resp = self.triton_client.infer(model_name=deployment_name,
-                                            inputs=inputs)
+            resp = self.triton_client.infer(model_name=deployment_name, inputs=inputs)
             res = {}
-            for output in resp.get_response()['outputs']:
-                res[output['name']] = resp.as_numpy(output['name'])
-            return {"outputs": res}
+            for output in resp.get_response()["outputs"]:
+                res[output["name"]] = resp.as_numpy(output["name"])
+            return pd.DataFrame.from_dict({"outputs": res})
         except InferenceServerException as ex:
             raise MlflowException(str(ex))
 
     def _generate_mlflow_meta_file(self, name, flavor, model_uri):
         triton_deployment_dir = os.path.join(self.triton_model_repo, name)
         meta_dict = {
-            'name': name,
-            'triton_model_path': triton_deployment_dir,
-            'mlflow_model_uri': model_uri,
-            'flavor': flavor
+            "name": name,
+            "triton_model_path": triton_deployment_dir,
+            "mlflow_model_uri": model_uri,
+            "flavor": flavor,
         }
-        with open(os.path.join(triton_deployment_dir, _MLFLOW_META_FILENAME),
-                  "w") as outfile:
-            json.dump(meta_dict, outfile, indent=4)
+
+        if "s3" in self.server_config:
+            self.server_config["s3"].put_object(
+                Body=json.dumps(meta_dict, indent=4).encode("utf-8"),
+                Bucket=self.server_config["s3_bucket"],
+                Key=os.path.join(
+                    self.server_config["s3_prefix"], name, _MLFLOW_META_FILENAME
+                ),
+            )
+        else:
+            with open(
+                os.path.join(triton_deployment_dir, _MLFLOW_META_FILENAME), "w"
+            ) as outfile:
+                json.dump(meta_dict, outfile, indent=4)
 
         print("Saved", _MLFLOW_META_FILENAME, "to", triton_deployment_dir)
 
     def _get_mlflow_meta_dict(self, name):
-        mlflow_meta_path = os.path.join(self.triton_model_repo, name,
-                                        _MLFLOW_META_FILENAME)
-        with open(mlflow_meta_path, 'r') as metafile:
-            mlflow_meta_dict = json.load(metafile)
+        mlflow_meta_path = os.path.join(
+            self.triton_model_repo, name, _MLFLOW_META_FILENAME
+        )
+
+        if "s3" in self.server_config:
+            mlflow_meta_dict = ast.literal_eval(
+                self.server_config["s3"]
+                .get_object(
+                    Bucket=self.server_config["s3_bucket"],
+                    Key=os.path.join(
+                        self.server_config["s3_prefix"], name, _MLFLOW_META_FILENAME
+                    ),
+                )["Body"]
+                .read()
+                .decode("utf-8")
+            )
+        else:
+            with open(mlflow_meta_path, "r") as metafile:
+                mlflow_meta_dict = json.load(metafile)
 
         return mlflow_meta_dict
 
     def _get_copy_paths(self, artifact_path, name, flavor):
         copy_paths = {}
-        copy_paths['model_path'] = {}
+        copy_paths["model_path"] = {}
         triton_deployment_dir = os.path.join(self.triton_model_repo, name)
         if flavor == "triton":
             # When flavor is 'triton', the model is assumed to be preconfigured
             # with proper model versions and version strategy, which may differ from
             # the versioning in MLFlow
             for file in artifact_path.iterdir():
-                if file.name not in ['MLmodel', 'conda.yaml']:
-                    copy_paths['model_path']['from'] = file
-            copy_paths['model_path']['to'] = triton_deployment_dir
+                if file.is_dir():
+                    copy_paths["model_path"]["from"] = file
+                    break
+            copy_paths["model_path"]["to"] = triton_deployment_dir
         elif flavor == "onnx":
             # Look for model file via MLModel metadata or iterating dir
             model_file = None
             config_file = None
             for file in artifact_path.iterdir():
-                if file.name == 'MLmodel':
+                if file.name == "MLmodel":
                     mlmodel = Model.load(file)
                     onnx_meta_data = mlmodel.flavors.get("onnx", None)
                     if onnx_meta_data is not None:
-                        model_file = onnx_meta_data.get('data', None)
-                elif file.name == 'config.pbtxt':
+                        model_file = onnx_meta_data.get("data", None)
+                elif file.name == "config.pbtxt":
                     config_file = file.name
-                    copy_paths['config_path'] = {}
-                elif file.suffix == '.txt' and file.stem != 'requirements':
-                    copy_paths[file.stem] = {
-                        'from': file,
-                        'to': triton_deployment_dir
-                    }
+                    copy_paths["config_path"] = {}
+                elif file.suffix == ".txt" and file.stem != "requirements":
+                    copy_paths[file.stem] = {"from": file, "to": triton_deployment_dir}
             if model_file is None:
                 for file in artifact_path.iterdir():
-                    if file.suffix == '.onnx':
+                    if file.suffix == ".onnx":
                         model_file = file.name
                         break
-            copy_paths['model_path']['from'] = os.path.join(
-                artifact_path, model_file)
-            copy_paths['model_path']['to'] = os.path.join(
-                triton_deployment_dir, "1")
+            copy_paths["model_path"]["from"] = os.path.join(artifact_path, model_file)
+            copy_paths["model_path"]["to"] = os.path.join(triton_deployment_dir, "1")
 
             if config_file is not None:
-                copy_paths['config_path']['from'] = os.path.join(
-                    artifact_path, config_file)
-                copy_paths['config_path']['to'] = triton_deployment_dir
+                copy_paths["config_path"]["from"] = os.path.join(
+                    artifact_path, config_file
+                )
+                copy_paths["config_path"]["to"] = triton_deployment_dir
             else:
                 # Make sure the directory has been created for config.pbtxt
                 os.makedirs(triton_deployment_dir, exist_ok=True)
                 # Provide a minimum config file so Triton knows what backend
                 # should be performing the auto-completion
-                config = '''
+                config = """
 backend: "onnxruntime"
 default_model_filename: "{}"
-'''.format(model_file)
-                with open(os.path.join(triton_deployment_dir, "config.pbtxt"),
-                          "w") as cfile:
+""".format(
+                    model_file
+                )
+                with open(
+                    os.path.join(triton_deployment_dir, "config.pbtxt"), "w"
+                ) as cfile:
                     cfile.write(config)
         return copy_paths
 
+    def _walk(self, path):
+        """Walk a path like os.walk() if path is dir,
+        return file in the expected format otherwise.
+        :param path: dir or file path
+
+        :return: root, dirs, files
+        """
+        if os.path.isfile(path):
+            return [(os.path.dirname(path), [], [os.path.basename(path)])]
+        elif os.path.isdir(path):
+            return list(os.walk(path))
+        else:
+            raise Exception(f"path: {path} is not a valid path to a file or dir.")
+
     def _copy_files_to_triton_repo(self, artifact_path, name, flavor):
         copy_paths = self._get_copy_paths(artifact_path, name, flavor)
         for key in copy_paths:
-            if os.path.isdir(copy_paths[key]['from']):
-                if os.path.isdir(copy_paths[key]['to']):
-                    shutil.rmtree(copy_paths[key]['to'])
-                shutil.copytree(copy_paths[key]['from'], copy_paths[key]['to'])
+            if "s3" in self.server_config:
+                # copy model dir to s3 recursively
+                for root, dirs, files in self._walk(copy_paths[key]["from"]):
+                    for filename in files:
+                        local_path = os.path.join(root, filename)
+
+                        if flavor == "onnx":
+                            s3_path = os.path.join(
+                                self.server_config["s3_prefix"],
+                                copy_paths[key]["to"]
+                                .replace(self.server_config["triton_model_repo"], "")
+                                .strip("/"),
+                                filename,
+                            )
+
+                        elif flavor == "triton":
+                            rel_path = os.path.relpath(
+                                local_path,
+                                copy_paths[key]["from"],
+                            )
+                            s3_path = os.path.join(
+                                self.server_config["s3_prefix"], name, rel_path
+                            )
+
+                        self.server_config["s3"].upload_file(
+                            local_path,
+                            self.server_config["s3_bucket"],
+                            s3_path,
+                        )
             else:
-                if not os.path.isdir(copy_paths[key]['to']):
-                    os.makedirs(copy_paths[key]['to'])
-                shutil.copy(copy_paths[key]['from'], copy_paths[key]['to'])
-            print("Copied", copy_paths[key]['from'], "to",
-                  copy_paths[key]['to'])
-        triton_deployment_dir = os.path.join(self.triton_model_repo, name)
-        version_folder = os.path.join(triton_deployment_dir, "1")
-        os.makedirs(version_folder, exist_ok=True)
+                if os.path.isdir(copy_paths[key]["from"]):
+                    if os.path.isdir(copy_paths[key]["to"]):
+                        shutil.rmtree(copy_paths[key]["to"])
+                    shutil.copytree(copy_paths[key]["from"], copy_paths[key]["to"])
+                else:
+                    if not os.path.isdir(copy_paths[key]["to"]):
+                        os.makedirs(copy_paths[key]["to"])
+                    shutil.copy(copy_paths[key]["from"], copy_paths[key]["to"])
+
+        if "s3" not in self.server_config:
+            triton_deployment_dir = os.path.join(self.triton_model_repo, name)
+            version_folder = os.path.join(triton_deployment_dir, "1")
+            os.makedirs(version_folder, exist_ok=True)
+
         return copy_paths
 
+    def _delete_mlflow_meta(self, filepath):
+        if "s3" in self.server_config:
+            self.server_config["s3"].delete_object(
+                Bucket=self.server_config["s3_bucket"],
+                Key=filepath,
+            )
+        elif os.path.isfile(filepath):
+            os.remove(filepath)
+
     def _delete_deployment_files(self, name):
         triton_deployment_dir = os.path.join(self.triton_model_repo, name)
 
-        # Check if the deployment directory exists
-        if not os.path.isdir(triton_deployment_dir):
-            raise Exception(
-                "A deployment does not exist for this model in directory {} for model name {}"
-                .format(triton_deployment_dir, name))
-
-        model_file = glob.glob("{}/model*".format(triton_deployment_dir))
-        for file in model_file:
-            print("Model directory found: {}".format(file))
-            os.remove(file)
-            print("Model directory removed: {}".format(file))
+        if "s3" in self.server_config:
+            objs = self.server_config["s3"].list_objects(
+                Bucket=self.server_config["s3_bucket"],
+                Prefix=os.path.join(self.server_config["s3_prefix"], name),
+            )
+
+            for key in objs["Contents"]:
+                key = key["Key"]
+                try:
+                    self.server_config["s3"].delete_object(
+                        Bucket=self.server_config["s3_bucket"],
+                        Key=key,
+                    )
+                except Exception as e:
+                    raise Exception(f"Could not delete {key}: {e}")
 
-    # Delete mlflow meta file
-        mlflow_meta_path = os.path.join(self.triton_model_repo, name,
-                                        _MLFLOW_META_FILENAME)
-        if os.path.isfile(mlflow_meta_path):
-            os.remove(mlflow_meta_path)
+        else:
+            # Check if the deployment directory exists
+            if not os.path.isdir(triton_deployment_dir):
+                raise Exception(
+                    "A deployment does not exist for this model in directory {} for model name {}".format(
+                        triton_deployment_dir, name
+                    )
+                )
+
+            model_file = glob.glob("{}/model*".format(triton_deployment_dir))
+            for file in model_file:
+                print("Model directory found: {}".format(file))
+                os.remove(file)
+                print("Model directory removed: {}".format(file))
+
+        # Delete mlflow meta file
+        mlflow_meta_path = os.path.join(
+            self.triton_model_repo, name, _MLFLOW_META_FILENAME
+        )
+        self._delete_mlflow_meta(mlflow_meta_path)
 
     def _validate_config_args(self, config):
-        if not config['version']:
+        if not config["version"]:
             raise Exception("Please provide the version as a config argument")
-        if not config['version'].isdigit():
+        if not config["version"].isdigit():
             raise ValueError(
                 "Please make sure version is a number. version = {}".format(
-                    config['version']))
+                    config["version"]
+                )
+            )
 
     def _validate_flavor(self, flavor):
         if flavor not in self.supported_flavors:
-            raise Exception(
-                "{} model flavor not supported by Triton".format(flavor))
+            raise Exception("{} model flavor not supported by Triton".format(flavor))
 
     def _model_exists(self, name):
         deploys = self.list_deployments()
         exists = False
         for d in deploys:
-            if d['name'] == name:
+            if d["name"] == name:
                 exists = True
         return exists
 
@@ -405,7 +530,7 @@ def target_help():
     help_msg = (
         "\nmlflow-triton plugin integrates the Triton Inference Server to the mlflow deployment pipeline. \n\n "
         "Example command: \n\n"
-        "  mlflow deployments create -t triton --name mymodel --flavor onnx -m models:/mymodel/Production -C \"version=1\" \n\n"
+        '  mlflow deployments create -t triton --name mymodel --flavor onnx -m models:/mymodel/Production -C "version=1" \n\n'
         "The environment variable TRITON_MODEL_REPO must be set to the location that the Triton"
         "Inference Server is storing its models\n\n"
         "export TRITON_MODEL_REPO = /path/to/triton/model/repo\n\n"
diff --git a/deploy/mlflow-triton-plugin/scripts/publish_model_to_mlflow.py b/deploy/mlflow-triton-plugin/scripts/publish_model_to_mlflow.py
old mode 100644
new mode 100755
index 5343e0da63..779d393020
--- a/deploy/mlflow-triton-plugin/scripts/publish_model_to_mlflow.py
+++ b/deploy/mlflow-triton-plugin/scripts/publish_model_to_mlflow.py
@@ -1,4 +1,6 @@
-# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -23,10 +25,10 @@
 # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-import mlflow
 import os
-import click
 
+import click
+import mlflow
 import triton_flavor
 
 
@@ -35,18 +37,20 @@
     "--model_name",
     help="Model name",
 )
-@click.option("--model_directory",
-              type=click.Path(exists=True, readable=True),
-              required=True,
-              help="Model filepath")
+@click.option(
+    "--model_directory",
+    type=click.Path(exists=True, readable=True),
+    required=True,
+    help="Model filepath",
+)
 @click.option(
     "--flavor",
-    type=click.Choice(['triton'], case_sensitive=True),
+    type=click.Choice(["triton"], case_sensitive=True),
     required=True,
     help="Model flavor",
 )
 def publish_to_mlflow(model_name, model_directory, flavor):
-    mlflow_tracking_uri = os.environ['MLFLOW_TRACKING_URI']
+    mlflow_tracking_uri = os.environ["MLFLOW_TRACKING_URI"]
     artifact_path = "triton"
 
     mlflow.set_tracking_uri(uri=mlflow_tracking_uri)
diff --git a/deploy/mlflow-triton-plugin/scripts/triton_flavor.py b/deploy/mlflow-triton-plugin/scripts/triton_flavor.py
old mode 100644
new mode 100755
index eaafdea7c7..7b0f61630d
--- a/deploy/mlflow-triton-plugin/scripts/triton_flavor.py
+++ b/deploy/mlflow-triton-plugin/scripts/triton_flavor.py
@@ -1,4 +1,6 @@
-# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,7 +27,7 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 """
 The ``triton`` module provides APIs for logging and loading Triton-recognized
-models in the MLflow Model format. This module exports MLflow Models with the following 
+models in the MLflow Model format. This module exports MLflow Models with the following
 flavors:
 
 Triton format
@@ -36,12 +38,12 @@
 import shutil
 import sys
 
+from mlflow.exceptions import MlflowException
 from mlflow.models import Model
 from mlflow.models.model import MLMODEL_FILE_NAME
-from mlflow.exceptions import MlflowException
 from mlflow.protos.databricks_pb2 import RESOURCE_ALREADY_EXISTS
-from mlflow.utils.annotations import experimental
 from mlflow.tracking._model_registry import DEFAULT_AWAIT_MAX_SLEEP_SECONDS
+from mlflow.utils.annotations import experimental
 
 FLAVOR_NAME = "triton"
 
@@ -63,8 +65,10 @@ def save_model(
 
     path = os.path.abspath(path)
     if os.path.exists(path):
-        raise MlflowException(message="Path '{}' already exists".format(path),
-                              error_code=RESOURCE_ALREADY_EXISTS)
+        raise MlflowException(
+            message="Path '{}' already exists".format(path),
+            error_code=RESOURCE_ALREADY_EXISTS,
+        )
     os.makedirs(path)
     triton_model_path = os.path.normpath(triton_model_path)
     model_data_subpath = os.path.basename(triton_model_path)
diff --git a/deploy/mlflow-triton-plugin/setup.py b/deploy/mlflow-triton-plugin/setup.py
old mode 100644
new mode 100755
index 20c4ac4c85..65b8e0df1e
--- a/deploy/mlflow-triton-plugin/setup.py
+++ b/deploy/mlflow-triton-plugin/setup.py
@@ -1,4 +1,6 @@
-# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -23,15 +25,15 @@
 # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-from setuptools import setup, find_packages
+from setuptools import find_packages, setup
 
 setup(
     name="mlflow-triton",
-    version="0.1.0",
+    version="0.2.0",
     description="Triton Mlflow Deployment",
     long_description=open("README.md").read(),
     long_description_content_type="text/markdown",
     packages=find_packages(),
-    install_requires=["mlflow>=1.15.0", "tritonclient[all]"],
+    install_requires=["mlflow>=2.2.1,<3.0", "tritonclient[all]", "boto3"],
     entry_points={"mlflow.deployments": "triton=mlflow_triton.deployments"},
 )
diff --git a/docker/cpu_only/entrypoint.d/12-banner.sh b/docker/cpu_only/entrypoint.d/12-banner.sh
old mode 100644
new mode 100755
diff --git a/docker/cpu_only/entrypoint.d/50-gpu-driver-check2.sh b/docker/cpu_only/entrypoint.d/50-gpu-driver-check2.sh
old mode 100644
new mode 100755
diff --git a/docker/entrypoint.d/15-container-copyright.txt b/docker/entrypoint.d/15-container-copyright.txt
index e772064e76..5e077f288f 100644
--- a/docker/entrypoint.d/15-container-copyright.txt
+++ b/docker/entrypoint.d/15-container-copyright.txt
@@ -1,2 +1,2 @@
 
-Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
diff --git a/docker/entrypoint.d/50-gpu-driver-check2.sh b/docker/entrypoint.d/50-gpu-driver-check2.sh
old mode 100644
new mode 100755
diff --git a/docker/entrypoint.d/99-check-run-aip-mode.sh b/docker/entrypoint.d/99-check-run-aip-mode.sh
old mode 100644
new mode 100755
diff --git a/docker/sagemaker/serve b/docker/sagemaker/serve
index e487f9af45..e9abc00bf5 100755
--- a/docker/sagemaker/serve
+++ b/docker/sagemaker/serve
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,34 +26,80 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 SAGEMAKER_SINGLE_MODEL_REPO=/opt/ml/model/
-SAGEMAKER_MULTI_MODEL_REPO=/opt/ml/models/
+
+# Use 'ready' for ping check in single-model endpoint mode, and use 'live' for ping check in multi-model endpoint model
+# https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/rest_predict_v2.yaml#L10-L26
+if [ -n "$SAGEMAKER_TRITON_OVERRIDE_PING_MODE" ]; then
+    SAGEMAKER_TRITON_PING_MODE=${SAGEMAKER_TRITON_OVERRIDE_PING_MODE}
+else
+    SAGEMAKER_TRITON_PING_MODE="ready"
+fi
+
+# Note: in Triton on SageMaker, each model url is registered as a separate repository
+# e.g., /opt/ml/models/<hash>/model. Specifying MME model repo path as /opt/ml/models causes Triton
+# to treat it as an additional empty repository and changes
+# the state of all models to be UNAVAILABLE in the model repository
+# https://github.com/triton-inference-server/core/blob/main/src/model_repository_manager.cc#L914,L922
+# On Triton, this path will be a dummy path as it's mandatory to specify a model repo when starting triton
+SAGEMAKER_MULTI_MODEL_REPO=/tmp/sagemaker
 
 SAGEMAKER_MODEL_REPO=${SAGEMAKER_SINGLE_MODEL_REPO}
 is_mme_mode=false
 
 if [ -n "$SAGEMAKER_MULTI_MODEL" ]; then
     if [ "$SAGEMAKER_MULTI_MODEL" == "true" ]; then
+        mkdir -p ${SAGEMAKER_MULTI_MODEL_REPO}
         SAGEMAKER_MODEL_REPO=${SAGEMAKER_MULTI_MODEL_REPO}
+        if [ -n "$SAGEMAKER_TRITON_OVERRIDE_PING_MODE" ]; then
+            SAGEMAKER_TRITON_PING_MODE=${SAGEMAKER_TRITON_OVERRIDE_PING_MODE}
+        else
+            SAGEMAKER_TRITON_PING_MODE="live"
+        fi
         is_mme_mode=true
-        echo "Triton is running in SageMaker MME mode" 
+        echo -e "Triton is running in SageMaker MME mode. Using Triton ping mode: \"${SAGEMAKER_TRITON_PING_MODE}\""
     fi
 fi
 
 SAGEMAKER_ARGS="--model-repository=${SAGEMAKER_MODEL_REPO}"
+#Set model namespacing to true, but allow disabling if required
+if [ -n "$SAGEMAKER_TRITON_DISABLE_MODEL_NAMESPACING" ]; then
+    SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --model-namespacing=${SAGEMAKER_TRITON_DISABLE_MODEL_NAMESPACING}"
+else
+    SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --model-namespacing=true"
+fi
 if [ -n "$SAGEMAKER_BIND_TO_PORT" ]; then
     SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --sagemaker-port=${SAGEMAKER_BIND_TO_PORT}"
 fi
 if [ -n "$SAGEMAKER_SAFE_PORT_RANGE" ]; then
     SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --sagemaker-safe-port-range=${SAGEMAKER_SAFE_PORT_RANGE}"
 fi
+if [ -n "$SAGEMAKER_TRITON_ALLOW_GRPC" ]; then
+    SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --allow-grpc=${SAGEMAKER_TRITON_ALLOW_GRPC}"
+else
+    SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --allow-grpc=false"
+fi
+if [ -n "$SAGEMAKER_TRITON_ALLOW_METRICS" ]; then
+    SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --allow-metrics=${SAGEMAKER_TRITON_ALLOW_METRICS}"
+else
+    SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --allow-metrics=false"
+fi
+if [ -n "$SAGEMAKER_TRITON_METRICS_PORT" ]; then
+    SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --metrics-port=${SAGEMAKER_TRITON_METRICS_PORT}"
+fi
+if [ -n "$SAGEMAKER_TRITON_GRPC_PORT" ]; then
+    SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --grpc-port=${SAGEMAKER_TRITON_GRPC_PORT}"
+fi
 if [ -n "$SAGEMAKER_TRITON_BUFFER_MANAGER_THREAD_COUNT" ]; then
     SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --buffer-manager-thread-count=${SAGEMAKER_TRITON_BUFFER_MANAGER_THREAD_COUNT}"
 fi
 if [ -n "$SAGEMAKER_TRITON_THREAD_COUNT" ]; then
     SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --sagemaker-thread-count=${SAGEMAKER_TRITON_THREAD_COUNT}"
 fi
+# Enable verbose logging by default. If env variable is specified, use value from env variable
 if [ -n "$SAGEMAKER_TRITON_LOG_VERBOSE" ]; then
     SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --log-verbose=${SAGEMAKER_TRITON_LOG_VERBOSE}"
+else
+    SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --log-verbose=true"
 fi
 if [ -n "$SAGEMAKER_TRITON_LOG_INFO" ]; then
     SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --log-info=${SAGEMAKER_TRITON_LOG_INFO}"
@@ -66,10 +112,27 @@ if [ -n "$SAGEMAKER_TRITON_LOG_ERROR" ]; then
 fi
 if [ -n "$SAGEMAKER_TRITON_SHM_DEFAULT_BYTE_SIZE" ]; then
     SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --backend-config=python,shm-default-byte-size=${SAGEMAKER_TRITON_SHM_DEFAULT_BYTE_SIZE}"
+else
+    SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --backend-config=python,shm-default-byte-size=16777216" #16MB
 fi
 if [ -n "$SAGEMAKER_TRITON_SHM_GROWTH_BYTE_SIZE" ]; then
     SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --backend-config=python,shm-growth-byte-size=${SAGEMAKER_TRITON_SHM_GROWTH_BYTE_SIZE}"
+else
+    SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --backend-config=python,shm-growth-byte-size=1048576" #1MB
+fi
+if [ -n "$SAGEMAKER_TRITON_TENSORFLOW_VERSION" ]; then
+    SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --backend-config=tensorflow,version=${SAGEMAKER_TRITON_TENSORFLOW_VERSION}"
 fi
+if [ -n "$SAGEMAKER_TRITON_MODEL_LOAD_GPU_LIMIT" ]; then
+    num_gpus=$(nvidia-smi -L | wc -l)
+    for ((i=0; i<${num_gpus}; i++)); do
+        SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --model-load-gpu-limit ${i}:${SAGEMAKER_TRITON_MODEL_LOAD_GPU_LIMIT}"
+    done
+fi
+if [ -n "$SAGEMAKER_TRITON_ADDITIONAL_ARGS" ]; then
+    SAGEMAKER_ARGS="${SAGEMAKER_ARGS} ${SAGEMAKER_TRITON_ADDITIONAL_ARGS}"
+fi
+
 
 if [ "${is_mme_mode}" = false ] && [ -f "${SAGEMAKER_MODEL_REPO}/config.pbtxt" ]; then
     echo "ERROR: Incorrect directory structure."
@@ -103,4 +166,4 @@ elif [ "${is_mme_mode}" = false ]; then
     SAGEMAKER_ARGS="${SAGEMAKER_ARGS} --load-model=${SAGEMAKER_TRITON_DEFAULT_MODEL_NAME}"
 fi
 
-tritonserver --allow-sagemaker=true --allow-grpc=false --allow-http=false --allow-metrics=false --model-control-mode=explicit $SAGEMAKER_ARGS
+tritonserver --allow-sagemaker=true --allow-http=false --model-control-mode=explicit $SAGEMAKER_ARGS
diff --git a/docs/.gitignore b/docs/.gitignore
new file mode 100644
index 0000000000..42afabfd2a
--- /dev/null
+++ b/docs/.gitignore
@@ -0,0 +1 @@
+/build
\ No newline at end of file
diff --git a/docs/Dockerfile.docs b/docs/Dockerfile.docs
new file mode 100644
index 0000000000..feaa76dd6a
--- /dev/null
+++ b/docs/Dockerfile.docs
@@ -0,0 +1,54 @@
+# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+FROM ubuntu:22.04
+
+# various documentation dependencies
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends python3-pip python3-dev git \
+    git-lfs zip doxygen build-essential unzip wget pandoc ssh && \
+    rm -rf /var/lib/apt/lists/*
+
+# install protobuf
+ADD https://github.com/google/protobuf/releases/download/v3.6.1/protoc-3.6.1-linux-x86_64.zip ./
+RUN unzip protoc-3.6.1-linux-x86_64.zip -d ./usr/local && \
+  rm protoc-3.6.1-linux-x86_64.zip
+
+# install pseudomuto/protoc-gen-doc
+RUN wget https://github.com/pseudomuto/protoc-gen-doc/releases/download/v1.3.2/protoc-gen-doc-1.3.2.linux-amd64.go1.12.6.tar.gz && \
+    tar xzf protoc-gen-doc-1.3.2.linux-amd64.go1.12.6.tar.gz && \
+    mv protoc-gen-doc-1.3.2.linux-amd64.go1.12.6/protoc-gen-doc /usr/local/bin/
+
+# install sphinx et al
+RUN pip3 install sphinx==4.5.0 nbclient==0.5.13 \
+    docutils==0.16 ablog==0.10.33.post1 myst-nb==0.17.2 rst-to-myst==0.3.4 nbsphinx==0.8.8 \
+    sphinx-book-theme==0.3.2 sphinx-copybutton==0.5.2 sphinx-design==0.4.1 sphinx-prompt==1.5.0  \
+    sphinxcontrib-bibtex==2.5.0 sphinx-tabs==3.2.0 \
+    exhale==0.2.3 breathe==4.14.1 sphinx-sitemap==2.5.0 ipython==8.12.1 attrs==21.4.0
+
+# Set visitor script to be included on every HTML page
+ENV VISITS_COUNTING_SCRIPT="//assets.adobedtm.com/b92787824f2e0e9b68dc2e993f9bd995339fe417/satelliteLib-7ba51e58dc61bcb0e9311aadd02a0108ab24cc6c.js"
+
diff --git a/docs/Makefile b/docs/Makefile
new file mode 100644
index 0000000000..099a8f659b
--- /dev/null
+++ b/docs/Makefile
@@ -0,0 +1,53 @@
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = .
+BUILDDIR      = build
+
+#PROTOBUFFILES = $(wildcard ../triton/proto/*.proto)
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# protobuf: source/reference/protos/gen_proto_doc.sh
+# 	cd source/reference/protos && \
+#     rm -f *.proto.rst && \
+#     bash -x ./gen_proto_doc.sh $(PROTOBUFFILES:%=../%)
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%:
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/docs/README.md b/docs/README.md
index 89bf82b1fd..22e0c0d691 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,90 +26,193 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 -->
 
-# Triton Inference Server Documentation
-
-## User Guide
-The User Guide describes how to use Triton as an inference solution, including information on how to configure Triton, how to organize and configure your models, how to use the C++ and Python clients, etc. 
-
-- [QuickStart](quickstart.md)
-  - [Install Triton](quickstart.md#install-triton-docker-image)
-  - [Create Model Repository](quickstart.md#create-a-model-repository)
-  - [Run Triton](quickstart.md#run-triton)
-- [Model Repository](model_repository.md)
-  - [Cloud Storage](model_repository.md#model-repository-locations)
-  - [File Organization](model_repository.md#model-files)
-  - [Model Versioning](model_repository.md#model-versions)
-- [Model Configuration](model_configuration.md)
-  - [Required Model Configuration](model_configuration.md#minimal-model-configuration)
-    - [Maximum Batch Size - Batching and Non-Batching Models](model_configuration.md#maximum-batch-size)
-    - [Input and Output Tensors](model_configuration.md#inputs-and-outputs)
-      - [Tensor Datatypes](model_configuration.md#datatypes)
-      - [Tensor Reshape](model_configuration.md#reshape)
-      - [Shape Tensor](model_configuration.md#shape-tensors)
-  - [Auto-Generate Required Model Configuration](model_configuration.md#auto-generated-model-configuration)
-  - [Version Policy](model_configuration.md#version-policy)
-  - [Instance Groups](model_configuration.md#instance-groups)
-    - [Specifying Multiple Model Instances](model_configuration.md#multiple-model-instances)
-    - [CPU and GPU Instances](model_configuration.md#cpu-model-instance)
-    - [Configuring Rate Limiter](model_configuration.md#rate-limiter-configuration)
-  - [Optimization Settings](model_configuration.md#optimization_policy)
-    - [Framework-Specific Optimization](optimization.md#framework-specific-optimization)
-      - [ONNX-TensorRT](optimization.md#onnx-with-tensorrt-optimization-ort-trt)
-      - [ONNX-OpenVINO](optimization.md#onnx-with-openvino-optimization)
-      - [TensorFlow-TensorRT](optimization.md#tensorflow-with-tensorrt-optimization-tf-trt)
-      - [TensorFlow-Mixed-Precision](optimization.md#tensorflow-automatic-fp16-optimization)
-    - [NUMA Optimization](optimization.md#numa-optimization)
-  - [Scheduling and Batching](model_configuration.md#scheduling-and-batching)
-    - [Default Scheduler - Non-Batching](model_configuration.md#default-scheduler)
-    - [Dynamic Batcher](model_configuration.md#dynamic-batcher)
-      - [How to Configure Dynamic Batcher](model_configuration.md#recommended-configuration-process)
-        - [Delayed Batching](model_configuration.md#delayed-batching)
-        - [Preferred Batch Size](model_configuration.md#preferred-batch-sizes)
-      - [Preserving Request Ordering](model_configuration.md#preserve-ordering)
-      - [Priority Levels](model_configuration.md#priority-levels)
-      - [Queuing Policies](model_configuration.md#queue-policy)
-      - [Ragged Batching](ragged_batching.md)
-    - [Sequence Batcher](model_configuration.md#sequence-batcher)
-      - [Stateful Models](architecture.md#stateful-models)
-      - [Control Inputs](architecture.md#control-inputs)
-      - [Implicit State - Stateful Inference Using a Stateless Model](architecture.md#implicit-state-management)
-      - [Sequence Scheduling Strategies](architecture.md#scheduling-strateties)
-        - [Direct](architecture.md#direct)
-        - [Oldest](architecture.md#oldest)
-    - [Rate Limiter](rate_limiter.md)
-  - [Model Warmup](model_configuration.md#model-warmup)
-  - [Inference Request/Response Cache](model_configuration.md#response-cache)
-- Model Pipeline
-  - [Model Ensemble](architecture.md#ensemble-models)
-  - [Business Logic Scripting (BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting)
-- [Model Management](model_management.md)
-  - [Explicit Model Loading and Unloading](model_management.md#model-control-mode-explicit)
-  - [Modifying the Model Repository](model_management.md#modifying-the-model-repository)
-- [Metrics](metrics.md)
-- [Framework Custom Operations](custom_operations.md)
-  - [TensorRT](custom_operations.md#tensorrt)
-  - [TensorFlow](custom_operations.md#tensorflow)
-  - [PyTorch](custom_operations.md#pytorch)
-  - [ONNX](custom_operations.md#onnx)
-- [Client Libraries and Examples](https://github.com/triton-inference-server/client)
-  - [C++ HTTP/GRPC Libraries](https://github.com/triton-inference-server/client#client-library-apis)
-  - [Python HTTP/GRPC Libraries](https://github.com/triton-inference-server/client#client-library-apis)
-  - [Java HTTP Library](https://github.com/triton-inference-server/client/tree/main/src/java)
-  - GRPC Generated Libraries
-    - [go](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/go)
-    - [Java/Scala](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/java)
-    - [Javascript](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/javascript)
-- [Performance Analysis](optimization.md)
-  - [Model Analyzer](model_analyzer.md)
-  - [Performance Analyzer](perf_analyzer.md)
-  - [Inference Request Tracing](trace.md)
-- [Jetson and JetPack](jetson.md)
-
-## Developer Guide
-The Developer Guide describes how to build and test Triton and also how Triton can be extended with new functionality.
-
-- [Build](build.md)
-- [Protocols and APIs](inference_protocols.md).
+# **Triton Inference Server Documentation**
+
+| [Installation](README.md#installation) | [Getting Started](README.md#getting-started) | [User Guide](README.md#user-guide) | [API Guide](protocol/README.md) | [Additional Resources](README.md#resources) | [Customization Guide](README.md#customization-guide) |
+| ------------ | --------------- | --------------- | ------------ | --------------- | --------------- |
+
+**New to Triton Inference Server?** Make use of
+[these tutorials](https://github.com/triton-inference-server/tutorials)
+ to begin your Triton journey!
+
+## **Installation**
+Before you can use the Triton Docker image you must install
+[Docker](https://docs.docker.com/engine/install). If you plan on using
+a GPU for inference you must also install the [NVIDIA Container
+Toolkit](https://github.com/NVIDIA/nvidia-docker). DGX users should
+follow [Preparing to use NVIDIA
+Containers](http://docs.nvidia.com/deeplearning/dgx/preparing-containers/index.html).
+
+Pull the image using the following command.
+
+```
+$ docker pull nvcr.io/nvidia/tritonserver:<yy.mm>-py3
+```
+
+Where \<yy.mm\> is the version of Triton that you want to pull. For a complete list of all the variants and versions of the Triton Inference Server Container,  visit the [NGC Page](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver). More information about customizing the Triton Container can be found in [this section](customization_guide/compose.md) of the User Guide.
+
+## **Getting Started**
+
+This guide covers the simplest possible workflow for deploying a model using a Triton Inference Server.
+- [Create a Model Repository](getting_started/quickstart.md#create-a-model-repository)
+- [Launch Triton](getting_started/quickstart.md#launch-triton)
+- [Send an Inference Request](getting_started/quickstart.md#send-an-inference-request)
+
+Triton Inference Server has a considerable list versatile and powerful features. All new users are recommended to explore the [User Guide](README.md#user-guide) and the [additional resources](README.md#resources) sections for features most relevant to their use case.
+
+## **User Guide**
+The User Guide describes how to configure Triton, organize and configure your models, use the C++ and Python clients, etc. This guide includes the following:
+* Creating a Model Repository [[Overview](README.md#model-repository) || [Details](user_guide/model_repository.md)]
+* Writing a Model Configuration [[Overview](README.md#model-configuration) || [Details](user_guide/model_configuration.md)]
+* Buillding a Model Pipeline [[Overview](README.md#model-pipeline)]
+* Managing Model Availability [[Overview](README.md#model-management) || [Details](user_guide/model_management.md)]
+* Collecting Server Metrics [[Overview](README.md#metrics) || [Details](user_guide/metrics.md)]
+* Supporting Custom Ops/layers [[Overview](README.md#framework-custom-operations) || [Details](user_guide/custom_operations.md)]
+* Using the Client API [[Overview](README.md#client-libraries-and-examples) || [Details](https://github.com/triton-inference-server/client)]
+* Cancelling Inference Requests [[Overview](README.md#cancelling-inference-requests) || [Details](user_guide/request_cancellation.md)]
+* Analyzing Performance [[Overview](README.md#performance-analysis)]
+* Deploying on edge (Jetson) [[Overview](README.md#jetson-and-jetpack)]
+* Debugging Guide [Details](./user_guide/debugging_guide.md)
+
+### Model Repository
+[Model Repositories](user_guide/model_repository.md) are the organizational hub for using Triton. All models, configuration files, and additional resources needed to serve the models are housed inside a model repository.
+- [Cloud Storage](user_guide/model_repository.md#model-repository-locations)
+- [File Organization](user_guide/model_repository.md#model-files)
+- [Model Versioning](user_guide/model_repository.md#model-versions)
+### Model Configuration
+
+A [Model Configuration](user_guide/model_configuration.md) file is where you set the model-level options, such as output tensor reshaping and dynamic batch sizing.
+
+#### Required Model Configuration
+
+Triton Inference Server requires some [Minimum Required parameters](user_guide/model_configuration.md#minimal-model-configuration) to be filled in the Model Configuration. These required parameters essentially pertain to the structure of the model. For TensorFlow, ONNX and TensorRT models, users can rely on Triton to [Auto Generate](user_guide/model_configuration.md#auto-generated-model-configuration) the Minimum Required model configuration.
+- [Maximum Batch Size - Batching and Non-Batching Models](user_guide/model_configuration.md#maximum-batch-size)
+- [Input and Output Tensors](user_guide/model_configuration.md#inputs-and-outputs)
+    - [Tensor Datatypes](user_guide/model_configuration.md#datatypes)
+    - [Tensor Reshape](user_guide/model_configuration.md#reshape)
+    - [Shape Tensor](user_guide/model_configuration.md#shape-tensors)
+
+#### Versioning Models
+Users need the ability to save and serve different versions of models based on business requirements. Triton allows users to set policies to make available different versions of the model as needed. [Learn More](user_guide/model_configuration.md#version-policy).
+
+#### Instance Groups
+Triton allows users to use of multiple instances of the same model. Users can specify how many instances (copies) of a model to load and whether to use GPU or CPU. If the model is being loaded on GPU, users can also select which GPUs to use. [Learn more](user_guide/model_configuration.md#instance-groups).
+- [Specifying Multiple Model Instances](user_guide/model_configuration.md#multiple-model-instances)
+- [CPU and GPU Instances](user_guide/model_configuration.md#cpu-model-instance)
+- [Configuring Rate Limiter](user_guide/model_configuration.md#rate-limiter-configuration)
+
+#### Optimization Settings
+
+The Model Configuration ModelOptimizationPolicy property is used to specify optimization and prioritization settings for a model. These settings control if/how a model is optimized by the backend and how it is scheduled and executed by Triton. See the [ModelConfig Protobuf](https://github.com/triton-inference-server/common/blob/main/protobuf/model_config.proto) and [Optimization Documentation](user_guide/optimization.md#optimization) for the currently available settings.
+- [Framework-Specific Optimization](user_guide/optimization.md#framework-specific-optimization)
+  - [ONNX-TensorRT](user_guide/optimization.md#onnx-with-tensorrt-optimization-ort-trt)
+  - [ONNX-OpenVINO](user_guide/optimization.md#onnx-with-openvino-optimization)
+  - [TensorFlow-TensorRT](user_guide/optimization.md#tensorflow-with-tensorrt-optimization-tf-trt)
+  - [TensorFlow-Mixed-Precision](user_guide/optimization.md#tensorflow-automatic-fp16-optimization)
+- [NUMA Optimization](user_guide/optimization.md#numa-optimization)
+
+#### Scheduling and Batching
+
+Triton supports batching individual inference requests to improve compute resource utilization. This is extremely important as individual requests typically will not saturate GPU resources thus not leveraging the parallelism provided by GPUs to its extent. Learn more about Triton's [Batcher and Scheduler](user_guide/model_configuration.md#scheduling-and-batching).
+- [Default Scheduler - Non-Batching](user_guide/model_configuration.md#default-scheduler)
+- [Dynamic Batcher](user_guide/model_configuration.md#dynamic-batcher)
+  - [How to Configure Dynamic Batcher](user_guide/model_configuration.md#recommended-configuration-process)
+    - [Delayed Batching](user_guide/model_configuration.md#delayed-batching)
+    - [Preferred Batch Size](user_guide/model_configuration.md#preferred-batch-sizes)
+  - [Preserving Request Ordering](user_guide/model_configuration.md#preserve-ordering)
+  - [Priority Levels](user_guide/model_configuration.md#priority-levels)
+  - [Queuing Policies](user_guide/model_configuration.md#queue-policy)
+  - [Ragged Batching](user_guide/ragged_batching.md)
+- [Sequence Batcher](user_guide/model_configuration.md#sequence-batcher)
+  - [Stateful Models](user_guide/architecture.md#stateful-models)
+  - [Control Inputs](user_guide/architecture.md#control-inputs)
+  - [Implicit State - Stateful Inference Using a Stateless Model](user_guide/architecture.md#implicit-state-management)
+  - [Sequence Scheduling Strategies](user_guide/architecture.md#scheduling-strategies)
+    - [Direct](user_guide/architecture.md#direct)
+    - [Oldest](user_guide/architecture.md#oldest)
+
+#### Rate Limiter
+Rate limiter manages the rate at which requests are scheduled on model instances by Triton. The rate limiter operates across all models loaded in Triton to allow cross-model prioritization. [Learn more](user_guide/rate_limiter.md).
+
+#### Model Warmup
+For a few of the Backends (check [Additional Resources](README.md#resources)) some or all of initialization is deferred until the first inference request is received, the benefit is resource conservation but comes with the downside of the initial requests getting processed slower than expected. Users can pre-"warm up" the model by instructing Triton to initialize the model. [Learn more](user_guide/model_configuration.md#model-warmup).
+
+#### Inference Request/Response Cache
+Triton has a feature which allows inference responses to get cached. [Learn More](user_guide/response_cache.md).
+
+### Model Pipeline
+Building ensembles is as easy as adding an addition configuration file which outlines the specific flow of tensors from one model to another. Any additional changes required by the model ensemble can be made in existing (individual) model configurations.
+- [Model Ensemble](user_guide/architecture.md#ensemble-models)
+- [Business Logic Scripting (BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting)
+### Model Management
+Users can specify policies in the model configuration for loading and unloading of models. This [section](user_guide/model_management.md) covers user selectable policy details.
+- [Explicit Model Loading and Unloading](user_guide/model_management.md#model-control-mode-explicit)
+- [Modifying the Model Repository](user_guide/model_management.md#modifying-the-model-repository)
+### Metrics
+Triton provides Prometheus metrics like GPU Utilization, Memory Usage, Latency and more. Learn about [available metrics](user_guide/metrics.md).
+### Framework Custom Operations
+Some frameworks provide the option of building custom layers/operations. These can be added to specific Triton Backends for the those frameworks. [Learn more](user_guide/custom_operations.md)
+- [TensorRT](user_guide/custom_operations.md#tensorrt)
+- [TensorFlow](user_guide/custom_operations.md#tensorflow)
+- [PyTorch](user_guide/custom_operations.md#pytorch)
+- [ONNX](user_guide/custom_operations.md#onnx)
+### Client Libraries and Examples
+Use the [Triton Client](https://github.com/triton-inference-server/client) API to integrate client applications over the network HTTP/gRPC API or integrate applications directly with Triton using CUDA shared memory to remove network overhead.
+- [C++ HTTP/GRPC Libraries](https://github.com/triton-inference-server/client#client-library-apis)
+- [Python HTTP/GRPC Libraries](https://github.com/triton-inference-server/client#client-library-apis)
+- [Java HTTP Library](https://github.com/triton-inference-server/client/tree/main/src/java)
+- GRPC Generated Libraries
+  - [go](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/go)
+  - [Java/Scala](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/java)
+  - [Javascript](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/javascript)
+- [Shared Memory Extension](protocol/extension_shared_memory.md)
+### Cancelling Inference Requests
+Triton can detect and handle requests that have been cancelled from the client-side. This [document](user_guide/request_cancellation.md) discusses scope and limitations of the feature.
+### Performance Analysis
+Understanding Inference performance is key to better resource utilization. Use Triton's Tools to costomize your deployment.
+- [Performance Tuning Guide](user_guide/performance_tuning.md)
+- [Optimization](user_guide/optimization.md)
+- [Model Analyzer](user_guide/model_analyzer.md)
+- [Performance Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md)
+- [Inference Request Tracing](user_guide/trace.md)
+### Jetson and JetPack
+Triton can be deployed on edge devices. Explore [resources](user_guide/jetson.md) and [examples](examples/jetson/README.md).
+
+## **Resources**
+
+The following resources are recommended to explore the full suite of Triton Inference Server's functionalities.
+- **Clients**: Triton Inference Server comes with C++, Python and Java APIs with which users can send HTTP/REST or gRPC(possible extensions for other languages) requests. Explore the [client repository](https://github.com/triton-inference-server/server/tree/main/docs/protocol) for examples and documentation.
+
+- **Configuring Deployment**: Triton comes with three tools which can be used to configure deployment setting, measure performance and recommend optimizations.
+  - [Model Analyzer](https://github.com/triton-inference-server/model_analyzer) Model Analyzer is CLI tool built to recommend deployment configurations for Triton Inference Server based on user's Quality of Service Requirements. It also generates detailed reports about model performance to summarize the benefits and trade offs of different configurations.
+  - [Perf Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md):
+  Perf Analyzer is a CLI application built to generate inference requests and
+  measures the latency of those requests and throughput of the model being
+  served.
+  - [Model Navigator](https://github.com/triton-inference-server/model_navigator):
+  The Triton Model Navigator is a tool that provides the ability to automate the process of moving model from source to optimal format and configuration for deployment on Triton Inference Server. The tool supports export model from source to all possible formats and applies the Triton Inference Server backend optimizations.
+
+- **Backends**: Triton supports a wide variety of frameworks used to run models. Users can extend this functionality by creating custom backends.
+  - [PyTorch](https://github.com/triton-inference-server/pytorch_backend): Widely used Open Source DL Framework
+  - [TensorFlow](https://github.com/triton-inference-server/tensorflow_backend): Widely used Open Source DL Framework
+  - [TensorRT](https://github.com/triton-inference-server/tensorrt_backend): NVIDIA [TensorRT](https://developer.nvidia.com/tensorrt) is an inference acceleration SDK that provide a with range of graph optimizations, kernel optimization, use of lower precision, and more.
+  - [ONNX](https://github.com/triton-inference-server/onnxruntime_backend): ONNX Runtime is a cross-platform inference and training machine-learning accelerator.
+  - [OpenVINO](https://github.com/triton-inference-server/openvino_backend): OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference.
+  - [Paddle Paddle](https://github.com/triton-inference-server/paddlepaddle_backend): Widely used Open Source DL Framework
+  - [Python](https://github.com/triton-inference-server/python_backend): Users can add custom business logic, or any python code/model for serving requests.
+  - [Forest Inference Library](https://github.com/triton-inference-server/fil_backend): Backend built for forest models trained by several popular machine learning frameworks (including XGBoost, LightGBM, Scikit-Learn, and cuML)
+  - [DALI](https://github.com/triton-inference-server/dali_backend): NVIDIA [DALI](https://developer.nvidia.com/dali) is a Data Loading Library purpose built to accelerated pre-processing and data loading steps in a Deep Learning Pipeline.
+  - [HugeCTR](https://github.com/triton-inference-server/hugectr_backend): HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates
+  - [Managed Stateful Models](https://github.com/triton-inference-server/stateful_backend): This backend automatically manages the input and output states of a model. The states are associated with a sequence id and need to be tracked for inference requests associated with the sequence id.
+  - [Faster Transformer](https://github.com/triton-inference-server/fastertransformer_backend): NVIDIA [FasterTransformer](https://github.com/NVIDIA/FasterTransformer/) (FT) is a library implementing an accelerated engine for the inference of transformer-based neural networks, with a special emphasis on large models, spanning many GPUs and nodes in a distributed manner.
+  - [Building Custom Backends](https://github.com/triton-inference-server/backend/tree/main/examples#tutorial)
+  - [Sample Custom Backend: Repeat_backend](https://github.com/triton-inference-server/repeat_backend): Backend built to demonstrate sending of zero, one, or multiple responses per request.
+
+## **Customization Guide**
+This guide describes how to build and test Triton and also how Triton can be extended with new functionality.
+
+- [Build](customization_guide/build.md)
+- [Protocols and APIs](customization_guide/inference_protocols.md).
 - [Backends](https://github.com/triton-inference-server/backend)
-- [Repository Agents](repository_agents.md)
-- [Test](test.md)
+- [Repository Agents](customization_guide/repository_agents.md)
+- [Test](customization_guide/test.md)
diff --git a/docs/_static/.gitattributes b/docs/_static/.gitattributes
new file mode 100644
index 0000000000..04865f126a
--- /dev/null
+++ b/docs/_static/.gitattributes
@@ -0,0 +1,2 @@
+nvidia-logo-horiz-rgb-blk-for-screen.png filter=lfs diff=lfs merge=lfs -text
+nvidia-logo-vert-rgb-blk-for-screen.png filter=lfs diff=lfs merge=lfs -text
diff --git a/docs/_static/custom.css b/docs/_static/custom.css
new file mode 100644
index 0000000000..46bab57d4e
--- /dev/null
+++ b/docs/_static/custom.css
@@ -0,0 +1,319 @@
+/*
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+*/
+@font-face {
+  font-family: "NVIDIA Sans";
+  src: url(https://aws1.discourse-cdn.com/nvidia/original/3X/5/2/52891dda673228d54e5d57bf1e4a3880d4b22405.woff2) format("woff2"),
+      url(https://aws1.discourse-cdn.com/nvidia/original/3X/e/0/e090b7dda7a582522c7f9045c6ce949cce60134f.woff) format("woff");
+  font-weight: 300;
+  font-style: normal;
+}
+@font-face {
+  font-family: "NVIDIA Sans";
+  src: url(https://aws1.discourse-cdn.com/nvidia/original/3X/a/1/a107baabcbf6b241099122336bce7429bcfd377a.woff2) format("woff2"),
+      url(https://aws1.discourse-cdn.com/nvidia/original/3X/3/a/3a6060a4e3bce70e5552ba0de8af4b22c6cf9144.woff) format("woff");
+  font-weight: 300;
+  font-style: italic;
+}
+@font-face {
+  font-family: "NVIDIA Sans";
+  src: url(https://aws1.discourse-cdn.com/nvidia/original/3X/9/9/9920d2b172b01d92fc9c1c0e521dcf45b59c47c3.woff2) format("woff2"),
+      url(https://aws1.discourse-cdn.com/nvidia/original/3X/6/c/6c7d947928a7e4ef3e80ed409bef6c243f2148cb.woff) format("woff");
+  font-weight: 400;
+  font-style: normal;
+}
+@font-face {
+  font-family: "NVIDIA Sans";
+  src: url(https://aws1.discourse-cdn.com/nvidia/original/3X/e/8/e8e63fe1244372cd942d957f44a5616a1eba0644.woff2) format("woff2"),
+      url(https://aws1.discourse-cdn.com/nvidia/original/3X/0/f/0f1fb2af0283ab09d36e7097bb07d895c3228f12.woff) format("woff");
+  font-weight: 400;
+  font-style: italic;
+}
+@font-face {
+  font-family: "NVIDIA Sans";
+  src: url(https://aws1.discourse-cdn.com/nvidia/original/3X/7/9/79d3c513a9cd72c59f65354f39f89ca52dc17dd2.woff2) format("woff2"),
+      url(https://aws1.discourse-cdn.com/nvidia/original/3X/2/5/2581ac533f5d01f4985d8a7245b0766b4630ced8.woff) format("woff");
+  font-weight: 500;
+  font-style: normal;
+}
+@font-face {
+  font-family: "NVIDIA Sans";
+  src: url(https://aws1.discourse-cdn.com/nvidia/original/3X/3/9/39d9ef1ee9770dd503f19bb2ace2fdb4eff3bb50.woff2) format("woff2"),
+      url(https://aws1.discourse-cdn.com/nvidia/original/3X/7/b/7bb5d5e2e71b2e13c8098b2e67c0a0ed9258e6c7.woff) format("woff");
+  font-weight: 500;
+  font-style: italic;
+}
+@font-face {
+  font-family: "NVIDIA Sans";
+  src: url(https://aws1.discourse-cdn.com/nvidia/original/3X/0/5/05276a55a43eb3f74981ec1e93252727afcd9d16.woff2) format("woff2"),
+      url(https://aws1.discourse-cdn.com/nvidia/original/3X/9/c/9cfec7ed941b06564aa4d5ca14610e81542d070f.woff) format("woff");
+  font-weight: 700;
+  font-style: normal;
+}
+@font-face {
+  font-family: "NVIDIA Sans";
+  src: url(https://aws1.discourse-cdn.com/nvidia/original/3X/a/e/aebd14d09ba56f541e1b8735fb051e33710f9ae7.woff2) format("woff2"),
+      url(https://aws1.discourse-cdn.com/nvidia/original/3X/e/d/edbdabef43acc5c12e84a94baaa5542c9404cfeb.woff) format("woff");
+  font-weight: 700;
+  font-style: italic;
+}
+
+/* Custom Styles */
+:root {
+--pst-font-size-base: none;
+--pst-color-primary: 0, 133, 197;
+--pst-color-admonition-note: var(--pst-color-primary);
+--pst-color-admonition-default: var(--pst-color-primary);
+--pst-color-info: 255, 193, 7;
+--pst-color-admonition-tip: var(--pst-color-info);
+--pst-color-admonition-hint: var(--pst-color-info);
+--pst-color-admonition-important: var(--pst-color-info);
+--pst-color-warning: 245, 162, 82;
+--pst-color-danger: 230, 101, 129;
+--pst-color-admonition-warning: var(--pst-color-danger);
+--pst-color-link: 118, 185, 0;
+--pst-color-inline-code: 92, 22, 130;
+--font-family-sans-serif: NVIDIA Sans, Helvetica, Arial, Sans-serif;
+--pst-font-family-base-system: NVIDIA Sans, Helvetica, Arial, Sans-serif;
+font-family: NVIDIA Sans, Helvetica, Arial, Sans-serif;
+}
+
+.prev-next-area {
+    font-size: small;
+}
+
+.docutils caption {
+  caption-side: top;
+}
+
+#site-navigation h1.site-logo {
+  font-size: 0.85em;
+}
+
+/* colors
+nv green 118,185,0
+black 0, 0, 0
+light gray 205, 205, 205
+medium gray 140, 140, 140
+dark gray 94, 94, 94
+
+emerald 0, 133, 100
+emerald #008564
+amethyst 92, 22, 130
+amethyst #5C1682
+cpu blue 0, 133, 197
+cpu blue #0085C5
+garnet 137, 12, 88
+garnet 890C58
+fluorite 250, 194, 0
+fluorite FAC200
+*/
+
+:root {
+  --nv-green: #76b900;
+  --nv-green-darken: #6ead00;
+  --emerald: #008564;
+  --emerald-darken: #017c5d;
+  --amethyst: #5d1682;
+  --amethyst-darken: #4c116b;
+  --cpu-blue: #0071c5;
+  --cpu-blue-darken: #0062ad;
+  --garnet: #890c58;
+  --garnet-darken: #7a0c4e;
+  --fluorite: #fac200;
+  --fluorite-darken: #e4b301;
+  --dark-gray: #5e5e5e;
+  --light-gray: #cdcdcd;
+  --medium-gray: #8c8c8c;
+  --medium-gray-darken: #8c8c8cde;
+  --primary: #76b900;
+  --secondary: #008564;
+  --success: #5d1682;
+  --info: #0071c5;
+  --warning: #fac200;
+  --danger: #890c58;
+}
+
+/* Riva TBYB (ASR and TTS) Styling */
+.demo-box {
+  background-color: rgb(245,245,245);
+}
+a:link { text-decoration: none; }
+.scrollable {
+  height: 125px;
+  overflow-y: auto;
+  font-size: 1.3rem;
+}
+.dot {
+  height: 8px;
+  width: 8px;
+  background-color: rgb(228, 77, 77);
+  border-radius: 50%;
+  display: inline-block;
+}
+.timer {
+  font-size: 80%;
+  text-transform: uppercase;
+  white-space: nowrap;
+}
+.form-select {
+  border-radius: 0%;
+  font-size: 80%;
+}
+.form-control {
+  border-radius: 0%;
+}
+.input-group-text {
+  border-radius: 0%;
+  font-size: 80%;
+  text-transform: uppercase;
+  background-color: rgb(245,245,245);
+}
+.card {
+  border-radius: 0%;
+}
+.speech-control {
+  border-top-width: 0px;
+}
+.btn {
+  border-radius: 0%;
+  font-size: 80%;
+  text-transform: uppercase;
+  white-space: nowrap;
+  min-width: 125px;
+}
+.btn-primary {
+  background-color: var(--nv-green);
+  border-color: var(--nv-green);
+}
+.btn-primary:hover {
+  background-color: var(--nv-green-darken);
+  border-color: var(--nv-green-darken);
+}
+.btn-primary:focus, .btn-primary.focus {
+  background-color: var(--nv-green-darken);
+  border-color: var(--nv-green-darken);
+  -webkit-box-shadow: 0 0 0 0.2rem rgba(147, 173, 102, 0.5);
+          box-shadow: 0 0 0 0.2rem rgba(147, 173, 102, 0.5);
+}
+.btn-primary.disabled, .btn-primary:disabled {
+  background-color: var(--nv-green);
+  border-color: var(--nv-green);
+}
+.btn-primary:not(:disabled):not(.disabled):active, .btn-primary:not(:disabled):not(.disabled).active,
+.show > .btn-primary.dropdown-toggle {
+  background-color: var(--nv-green-darken);
+  border-color: var(--nv-green-darken);
+}
+.btn-primary:not(:disabled):not(.disabled):active:focus, .btn-primary:not(:disabled):not(.disabled).active:focus,
+.show > .btn-primary.dropdown-toggle:focus {
+  -webkit-box-shadow: 0 0 0 0.2rem rgba(147, 173, 102, 0.5);
+          box-shadow: 0 0 0 0.2rem rgba(147, 173, 102, 0.5);
+}
+.btn-secondary {
+  background-color: var(--medium-gray);
+  border-color: var(--medium-gray);
+}
+.btn-secondary:hover {
+  background-color: var(--medium-gray-darken);
+  border-color: var(--medium-gray-darken);
+}
+.btn-secondary:focus, .btn-secondary.focus {
+  background-color: var(--medium-gray-darken);
+  border-color: var(--medium-gray-darken);
+  -webkit-box-shadow: 0 0 0 0.2rem rgba(140, 140, 140, 0.5);
+          box-shadow: 0 0 0 0.2rem rgba(140, 140, 140, 0.5);
+}
+.btn-secondary.disabled, .btn-secondary:disabled {
+  background-color: var(--medium-gray);
+  border-color: var(--medium-gray);
+}
+.btn-secondary:not(:disabled):not(.disabled):active, .btn-secondary:not(:disabled):not(.disabled).active,
+.show > .btn-secondary.dropdown-toggle {
+  background-color: var(--medium-gray-darken);
+  border-color: var(--medium-gray-darken);
+}
+.btn-secondary:not(:disabled):not(.disabled):active:focus, .btn-secondary:not(:disabled):not(.disabled).active:focus,
+.show > .btn-secondary.dropdown-toggle:focus {
+  -webkit-box-shadow: 0 0 0 0.2rem rgba(140, 140, 140, 0.5);
+          box-shadow: 0 0 0 0.2rem rgba(140, 140, 140, 0.5);
+}
+.btn-link {
+  color: var(--nv-green);
+  text-decoration-line: none;
+}
+.btn-link:hover {
+  color: var(--nv-green-darken);
+}
+.btn-link:focus, .btn-link.focus {
+  color: var(--nv-green-darken);
+  -webkit-box-shadow: 0 0 0 0.2rem rgba(147, 173, 102, 0.5);
+          box-shadow: 0 0 0 0.2rem rgba(147, 173, 102, 0.5);
+}
+.link-primary {
+  color: var(--nv-green);
+}
+.link-primary:hover {
+  color: var(--nv-green-darken);
+}
+
+/* Riva ASR Styles */
+#riva-upload-label {
+  margin-top: 0.5rem;
+}
+
+/* Riva TTS Styles */
+.tts-control {
+  justify-content: space-between;
+  align-items: center;
+}
+
+.tts-control > p {
+  margin: unset;
+}
+
+#riva-tts-field {
+  resize: none;
+  border: unset;
+  padding: 0;
+  height: 100%;
+  font-size: 1.0rem;
+}
+
+#riva-terms-of-use p {
+  max-width: 620px;
+}
+
+/* Media Queries */
+@media (max-width: 1024px) {
+
+  /* Riva TTS and ASR */
+  .scrollable {
+      height: 250px;
+  }
+}
+
diff --git a/docs/_static/logo_2color_horizontal.svg b/docs/_static/logo_2color_horizontal.svg
new file mode 100644
index 0000000000..5ab0442d32
--- /dev/null
+++ b/docs/_static/logo_2color_horizontal.svg
@@ -0,0 +1,2 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<svg xmlns="http://www.w3.org/2000/svg" height="979.59183673469" viewBox="0 0 392 160" width="2400"><path d="m0 0h392v160h-392z" fill="#fff"/><path d="m101.85264 71.15366v-3.96347c.38481-.02742.77346-.04793 1.1695-.06034 10.84039-.34071 17.95214 9.31439 17.95214 9.31439s-7.68112 10.66869-15.91688 10.66869a9.98763 9.98763 0 0 1 -3.20476-.51238v-12.0187c4.22026.50977 5.0686 2.37382 7.60648 6.60292l5.64279-4.75769a14.941 14.941 0 0 0 -11.06261-5.40241 20.46324 20.46324 0 0 0 -2.18666.129m0-13.09284v5.92027c.38911-.03076.77871-.05554 1.1695-.06961 15.0752-.50786 24.89629 12.36323 24.89629 12.36323s-11.281 13.71754-23.03316 13.71754a17.34363 17.34363 0 0 1 -3.03263-.26706v3.65948a19.96037 19.96037 0 0 0 2.52547.16356c10.93695 0 18.8458-5.58481 26.50476-12.19537 1.26895 1.01668 6.46777 3.49 7.53691 4.5741-7.2825 6.09578-24.25276 11.00917-33.87357 11.00917-.927 0-1.81876-.056-2.69357-.13995v5.143h41.56853v-43.87837zm0 28.53973v3.12464c-10.11532-1.8035-12.9229-12.3184-12.9229-12.3184a21.86733 21.86733 0 0 1 12.9229-6.25314v3.42819l-.01575-.00166c-4.23267-.50836-7.54 3.44655-7.54 3.44655s1.85285 6.65775 7.55573 8.57381m-17.966-9.64939a25.05247 25.05247 0 0 1 17.966-9.761v-3.20906c-13.25881 1.06413-24.74084 12.29362-24.74084 12.29362s6.50281 18.80022 24.74086 20.52147v-3.41151c-13.38354-1.68382-17.96599-16.43352-17.96599-16.43352z" fill="#76b900"/><path d="m218.2973 66.35212.00289 28.80124h8.1338v-28.80074zm-63.98735-.03912v28.84036h8.20639v-21.89888l6.35734.00218a5.97838 5.97838 0 0 1 4.62062 1.60684c1.28662 1.371 1.81174 3.58116 1.81174 7.62594v12.66392l7.9498-.00145.00145-15.93268c0-11.37261-7.2493-12.90625-14.34111-12.90625h-14.60623m77.08424.04011v28.80027h13.193c7.02922 0 9.3234-1.16908 11.80458-3.79014 1.754-1.84 2.88706-5.87921 2.88706-10.2934 0-4.04814-.95912-7.65975-2.63266-9.90875-3.0134-4.02182-7.35489-4.808-13.83615-4.808zm8.06854 6.27109h3.49733c5.07374 0 8.35509 2.27846 8.35509 8.19028 0 5.91329-3.28135 8.19175-8.35509 8.19175h-3.49733zm-32.8932-6.27109-6.78845 22.82591-6.50485-22.82442-8.78051-.00146 9.2898 28.80027h11.72341l9.36278-28.80027zm56.49439 28.80027h8.13524v-28.79881l-8.13717-.00146zm22.80226-28.78988-11.35813 28.78h8.0205l1.797-5.08677h13.44185l1.70108 5.08677h8.70805l-11.44485-28.7824zm5.28005 5.25059 4.92734 13.48293h-10.01026z"/><path d="m312.34746 96.03194a2.5406 2.5406 0 1 1 2.54076-2.54046 2.53986 2.53986 0 0 1 -2.54076 2.54046zm0-4.5932a2.05307 2.05307 0 1 0 2.01646 2.05274 1.99245 1.99245 0 0 0 -2.01646-2.05274zm.45632 3.40026-.49117-1.04578h-.288v1.04578h-.60347v-2.70218h1.1018a.85537.85537 0 0 1 .89835.8492.76058.76058 0 0 1 -.49833.71605l.56862 1.13693zm-.358-2.20386h-.42118v.70176h.42118a.35148.35148 0 1 0 0-.70176z"/></svg>
diff --git a/docs/_static/logo_2color_vertical.svg b/docs/_static/logo_2color_vertical.svg
new file mode 100644
index 0000000000..69e64b7001
--- /dev/null
+++ b/docs/_static/logo_2color_vertical.svg
@@ -0,0 +1,2 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<svg xmlns="http://www.w3.org/2000/svg" height="979.59183673469" viewBox="0 0 392 160" width="2400"><path d="m0 0h392v160h-392z" fill="#fff"/><path d="m179.69824 110.32017.0029 28.80125h8.13379v-28.80075zm-63.98724-.03917v28.84038h8.20638v-21.89884l6.35733.00217a5.97833 5.97833 0 0 1 4.62061 1.60685c1.28662 1.37105 1.81174 3.58116 1.81174 7.62593v12.66393l7.94979-.00145.00145-15.93269c0-11.3726-7.24929-12.90624-14.34109-12.90624h-14.60621m77.08411.0401v28.80028h13.193c7.02921 0 9.32339-1.16908 11.80457-3.79014 1.754-1.84 2.88705-5.87922 2.88705-10.2934 0-4.04814-.95912-7.65976-2.63266-9.90876-3.0134-4.02182-7.35487-4.808-13.83613-4.808zm8.06853 6.27109h3.49736c5.07374 0 8.35508 2.27846 8.35508 8.19029 0 5.91328-3.28134 8.19175-8.35508 8.19175h-3.49732zm-32.89314-6.27109-6.78845 22.82586-6.50483-22.82442-8.7805-.00146 9.28979 28.80028h11.72339l9.36276-28.80028zm56.49429 28.80028h8.13521v-28.79882l-8.13716-.00146zm22.80221-28.78989-11.35812 28.78h8.02049l1.797-5.08677h13.44186l1.70108 5.08677h8.708l-11.44481-28.7824zm26.48125 29.66847a2.54061 2.54061 0 1 1 2.54075-2.54046 2.53987 2.53987 0 0 1 -2.54075 2.54046zm0-4.5932a2.05307 2.05307 0 1 0 2.01646 2.05274 1.99245 1.99245 0 0 0 -2.01646-2.05274zm.45632 3.40026-.49118-1.04579h-.288v1.04579h-.60347v-2.70219h1.10179a.85536.85536 0 0 1 .89835.84921.76058.76058 0 0 1 -.49832.716l.56861 1.13693zm-.358-2.20387h-.42117v.70177h.42117a.35149.35149 0 1 0 0-.70177zm-21.29949-21.02107 4.92733 13.48292h-10.01027z"/><path d="m168.9447 42.85644v-6.91929c.67138-.04757 1.35-.08347 2.0416-.10531 18.924-.59469 31.33954 16.26052 31.33954 16.26052s-13.40932 18.62441-27.78675 18.62441a17.42732 17.42732 0 0 1 -5.59439-.89463v-20.98129c7.367.88992 8.84823 4.1442 13.27862 11.527l9.85075-8.30554s-7.19066-9.431-19.31239-9.431a35.65918 35.65918 0 0 0 -3.817.22518m0-22.85644v10.33514c.67912-.05389 1.35909-.09676 2.0416-.12165 26.31676-.88653 43.46189 21.58261 43.46189 21.58261s-19.69361 23.947-40.20951 23.947a30.25525 30.25525 0 0 1 -5.294-.46641v6.3885a34.85638 34.85638 0 0 0 4.4084.28569c19.0929 0 32.90014-9.74956 46.27035-21.28982 2.215 1.775 11.29076 6.09267 13.1571 7.98508-12.7133 10.64175-42.33847 19.219-59.13367 19.219-1.61858 0-3.17519-.0976-4.70218-.2445v8.97831h72.56691v-76.599zm0 49.82214v5.4546c-17.6589-3.14832-22.55994-21.50434-22.55994-21.50434a38.17412 38.17412 0 0 1 22.55994-10.916v5.98441c-.011 0-.01814-.00277-.02769-.00277-7.389-.8875-13.16277 6.01692-13.16277 6.01692s3.23478 11.62226 13.19048 14.96713m-31.36384-16.84514a43.73468 43.73468 0 0 1 31.36384-17.03985v-5.602c-23.14654 1.8575-43.19085 21.461-43.19085 21.461s11.35191 32.82005 43.19085 35.82451v-5.95542c-23.36398-2.9396-31.36384-28.68824-31.36384-28.68824z" fill="#76b900"/></svg>
diff --git a/docs/_static/nvidia-logo-horiz-rgb-blk-for-screen.png b/docs/_static/nvidia-logo-horiz-rgb-blk-for-screen.png
new file mode 100644
index 0000000000..6316a9340f
--- /dev/null
+++ b/docs/_static/nvidia-logo-horiz-rgb-blk-for-screen.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:dd57ffce985e08c97c6af5fdadd2a28e4a92996455edc2d0598dd964cca51eae
+size 48928
diff --git a/docs/_static/nvidia-logo-vert-rgb-blk-for-screen.png b/docs/_static/nvidia-logo-vert-rgb-blk-for-screen.png
new file mode 100644
index 0000000000..5546c1b57d
--- /dev/null
+++ b/docs/_static/nvidia-logo-vert-rgb-blk-for-screen.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:17a25111e145aa52b77ec5a89eb3b0c7d9a2a90dea25a0bb867a937514fc783c
+size 63541
diff --git a/docs/_static/rtd-data.js b/docs/_static/rtd-data.js
new file mode 100644
index 0000000000..7ed13e8ee0
--- /dev/null
+++ b/docs/_static/rtd-data.js
@@ -0,0 +1,36 @@
+/*
+# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+*/
+
+// Dummy data for testing ReadTheDocs footer insertion
+// This mimics RTD data for a project that uses both versions + languages
+var READTHEDOCS_DATA = {
+  project: "frc-docs",
+  version: "latest",
+  language: "en",
+  proxied_api_host: "https://readthedocs.org",
+};
diff --git a/docs/_templates/layout.html b/docs/_templates/layout.html
new file mode 100644
index 0000000000..570aba8ba3
--- /dev/null
+++ b/docs/_templates/layout.html
@@ -0,0 +1,31 @@
+<!--
+# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+{% extends "!layout.html" %}
+{%- block footer %}
+<script type="text/javascript">_satellite.pageBottom();</script>
+{%- endblock %}
diff --git a/docs/conf.py b/docs/conf.py
new file mode 100755
index 0000000000..9378329752
--- /dev/null
+++ b/docs/conf.py
@@ -0,0 +1,256 @@
+#!/usr/bin/env python3
+
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# Configuration file for the Sphinx documentation builder.
+#
+# This file only contains a selection of the most common options. For a full
+# list see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+# -- Path setup --------------------------------------------------------------
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+import os
+
+from docutils import nodes
+from sphinx import search
+
+# import sys
+# sys.path.insert(0, os.path.abspath('.'))
+
+# -- Project information -----------------------------------------------------
+
+project = "NVIDIA Triton Inference Server"
+copyright = "2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved"
+author = "NVIDIA"
+
+# The full version, including alpha/beta/rc tags
+# Env only set during riva-release process, otherwise keep as dev for all internal builds
+release = os.getenv("TRITON_VERSION", "dev")
+
+# maintain left-side bar toctrees in `contents` file
+# so it doesn't show up needlessly in the index page
+master_doc = "contents"
+
+# -- General configuration ---------------------------------------------------
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+    "ablog",
+    "myst_nb",
+    "sphinx_copybutton",
+    "sphinx_design",
+    "sphinx-prompt",
+    # "sphinxcontrib.bibtex",
+    "sphinx_tabs.tabs",
+    "sphinx_sitemap",
+]
+
+suppress_warnings = ["myst.domains", "ref.ref"]
+
+numfig = True
+
+# final location of docs for seo/sitemap
+html_baseurl = (
+    "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/"
+)
+
+myst_enable_extensions = [
+    "dollarmath",
+    "amsmath",
+    "deflist",
+    # "html_admonition",
+    # "html_image",
+    "colon_fence",
+    # "smartquotes",
+    "replacements",
+    # "linkify",
+    "substitution",
+]
+myst_heading_anchors = 5
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ["_templates"]
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = ["README.md"]
+
+# -- Options for HTML output -------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+html_theme = "sphinx_book_theme"
+html_logo = "_static/nvidia-logo-horiz-rgb-blk-for-screen.png"
+html_title = "NVIDIA Triton Inference Server"
+html_short_title = "Triton"
+html_copy_source = True
+html_sourcelink_suffix = ""
+html_favicon = "_static/nvidia-logo-vert-rgb-blk-for-screen.png"
+html_last_updated_fmt = ""
+html_additional_files = ["index.html"]
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ["_static"]
+html_css_files = ["custom.css"]
+
+html_theme_options = {
+    "path_to_docs": "docs",
+    # "launch_buttons": {
+    #     "binderhub_url": "https://mybinder.org",
+    #     "colab_url": "https://colab.research.google.com/",
+    #     "deepnote_url": "https://deepnote.com/",
+    #     "notebook_interface": "jupyterlab",
+    #     "thebe": True,
+    #     # "jupyterhub_url": "https://datahub.berkeley.edu",  # For testing
+    # },
+    "use_edit_page_button": False,
+    "use_issues_button": True,
+    "use_repository_button": True,
+    "use_download_button": False,
+    "logo_only": False,
+    "show_toc_level": 2,
+    "extra_navbar": "",
+    "extra_footer": "",
+    "repository_url": "https://github.com/triton-inference-server/server",
+    "use_repository_button": True,
+}
+
+version_short = release
+deploy_ngc_org = "nvidia"
+deploy_ngc_team = "triton"
+myst_substitutions = {
+    "VersionNum": version_short,
+    "deploy_ngc_org_team": f"{deploy_ngc_org}/{deploy_ngc_team}"
+    if deploy_ngc_team
+    else deploy_ngc_org,
+}
+
+
+def ultimateReplace(app, docname, source):
+    result = source[0]
+    for key in app.config.ultimate_replacements:
+        result = result.replace(key, app.config.ultimate_replacements[key])
+    source[0] = result
+
+
+# this is a necessary hack to allow us to fill in variables that exist in code blocks
+ultimate_replacements = {
+    "{VersionNum}": version_short,
+    "{SamplesVersionNum}": version_short,
+    "{NgcOrgTeam}": f"{deploy_ngc_org}/{deploy_ngc_team}"
+    if deploy_ngc_team
+    else deploy_ngc_org,
+}
+
+# bibtex_bibfiles = ["references.bib"]
+# To test that style looks good with common bibtex config
+# bibtex_reference_style = "author_year"
+# bibtex_default_style = "plain"
+
+### We currently use Myst: https://myst-nb.readthedocs.io/en/latest/use/execute.html
+jupyter_execute_notebooks = "off"  # Global execution disable
+# execution_excludepatterns = ['tutorials/tts-python-basics.ipynb']  # Individual notebook disable
+
+
+def setup(app):
+    app.add_config_value("ultimate_replacements", {}, True)
+    app.connect("source-read", ultimateReplace)
+    app.add_js_file("https://js.hcaptcha.com/1/api.js")
+
+    visitor_script = (
+        "//assets.adobedtm.com/5d4962a43b79/c1061d2c5e7b/launch-191c2462b890.min.js"
+    )
+
+    if visitor_script:
+        app.add_js_file(visitor_script)
+
+    # if not os.environ.get("READTHEDOCS") and not os.environ.get("GITHUB_ACTIONS"):
+    #     app.add_css_file(
+    #         "https://assets.readthedocs.org/static/css/readthedocs-doc-embed.css"
+    #     )
+    #     app.add_css_file("https://assets.readthedocs.org/static/css/badge_only.css")
+
+    #     # Create the dummy data file so we can link it
+    #     # ref: https://github.com/readthedocs/readthedocs.org/blob/bc3e147770e5740314a8e8c33fec5d111c850498/readthedocs/core/static-src/core/js/doc-embed/footer.js  # noqa: E501
+    #     app.add_js_file("rtd-data.js")
+    #     app.add_js_file(
+    #         "https://assets.readthedocs.org/static/javascript/readthedocs-doc-embed.js",
+    #         priority=501,
+    #     )
+
+
+# Patch for sphinx.search stemming short terms (i.e. tts -> tt)
+# https://github.com/sphinx-doc/sphinx/blob/4.5.x/sphinx/search/__init__.py#L380
+def sphinxSearchIndexFeed(
+    self, docname: str, filename: str, title: str, doctree: nodes.document
+):
+    """Feed a doctree to the index."""
+    self._titles[docname] = title
+    self._filenames[docname] = filename
+
+    visitor = search.WordCollector(doctree, self.lang)
+    doctree.walk(visitor)
+
+    # memoize self.lang.stem
+    def stem(word: str) -> str:
+        try:
+            return self._stem_cache[word]
+        except KeyError:
+            self._stem_cache[word] = self.lang.stem(word).lower()
+            return self._stem_cache[word]
+
+    _filter = self.lang.word_filter
+
+    for word in visitor.found_title_words:
+        stemmed_word = stem(word)
+        if len(stemmed_word) > 3 and _filter(stemmed_word):
+            self._title_mapping.setdefault(stemmed_word, set()).add(docname)
+        elif _filter(word):  # stemmer must not remove words from search index
+            self._title_mapping.setdefault(word.lower(), set()).add(docname)
+
+    for word in visitor.found_words:
+        stemmed_word = stem(word)
+        # again, stemmer must not remove words from search index
+        if len(stemmed_word) <= 3 or not _filter(stemmed_word) and _filter(word):
+            stemmed_word = word.lower()
+        already_indexed = docname in self._title_mapping.get(stemmed_word, set())
+        if _filter(stemmed_word) and not already_indexed:
+            self._mapping.setdefault(stemmed_word, set()).add(docname)
+
+
+search.IndexBuilder.feed = sphinxSearchIndexFeed
diff --git a/docs/contents.md b/docs/contents.md
new file mode 100644
index 0000000000..ca952fed2c
--- /dev/null
+++ b/docs/contents.md
@@ -0,0 +1,104 @@
+<!--
+# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+```{toctree}
+:maxdepth: 1
+:caption: Getting Started
+
+getting_started/quickstart
+```
+
+```{toctree}
+:maxdepth: 1
+:caption: User Guide
+
+user_guide/performance_tuning
+user_guide/architecture
+user_guide/model_repository
+customization_guide/repository_agents
+user_guide/model_configuration
+user_guide/request_cancellation
+user_guide/optimization
+user_guide/ragged_batching
+user_guide/rate_limiter
+user_guide/model_analyzer
+user_guide/perf_analyzer
+user_guide/model_management
+user_guide/custom_operations
+user_guide/decoupled_models
+user_guide/response_cache
+user_guide/metrics
+user_guide/trace
+user_guide/jetson
+user_guide/v1_to_v2
+customization_guide/deploy
+```
+
+```{toctree}
+:maxdepth: 1
+:caption: Debugging
+
+user_guide/debugging_guide
+user_guide/faq
+```
+
+```{toctree}
+:maxdepth: 1
+:caption: Protocol Guides
+
+protocol/README.md
+customization_guide/inference_protocols
+protocol/extension_binary_data
+protocol/extension_classification
+protocol/extension_generate
+protocol/extension_logging
+protocol/extension_model_configuration
+protocol/extension_model_repository
+protocol/extension_schedule_policy
+protocol/extension_sequence
+protocol/extension_shared_memory
+protocol/extension_statistics
+protocol/extension_trace
+```
+
+```{toctree}
+:maxdepth: 1
+:caption: Customization Guide
+
+customization_guide/build
+customization_guide/compose
+customization_guide/test
+```
+
+```{toctree}
+:maxdepth: 1
+:caption: Examples
+
+examples/jetson/README
+examples/jetson/concurrency_and_dynamic_batching/README
+```
diff --git a/docs/build.md b/docs/customization_guide/build.md
similarity index 90%
rename from docs/build.md
rename to docs/customization_guide/build.md
index d64cceb4cc..40f8f00c76 100644
--- a/docs/build.md
+++ b/docs/customization_guide/build.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -47,37 +47,37 @@ The Triton source is distributed across multiple GitHub repositories
 that together can be built and installed to create a complete Triton
 installation. Triton server is built using CMake and (optionally)
 Docker. To simplify the build process, Triton provides a
-[build.py](../build.py) script. The build.py script will generate the
-CMake and Docker build steps required to build Triton, and will
-optionally invoke those steps or leave the invocation to you, as
-described below.
+[build.py](https://github.com/triton-inference-server/server/blob/main/build.py) script.
+The build.py script will generate the CMake and Docker build steps required to
+build Triton, and will optionally invoke those steps or leave the invocation to
+you, as described below.
 
 The build.py script currently supports building Triton for the
 following platforms. See [Building on Unsupported
 Platforms](#building-on-unsupported-platforms) if you are attempting
 to build Triton on a platform that is not listed here.
 
-* [Ubuntu 20.04, x86-64](#ubuntu)
+* [Ubuntu 22.04, x86-64](#building-for-ubuntu-2204)
 
-* [Jetpack 4.x, NVIDIA Jetson (Xavier, Nano, TX2)](#jetpack)
+* [Jetpack 4.x, NVIDIA Jetson (Xavier, Nano, TX2)](#building-for-jetpack-4x)
 
-* [Windows 10, x86-64](#windows)
+* [Windows 10, x86-64](#building-for-windows-10)
 
 If you are developing or debugging Triton, see [Development and
 Incremental Builds](#development-and-incremental-builds) for information
 on how to perform incremental build.
 
-## <a name="ubuntu"></a>Building for Ubuntu 20.04
+## Building for Ubuntu 22.04
 
-For Ubuntu-20.04, build.py supports both a Docker build and a
+For Ubuntu-22.04, build.py supports both a Docker build and a
 non-Docker build.
 
-* [Build using Docker](#ubuntu-docker) and the TensorFlow and PyTorch
+* [Build using Docker](#building-with-docker) and the TensorFlow and PyTorch
   Docker images from [NVIDIA GPU Cloud (NGC)](https://ngc.nvidia.com).
 
-* [Build without Docker](#ubuntu-without-docker).
+* [Build without Docker](#building-without-docker).
 
-### <a name="ubuntu-docker"></a>Building With Docker
+### Building With Docker
 
 The easiest way to build Triton is to use Docker. The result of the
 build will be a Docker image called *tritonserver* that will contain
@@ -111,7 +111,7 @@ building with Docker.
     image pulled from [NGC](https://ngc.nvidia.com) that contains the
     CUDA, cuDNN, TensorRT and other dependencies that are required to
     build Triton. When building without GPU support, the *min* image
-    is the standard ubuntu:20.04 image.
+    is the standard ubuntu:22.04 image.
 
   * Run the cmake_build script within the *tritonserver_buildbase*
     image to actually build Triton. The cmake_build script performs
@@ -173,7 +173,7 @@ $ ./build.py ... --repo-tag=common:<container tag> --repo-tag=core:<container ta
 
 If you are building on a release branch then `<container tag>` will
 default to the branch name. For example, if you are building on the
-r22.05 branch, `<container tag>` will default to r22.05. If you are
+r23.11 branch, `<container tag>` will default to r23.11. If you are
 building on any other branch (including the *main* branch) then
 `<container tag>` will default to "main". Therefore, you typically do
 not need to provide `<container tag>` at all (nor the preceding
@@ -189,7 +189,7 @@ repo that you want to use in the build, you would specify
 If you want to build without GPU support you must specify individual
 feature flags and not include the `--enable-gpu` and
 `--enable-gpu-metrics` flags. Only the following backends are
-available for a non-GPU / CPU-only build: `identity`, `repeat`,
+available for a non-GPU / CPU-only build: `identity`, `repeat`, `ensemble`,
 `square`, `tensorflow2`, `pytorch`, `onnxruntime`, `openvino`,
 `python` and `fil`.
 
@@ -197,12 +197,12 @@ To include the TensorFlow2 backend in your CPU-only build, you must
 provide this additional flag to build.py:
 `--extra-backend-cmake-arg=tensorflow2:TRITON_TENSORFLOW_INSTALL_EXTRA_DEPS=ON`.
 
-When building without GPU support, you must use the `--image=gpu-base,nvcr.io/nvidia/tritonserver:<xx.yy>-py3-min`
-flag. This is needed since the CPU-only builds of the TensorFlow and
-PyTorch backends require some CUDA stubs and runtime dependencies that are
-not present in the CPU-only base container.
+CPU-only builds of the TensorFlow and PyTorch backends require some CUDA stubs
+and runtime dependencies that are not present in the CPU-only base container.
+These are retrieved from a GPU base container, which can be changed with the
+`--image=gpu-base,nvcr.io/nvidia/tritonserver:<xx.yy>-py3-min` flag.
 
-### <a name="ubuntu-without-docker"></a>Building Without Docker
+### Building Without Docker
 
 To build Triton without using Docker you must install the build
 dependencies that are handled automatically when building with Docker.
@@ -239,7 +239,7 @@ Triton.
 $ ./build.py -v --no-container-build --build-dir=`pwd`/build --enable-all
 ```
 
-See [Building with Docker](#ubuntu-docker) for more details on how the
+See [Building with Docker](#building-with-docker) for more details on how the
 cmake_build script is used to perform the build.
 
 #### CUDA, cuBLAS, cuDNN
@@ -267,18 +267,18 @@ For a given version of Triton you can attempt to build with
 non-supported versions of TensorRT but you may have build or execution
 issues since non-supported versions are not tested.
 
-## <a name="jetpack"></a>Building for JetPack 4.x
+## Building for JetPack 4.x
 
 *Under Construction*
 
-## <a name="windows"></a>Building for Windows 10
+## Building for Windows 10
 
 For Windows 10, build.py supports both a Docker build and a non-Docker
-build in a similar way as described for [Ubuntu](#ubuntu). The primary
+build in a similar way as described for [Ubuntu](#building-for-ubuntu-2204). The primary
 difference is that the minimal/base image used as the base of
 Dockerfile.buildbase image can be built from the provided
-[Dockerfile.win10.min](../Dockerfile.win10.min) file as described in
-[Windows 10 "Min" Image](#windows-10-min-image). When running build.py
+[Dockerfile.win10.min](https://github.com/triton-inference-server/server/blob/main/Dockerfile.win10.min)
+file as described in [Windows 10 "Min" Image](#windows-10-min-image). When running build.py
 use the --image flag to specify the tag that you assigned to this
 image. For example, --image=base,win10-py3-min.
 
@@ -296,7 +296,7 @@ step.
 
 The "min" container describes the base dependencies needed to perform
 the Windows build. The Windows min container is
-[Dockerfile.win10.min](../Dockerfile.win10.min).
+[Dockerfile.win10.min](https://github.com/triton-inference-server/server/blob/main/Dockerfile.win10.min).
 
 Before building the min container you must download the appropriate
 cuDNN and TensorRT versions and place them in the same directory as
@@ -334,8 +334,8 @@ python build.py --cmake-dir=<path/to/repo>/build --build-dir=/tmp/citritonbuild
 If you are building on *main* branch then '<container tag>' will
 default to "main". If you are building on a release branch then
 '<container tag>' will default to the branch name. For example, if you
-are building on the r22.05 branch, '<container tag>' will default to
-r22.05. Therefore, you typically do not need to provide '<container
+are building on the r23.11 branch, '<container tag>' will default to
+r23.11. Therefore, you typically do not need to provide '<container
 tag>' at all (nor the preceding colon). You can use a different
 '<container tag>' for a component to instead use the corresponding
 branch/tag in the build. For example, if you have a branch called
@@ -360,7 +360,7 @@ dependencies that were used for the build.
 Building for an unsupported OS and/or hardware platform is
 possible. All of the build scripting, Dockerfiles and CMake
 invocations are included in the public repos or are generated by
-build.py as described in [Building with Docker](#ubuntu-docker). From
+build.py as described in [Building with Docker](#building-with-docker). From
 these files you can find the required dependencies and CMake
 invocations. However, due to differences in compilers, libraries,
 package management, etc. you may have to make changes in the build
@@ -378,7 +378,7 @@ platforms by reading the above documentation and then follow the
 process for the supported platform that most closely matches the
 platform you are interested in (for example, if you are trying to
 build for RHEL/x86-64 then follow the [Building for Ubuntu
-20.04](#building-for-ubuntu-2004) process. You will likely need to
+22.04](#building-for-ubuntu-2204) process. You will likely need to
 make changes in the following areas and then manually run docker_build
 and cmake_build or the equivalent commands to perform a build.
 
@@ -410,7 +410,7 @@ and cmake_build or the equivalent commands to perform a build.
   [TensorFlow](https://github.com/triton-inference-server/tensorflow_backend)
   backend extracts pre-built shared libraries from the TensorFlow NGC
   container as part of the build. This container is only available for
-  Ubuntu-20.04 / x86-64, so if you require the TensorFlow backend for
+  Ubuntu-22.04 / x86-64, so if you require the TensorFlow backend for
   your platform you will need download the TensorFlow container and
   modify its build to produce shared libraries for your platform. You
   must use the TensorFlow source and build scripts from within the NGC
@@ -428,14 +428,14 @@ and cmake_build or the equivalent commands to perform a build.
 
 ### Development Builds Without Docker
 
-If you are [building without Docker](#ubuntu-without-docker) use the
+If you are [building without Docker](#building-without-docker) use the
 CMake invocation steps in cmake_build to invoke CMake to set-up a
 build environment where you can invoke make/msbuild.exe to incremental
 build the Triton core, a backend, or a repository agent.
 
 ### Development Builds With Docker
 
-If you are [building with Docker](#ubuntu-docker), the generated
+If you are [building with Docker](#building-with-docker), the generated
 *tritonserver_buildbase* image contains all the dependencies needed to
 perform a full or incremental build. Within *tritonserver_buildbase*,
 /workspace/build/cmake_build contains the CMake invocations that are
diff --git a/docs/compose.md b/docs/customization_guide/compose.md
similarity index 89%
rename from docs/compose.md
rename to docs/customization_guide/compose.md
index 6c260e2821..5c66e8933b 100644
--- a/docs/compose.md
+++ b/docs/customization_guide/compose.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2020-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -44,8 +44,8 @@ from source to get more exact customization.
 The `compose.py` script can be found in the [server repository](https://github.com/triton-inference-server/server).
 Simply clone the repository and run `compose.py` to create a custom container.
 Note: Created container version will depend on the branch that was cloned.
-For example branch [r22.05](https://github.com/triton-inference-server/server/tree/r22.05)
-should be used to create a image based on the NGC 22.05 Triton release.
+For example branch [r23.11](https://github.com/triton-inference-server/server/tree/r23.11)
+should be used to create a image based on the NGC 23.11 Triton release.
 
 `compose.py` provides `--backend`, `--repoagent` options that allow you to
 specify which backends and repository agents to include in the custom image.
@@ -62,7 +62,7 @@ will provide a container `tritonserver` locally. You can access the container wi
 $ docker run -it tritonserver:latest
 ```
 
-Note: If `compose.py` is run on release versions `r22.05` and earlier,
+Note: If `compose.py` is run on release versions `r21.08` and earlier,
 the resulting container will have DCGM version 2.2.3 installed.
 This may result in different GPU statistic reporting behavior.
 
@@ -76,19 +76,19 @@ For example, running
 ```
 python3 compose.py --backend tensorflow1 --repoagent checksum
 ```
-on branch [r22.05](https://github.com/triton-inference-server/server/tree/r22.05) pulls:
-- `min` container `nvcr.io/nvidia/tritonserver:22.05-py3-min`
-- `full` container `nvcr.io/nvidia/tritonserver:22.05-py3`
+on branch [r23.11](https://github.com/triton-inference-server/server/tree/r23.11) pulls:
+- `min` container `nvcr.io/nvidia/tritonserver:23.11-py3-min`
+- `full` container `nvcr.io/nvidia/tritonserver:23.11-py3`
 
 Alternatively, users can specify the version of Triton container to pull from any branch by either:
 1. Adding flag `--container-version <container version>` to branch
 ```
-python3 compose.py --backend tensorflow1 --repoagent checksum --container-version 22.05
+python3 compose.py --backend tensorflow1 --repoagent checksum --container-version 23.11
 ```
 2. Specifying `--image min,<min container image name> --image full,<full container image name>`.
    The user is responsible for specifying compatible `min` and `full` containers.
 ```
-python3 compose.py --backend tensorflow1 --repoagent checksum --image min,nvcr.io/nvidia/tritonserver:22.05-py3-min --image full,nvcr.io/nvidia/tritonserver:22.05-py3
+python3 compose.py --backend tensorflow1 --repoagent checksum --image min,nvcr.io/nvidia/tritonserver:23.11-py3-min --image full,nvcr.io/nvidia/tritonserver:23.11-py3
 ```
 Method 1 and 2 will result in the same composed container. Furthermore, `--image` flag overrides the `--container-version` flag when both are specified.
 
diff --git a/docs/customization_guide/deploy.md b/docs/customization_guide/deploy.md
new file mode 100644
index 0000000000..112a2cebcf
--- /dev/null
+++ b/docs/customization_guide/deploy.md
@@ -0,0 +1,279 @@
+<!--
+# Copyright (c) 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Secure Deployment Considerations
+
+The Triton Inference Server project is designed for flexibility and
+allows developers to create and deploy inferencing solutions in a
+variety of ways. Developers can deploy Triton as an http server, a
+grpc server, a server supporting both, or embed a Triton server into
+their own application. Developers can deploy Triton locally or in the
+cloud, within a Kubernetes cluster behind an API gateway or as a
+standalone process.  This guide is intended to provide some key points
+and best practices that users deploying Triton based solutions should
+consider.
+
+| [Deploying Behind a Secure Gateway or Proxy](#deploying-behind-a-secure-proxy-or-gateway) | [Running with Least Privilege](#running-with-least-privilege) |
+
+> [!IMPORTANT]
+> Ultimately the security of a solution based on Triton
+> is the responsibility of the developer building and deploying that
+> solution. When deploying in production settings please have security
+> experts review any potential risks and threats.
+
+> [!WARNING]
+> Dynamic updates to model repositories are disabled by
+> default. Enabling dynamic updates to model repositories either
+> through model loading APIs or through directory polling can lead to
+> arbitrary code execution. Model repository access control is
+> critical in production deployments. If dynamic updates are required,
+> ensure only trusted entities have access to model loading APIs and
+> model repository directories.
+
+## Deploying Behind a Secure Proxy or Gateway
+
+The Triton Inference Server is designed primarily as a microservice to
+be deployed as part of a larger solution within an application
+framework or service mesh.
+
+In such deployments it is typical to utilize dedicated gateway or
+proxy servers to handle authorization, access control, resource
+management, encryption, load balancing, redundancy and many other
+security and availability features.
+
+The full design of such systems is outside the scope of this
+deployment guide but in such scenarios dedicated ingress controllers
+handle access from outside the trusted network while Triton Inference
+Server handles only trusted, validated requests.
+
+In such scenarios Triton Inference Server is not exposed directly to
+an untrusted network.
+
+### References on Secure Deployments
+
+In the following references, Triton Inference Server would be deployed
+as an "Application" or "Service" within the trusted internal network.
+
+* [https://www.nginx.com/blog/architecting-zero-trust-security-for-kubernetes-apps-with-nginx/]
+* [https://istio.io/latest/docs/concepts/security/]
+* [https://konghq.com/blog/enterprise/envoy-service-mesh]
+* [https://www.solo.io/topics/envoy-proxy/]
+
+## Running with Least Privilege
+
+  The security principle of least privilege advocates that a process be
+  granted the minimum permissions required to do its job.
+
+  For an inference solution based on Triton Inference Server there are a
+  number of ways to reduce security risks by limiting the permissions
+  and capabilities of the server to the minimum required for correct
+  operation.
+
+### 1. Follow Best Practices for Securing Kubernetes Deployments
+
+ When deploying Triton within a Kubernetes pod ensure that it is
+ running with a service account with the fewest possible
+ permissions. Ensure that you have configured [role based access
+ control](https://kubernetes.io/docs/reference/access-authn-authz/rbac/)
+ to limit access to resources and capabilities as required by your
+ application.
+
+### 2. Follow Best Practices for Launching Standalone Docker Containers
+
+  When Triton is deployed as a containerized service, standard docker
+  security practices apply. This includes limiting the resources that a
+  container has access to as well as limiting network access to the
+  container. https://docs.docker.com/engine/security/
+
+### 3. Run as a Non-Root User
+
+   Triton's pre-built containers contain a non-root user that can be used
+   to launch the tritonserver application with limited permissions. This
+   user, `triton-server` is created with `user id 1000`. When launching
+   the container using docker the user can be set with the `--user`
+   command line option.
+
+##### Example Launch Command
+
+   ```
+   docker run --rm --user triton-server -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:YY.MM-py3 tritonserver --model-repository=/models
+   ```
+
+### 4. Restrict or Disable Access to Protocols and APIs
+
+The pre-built Triton Inference Serrver application enables a full set
+of features including health checks, server metadata, inference apis,
+shared memory apis, model and model repository configuration,
+statistics, tracing and logging. Care should be taken to only expose
+those capabilities that are required for your solution.
+
+#### Disabling Features at Compile Time
+
+When building a custom inference server application features can be
+selectively enabled or disabled using the `build.py` script. As an
+example a developer can use the flags `--endpoint http` and
+`--endpoint grpc` to compile support for `http`, `grpc` or
+both. Support for individual backends can be enabled as well. For more
+details please see [documentation](build.md) on building a custom
+inference server application.
+
+#### Disabling / Restricting Features at Run Time
+
+The `tritonserver` application provides a number of command line
+options to enable and disable features when launched. For a full list
+of options please see `tritonserver --help`. The following subset are
+described here with basic recommendations.
+
+##### `--exit-on-error <boolean>, default True`
+
+Exits the inference server if any error occurs during
+initialization. Recommended to set to `True` to catch any
+unanticipated errors.
+
+##### `--disable-auto-complete-config, default enabled`
+
+Disables backends from autocompleting model configuration. If not
+required for your solution recommended to disable to ensure model
+configurations are defined statically.
+
+##### `--strict-readiness <boolean>, default True`
+
+If set to true `/v2/health/ready` will only report ready when all
+selected models are loaded. Recommended to set to `True` to provide a
+signal to other services and orchestration frameworks when full
+initialization is complete and server is healthy.
+
+##### `--model-control-mode <string>, default "none"`
+
+Specifies the mode for model management.
+
+> [!WARNING]
+> Allowing dynamic updates to the model repository can lead
+> to arbitrary code execution. Model repository access control is
+> critical in production deployments. Unless required for operation, it's recommended
+> to disable dynamic updates. If required, please ensure only trusted entities
+> can add or remove models from a model repository.
+
+Options:
+
+ * `none`- Models are loaded at start up and can not be modified.
+ * `poll`- Server process will poll the model repository for changes.
+ * `explicit` - Models can be loaded and unloaded via the model control APIs.
+
+Recommended to set to `none` unless dynamic updates are required. If
+dynamic updates are required care must be taken to control access to
+the model repository files and load and unload APIs.
+
+##### `--allow-http <boolean>, default True`
+
+Enable HTTP request handling. Recommended to set to `False` if not required.
+
+##### `--allow-grpc <boolean>, default True`
+
+Enable gRPC request handling. Recommended to set to `False` if not required.
+
+##### `--grpc-use-ssl <boolean> default False`
+
+Use SSL authentication for gRPC requests. Recommended to set to `True` if service is not protected by a gateway or proxy.
+
+##### `--grpc-use-ssl-mutual <boolean> default False`
+
+Use mutual SSL authentication for gRPC requests. Recommended to set to `True` if service is not protected by a gateway or proxy.
+
+##### `--grpc-restricted-protocol <<string>:<string>=<string>>`
+
+Restrict access to specific gRPC protocol categories to users with
+specific key, value pair shared secret. See
+[limit-endpoint-access](inference_protocols.md#limit-endpoint-access-beta)
+for more information.
+
+> [!Note]
+> Restricting access can be used to limit exposure to model
+> control APIs to trusted users.
+
+##### `--http-restricted-api <<string>:<string>=<string>>`
+
+Restrict access to specific HTTP API categories to users with
+specific key, value pair shared secret. See
+[limit-endpoint-access](inference_protocols.md#limit-endpoint-access-beta)
+for more information.
+
+> [!Note]
+> Restricting access can be used to limit exposure to model
+> control APIs to trusted users.
+
+##### `--allow-sagemaker <boolean> default False`
+
+Enable Sagemaker request handling. Recommended to set to `False` unless required.
+
+##### `--allow-vertex-ai <boolean> default depends on environment variable`
+
+Enable Vertex AI request handling. Default is `True` if
+`AIP_MODE=PREDICTION`, `False` otherwise. Recommended to set to
+`False` unless required.
+
+##### `--allow-metrics <boolean> default True`
+
+Allow server to publish prometheus style metrics. Recommended to set
+to `False` if not required to avoid capturing or exposing any sensitive information.
+
+#### `--trace-config level=<string> default "off"`
+
+Tracing mode. Trace mode supports `triton` and `opentelemetry`. Unless required `--trace-config level=off` should be set to avoid capturing or exposing any sensitive information.
+
+
+##### `backend-directory <string> default /opt/tritonserver/backends`
+
+Directory where backend shared libraries are found.
+
+> [!Warning]
+> Access to add or remove files from the backend directory
+> must be access controlled. Adding untrusted files
+> can lead to arbitrarty code execution.
+
+##### `repoagent-directory <string> default /opt/tritonserver/repoagents`
+Directory where repository agent shared libraries are found.
+
+> [!Warning]
+> Access to add or remove files from the repoagent directory
+> must be access controlled. Adding untrusted files
+> can lead to arbitrarty code execution.
+
+##### `cache-directory <string> default /opt/tritonserver/caches`
+
+Directory where cache shared libraries are found.
+
+> [!Warning]
+> Access to add or remove files from the cache directory
+> must be access controlled. Adding untrusted files
+> can lead to arbitrarty code execution.
+
+
+
+
+
diff --git a/docs/inference_protocols.md b/docs/customization_guide/inference_protocols.md
similarity index 65%
rename from docs/inference_protocols.md
rename to docs/customization_guide/inference_protocols.md
index 350fb78b41..592f26e7d1 100644
--- a/docs/inference_protocols.md
+++ b/docs/customization_guide/inference_protocols.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -31,7 +31,8 @@
 Clients can communicate with Triton using either an [HTTP/REST
 protocol](#httprest-and-grpc-protocols), a [GRPC
 protocol](#httprest-and-grpc-protocols), or by an [in-process C
-API](#in-process-triton-server-api).
+API](#in-process-triton-server-api) or its
+[C++ wrapper](https://github.com/triton-inference-server/developer_tools/tree/main/server).
 
 ## HTTP/REST and GRPC Protocols
 
@@ -40,14 +41,29 @@ inference
 protocols](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2)
 that have been proposed by the [KServe
 project](https://github.com/kserve). To fully enable all capabilities
-Triton also implements a number [HTTP/REST and GRPC
+Triton also implements [HTTP/REST and GRPC
 extensions](https://github.com/triton-inference-server/server/tree/main/docs/protocol)
-to the KServe inference protocol.
-
-The HTTP/REST and GRPC protcols provide endpoints to check server and
-model health, metadata and statistics. Additional endpoints allow
-model loading and unloading, and inferencing. See the KServe and
-extension documentation for details.
+to the KServe inference protocol. GRPC protocol also provides a
+bi-directional streaming version of the inference RPC to allow a
+sequence of inference requests/responses to be sent over a
+GRPC stream. We typically recommend using the unary version for
+inference requests. The streaming version should be used only if the
+situation demands it. Some of such use cases can be:
+
+* Assume a system with multiple Triton server instances running
+  behind a Load Balancer. If a sequence of inference requests is
+  needed to hit the same Triton server instance, a GRPC stream
+  will hold a single connection throughout the lifetime and hence
+  ensure the requests are delivered to the same Triton instance.
+* If the order of requests/responses needs to be preserved over
+  the network, a GRPC stream will ensure that the server receives
+  the requests in the same order as they were sent from the
+  client.
+
+The HTTP/REST and GRPC protocols also provide endpoints to check
+server and model health, metadata and statistics. Additional
+endpoints allow model loading and unloading, and inferencing. See
+the KServe and extension documentation for details.
 
 ### HTTP Options
 Triton provides the following configuration options for server-client network transactions over HTTP protocol.
@@ -99,6 +115,64 @@ These options can be used to configure the KeepAlive settings:
 
 For client-side documentation, see [Client-Side GRPC KeepAlive](https://github.com/triton-inference-server/client/blob/main/README.md#grpc-keepalive).
 
+### Limit Endpoint Access (BETA)
+
+Triton users may want to restrict access to protocols or APIs that are
+provided by the GRPC or HTTP endpoints of a server. For example, users
+can provide one set of access credentials for inference APIs and
+another for model control APIs such as model loading and unloading.
+
+The following options can be specified to declare a restricted
+protocol group (GRPC) or restricted API group (HTTP):
+
+```
+--grpc-restricted-protocol=<protocol_1>,<protocol_2>,...:<restricted-key>=<restricted-value>
+--http-restricted-api=<API_1>,API_2>,...:<restricted-key>=<restricted-value>
+```
+
+The option can be specified multiple times to specifies multiple groups of
+protocols or APIs with different restriction settings.
+
+* `protocols / APIs` : A comma-separated list of protocols / APIs to be included in this
+group. Note that currently a given protocol / API is not allowed to be included in
+multiple groups. The following protocols / APIs are recognized:
+
+  * `health` : Health endpoint defined for [HTTP/REST](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#health) and [GRPC](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#health-1). For GRPC endpoint, this value also exposes [GRPC health check protocol](https://github.com/triton-inference-server/common/blob/main/protobuf/health.proto).
+  * `metadata` : Server / model metadata endpoints defined for [HTTP/REST](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#server-metadata) and [GRPC](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#server-metadata-1).
+  * `inference` : Inference endpoints defined for [HTTP/REST](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#inference) and [GRPC](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#inference-1).
+  * `shared-memory` : [Shared-memory endpoint](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_shared_memory.md).
+  * `model-config` : [Model configuration endpoint](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_model_configuration.md).
+  * `model-repository` : [Model repository endpoint](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_model_repository.md).
+  * `statistics` : [statistics endpoint](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_statistics.md).
+  * `trace` : [trace endpoint](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_trace.md).
+  * `logging` : [logging endpoint](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_logging.md).
+
+* `restricted-key` : The GRPC / HTTP request header
+to be checked when a request is received. The
+completed header for GRPC will be in the form of
+`triton-grpc-protocol-<restricted-key>`. The completed header for HTTP
+will be in the form of `<restricted-key>`.
+
+* `restricted-value` : The header value required to access the specified protocols.
+
+#### Example
+
+To start the server with a set of protocols and APIs restricted for
+`admin` usage and the rest of the protocols and APIs left unrestricted
+use the following command line arguments:
+
+
+```
+tritonserver --grpc-restricted-protocol=shared-memory,model-config,model-repository,statistics,trace:<admin-key>=<admin-value> \
+             --http-restricted-api=shared-memory,model-config,model-repository,statistics,trace:<admin-key>=<admin-value> ...
+```
+
+GRPC requests to `admin` protocols require that an additional header
+`triton-grpc-protocol-<admin-key>` is provided with value
+`<admin-value>`. HTTP requests to `admin` APIs required that an
+additional header `<admin-key>` is provided with value `<admin-value>`.
+
+
 ## In-Process Triton Server API
 
 The Triton Inference Server provides a backwards-compatible C API that
@@ -112,21 +186,21 @@ tritonserver.dll. In the Triton Docker image the shared library is
 found in /opt/tritonserver/lib. The header file that defines and
 documents the Server API is
 [tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h).
-[Java bindings for In-Process Triton Server API](#java-bindings-for-in-process-triton-server-api) 
-are built on top of `tritonserver.h` and can be used for Java applications that 
+[Java bindings for In-Process Triton Server API](#java-bindings-for-in-process-triton-server-api)
+are built on top of `tritonserver.h` and can be used for Java applications that
 need to use Tritonserver in-process.
 
 All capabilities of Triton server are encapsulated in the shared
 library and are exposed via the Server API. The `tritonserver`
 executable implements HTTP/REST and GRPC endpoints and uses the Server
 API to communicate with core Triton logic. The primary source files
-for the endpoints are [grpc_server.cc](../src/grpc_server.cc) and
-[http_server.cc](../src/http_server.cc). In these source files you can
+for the endpoints are [grpc_server.cc](https://github.com/triton-inference-server/server/blob/main/src/grpc/grpc_server.cc) and
+[http_server.cc](https://github.com/triton-inference-server/server/blob/main/src/http_server.cc). In these source files you can
 see the Server API being used.
 
 You can use the Server API in your own application as well. A simple
 example using the Server API can be found in
-[simple.cc](../src/simple.cc).
+[simple.cc](https://github.com/triton-inference-server/server/blob/main/src/simple.cc).
 
 ### API Description
 
@@ -141,7 +215,7 @@ When you link the Triton shared library into your application you are
 *not* spawning a separate Triton process, instead, you are including
 the Triton core logic directly in your application. The Triton
 HTTP/REST or GRPC protocols are not used to communicate with this
-Triton core logic, instead all communication between your appliation
+Triton core logic, instead all communication between your application
 and the Triton core logic must take place via the [Server
 API](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h).
 
@@ -151,7 +225,7 @@ all of the features and capabilities of Triton. A
 `TRITONSERVER_Server` object is created by calling
 `TRITONSERVER_ServerNew` with a set of options that indicate how the
 object should be initialized.  Use of `TRITONSERVER_ServerNew` is
-demonstrated in [simple.cc](../src/simple.cc). Once you have created a
+demonstrated in [simple.cc](https://github.com/triton-inference-server/server/blob/main/src/simple.cc). Once you have created a
 `TRITONSERVER_Server` object, you can begin using the rest of the
 Server API as described below.
 
@@ -170,12 +244,12 @@ the Server API function. As a result, your application is responsible
 for managing the lifecycle of the returned `TRITONSERVER_Error`
 object. You must delete the error object using
 `TRITONSERVER_ErrorDelete` when you are done using it. Macros such as
-`FAIL_IF_ERR` shown in [common.h](../src/common.h) are useful for
+`FAIL_IF_ERR` shown in [common.h](https://github.com/triton-inference-server/server/blob/main/src/common.h) are useful for
 managing error object lifetimes.
 
 #### Versioning and Backwards Compatibility
 
-A typical pattern, demonstrated in [simple.cc](../src/simple.cc) and
+A typical pattern, demonstrated in [simple.cc](https://github.com/triton-inference-server/server/blob/main/src/simple.cc) and
 shown below, shows how you can compare the Server API version provided
 by the shared library against the Server API version that you compiled
 your application against. The Server API is backwards compatible, so
@@ -203,14 +277,14 @@ The Server API contains functions for checking health and readiness,
 getting model information, getting model statistics and metrics,
 loading and unloading models, etc. The use of these functions is
 straightforward and some of these functions are demonstrated in
-[simple.cc](../src/simple.cc) and all are documented in
+[simple.cc](https://github.com/triton-inference-server/server/blob/main/src/simple.cc) and all are documented in
 [tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h).
 
 #### Inference APIs
 
 Performing an inference request requires the use of many Server API
 functions and objects, as demonstrated in
-[simple.cc](../src/simple.cc). The general usage requires the
+[simple.cc](https://github.com/triton-inference-server/server/blob/main/src/simple.cc). The general usage requires the
 following steps.
 
 * Create a `TRITONSERVER_ResponseAllocator` using
@@ -227,7 +301,7 @@ following steps.
   `TRITONSERVER_ResponseAllocatorAllocFn_t` and
   `TRITONSERVER_ResponseAllocatorReleaseFn_t` as defined in
   [tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h). In
-  [simple.cc](../src/simple.cc), these callback functions are
+  [simple.cc](https://github.com/triton-inference-server/server/blob/main/src/simple.cc), these callback functions are
   implemented as `ResponseAlloc` and `ResponseRelease`.
 
 * Create an inference request as a `TRITONSERVER_InferenceRequest`
@@ -262,15 +336,15 @@ following steps.
 
   You can reuse an existing `TRITONSERVER_InferenceRequest` object for
   a new inference request. A couple of examples of how this is done
-  and why it is useful are shown in [simple.cc](../src/simple.cc).
+  and why it is useful are shown in [simple.cc](https://github.com/triton-inference-server/server/blob/main/src/simple.cc).
 
 * Ask Triton to execute the inference request using
   `TRITONSERVER_ServerInferAsync`. `TRITONSERVER_ServerInferAsync` is
-  a asychronous call that returns immediately. The inference response
+  a asynchronous call that returns immediately. The inference response
   is returned via a callback into your application. You register this
   callback using `TRITONSERVER_InferenceRequestSetResponseCallback`
   before you invoke `TRITONSERVER_ServerInferAsync`. In
-  [simple.cc](../src/simple.cc) this callback is
+  [simple.cc](https://github.com/triton-inference-server/server/blob/main/src/simple.cc) this callback is
   `InferResponseComplete`.
 
   When you invoke `TRITONSERVER_ServerInferAsync` and it returns
@@ -296,8 +370,8 @@ following steps.
   output tensors, and `TRITONSERVER_InferenceResponseOutput` to get
   information about each output tensor.
 
-  Note that the [simple.cc](../src/simple.cc) example uses a
-  std::promise to simply wait for the response, but sychronizing
+  Note that the [simple.cc](https://github.com/triton-inference-server/server/blob/main/src/simple.cc) example uses a
+  std::promise to simply wait for the response, but synchronizing
   response handling in this way is not required. You can have multiple
   inference requests in flight at the same time and can issue
   inference requests from the same thread or from multiple different
@@ -307,47 +381,50 @@ is documented in
 [tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h).
 
 A simple example using the C API can be found in
-[simple.cc](../src/simple.cc).  A more complicated example can be
+[simple.cc](https://github.com/triton-inference-server/server/blob/main/src/simple.cc).  A more complicated example can be
 found in the source that implements the HTTP/REST and GRPC endpoints
 for Triton. These endpoints use the C API to communicate with the core
 of Triton. The primary source files for the endpoints are
-[grpc_server.cc](../src/grpc_server.cc) and
-[http_server.cc](../src/http_server.cc).
+[grpc_server.cc](https://github.com/triton-inference-server/server/blob/main/src/grpc/grpc_server.cc) and
+[http_server.cc](https://github.com/triton-inference-server/server/blob/main/src/http_server.cc).
 
 ## Java bindings for In-Process Triton Server API
 
 The Triton Inference Server uses [Java CPP](https://github.com/bytedeco/javacpp)
 to create bindings around Tritonserver to create Java API.
 
-The API is documented in 
+The API is documented in
 [tritonserver.java](https://github.com/bytedeco/javacpp-presets/blob/master/tritonserver/src/gen/java/org/bytedeco/tritonserver/global/tritonserver.java).
 Alternatively, the user can refer to the web version [API docs](http://bytedeco.org/javacpp-presets/tritonserver/apidocs/)
 generated from `tritonserver.java`.
+**Note:** Currently, `tritonserver.java` contains bindings for both the `In-process C-API`
+and the bindings for `C-API Wrapper`. More information about the [developer_tools/server C-API wrapper](https://github.com/triton-inference-server/developer_tools/blob/main/server/README.md) can be found in the [developer_tools repository](https://github.com/triton-inference-server/developer_tools/).
+
 A simple example using the Java API can be found in
 [Samples folder](https://github.com/bytedeco/javacpp-presets/tree/master/tritonserver/samples)
-which includes `Simple.java` which is similar to 
-[`simple.cc`](https://github.com/triton-inference-server/server/blob/main/src/servers/simple.cc). 
+which includes `Simple.java` which is similar to
+[`simple.cc`](https://github.com/triton-inference-server/server/blob/main/src/simple.cc).
 Please refer to
 [sample usage documentation](https://github.com/bytedeco/javacpp-presets/tree/master/tritonserver#sample-usage)
 to learn about how to build and run `Simple.java`.
 
-In the [QA folder](../qa), folders starting with L0_java include Java API tests.
+In the [QA folder](https://github.com/triton-inference-server/server/blob/main/qa), folders starting with L0_java include Java API tests.
 These can be useful references for getting started, such as the
-[ResNet50 test](../qa/L0_java_resnet).
+[ResNet50 test](https://github.com/triton-inference-server/server/blob/main/qa/L0_java_resnet).
 
 ### Java API setup instructions
 
 To use the Tritonserver Java API, you will need to have the Tritonserver library
-and dependencies installed in your enviroment. There are two ways to do this:
+and dependencies installed in your environment. There are two ways to do this:
 
 1. Use a Tritonserver docker container with
    1. `.jar` Java bindings to C API (recommended)
    2. maven and build bindings yourself
-2. Build Triton from your enviroment without Docker (not recommended)
+2. Build Triton from your environment without Docker (not recommended)
 
 #### Run Tritonserver container and install dependencies
 
-To set up your enviroment with Triton Java API, please follow the following steps:
+To set up your environment with Triton Java API, please follow the following steps:
 1. First run Docker container:
 ```
  $ docker run -it --gpus=all -v ${pwd}:/workspace nvcr.io/nvidia/tritonserver:<your container version>-py3 bash
@@ -369,26 +446,30 @@ $ cd /opt/tritonserver
 After ensuring that Tritonserver and dependencies are installed, you can run your
 Java program with the Java bindings with the following steps:
 
-1. Place Java bindings into your enviroment. You can do this by either:
-   
+1. Place Java bindings into your environment. You can do this by either:
+
    a. Building Java API bindings with provided build script:
       ```bash
       # Clone Triton client repo. Recommended client repo tag is: main
       $ git clone --single-branch --depth=1 -b <client repo tag>
                      https://github.com/triton-inference-server/client.git clientrepo
       # Run build script
+      ## For In-Process C-API Java Bindings
       $ source clientrepo/src/java-api-bindings/scripts/install_dependencies_and_build.sh
+      ## For C-API Wrapper (Triton with C++ bindings) Java Bindings
+      $ source clientrepo/src/java-api-bindings/scripts/install_dependencies_and_build.sh --enable-developer-tools-server
       ```
       This will install the Java bindings to `/workspace/install/java-api-bindings/tritonserver-java-bindings.jar`
-   
+
    *or*
 
-   b. Copying "Uber Jar" from Triton SDK container to your enviroment
+   b. Copying "Uber Jar" from Triton SDK container to your environment
       ```bash
       $ id=$(docker run -dit nvcr.io/nvidia/tritonserver:<triton container version>-py3-sdk bash)
       $ docker cp ${id}:/workspace/install/java-api-bindings/tritonserver-java-bindings.jar <Uber Jar directory>/tritonserver-java-bindings.jar
       $ docker stop ${id}
-      ``` 
+      ```
+      **Note:** `tritonserver-java-bindings.jar` only includes the `In-Process Java Bindings`. To use the `C-API Wrapper Java Bindings`, please use the build script.
 2. Use the built "Uber Jar" that contains the Java bindings
    ```bash
    $ java -cp <Uber Jar directory>/tritonserver-java-bindings.jar <your Java program>
@@ -396,18 +477,28 @@ Java program with the Java bindings with the following steps:
 
 #### Build Java bindings and run Java program with Maven
 
-If you want to make changes to the Java bindings, then you can use Maven to build yourself. You can refer to part 1.a of [Run Java program with Java bindings Jar](#run-java-program-with-java-bindings-jar) to also build the jar yourself without any modifications to the Tritonserver bindings in JavaCPP-presets. You can do this using the following steps:
-
-1. Create the JNI binaries in your local repository (`/root/.m2/repository`) 
-   with [`javacpp-presets/tritonserver`](https://github.com/bytedeco/javacpp-presets/tree/master/tritonserver)
+If you want to make changes to the Java bindings, then you can use Maven to
+build yourself. You can refer to part 1.a of [Run Java program with Java
+bindings Jar](#run-java-program-with-java-bindings-jar) to also build the jar
+yourself without any modifications to the Tritonserver bindings in
+JavaCPP-presets.
+You can do this using the following steps:
+
+1. Create the JNI binaries in your local repository (`/root/.m2/repository`)
+   with [`javacpp-presets/tritonserver`](https://github.com/bytedeco/javacpp-presets/tree/master/tritonserver).
+   For C-API Wrapper Java bindings (Triton with C++ bindings), you need to
+   install some build specific dependencies including cmake and rapidjson.
+   Refer to [java installation script](https://github.com/triton-inference-server/client/blob/main/src/java-api-bindings/scripts/install_dependencies_and_build.sh)
+   for dependencies you need to install and modifications you need to make for your container.
+After installing dependencies, you can build the tritonserver project on javacpp-presets:
 ```bash
  $ git clone https://github.com/bytedeco/javacpp-presets.git
  $ cd javacpp-presets
  $ mvn clean install --projects .,tritonserver
  $ mvn clean install -f platform --projects ../tritonserver/platform -Djavacpp.platform=linux-x86_64
 ```
-2. Create your custom `*.pom` file for Maven. Please refer to 
-   [samples/pom.xml](https://github.com/bytedeco/javacpp-presets/blob/master/tritonserver/samples/pom.xml) as 
+2. Create your custom `*.pom` file for Maven. Please refer to
+   [samples/simple/pom.xml](https://github.com/bytedeco/javacpp-presets/blob/master/tritonserver/samples/simple/pom.xml) as
    reference for how to create your pom file.
 3. After creating your `pom.xml` file you can build your application with:
 ```bash
diff --git a/docs/repository_agents.md b/docs/customization_guide/repository_agents.md
similarity index 98%
rename from docs/repository_agents.md
rename to docs/customization_guide/repository_agents.md
index a014358bd1..02fb1d57ec 100644
--- a/docs/repository_agents.md
+++ b/docs/customization_guide/repository_agents.md
@@ -47,7 +47,7 @@ before loading a model.
 
 A model can use one or more repository agents by specifying them in
 the *ModelRepositoryAgents* section of the [model
-configuration](model_configuration.md). Each repository agent can have
+configuration](../user_guide/model_configuration.md). Each repository agent can have
 parameters specific to that agent that are specified in the model
 configuration to control the behavior of the agent. To understand the
 parameters available for a given agent consult the documentation for
diff --git a/docs/test.md b/docs/customization_guide/test.md
similarity index 96%
rename from docs/test.md
rename to docs/customization_guide/test.md
index af5c06e6b7..39b517a50d 100644
--- a/docs/test.md
+++ b/docs/customization_guide/test.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -49,7 +49,7 @@ $ ./gen_qa_custom_ops
 ```
 
 This will create multiple model repositories in /tmp/<version>/qa_*
-(for example /tmp/22.05/qa_model_repository).  The TensorRT models
+(for example /tmp/23.11/qa_model_repository).  The TensorRT models
 will be created for the GPU on the system that CUDA considers device 0
 (zero). If you have multiple GPUs on your system see the documentation
 in the scripts for how to target a specific GPU.
@@ -73,7 +73,7 @@ $ docker build -t tritonserver_sdk -f Dockerfile.sdk .
 Next you need to build a QA version of the Triton Docker image. This
 image will contain Triton, the QA tests, and all the dependencies
 needed to run the QA tests. First do a [Docker image
-build](build.md#building-triton-with-docker) to produce the
+build](build.md#building-with-docker) to produce the
 *tritonserver_cibase* and *tritonserver* images.
 
 Then, build the actual QA image.
diff --git a/docs/examples/README.md b/docs/examples/README.md
new file mode 100644
index 0000000000..84bfcb9499
--- /dev/null
+++ b/docs/examples/README.md
@@ -0,0 +1,35 @@
+<!--
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Triton Examples
+
+**New to Triton Inference Server?** Make use of [these tutorials](https://github.com/triton-inference-server/tutorials) to begin your Triton journey!
+
+This folder contains the following:
+* jetson: This covers deploying Triton Inference Server on Jetson devices.
+* model_repository: This folder is a basic model repository for deploying models using the Triton Inference Server.
\ No newline at end of file
diff --git a/docs/examples/jetson/README.md b/docs/examples/jetson/README.md
index fcd28e6c59..f149acbca4 100644
--- a/docs/examples/jetson/README.md
+++ b/docs/examples/jetson/README.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,6 +28,7 @@
 
 # Using Triton Inference Server as a shared library for execution on Jetson
 
+## Overview
 This project demonstrates how to run C API applications using Triton Inference Server as a shared library. We also show how to build and execute such applications on Jetson.
 
 ### Prerequisites
@@ -44,14 +45,17 @@ In our example, we placed the contents of downloaded release directory under `/o
 
 ## Part 1. Concurrent inference and dynamic batching
 
-The purpose of the sample located under [concurrency_and_dynamic_batching](concurrency_and_dynamic_batching)
+The purpose of the sample located under [concurrency_and_dynamic_batching](concurrency_and_dynamic_batching/README.md)
 is to demonstrate the important features of Triton Inference Server such as concurrent model execution and
 dynamic batching. In order to do that, we implemented a people detection application using C API and Triton
 Inference Server as a shared library.
 
 ## Part 2. Analyzing model performance with perf_analyzer
 
-To analyze model performance on Jetson, [perf_analyzer](https://github.com/triton-inference-server/server/blob/main/docs/perf_analyzer.md) tool is used. The `perf_analyzer` is included in the release tar file or can be compiled from source.
+To analyze model performance on Jetson,
+[perf_analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md)
+tool is used. The `perf_analyzer` is included in the release tar file or can be
+compiled from source.
 
 From this directory of the repository, execute the following to evaluate model performance:
 
@@ -59,4 +63,6 @@ From this directory of the repository, execute the following to evaluate model p
 ./perf_analyzer -m peoplenet -b 2 --service-kind=triton_c_api --model-repo=$(pwd)/concurrency_and_dynamic_batching/trtis_model_repo_sample_1 --triton-server-directory=/opt/tritonserver --concurrency-range 1:6 -f perf_c_api.csv
 ```
 
-In the example above we saved the results as a `.csv` file. To visualize these results, follow the steps described [here](https://github.com/triton-inference-server/server/blob/main/docs/perf_analyzer.md#visualizing-latency-vs-throughput).
+In the example above we saved the results as a `.csv` file. To visualize these
+results, follow the steps described
+[here](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md#visualizing-latency-vs-throughput).
diff --git a/docs/examples/jetson/concurrency_and_dynamic_batching/Makefile b/docs/examples/jetson/concurrency_and_dynamic_batching/Makefile
index 5b22a63e06..6506314999 100644
--- a/docs/examples/jetson/concurrency_and_dynamic_batching/Makefile
+++ b/docs/examples/jetson/concurrency_and_dynamic_batching/Makefile
@@ -1,4 +1,4 @@
-# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -32,14 +32,14 @@ GCC_PARMS+=-I${HOME}/tritonserver/include/tritonserver -D TRITON_ENABLE_GPU=ON -
 GCC_LIBS=-L${HOME}/tritonserver/lib -L/usr/lib -L/usr/local/cuda/targets/aarch64-linux/lib
 GCC_LIBS+=-lpthread -ltritonserver -lopencv_core -lopencv_highgui -lopencv_imgproc -lopencv_imgcodecs -lopencv_dnn -lcudart
 
-all: $(TARGET) 
+all: $(TARGET)
 
 
 %.o: %.cc
 	$(GCC) $(GCC_PARMS) -c -g -o $@ $^
 
 $(TARGET): $(TARGET).o
-	$(GCC) $^ $(GCC_LIBS) -o $@ 
+	$(GCC) $^ $(GCC_LIBS) -o $@
 
 clean:
 	rm -f $(TARGET).o $(TARGET)
diff --git a/docs/examples/jetson/concurrency_and_dynamic_batching/README.md b/docs/examples/jetson/concurrency_and_dynamic_batching/README.md
index 310f58a9c0..115983b157 100644
--- a/docs/examples/jetson/concurrency_and_dynamic_batching/README.md
+++ b/docs/examples/jetson/concurrency_and_dynamic_batching/README.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -48,7 +48,7 @@ ngc registry model download-version "nvidia/tao/peoplenet:pruned_v2.1"
 
 For latter you need to setup the [NGC CLI](https://ngc.nvidia.com/setup).
 
-Having downloaded the model from the NGC, unzip the archive `peoplenet_pruned_v2.1.zip` into `concurrency_and_dynamic_batching/tao/models/peoplenet`. 
+Having downloaded the model from the NGC, unzip the archive `peoplenet_pruned_v2.1.zip` into `concurrency_and_dynamic_batching/tao/models/peoplenet`.
 
 If you have the zip archive in the `concurrency_and_dynamic_batching` directory, the following will automatically place the model to the correct location:
 
@@ -78,10 +78,10 @@ The `tao-converter` tool is available as a compiled release file for different p
 After you have downloaded `tao-converter`, you might need to execute
 
 ```shell
-chmod 777 tao-converter 
-``` 
+chmod 777 tao-converter
+```
 
-in the directory with the tool. 
+in the directory with the tool.
 
 We provide a conversion script `tao/convert_peoplenet.sh` which expects the model to be present at the location.
 
@@ -139,13 +139,13 @@ To execute from the terminal, run from the `concurrency_and_dynamic_batching` di
 LD_LIBRARY_PATH=$HOME/tritonserver/lib ./people_detection -m system -v -r $(pwd)/trtis_model_repo_sample_1 -t 6 -s false -p $HOME/tritonserver
 ```
 
-The parameter `-t` controlls the number of concurrent inference calls we want to execute. We will be executing the same model on the same sample image with the purpose of demonstrating how setting different concurency options affects the performance.
+The parameter `-t` controls the number of concurrent inference calls we want to execute. We will be executing the same model on the same sample image with the purpose of demonstrating how setting different concurrency options affects the performance.
 
 You can enable saving detected bounding boxes in the project directory in form of overlays over the original image for each execution thread. You can turn the visualization on by setting the parameter `-s` to `true` upon execution (`-s` is set to `false` by default).
 
 ### Expected output
 
-Upon execution, in the terminal log you will see _Model 'peoplenet' Stats_ in json format reflecting the inference performance. We also output _TOTAL INFERENCE TIME_ which simply reflects the elapsed time requred to run the application including data loading, pre-processing and post-processing.
+Upon execution, in the terminal log you will see _Model 'peoplenet' Stats_ in json format reflecting the inference performance. We also output _TOTAL INFERENCE TIME_ which simply reflects the elapsed time required to run the application including data loading, pre-processing and post-processing.
 
 A typical output in the log for _Model 'peoplenet' Stats_ looks as follows:
 
@@ -210,7 +210,7 @@ TOTAL INFERENCE TIME: 174ms
 
 To learn about different statistics check out the [documentation](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_statistics.md#statistics-extension).
 
-To see how setting different values for concurrency affects total execution time and its componets reflected in the model stats, you need to modify a single parameter in the model config file.
+To see how setting different values for concurrency affects total execution time and its components reflected in the model stats, you need to modify a single parameter in the model config file.
 
 To enable concurrent model execution support for a model, corresponding model config file `trtis_model_repo_sample_1/peoplenet/config.pbtxt` includes the following:
 
@@ -223,17 +223,17 @@ instance_group [
 ]
 ```
 
-You can change the count of allowed inferences for the same model instance and observe how it affects performance in _Model 'peoplenet' Stats_ and _TOTAL INFERENCE TIME_. Note that on Jetson we dont recommend setting values too high: for instance, on a device like a Jetson Xavier AGX we don't recommend setting the number larger than 6. The values in the range 1-3 are optimal. 
+You can change the count of allowed inferences for the same model instance and observe how it affects performance in _Model 'peoplenet' Stats_ and _TOTAL INFERENCE TIME_. Note that on Jetson we dont recommend setting values too high: for instance, on a device like a Jetson Xavier AGX we don't recommend setting the number larger than 6. The values in the range 1-3 are optimal.
 
 While trying out different values, note how it affects total inference time as well as some inference statistics (like queue and compute times)
 
 ## Demonstration case 2: Dynamic batching
 
-For models that support batching, Triton implements multiple scheduling and batching algorithms that combine individual inference requests together to improve inference throughput. In this example, we want to demonstrate how enbling automatic dynamic batching affects inference performance. 
+For models that support batching, Triton implements multiple scheduling and batching algorithms that combine individual inference requests together to improve inference throughput. In this example, we want to demonstrate how enbling automatic dynamic batching affects inference performance.
 
 ### Running the sample
 
-To observe the effect of dynamic batching, from the `concurrency_and_dynamic_batching` directory execute: 
+To observe the effect of dynamic batching, from the `concurrency_and_dynamic_batching` directory execute:
 
 ```shell
 LD_LIBRARY_PATH=$HOME/tritonserver/lib ./people_detection -m system -v -r $(pwd)/trtis_model_repo_sample_2 -t 6 -s false -p $HOME/tritonserver
@@ -326,6 +326,6 @@ dynamic_batching {
 }
 ```
 
-To try further options of dynamic batcher see the [documentation](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#dynamic-batcher). 
+To try further options of dynamic batcher see the [documentation](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#dynamic-batcher).
 
 You can also try enabling both concurrent model execution and dynamic batching.
\ No newline at end of file
diff --git a/docs/examples/jetson/concurrency_and_dynamic_batching/common.h b/docs/examples/jetson/concurrency_and_dynamic_batching/common.h
index 4a0a27ac08..b55c8b71c5 100644
--- a/docs/examples/jetson/concurrency_and_dynamic_batching/common.h
+++ b/docs/examples/jetson/concurrency_and_dynamic_batching/common.h
@@ -27,6 +27,7 @@
 
 #include <iostream>
 #include <string>
+
 #include "triton/core/tritonserver.h"
 
 #define RETURN_IF_ERR(X)             \
diff --git a/docs/examples/jetson/concurrency_and_dynamic_batching/people_detection.cc b/docs/examples/jetson/concurrency_and_dynamic_batching/people_detection.cc
index 0affacb3f1..ce22bdcba9 100644
--- a/docs/examples/jetson/concurrency_and_dynamic_batching/people_detection.cc
+++ b/docs/examples/jetson/concurrency_and_dynamic_batching/people_detection.cc
@@ -27,24 +27,23 @@
 #include <rapidjson/document.h>
 #include <rapidjson/error/en.h>
 #include <unistd.h>
+
 #include <chrono>
 #include <cstring>
 #include <future>
 #include <iostream>
+#include <opencv2/dnn.hpp>
 #include <string>
 #include <thread>
 #include <unordered_map>
 #include <vector>
 
-#include "triton/core/tritonserver.h"
-
 #include "common.h"
-
-#include <opencv2/dnn.hpp>
 #include "opencv2/core.hpp"
 #include "opencv2/highgui.hpp"
 #include "opencv2/imgproc.hpp"
 #include "opencv2/opencv.hpp"
+#include "triton/core/tritonserver.h"
 
 #ifdef TRITON_ENABLE_GPU
 #include <cuda_runtime_api.h>
diff --git a/docs/examples/jetson/concurrency_and_dynamic_batching/tao/convert_peoplenet.sh b/docs/examples/jetson/concurrency_and_dynamic_batching/tao/convert_peoplenet.sh
old mode 100644
new mode 100755
diff --git a/docs/examples/model_repository/simple_identity/config.pbtxt b/docs/examples/model_repository/simple_identity/config.pbtxt
old mode 100755
new mode 100644
diff --git a/docs/quickstart.md b/docs/getting_started/quickstart.md
similarity index 84%
rename from docs/quickstart.md
rename to docs/getting_started/quickstart.md
index c3852fe787..1d475e771e 100644
--- a/docs/quickstart.md
+++ b/docs/getting_started/quickstart.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2018-2020, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2018-2023, NVIDIA CORPORATION. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,35 +28,29 @@
 
 # Quickstart
 
+**New to Triton Inference Server and want do just deploy your model quickly?**
+Make use of
+[these tutorials](https://github.com/triton-inference-server/tutorials#quick-deploy)
+ to begin your Triton journey!
+
 The Triton Inference Server is available as [buildable source
-  code](build.md), but the easiest way to install and run Triton is to
+  code](../customization_guide/build.md), but the easiest way to install and run Triton is to
   use the pre-built Docker image available from the [NVIDIA GPU
   Cloud (NGC)](https://ngc.nvidia.com).
 
-## Install Triton Docker Image
-
-Before you can use the Triton Docker image you must install
-[Docker](https://docs.docker.com/engine/install). If you plan on using
-a GPU for inference you must also install the [NVIDIA Container
-Toolkit](https://github.com/NVIDIA/nvidia-docker). DGX users should
-follow [Preparing to use NVIDIA
-Containers](http://docs.nvidia.com/deeplearning/dgx/preparing-containers/index.html).
+Launching and maintaining Triton Inference Server revolves around the use of building model repositories. This tutorial will cover:
 
-Pull the image using the following command.
-
-```
-$ docker pull nvcr.io/nvidia/tritonserver:<xx.yy>-py3
-```
-
-Where \<xx.yy\> is the version of Triton that you want to pull.
+* Creating a Model Repository
+* Launching Triton
+* Send an Inference Request
 
 ## Create A Model Repository
 
-The [model repository](model_repository.md) is the directory where you
+The [model repository](../user_guide/model_repository.md) is the directory where you
 place the models that you want Triton to serve. An example model
 repository is included in the
-[docs/examples/model_repository](examples/model_repository). Before
-using the repository, you must fetch any missing model definition
+[docs/examples/model_repository](../examples/model_repository).
+Before using the repository, you must fetch any missing model definition
 files from their public model zoos via the provided script.
 
 ```
@@ -64,7 +58,7 @@ $ cd docs/examples
 $ ./fetch_models.sh
 ```
 
-## Run Triton
+## Launch Triton
 
 Triton is optimized to provide the best inferencing performance by
 using GPUs, but it can also work on CPU-only systems. In both cases
@@ -119,7 +113,7 @@ Because the --gpus flag is not used, a GPU is not available and Triton
 will therefore be unable to load any model configuration that requires
 a GPU.
 
-## Verify Triton Is Running Correctly
+### Verify Triton Is Running Correctly
 
 Use Triton’s *ready* endpoint to verify that the server and the models
 are ready for inference. From the host system use curl to access the
@@ -136,7 +130,7 @@ $ curl -v localhost:8000/v2/health/ready
 The HTTP request returns status 200 if Triton is ready and non-200 if
 it is not ready.
 
-## Getting The Client Examples
+## Send an Inference Request
 
 Use docker pull to get the client libraries and examples image
 from NGC.
@@ -152,8 +146,6 @@ image.
 $ docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk
 ```
 
-## Running The Image Classification Example
-
 From within the nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk
 image, run the example image-client application to perform image
 classification using the example densenet_onnx model.
diff --git a/docs/index.md b/docs/index.md
new file mode 100644
index 0000000000..6d42750eaa
--- /dev/null
+++ b/docs/index.md
@@ -0,0 +1,106 @@
+<!--
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+::::{grid}
+:reverse:
+:gutter: 2 1 1 1
+:margin: 4 4 1 1
+
+:::{grid-item}
+:columns: 4
+
+```{image} ./_static/nvidia-logo-vert-rgb-blk-for-screen.png
+:width: 300px
+```
+:::
+:::{grid-item}
+:columns: 8
+:class: sd-fs-3
+
+NVIDIA Triton Inference Server
+
+:::
+::::
+
+Triton Inference Server is an open source inference serving software that streamlines AI inferencing.
+
+  <!-- :::
+  :align: center
+  [![Getting Started Video](https://img.youtube.com/vi/NQDtfSi5QF4/1.jpg)](https://www.youtube.com/watch?v=NQDtfSi5QF4)
+  ::: -->
+
+<div>
+<iframe width="560" height="315" src="https://www.youtube.com/embed/NQDtfSi5QF4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
+</div>
+
+# Triton Inference Server
+
+Triton Inference Server enables teams to deploy any AI model from multiple deep
+learning and machine learning frameworks, including TensorRT, TensorFlow,
+PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton supports inference
+across cloud, data center, edge and embedded devices on NVIDIA GPUs, x86 and ARM
+CPU, or AWS Inferentia. Triton Inference Server delivers optimized performance
+for many query types, including real time, batched, ensembles and audio/video
+streaming. Triton inference Server is part of
+[NVIDIA AI Enterprise](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/),
+a software platform that accelerates the data science pipeline and streamlines
+the development and deployment of production AI.
+
+Major features include:
+
+- [Supports multiple deep learning
+  frameworks](https://github.com/triton-inference-server/backend#where-can-i-find-all-the-backends-that-are-available-for-triton)
+- [Supports multiple machine learning
+  frameworks](https://github.com/triton-inference-server/fil_backend)
+- [Concurrent model
+  execution](user_guide/architecture.md#concurrent-model-execution)
+- [Dynamic batching](user_guide/model_configuration.md#dynamic-batcher)
+- [Sequence batching](user_guide/model_configuration.md#sequence-batcher) and
+  [implicit state management](user_guide/architecture.md#implicit-state-management)
+  for stateful models
+- Provides [Backend API](https://github.com/triton-inference-server/backend) that
+  allows adding custom backends and pre/post processing operations
+- Model pipelines using
+  [Ensembling](user_guide/architecture.md#ensemble-models) or [Business
+  Logic Scripting
+  (BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting)
+- [HTTP/REST and GRPC inference
+  protocols](customization_guide/inference_protocols.md) based on the community
+  developed [KServe
+  protocol](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2)
+- A [C API](customization_guide/inference_protocols.md#in-process-triton-server-api) and
+  [Java API](customization_guide/inference_protocols.md#java-bindings-for-in-process-triton-server-api)
+  allow Triton to link directly into your application for edge and other in-process use cases
+- [Metrics](user_guide/metrics.md) indicating GPU utilization, server
+  throughput, server latency, and more
+
+Join the [Triton and TensorRT community](https://www.nvidia.com/en-us/deep-learning-ai/triton-tensorrt-newsletter/) and stay current on the latest product updates, bug fixes, content, best
+practices, and more. Need enterprise support? NVIDIA global support is available
+for Triton Inference Server with the [NVIDIA AI Enterprise software suite](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/).
+
+See the [Latest Release Notes](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-23-05.html#rel-23-05) for updates on the newest features and bug fixes.
diff --git a/docs/metrics.md b/docs/metrics.md
deleted file mode 100644
index 6f9a15f918..0000000000
--- a/docs/metrics.md
+++ /dev/null
@@ -1,143 +0,0 @@
-<!--
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-#
-# Redistribution and use in source and binary forms, with or without
-# modification, are permitted provided that the following conditions
-# are met:
-#  * Redistributions of source code must retain the above copyright
-#    notice, this list of conditions and the following disclaimer.
-#  * Redistributions in binary form must reproduce the above copyright
-#    notice, this list of conditions and the following disclaimer in the
-#    documentation and/or other materials provided with the distribution.
-#  * Neither the name of NVIDIA CORPORATION nor the names of its
-#    contributors may be used to endorse or promote products derived
-#    from this software without specific prior written permission.
-#
-# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
-# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
-# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
-# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
-# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--->
-
-# Metrics
-
-Triton provides [Prometheus](https://prometheus.io/) metrics
-indicating GPU and request statistics. By default, these metrics are
-available at http://localhost:8002/metrics. The metrics are only
-available by accessing the endpoint, and are not pushed or published
-to any remote server. The metric format is plain text so you can view
-them directly, for example:
-
-```
-$ curl localhost:8002/metrics
-```
-
-The tritonserver --allow-metrics=false option can be used to disable
-all metric reporting and --allow-gpu-metrics=false can be used to
-disable just the GPU Utilization and GPU Memory metrics. The
---metrics-port option can be used to select a different port. For now,
-Triton reuses http address for metrics endpoint. The option --http-address
-can be used to bind http and metrics endpoints to the same specific address
-when http service is enabled.
-
-The following table describes the available metrics.
-
-|Category      |Metric          |Description                            |Granularity|Frequency    |
-|--------------|----------------|---------------------------------------|-----------|-------------|
-|GPU Utilization |Power Usage   |GPU instantaneous power                |Per GPU    |Per second   |
-|              |Power Limit     |Maximum GPU power limit                |Per GPU    |Per second   |
-|              |Energy Consumption|GPU energy consumption in joules since Triton started|Per GPU|Per second|
-|              |GPU Utilization |GPU utilization rate (0.0 - 1.0)       |Per GPU    |Per second   |
-|GPU Memory    |GPU Total Memory|Total GPU memory, in bytes             |Per GPU    |Per second   |
-|              |GPU Used Memory |Used GPU memory, in bytes              |Per GPU    |Per second   |
-|Count         |Success Count   |Number of successful inference requests received by Triton (each request is counted as 1, even if the request contains a batch) |Per model  |Per request  |
-|              |Failure Count   |Number of failed inference requests received by Triton (each request is counted as 1, even if the request contains a batch) |Per model  |Per request  |
-|              |Inference Count |Number of inferences performed (a batch of "n" is counted as "n" inferences, does not include cached requests)|Per model|Per request|
-|              |Execution Count |Number of inference batch executions (see [Count Metrics](#count-metrics), does not include cached requests)|Per model|Per request|
-|Latency       |Request Time    |Cumulative end-to-end inference request handling time (includes cached requests) |Per model  |Per request  |
-|              |Queue Time      |Cumulative time requests spend waiting in the scheduling queue (includes cached requests) |Per model  |Per request  |
-|              |Compute Input Time|Cumulative time requests spend processing inference inputs (in the framework backend, does not include cached requests)     |Per model  |Per request  |
-|              |Compute Time    |Cumulative time requests spend executing the inference model (in the framework backend, does not include cached requests)     |Per model  |Per request  |
-|              |Compute Output Time|Cumulative time requests spend processing inference outputs (in the framework backend, does not include cached requests)     |Per model  |Per request  |
-|Response Cache|Total Cache Entry Count |Total number of responses stored in response cache across all models |Server-wide |Per second |
-|              |Total Cache Lookup Count |Total number of response cache lookups done by Triton across all models |Server-wide |Per second |
-|              |Total Cache Hit Count |Total number of response cache hits across all models |Server-wide |Per second |
-|              |Total Cache Miss Count |Total number of response cache misses across all models |Server-wide |Per second |
-|              |Total Cache Eviction Count |Total number of response cache evictions across all models |Server-wide |Per second |
-|              |Total Cache Lookup Time |Cumulative time requests spend checking for a cached response across all models (microseconds) |Server-wide |Per second |
-|              |Total Cache Utilization |Total Response Cache utilization rate (0.0 - 1.0) |Server-wide |Per second |
-|              |Cache Hit Count |Number of response cache hits per model |Per model |Per request |
-|              |Cache Hit Lookup Time |Cumulative time requests spend retrieving a cached response per model on cache hits (microseconds) |Per model |Per request |
-|              |Cache Miss Count |Number of response cache misses per model |Per model |Per request |
-|              |Cache Miss Lookup Time |Cumulative time requests spend looking up a request hash on a cache miss (microseconds) |Per model |Per request |
-|              |Cache Miss Insertion Time |Cumulative time requests spend inserting responses into the cache on a cache miss (microseconds) |Per model |Per request |
-
-
-## Response Cache
-
-Compute latency metrics in the table above are calculated for the
-time spent in model inference backends. If the response cache is enabled for a
-given model (see [Response Cache](https://github.com/triton-inference-server/server/blob/main/docs/response_cache.md)
-docs for more info), total inference times may be affected by response cache
-lookup times.
-
-On cache hits, "Cache Hit Lookup Time" indicates the time spent looking up the
-response, and "Compute Input Time" /  "Compute Time" / "Compute Output Time"
-are not recorded.
-
-On cache misses, "Cache Miss Lookup Time" indicates the time spent looking up
-the request hash and "Cache Miss Insertion Time" indicates the time spent
-inserting the computed output tensor data into the cache. Otherwise, "Compute
-Input Time" /  "Compute Time" / "Compute Output Time" will be recorded as usual.
-
-## Count Metrics
-
-For models that do not support batching, *Request Count*, *Inference
-Count* and *Execution Count* will be equal, indicating that each
-inference request is executed separately.
-
-For models that support batching, the count metrics can be interpreted
-to determine average batch size as *Inference Count* / *Execution
-Count*. The count metrics are illustrated by the following examples:
-
-* Client sends a single batch-1 inference request. *Request Count* =
-  1, *Inference Count* = 1, *Execution Count* = 1.
-
-* Client sends a single batch-8 inference request. *Request Count* =
-  1, *Inference Count* = 8, *Execution Count* = 1.
-
-* Client sends 2 requests: batch-1 and batch-8. Dynamic batcher is not
-  enabled for the model. *Request Count* = 2, *Inference Count* = 9,
-  *Execution Count* = 2.
-
-* Client sends 2 requests: batch-1 and batch-1. Dynamic batcher is
-  enabled for the model and the 2 requests are dynamically batched by
-  the server. *Request Count* = 2, *Inference Count* = 2, *Execution
-  Count* = 1.
-
-* Client sends 2 requests: batch-1 and batch-8. Dynamic batcher is
-  enabled for the model and the 2 requests are dynamically batched by
-  the server. *Request Count* = 2, *Inference Count* = 9, *Execution
-  Count* = 1.
-
-## Custom Metrics
-
-Triton exposes a C API to allow users and backends to register and collect
-custom metrics with the existing Triton metrics endpoint. The user takes the
-ownership of the custom metrics created through the APIs and must manage their
-lifetime following the API documentation.
-
-The 
-[identity_backend](https://github.com/triton-inference-server/identity_backend/blob/main/README.md#custom-metric-example)
-demonstrates a practical example of adding a custom metric to a backend.
-
-Further documentation can be found in the `TRITONSERVER_MetricFamily*` and
-`TRITONSERVER_Metric*` API annotations in
-[tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h).
diff --git a/docs/perf_analyzer.md b/docs/perf_analyzer.md
deleted file mode 100644
index 5decc4c5e4..0000000000
--- a/docs/perf_analyzer.md
+++ /dev/null
@@ -1,667 +0,0 @@
-<!--
-# Copyright (c) 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-#
-# Redistribution and use in source and binary forms, with or without
-# modification, are permitted provided that the following conditions
-# are met:
-#  * Redistributions of source code must retain the above copyright
-#    notice, this list of conditions and the following disclaimer.
-#  * Redistributions in binary form must reproduce the above copyright
-#    notice, this list of conditions and the following disclaimer in the
-#    documentation and/or other materials provided with the distribution.
-#  * Neither the name of NVIDIA CORPORATION nor the names of its
-#    contributors may be used to endorse or promote products derived
-#    from this software without specific prior written permission.
-#
-# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
-# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
-# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
-# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
-# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--->
-
-# Performance Analyzer
-
-A critical part of optimizing the inference performance of your model
-is being able to measure changes in performance as you experiment with
-different optimization strategies. The perf_analyzer application
-(previously known as perf_client) performs this task for the Triton
-Inference Server. The perf_analyzer is included with the client
-examples which are [available from several
-sources](https://github.com/triton-inference-server/client#getting-the-client-libraries-and-examples).
-
-The perf_analyzer application generates inference requests to your
-model and measures the throughput and latency of those requests. To
-get representative results, perf_analyzer measures the throughput and
-latency over a time window, and then repeats the measurements until it
-gets stable values. By default perf_analyzer uses average latency to
-determine stability but you can use the --percentile flag to stabilize
-results based on that confidence level. For example, if
---percentile=95 is used the results will be stabilized using the 95-th
-percentile request latency. For example,
-
-```
-$ perf_analyzer -m inception_graphdef --percentile=95
-*** Measurement Settings ***
-  Batch size: 1
-  Measurement window: 5000 msec
-  Using synchronous calls for inference
-  Stabilizing using p95 latency
-
-Request concurrency: 1
-  Client:
-    Request count: 348
-    Throughput: 69.6 infer/sec
-    p50 latency: 13936 usec
-    p90 latency: 18682 usec
-    p95 latency: 19673 usec
-    p99 latency: 21859 usec
-    Avg HTTP time: 14017 usec (send/recv 200 usec + response wait 13817 usec)
-  Server:
-    Inference count: 428
-    Execution count: 428
-    Successful request count: 428
-    Avg request latency: 12005 usec (overhead 36 usec + queue 42 usec + compute input 164 usec + compute infer 11748 usec + compute output 15 usec)
-
-Inferences/Second vs. Client p95 Batch Latency
-Concurrency: 1, throughput: 69.6 infer/sec, latency 19673 usec
-```
-
-## Request Concurrency
-
-By default perf_analyzer measures your model's latency and throughput
-using the lowest possible load on the model. To do this perf_analyzer
-sends one inference request to Triton and waits for the response.
-When that response is received, the perf_analyzer immediately sends
-another request, and then repeats this process during the measurement
-windows. The number of outstanding inference requests is referred to
-as the *request concurrency*, and so by default perf_analyzer uses a
-request concurrency of 1.
-
-Using the --concurrency-range \<start\>:\<end\>:\<step\> option you can have
-perf_analyzer collect data for a range of request concurrency
-levels. Use the --help option to see complete documentation for this
-and other options. For example, to see the latency and throughput of
-your model for request concurrency values from 1 to 4:
-
-```
-$ perf_analyzer -m inception_graphdef --concurrency-range 1:4
-*** Measurement Settings ***
-  Batch size: 1
-  Measurement window: 5000 msec
-  Latency limit: 0 msec
-  Concurrency limit: 4 concurrent requests
-  Using synchronous calls for inference
-  Stabilizing using average latency
-
-Request concurrency: 1
-  Client:
-    Request count: 339
-    Throughput: 67.8 infer/sec
-    Avg latency: 14710 usec (standard deviation 2539 usec)
-    p50 latency: 13665 usec
-...
-Request concurrency: 4
-  Client:
-    Request count: 415
-    Throughput: 83 infer/sec
-    Avg latency: 48064 usec (standard deviation 6412 usec)
-    p50 latency: 47975 usec
-    p90 latency: 56670 usec
-    p95 latency: 59118 usec
-    p99 latency: 63609 usec
-    Avg HTTP time: 48166 usec (send/recv 264 usec + response wait 47902 usec)
-  Server:
-    Inference count: 498
-    Execution count: 498
-    Successful request count: 498
-    Avg request latency: 45602 usec (overhead 39 usec + queue 33577 usec + compute input 217 usec + compute infer 11753 usec + compute output 16 usec)
-
-Inferences/Second vs. Client Average Batch Latency
-Concurrency: 1, throughput: 67.8 infer/sec, latency 14710 usec
-Concurrency: 2, throughput: 89.8 infer/sec, latency 22280 usec
-Concurrency: 3, throughput: 80.4 infer/sec, latency 37283 usec
-Concurrency: 4, throughput: 83 infer/sec, latency 48064 usec
-```
-
-## Understanding The Output
-
-For each request concurrency level perf_analyzer reports latency and
-throughput as seen from the *client* (that is, as seen by
-perf_analyzer) and also the average request latency on the server.
-
-The server latency measures the total time from when the request is
-received at the server until the response is sent from the
-server. Because of the HTTP and GRPC libraries used to implement the
-server endpoints, total server latency is typically more accurate for
-HTTP requests as it measures time from first byte received until last
-byte sent. For both HTTP and GRPC the total server latency is
-broken-down into the following components:
-
-- *queue*: The average time spent in the inference schedule queue by a
-  request waiting for an instance of the model to become available.
-- *compute*: The average time spent performing the actual inference,
-  including any time needed to copy data to/from the GPU.
-
-The client latency time is broken-down further for HTTP and GRPC as
-follows:
-
-- HTTP: *send/recv* indicates the time on the client spent sending the
-  request and receiving the response. *response wait* indicates time
-  waiting for the response from the server.
-- GRPC: *(un)marshal request/response* indicates the time spent
-  marshalling the request data into the GRPC protobuf and
-  unmarshalling the response data from the GRPC protobuf. *response
-  wait* indicates time writing the GRPC request to the network,
-  waiting for the response, and reading the GRPC response from the
-  network.
-
-Use the verbose (-v) option to perf_analyzer to see more output,
-including the stabilization passes run for each request concurrency
-level.
-
-## Visualizing Latency vs. Throughput
-
-The perf_analyzer provides the -f option to generate a file containing
-CSV output of the results.
-
-```
-$ perf_analyzer -m inception_graphdef --concurrency-range 1:4 -f perf.csv
-$ cat perf.csv
-Concurrency,Inferences/Second,Client Send,Network+Server Send/Recv,Server Queue,Server Compute Input,Server Compute Infer,Server Compute Output,Client Recv,p50 latency,p90 latency,p95 latency,p99 latency
-1,69.2,225,2148,64,206,11781,19,0,13891,18795,19753,21018
-3,84.2,237,1768,21673,209,11742,17,0,35398,43984,47085,51701
-4,84.2,279,1604,33669,233,11731,18,1,47045,56545,59225,64886
-2,87.2,235,1973,9151,190,11346,17,0,21874,28557,29768,34766
-```
-
-NOTE: The rows in the CSV file are sorted in an increasing order of throughput (Inferences/Second).
-
-You can import the CSV file into a spreadsheet to help visualize
-the latency vs inferences/second tradeoff as well as see some
-components of the latency. Follow these steps:
-
-- Open [this
-  spreadsheet](https://docs.google.com/spreadsheets/d/1S8h0bWBBElHUoLd2SOvQPzZzRiQ55xjyqodm_9ireiw)
-- Make a copy from the File menu "Make a copy..."
-- Open the copy
-- Select the A1 cell on the "Raw Data" tab
-- From the File menu select "Import..."
-- Select "Upload" and upload the file
-- Select "Replace data at selected cell" and then select the "Import data" button
-
-## Input Data
-
-Use the --help option to see complete documentation for all input
-data options. By default perf_analyzer sends random data to all the
-inputs of your model. You can select a different input data mode with
-the --input-data option:
-
-- *random*: (default) Send random data for each input.
-- *zero*: Send zeros for each input.
-- directory path: A path to a directory containing a binary file for each input, named the same as the input. Each binary file must contain the data required for that input for a batch-1 request. Each file should contain the raw binary representation of the input in row-major order.
-- file path: A path to a JSON file containing data to be used with every inference request. See the "Real Input Data" section for further details. --input-data can be provided multiple times with different file paths to specific multiple JSON files.
-
-For tensors with with STRING/BYTES datatype there are additional
-options --string-length and --string-data that may be used in some
-cases (see --help for full documentation).
-
-For models that support batching you can use the -b option to indicate
-the batch-size of the requests that perf_analyzer should send. For
-models with variable-sized inputs you must provide the --shape
-argument so that perf_analyzer knows what shape tensors to use. For
-example, for a model that has an input called *IMAGE* that has shape [
-3, N, M ], where N and M are variable-size dimensions, to tell
-perf_analyzer to send batch-size 4 requests of shape [ 3, 224, 224 ]:
-
-```
-$ perf_analyzer -m mymodel -b 4 --shape IMAGE:3,224,224
-```
-
-## Real Input Data
-
-The performance of some models is highly dependent on the data used.
-For such cases you can provide data to be used with every inference
-request made by analyzer in a JSON file. The perf_analyzer will use
-the provided data in a round-robin order when sending inference
-requests.
-
-Each entry in the "data" array must specify all input tensors with the
-exact size expected by the model from a single batch. The following
-example describes data for a model with inputs named, INPUT0 and
-INPUT1, shape [4, 4] and data type INT32:
-
-```
-  {
-    "data" :
-     [
-        {
-          "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
-          "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
-        },
-        {
-          "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
-          "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
-        },
-        {
-          "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
-          "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
-        },
-        {
-          "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
-          "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
-        }
-        ...
-      ]
-  }
-```
-
-Note that the [4, 4] tensor has been flattened in a row-major format
-for the inputs. In addition to specifying explicit tensors, you can
-also provide Base64 encoded binary data for the tensors. Each data
-object must list its data in a row-major order. Binary data must be in
-little-endian byte order. The following example highlights how this
-can be acheived:
-
-```
-  {
-    "data" :
-     [
-        {
-          "INPUT0" : {"b64": "YmFzZTY0IGRlY29kZXI="},
-          "INPUT1" : {"b64": "YmFzZTY0IGRlY29kZXI="}
-        },
-        {
-          "INPUT0" : {"b64": "YmFzZTY0IGRlY29kZXI="},
-          "INPUT1" : {"b64": "YmFzZTY0IGRlY29kZXI="}
-        },
-        {
-          "INPUT0" : {"b64": "YmFzZTY0IGRlY29kZXI="},
-          "INPUT1" : {"b64": "YmFzZTY0IGRlY29kZXI="}
-        },
-        ...
-      ]
-  }
-```
-
-In case of sequence models, multiple data streams can be specified in
-the JSON file. Each sequence will get a data stream of its own and the
-analyzer will ensure the data from each stream is played back to the
-same correlation id. The below example highlights how to specify data
-for multiple streams for a sequence model with a single input named
-INPUT, shape [1] and data type STRING:
-
-```
-  {
-    "data" :
-      [
-        [
-          {
-            "INPUT" : ["1"]
-          },
-          {
-            "INPUT" : ["2"]
-          },
-          {
-            "INPUT" : ["3"]
-          },
-          {
-            "INPUT" : ["4"]
-          }
-        ],
-        [
-          {
-            "INPUT" : ["1"]
-          },
-          {
-            "INPUT" : ["1"]
-          },
-          {
-            "INPUT" : ["1"]
-          }
-        ],
-        [
-          {
-            "INPUT" : ["1"]
-          },
-          {
-            "INPUT" : ["1"]
-          }
-        ]
-      ]
-  }
-```
-
-The above example describes three data streams with lengths 4, 3 and 2
-respectively.  The perf_analyzer will hence produce sequences of
-length 4, 3 and 2 in this case.
-
-You can also provide an optional "shape" field to the tensors. This is
-especially useful while profiling the models with variable-sized
-tensors as input. Additionally note that when providing the "shape" field,
-tensor contents must be provided separately in "content" field in row-major
-order. The specified shape values will override default input shapes
-provided as a command line option (see --shape) for variable-sized inputs.
-In the absence of "shape" field, the provided defaults will be used. There
-is no need to specify shape as a command line option if all the data steps
-provide shape values for variable tensors. Below is an example json file
-for a model with single input "INPUT", shape [-1,-1] and data type INT32:
-
-```
-  {
-    "data" :
-     [
-        {
-          "INPUT" :
-                {
-                    "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
-                    "shape": [2,8]
-                }
-        },
-        {
-          "INPUT" :
-                {
-                    "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
-                    "shape": [8,2]
-                }
-        },
-        {
-          "INPUT" :
-                {
-                    "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
-                }
-        },
-        {
-          "INPUT" :
-                {
-                    "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
-                    "shape": [4,4]
-                }
-        }
-        ...
-      ]
-  }
-```
-
-The following is the example to provide contents as base64 string with explicit shapes:
-
-```
-{
-  "data": [{ 
-      "INPUT": {
-                 "content": {"b64": "/9j/4AAQSkZ(...)"},
-                 "shape": [7964]
-               }},
-    (...)]
-}
-```
-
-### Output Validation
-
-When real input data is provided, it is optional to request perf analyzer to
-validate the inference output for the input data.
-
-Validation output can be specified in "validation_data" field in the same format
-as "data" field for real input. Note that the entries in "validation_data" must
-align with "data" for proper mapping. The following example describes validation
-data for a model with inputs named, INPUT0 and INPUT1, outputs named, OUTPUT0
-and OUTPUT1, all tensors have shape [4, 4] and data type INT32:
-
-```
-  {
-    "data" :
-     [
-        {
-          "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
-          "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
-        }
-        ...
-      ],
-    "validation_data" :
-     [
-        {
-          "OUTPUT0" : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
-          "OUTPUT1" : [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
-        }
-        ...
-      ]
-  }
-```
-
-Besides the above example, the validation outputs can be specified in the same
-variations described in "real input data" section.
-
-## Shared Memory
-
-By default perf_analyzer sends input tensor data and receives output
-tensor data over the network. You can instead instruct perf_analyzer to
-use system shared memory or CUDA shared memory to communicate tensor
-data. By using these options you can model the performance that you
-can achieve by using shared memory in your application. Use
---shared-memory=system to use system (CPU) shared memory or
---shared-memory=cuda to use CUDA shared memory.
-
-## Communication Protocol
-
-By default perf_analyzer uses HTTP to communicate with Triton. The GRPC
-protocol can be specificed with the -i option. If GRPC is selected the
---streaming option can also be specified for GRPC streaming.
-
-### SSL/TLS Support
-
-perf_analyzer can be used to benchmark Triton service behind SSL/TLS-enabled endpoints. These options can help in establishing secure connection with the endpoint and profile the server.
-
-For gRPC, see the following options:
-
-* `--ssl-grpc-use-ssl`
-* `--ssl-grpc-root-certifications-file`
-* `--ssl-grpc-private-key-file`
-* `--ssl-grpc-certificate-chain-file`
-
-More details here: https://grpc.github.io/grpc/cpp/structgrpc_1_1_ssl_credentials_options.html
-
-The [inference protocol gRPC SSL/TLS section](inference_protocols.md#ssltls) describes server-side options to configure SSL/TLS in Triton's gRPC endpoint.
-
-For HTTPS, the following options are exposed:
-
-* `--ssl-https-verify-peer`
-* `--ssl-https-verify-host`
-* `--ssl-https-ca-certificates-file`
-* `--ssl-https-client-certificate-file`
-* `--ssl-https-client-certificate-type`
-* `--ssl-https-private-key-file`
-* `--ssl-https-private-key-type`
-
-See `--help` for full documentation.
-
-Unlike gRPC, Triton's HTTP server endpoint can not be configured with SSL/TLS support.
-
-Note: Just providing these `--ssl-http-*` options to perf_analyzer does not ensure the SSL/TLS is used in communication. If SSL/TLS is not enabled on the service endpoint, these options have no effect. The intent of exposing these options to a user of perf_analyzer is to allow them to configure perf_analyzer to benchmark Triton service behind SSL/TLS-enabled endpoints. In other words, if Triton is running behind a HTTPS server proxy, then these options would allow perf_analyzer to profile Triton via exposed HTTPS proxy.
-
-## Benchmarking Triton directly via C API
-
-Besides using HTTP or gRPC server endpoints to communicate with Triton, perf_analyzer also allows user to benchmark Triton directly using C API. HTTP/gRPC endpoints introduce an additional latency in the pipeline which may not be of interest to the user who is using Triton via C API within their application. Specifically, this feature is useful to benchmark bare minimum Triton without additional overheads from HTTP/gRPC communication.
-
-### Prerequisite
-Pull the Triton SDK and the Inference Server container images on target machine.
-Since you will need access to the Tritonserver install, it might be easier if 
-you copy the perf_analyzer binary to the Inference Server container.
-
-### Required Parameters
-Use the --help option to see complete list of supported command line arguments.
-By default perf_analyzer expects the Triton instance to already be running. You can configure the C API mode using the `--service-kind` option. In additon, you will need to point
-perf_analyzer to the Triton server library path using the `--triton-server-directory` option and the model 
-repository path using the `--model-repository` option.
-If the server is run successfully, there is a prompt: "server is alive!" and perf_analyzer will print the stats, as normal.
-An example run would look like:
-```
-perf_analyzer -m graphdef_int32_int32_int32 --service-kind=triton_c_api --triton-server-directory=/opt/tritonserver --model-repository=/workspace/qa/L0_perf_analyzer_capi/models
-```
-
-### Non-supported functionalities
-There are a few functionalities that are missing from the C API. They are:
-1. Async mode (`-a`)
-2. Using shared memory mode (`--shared-memory=cuda` or `--shared-memory=system`)
-3. Request rate range mode
-4. For additonal known non-working cases, please refer to 
-   [qa/L0_perf_analyzer_capi/test.sh](https://github.com/triton-inference-server/server/blob/main/qa/L0_perf_analyzer_capi/test.sh#L239-L277)
-
-
-## Benchmarking TensorFlow Serving
-perf_analyzer can also be used to benchmark models deployed on
-[TensorFlow Serving](https://github.com/tensorflow/serving) using
-the `--service-kind` option. The support is however only available
-through gRPC protocol.
- 
-Following invocation demonstrates how to configure perf_analyzer
-to issue requests to a running instance of
-`tensorflow_model_server`:
- 
-```
-$ perf_analyzer -m resnet50 --service-kind tfserving -i grpc -b 1 -p 5000 -u localhost:8500
-*** Measurement Settings ***
-  Batch size: 1
-  Using "time_windows" mode for stabilization
-  Measurement window: 5000 msec
-  Using synchronous calls for inference
-  Stabilizing using average latency
-Request concurrency: 1
-  Client: 
-    Request count: 829
-    Throughput: 165.8 infer/sec
-    Avg latency: 6032 usec (standard deviation 569 usec)
-    p50 latency: 5863 usec
-    p90 latency: 6655 usec
-    p95 latency: 6974 usec
-    p99 latency: 8093 usec
-    Avg gRPC time: 5984 usec ((un)marshal request/response 257 usec + response wait 5727 usec)
-Inferences/Second vs. Client Average Batch Latency
-Concurrency: 1, throughput: 165.8 infer/sec, latency 6032 usec
-```
- 
-You might have to specify a different url(`-u`) to access wherever
-the server is running. The report of perf_analyzer will only
-include statistics measured at the client-side.
- 
-**NOTE:** The support is still in **beta**. perf_analyzer does
-not guarantee optimum tuning for TensorFlow Serving. However, a
-single benchmarking tool that can be used to stress the inference
-servers in an identical manner is important for performance
-analysis.
-
- 
-The following points are important for interpreting the results:
-1. `Concurrent Request Execution`:
-TensorFlow Serving (TFS), as of version 2.8.0, by default creates
-threads for each request that individually submits requests to
-TensorFlow Session. There is a resource limit on the number of
-concurrent threads serving requests. When benchmarking at a higher
-request concurrency, you can see higher throughput because of this.  
-Unlike TFS, by default Triton is configured with only a single
-[instance count](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#instance-groups). Hence, at a higher request concurrency, most
-of the requests are blocked on the instance availability. To
-configure Triton to behave like TFS, set the instance count to a
-reasonably high value and then set
-[MAX_SESSION_SHARE_COUNT](https://github.com/triton-inference-server/tensorflow_backend#parameters)
-parameter in the model confib.pbtxt to the same value.For some
-context, the TFS sets its thread constraint to four times the
-num of schedulable CPUs.
-2. `Different library versions`:
-The version of TensorFlow might differ between Triton and
-TensorFlow Serving being benchmarked. Even the versions of cuda
-libraries might differ between the two solutions. The performance
-of models can be susceptible to the versions of these libraries.
-For a single request concurrency, if the compute_infer time
-reported by perf_analyzer when benchmarking Triton is as large as
-the latency reported by perf_analyzer when benchmarking TFS, then
-the performance difference is likely because of the difference in
-the software stack and outside the scope of Triton.
-3. `CPU Optimization`:
-TFS has separate builds for CPU and GPU targets. They have
-target-specific optimization. Unlike TFS, Triton has a single build
-which is optimized for execution on GPUs. When collecting performance
-on CPU models on Triton, try running Triton with the environment
-variable `TF_ENABLE_ONEDNN_OPTS=1`.
- 
- 
-## Benchmarking TorchServe
-perf_analyzer can also be used to benchmark
-[TorchServe](https://github.com/pytorch/serve) using the
-`--service-kind` option. The support is however only available through
-HTTP protocol. It also requires input to be provided via JSON file.
- 
-Following invocation demonstrates how to configure perf_analyzer to
-issue requests to a running instance of `torchserve` assuming the
-location holds `kitten_small.jpg`:
- 
-```
-$ perf_analyzer -m resnet50 --service-kind torchserve -i http -u localhost:8080 -b 1 -p 5000 --input-data data.json
- Successfully read data for 1 stream/streams with 1 step/steps.
-*** Measurement Settings ***
-  Batch size: 1
-  Using "time_windows" mode for stabilization
-  Measurement window: 5000 msec
-  Using synchronous calls for inference
-  Stabilizing using average latency
-Request concurrency: 1
-  Client: 
-    Request count: 799
-    Throughput: 159.8 infer/sec
-    Avg latency: 6259 usec (standard deviation 397 usec)
-    p50 latency: 6305 usec
-    p90 latency: 6448 usec
-    p95 latency: 6494 usec
-    p99 latency: 7158 usec
-    Avg HTTP time: 6272 usec (send/recv 77 usec + response wait 6195 usec)
-Inferences/Second vs. Client Average Batch Latency
-Concurrency: 1, throughput: 159.8 infer/sec, latency 6259 usec
-```
- 
-The content of `data.json`:
- 
-```
- {
-   "data" :
-    [
-       {
-         "TORCHSERVE_INPUT" : ["kitten_small.jpg"]
-       }
-     ]
- }
-```
- 
-You might have to specify a different url(`-u`) to access wherever
-the server is running. The report of perf_analyzer will only include
-statistics measured at the client-side.
- 
-**NOTE:** The support is still in **beta**. perf_analyzer does not
-guarantee optimum tuning for TorchServe. However, a single benchmarking
-tool that can be used to stress the inference servers in an identical
-manner is important for performance analysis.
-
-## Advantages of using Perf Analyzer over third-party benchmark suites
-
-Triton Inference Server offers the entire serving solution which
-includes [client libraries](https://github.com/triton-inference-server/client)
-that are optimized for Triton.
-Using third-party benchmark suites like jmeter fails to take advantage of the
-optimized libraries. Some of these optimizations includes but are not limited
-to:
-1. Using [binary tensor data extension](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_binary_data.md) with HTTP requests.
-2. Effective re-use of gRPC message allocation in subsequent requests.
-3. Avoiding extra memory copy via libcurl interface.
-
-These optimizations can have a tremendous impact on overall performance.
-Using perf_analyzer for benchmarking directly allows a user to access
-these optimizations in their study. 
-
-Not only that, perf_analyzer is also very customizable and supports many
-Triton features as described in this document. This, along with a detailed
-report, allows a user to identify performance bottlenecks and experiment
-with different features before deciding upon what works best for them.
diff --git a/docs/protocol/README.md b/docs/protocol/README.md
index 3ce381c8c8..ddec7fc1d3 100644
--- a/docs/protocol/README.md
+++ b/docs/protocol/README.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -36,14 +36,67 @@ plus several extensions that are defined in the following documents:
 
 - [Binary tensor data extension](./extension_binary_data.md)
 - [Classification extension](./extension_classification.md)
-- [Model configuration extension](./extension_model_configuration.md)
-- [Model repository extension](./extension_model_repository.md)
 - [Schedule policy extension](./extension_schedule_policy.md)
 - [Sequence extension](./extension_sequence.md)
 - [Shared-memory extension](./extension_shared_memory.md)
+- [Model configuration extension](./extension_model_configuration.md)
+- [Model repository extension](./extension_model_repository.md)
 - [Statistics extension](./extension_statistics.md)
 - [Trace extension](./extension_trace.md)
+- [Logging extension](./extension_logging.md)
+- [Parameters extension](./extension_parameters.md)
+
+Note that some extensions introduce new fields onto the inference protocols,
+and the other extensions define new protocols that Triton follows, please refer
+to the extension documents for detail.
 
-For the GRPC protocol the [protobuf
+For the GRPC protocol, the [protobuf
 specification](https://github.com/triton-inference-server/common/blob/main/protobuf/grpc_service.proto)
-is also available.
+is also available. In addition, you can find the GRPC health checking protocol protobuf
+specification [here](https://github.com/triton-inference-server/common/blob/main/protobuf/health.proto).
+
+## Restricted Protocols
+
+You can configure the Triton endpoints, which implement the protocols, to
+restrict access to some protocols and to control network settings, please refer
+to [protocol customization guide](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/inference_protocols.md#httprest-and-grpc-protocols) for detail.
+
+## IPv6
+
+Assuming your host or [docker config](https://docs.docker.com/config/daemon/ipv6/)
+supports IPv6 connections, `tritonserver` can be configured to use IPv6
+HTTP endpoints as follows:
+```
+$ tritonserver ... --http-address ipv6:[::1]&
+...
+I0215 21:04:11.572305 571 grpc_server.cc:4868] Started GRPCInferenceService at 0.0.0.0:8001
+I0215 21:04:11.572528 571 http_server.cc:3477] Started HTTPService at ipv6:[::1]:8000
+I0215 21:04:11.614167 571 http_server.cc:184] Started Metrics Service at ipv6:[::1]:8002
+```
+
+This can be confirmed via `netstat`, for example:
+```
+$ netstat -tulpn | grep tritonserver
+tcp6      0      0 :::8000      :::*      LISTEN      571/tritonserver
+tcp6      0      0 :::8001      :::*      LISTEN      571/tritonserver
+tcp6      0      0 :::8002      :::*      LISTEN      571/tritonserver
+```
+
+And can be tested via `curl`, for example:
+```
+$ curl -6 --verbose "http://[::1]:8000/v2/health/ready"
+*   Trying ::1:8000...
+* TCP_NODELAY set
+* Connected to ::1 (::1) port 8000 (#0)
+> GET /v2/health/ready HTTP/1.1
+> Host: [::1]:8000
+> User-Agent: curl/7.68.0
+> Accept: */*
+>
+* Mark bundle as not supporting multiuse
+< HTTP/1.1 200 OK
+< Content-Length: 0
+< Content-Type: text/plain
+<
+* Connection #0 to host ::1 left intact
+```
diff --git a/docs/protocol/extension_binary_data.md b/docs/protocol/extension_binary_data.md
index 90c0962e98..d04edda28b 100644
--- a/docs/protocol/extension_binary_data.md
+++ b/docs/protocol/extension_binary_data.md
@@ -47,13 +47,13 @@ delivered in the HTTP body after the JSON object (see Examples).
 
 The binary tensor data extension uses parameters to indicate that an
 input or output tensor is communicated as binary data. The first
-parameter is used in $request_input and $response_output to indicate
+parameter is used in `$request_input` and `$response_output` to indicate
 that the input or output tensor is communicated as binary data:
 
 - "binary_data_size" : int64 parameter indicating the size of the
   tensor binary data, in bytes.
 
-The second parameter is used in $request_output to indicate that the
+The second parameter is used in `$request_output` to indicate that the
 output should be returned from Triton as binary data.
 
 - "binary_data" : bool parameter that is true if the output should be
diff --git a/docs/protocol/extension_classification.md b/docs/protocol/extension_classification.md
index 9a63e2c748..5c481e16a7 100644
--- a/docs/protocol/extension_classification.md
+++ b/docs/protocol/extension_classification.md
@@ -62,15 +62,15 @@ indices, the returned tensor will be [ “10:2:apple”, “5:1:pickle” ].
 
 ## HTTP/REST
 
-In all JSON schemas shown in this document $number, $string, $boolean,
-$object and $array refer to the fundamental JSON types. #optional
+In all JSON schemas shown in this document `$number`, `$string`, `$boolean`,
+`$object` and `$array` refer to the fundamental JSON types. #optional
 indicates an optional JSON field.
 
 The classification extension requires that the “classification”
 parameter, when applied to a requested inference output, be recognized
 by Triton as follows:
 
-- “classification” : $number indicating the number of classes that
+- “classification” : `$number` indicating the number of classes that
   should be returned for the output.
 
 The following example shows how the classification parameter is used
diff --git a/docs/protocol/extension_generate.md b/docs/protocol/extension_generate.md
new file mode 100644
index 0000000000..b54b0caffb
--- /dev/null
+++ b/docs/protocol/extension_generate.md
@@ -0,0 +1,188 @@
+<!--
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Generate Extension
+
+> [!NOTE]
+> The Generate Extension is *provisional* and likely to change in future versions.
+
+This document describes Triton's generate extension. The generate
+extension provides a simple text-oriented endpoint schema for interacting with
+large language models (LLMs). The generate endpoint is specific to HTTP/REST
+frontend.
+
+## HTTP/REST
+
+In all JSON schemas shown in this document, `$number`, `$string`, `$boolean`,
+`$object` and `$array` refer to the fundamental JSON types. #optional
+indicates an optional JSON field.
+
+Triton exposes the generate endpoint at the following URLs. The client may use
+HTTP POST request to different URLs for different response behavior, the
+endpoint will return the generate results on success or an error in the case of
+failure.
+
+```
+POST v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/generate
+
+POST v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/generate_stream
+```
+
+### generate vs. generate_stream
+
+Both URLs expect the same request JSON object, and generate the same JSON
+response object. However, there are some differences in the format used to
+return each:
+* `/generate` returns exactly 1 response JSON object with a
+`Content-Type` of `application/json`
+* `/generate_stream` may return multiple responses based on the inference
+results, with a `Content-Type` of `text/event-stream; charset=utf-8`.
+These responses will be sent as
+[Server-Sent Events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events)
+(SSE), where each response will be a "data" chunk in the HTTP
+response body. In the case of inference errors, responses will have
+an [error JSON object](#generate-response-json-error-object).
+    * Note that the HTTP response code is set in the first response of the SSE,
+    so if the first response succeeds but an error occurs in a subsequent
+    response for the request, it can result in receiving an error object
+    while the status code shows success (200). Therefore, the user must
+    always check whether an error object is received when generating
+    responses through `/generate_stream`.
+    * If the request fails before inference begins, then a JSON error will
+    be returned with `Content-Type` of `application/json`, similar to errors
+    from other endpoints with the status code set to an error.
+
+### Generate Request JSON Object
+
+The generate request object, identified as *$generate_request*, is
+required in the HTTP body of the POST request. The model name and
+(optionally) version must be available in the URL. If a version is not
+provided, the server may choose a version based on its own policies or
+return an error.
+
+    $generate_request =
+    {
+      "text_input" : $string,
+      "parameters" : $parameters #optional
+    }
+
+* "text_input" : The text input that the model should generate output from.
+* "parameters" : An optional object containing zero or more parameters for this
+  generate request expressed as key/value pairs. See
+  [Parameters](#parameters) for more information.
+
+> [!NOTE]
+> Any additional properties in the request object are passed either as
+> parameters or tensors based on model specification.
+
+#### Parameters
+
+The `$parameters` JSON describes zero or more “name”/”value” pairs,
+where the “name” is the name of the parameter and the “value” is a
+`$string`, `$number`, or `$boolean`.
+
+    $parameters =
+    {
+      $parameter, ...
+    }
+
+    $parameter = $string : $string | $number | $boolean
+
+Parameters are model-specific. The user should check with the model
+specification to set the parameters.
+
+#### Example Request
+
+Below is an example to send generate request with additional model parameters `stream` and `temperature`.
+
+```
+$ curl -X POST localhost:8000/v2/models/mymodel/generate -d '{"text_input": "client input", "parameters": {"stream": false, "temperature": 0}}'
+
+POST /v2/models/mymodel/generate HTTP/1.1
+Host: localhost:8000
+Content-Type: application/json
+Content-Length: <xx>
+{
+  "text_input":  "client input",
+  "parameters" :
+    {
+      "stream": false,
+      "temperature": 0
+    }
+}
+```
+
+### Generate Response JSON Object
+
+A successful generate request is indicated by a 200 HTTP status code.
+The generate response object, identified as `$generate_response`, is returned in
+the HTTP body.
+
+    $generate_response =
+    {
+      "model_name" : $string,
+      "model_version" : $string,
+      "text_output" : $string
+    }
+
+* "model_name" : The name of the model used for inference.
+* "model_version" : The specific model version used for inference.
+* "text_output" : The output of the inference.
+
+#### Example Response
+
+```
+200
+{
+  "model_name" : "mymodel",
+  "model_version" : "1",
+  "text_output" : "model output"
+}
+```
+
+### Generate Response JSON Error Object
+
+A failed generate request must be indicated by an HTTP error status
+(typically 400). The HTTP body must contain the
+`$generate_error_response` object.
+
+    $generate_error_response =
+    {
+      "error": <error message string>
+    }
+
+* “error” : The descriptive message for the error.
+
+#### Example Error
+
+```
+400
+{
+  "error" : "error message"
+}
+```
diff --git a/docs/protocol/extension_logging.md b/docs/protocol/extension_logging.md
new file mode 100644
index 0000000000..e30c22b784
--- /dev/null
+++ b/docs/protocol/extension_logging.md
@@ -0,0 +1,198 @@
+<!--
+# Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Logging Extension
+
+This document describes Triton's logging extension. The logging extension enables
+the client to configure log settings during a Triton run. Triton reports "logging"
+in the extensions field of its Server Metadata.
+
+## HTTP/REST
+
+In all JSON schemas shown in this document `$number`, `$string`, `$boolean`,
+`$object` and `$array` refer to the fundamental JSON types. #optional
+indicates an optional JSON field.
+
+Triton exposes the logging endpoint at the following URL. The client may use
+HTTP GET request to retrieve the current log settings. A HTTP POST request
+will modify the log settings, and the endpoint will return the updated log
+settings on success or an error in the case of failure.
+
+```
+GET v2/logging
+
+POST v2/logging
+```
+
+### Log Setting Response JSON Object
+
+A successful log setting request is indicated by a 200 HTTP status
+code. The response object, identified as `$log_setting_response`, is
+returned in the HTTP body for every successful log setting request.
+
+```
+$log_setting_response =
+{
+  $log_setting, ...
+}
+
+$log_setting = $string : $string | $boolean | $number
+```
+
+Each `$log_setting` JSON describes a “name”/”value” pair, where the “name” is
+the `$string` representation of the log setting and the “value” is a `$string`,
+`$bool`, or `$number` representation of the setting value. Currently, the
+following log settings are defined:
+
+- "log_file" : a `$string` parameter defining the file where the log outputs will be saved. If an empty string is specified, log outputs will stream to the console.
+
+- "log_info" : a `$boolean` parameter that controls whether the Triton server logs INFO level messages.
+
+- "log_warning" : a `$boolean` parameter that controls whether the Triton server logs WARNING level messages.
+
+- "log_error" : a `$boolean` parameter that controls whether the Triton server logs ERROR level messages.
+
+- "log_verbose_level" : a `$number` parameter that controls whether the Triton server outputs verbose messages
+of varying degrees. This value can be any integer >= 0. If "log_verbose_level" is 0, verbose logging will be disabled, and
+no verbose messages will be output by the Triton server. If "log_verbose_level" is 1, level 1 verbose messages will be output
+by the Triton server. If "log_verbose_level" is 2, the Triton server will output all verbose messages of
+level <= 2, etc. Attempting to set "log_verbose_level" to a number < 0 will result in an error.
+
+- "log_format" : a `$string` parameter that controls the format of Triton server log messages. There are currently
+2 formats: "default" and "ISO8601".
+
+
+### Log Setting Response JSON Error Object
+
+A failed log setting request will be indicated by an HTTP error status
+(typically 400). The HTTP body will contain a `$log_setting_error_response` object.
+
+```
+$log_setting_error_response =
+{
+  "error": $string
+}
+```
+
+- “error” : The descriptive message for the error.
+
+### Log Setting Request JSON Object
+
+A log setting request is made with a HTTP POST to
+the logging endpoint. In the corresponding response, the HTTP body contains the
+response JSON. A successful request is indicated by a 200 HTTP status code.
+
+The request object, identified as `$log_setting_request` must be provided in the HTTP
+body.
+
+```
+$log_setting_request =
+{
+  $log_setting, ...
+}
+```
+
+When a `$log_setting` JSON is received (defined above), only the specified
+settings will be updated.
+
+### Example Usage
+The logging protocol extension can be invoked using the curl library in the following manner (assuming
+a Triton server is running at `localhost:8000`):
+```
+curl -s -w '\n%{http_code}\n' -d '{"log_verbose_level":1}' -X POST localhost:8000/v2/logging
+```
+This command should return a `$log_setting_response` JSON object with the following format:
+```
+{"log_file":"","log_info":true,"log_warnings":true,"log_errors":true,"log_verbose_level":1,"log_format":"default"}
+200
+```
+Note that the current values for all parameter fields are returned even though `log_verbose_level`
+was the only parameter that was modified.
+
+## GRPC
+
+For the logging extension, Triton implements the following API:
+
+```
+service GRPCInferenceService
+{
+  …
+
+  // Update and get the log setting of the Triton server.
+  rpc LogSettings(LogSettingsRequest)
+          returns (LogSettingsResponse) {}
+}
+```
+
+The Log Setting API returns the latest log settings. Errors are indicated
+by the `google.rpc.Status` returned for the request. The OK code
+indicates success and other codes indicate failure. The request and
+response messages for Log Settings are:
+
+```
+message LogSettingsRequest
+{
+  message SettingValue
+  {
+    oneof parameter_choice
+    {
+      // bool param option
+      bool bool_param = 1;
+
+      // uint32 param option
+      uint32 uint32_param = 2;
+
+      // string param option
+      string string_param = 3;
+    }
+  }
+  // The new setting values to be updated.
+  // Unspecified settings will remain unchanged.
+  map<string, SettingValue> settings = 1;
+}
+
+message LogSettingsResponse
+{
+  message SettingValue
+  {
+    oneof parameter_choice
+    {
+      // bool param option
+      bool bool_param = 1;
+
+      // uint32 param option
+      uint32 uint32_param = 2;
+
+      // string param option
+      string string_param = 3;
+    }
+  }
+  // The latest log settings values.
+  map<string, SettingValue> settings = 1;
+}
+```
diff --git a/docs/protocol/extension_model_configuration.md b/docs/protocol/extension_model_configuration.md
index 6e995cf77c..04a2d28fac 100644
--- a/docs/protocol/extension_model_configuration.md
+++ b/docs/protocol/extension_model_configuration.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -35,11 +35,11 @@ information.  Because this extension is supported, Triton reports
 
 ## HTTP/REST
 
-In all JSON schemas shown in this document $number, $string, $boolean,
-$object and $array refer to the fundamental JSON types. #optional
+In all JSON schemas shown in this document `$number`, `$string`, `$boolean`,
+`$object` and `$array` refer to the fundamental JSON types. #optional
 indicates an optional JSON field.
 
-Triton exposes the model configuation endpoint at the following
+Triton exposes the model configuration endpoint at the following
 URL. The versions portion of the URL is optional; if not provided
 Triton will return model configuration for the highest-numbered
 version of the model.
@@ -51,7 +51,7 @@ GET v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/config
 A model configuration request is made with an HTTP GET to the model
 configuration endpoint.A successful model configuration request is
 indicated by a 200 HTTP status code. The model configuration response
-object, identified as $model_configuration_response, is returned in
+object, identified as `$model_configuration_response`, is returned in
 the HTTP body for every successful request.
 
 ```
@@ -67,7 +67,7 @@ model_config.proto](https://github.com/triton-inference-server/common/blob/main/
 
 A failed model configuration request must be indicated by an HTTP
 error status (typically 400). The HTTP body must contain the
-$model_configuration_error_response object.
+`$model_configuration_error_response` object.
 
 ```
 $model_configuration_error_response =
diff --git a/docs/protocol/extension_model_repository.md b/docs/protocol/extension_model_repository.md
index 359ca5e396..b0009043f5 100644
--- a/docs/protocol/extension_model_repository.md
+++ b/docs/protocol/extension_model_repository.md
@@ -41,8 +41,8 @@ Server Metadata.
 
 ## HTTP/REST
 
-In all JSON schemas shown in this document $number, $string, $boolean,
-$object and $array refer to the fundamental JSON types. #optional
+In all JSON schemas shown in this document `$number`, `$string`, `$boolean`,
+`$object` and `$array` refer to the fundamental JSON types. `#optional`
 indicates an optional JSON field.
 
 The model-repository extension requires Index, Load and Unload
@@ -65,7 +65,7 @@ loaded by the Load API. A model-repository index request is made with
 an HTTP POST to the index endpoint. In the corresponding response the
 HTTP body contains the JSON response.
 
-The index request object, identified as $repository_index_request, is
+The index request object, identified as `$repository_index_request`, is
 required in the HTTP body of the POST request.
 
 ```
@@ -78,7 +78,7 @@ $repository_index_request =
 - "ready" : Optional, default is false. If true return only models ready for inferencing.
 
 A successful index request is indicated by a 200 HTTP status code. The
-response object, identified as $repository_index_response, is returned
+response object, identified as `$repository_index_response`, is returned
 in the HTTP body for every successful request.
 
 ```
@@ -101,7 +101,7 @@ $repository_index_response =
 
 A failed index request must be indicated by an HTTP error status
 (typically 400). The HTTP body must contain the
-$repository_index_error_response object.
+`$repository_index_error_response` object.
 
 ```
 $repository_index_error_response =
@@ -117,7 +117,7 @@ $repository_index_error_response =
 The load API requests that a model be loaded into Triton, or reloaded
 if the model is already loaded. A load request is made with an HTTP
 POST to a load endpoint. The HTTP body may be empty or may contain
-the load request object, identified as $repository_load_request.
+the load request object, identified as `$repository_load_request`.
 A successful load request is indicated by a 200 HTTP status.
 
 
@@ -147,13 +147,13 @@ applied.
 This convention will be used to specify the override model directory to load
 the model from. For instance, if the user wants to specify a model directory
 that contains an ONNX model as version 2, then the user will specify the
-parameter to "file:2/model.onnx" : "<base64-encoded-file-content>". Note that
+parameter to "file:2/model.onnx" : "\<base64-encoded-file-content\>". Note that
 "config" parameter must be provided to serve as the model configuration of the
 override model directory.
 
 A failed load request must be indicated by an HTTP error status
 (typically 400). The HTTP body must contain the
-$repository_load_error_response object.
+`$repository_load_error_response` object.
 
 ```
 $repository_load_error_response =
@@ -163,12 +163,44 @@ $repository_load_error_response =
 ```
 - “error” : The descriptive message for the error.
 
+#### Examples
+
+For the following request, Triton will load the model "mymodel" with provided
+model configuration and model file.
+
+```
+POST /v2/repository/models/mymodel/load HTTP/1.1
+Host: localhost:8000
+{
+  "parameters": {
+    "config": "{
+      "name": "mymodel",
+      "backend": "onnxruntime",
+      "inputs": [{
+          "name": "INPUT0",
+          "datatype": "FP32",
+          "shape": [ 1 ]
+        }
+      ],
+      "outputs": [{
+          "name": "OUTPUT0",
+          "datatype": "FP32",
+          "shape": [ 1 ]
+        }
+      ]
+    }",
+
+    "file:1/model.onnx" : "<base64-encoded-file-content>"
+  }
+}
+```
+
 ### Unload
 
 The unload API requests that a model be unloaded from Triton. An
 unload request is made with an HTTP POST to an unload endpoint. The
 HTTP body may be empty or may contain the unload request object,
-identified as $repository_unload_request. A successful unload request
+identified as `$repository_unload_request`. A successful unload request
 is indicated by a 200 HTTP status.
 
 ```
@@ -192,7 +224,7 @@ The unload API accepts the following parameters:
 
 A failed unload request must be indicated by an HTTP error status
 (typically 400). The HTTP body must contain the
-$repository_unload_error_response object.
+`$repository_unload_error_response` object.
 
 ```
 $repository_unload_error_response =
@@ -329,7 +361,7 @@ applied.
 file content. This convention will be used to specify the override model
 directory to load the model from. For instance, if the user wants to specify a
 model directory that contains an ONNX model as version 2, then the user will
-specify the parameter to "file:2/model.onnx" : "<file-content>". Note that
+specify the parameter to "file:2/model.onnx" : "\<file-content\>". Note that
 "config" parameter must be provided to serve as the model configuration of the
 override model directory.
 
diff --git a/docs/protocol/extension_parameters.md b/docs/protocol/extension_parameters.md
new file mode 100644
index 0000000000..4cdb60cf38
--- /dev/null
+++ b/docs/protocol/extension_parameters.md
@@ -0,0 +1,104 @@
+<!--
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Parameters Extension
+
+This document describes Triton's parameters extension. The
+parameters extension allows an inference request to provide
+custom parameters that cannot be provided as inputs. Because this extension is
+supported, Triton reports “parameters” in the extensions field of its
+Server Metadata. This extension uses the optional "parameters"
+field in the KServe Protocol in
+[HTTP](https://kserve.github.io/website/0.10/modelserving/data_plane/v2_protocol/#inference-request-json-object)
+and
+[GRPC](https://kserve.github.io/website/0.10/modelserving/data_plane/v2_protocol/#parameters).
+
+The following parameters are reserved for Triton's usage and should not be
+used as custom parameters:
+
+- sequence_id
+- priority
+- timeout
+- sequence_start
+- sequence_end
+- headers
+- All the keys that start with `"triton_"` prefix. Some examples used today:
+  - `"triton_enable_empty_final_response"` request parameter
+  - `"triton_final_response"` response parameter
+
+When using both GRPC and HTTP endpoints, you need to make sure to not use
+the reserved parameters list to avoid unexpected behavior. The reserved
+parameters are not accessible in the Triton C-API.
+
+## HTTP/REST
+
+The following example shows how a request can include custom parameters.
+
+```
+POST /v2/models/mymodel/infer HTTP/1.1
+Host: localhost:8000
+Content-Type: application/json
+Content-Length: <xx>
+{
+  "parameters" : { "my_custom_parameter" : 42 }
+  "inputs" : [
+    {
+      "name" : "input0",
+      "shape" : [ 2, 2 ],
+      "datatype" : "UINT32",
+      "data" : [ 1, 2, 3, 4 ]
+    }
+  ],
+  "outputs" : [
+    {
+      "name" : "output0",
+    }
+  ]
+}
+```
+
+## GRPC
+
+The `parameters` field in the
+ModelInferRequest message can be used to send custom parameters.
+
+## Forwarding HTTP/GRPC Headers as Parameters
+
+Triton can forward HTTP/GRPC headers as inference request parameters. By
+specifying a regular expression in `--http-header-forward-pattern` and
+`--grpc-header-forward-pattern`,
+Triton will add the headers that match with the regular expression as request
+parameters. All the forwarded headers will be added as a parameter with string
+value. For example to forward all the headers that start with 'PREFIX_' from
+both HTTP and GRPC, you should add `--http-header-forward-pattern PREFIX_.*
+--grpc-header-forward-pattern PREFIX_.*` to your `tritonserver` command.
+
+The forwarded headers can be accessed using the
+[Python](https://github.com/triton-inference-server/python_backend#inference-request-parameters)
+or C Backend APIs as inference request parameters.
+
diff --git a/docs/protocol/extension_schedule_policy.md b/docs/protocol/extension_schedule_policy.md
index a49a97a3de..c3c57a63c7 100644
--- a/docs/protocol/extension_schedule_policy.md
+++ b/docs/protocol/extension_schedule_policy.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -31,8 +31,16 @@
 This document describes Triton's schedule policy extension. The
 schedule-policy extension allows an inference request to provide
 parameters that influence how Triton handles and schedules the
-request.  Because this extension is supported, Triton reports
+request. Because this extension is supported, Triton reports
 “schedule_policy” in the extensions field of its Server Metadata.
+Note the policies are specific to [dynamic
+batcher](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#dynamic-batcher)
+and only experimental support to [sequence
+batcher](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#sequence-batcher)
+with the [direct](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#direct)
+scheduling strategy.
+
+## Dynamic Batcher
 
 The schedule-policy extension uses request parameters to indicate the
 policy. The parameters and their type are:
@@ -49,6 +57,25 @@ policy. The parameters and their type are:
   the time Triton will take a model-specific action such as
   terminating the request.
 
-Both parameters are optional and if not specified Triton will handle
+Both parameters are optional and, if not specified, Triton will handle
 the request using the default priority and timeout values appropriate
 for the model.
+
+## Sequence Batcher with Direct Scheduling Strategy
+
+**Note that the schedule policy for sequence batcher is at experimental stage
+and it is subject to change.**
+
+The schedule-policy extension uses request parameters to indicate the
+policy. The parameters and their type are:
+
+- "timeout" : int64 value indicating the timeout value for the
+  request, in microseconds. If the request cannot be completed within
+  the time Triton will terminate the request, as well as the corresponding
+  sequence and received requests of the sequence. The timeout will only be
+  applied to requests of the sequences that haven't been allocated a batch slot
+  for execution, the requests of the sequences that have been allocated batch
+  slots will not be affected by the timeout setting.
+
+The parameter is optional and, if not specified, Triton will handle
+the request and corresponding sequence based on the model configuration.
\ No newline at end of file
diff --git a/docs/protocol/extension_sequence.md b/docs/protocol/extension_sequence.md
index f7ebdf9c7d..3836d06fce 100644
--- a/docs/protocol/extension_sequence.md
+++ b/docs/protocol/extension_sequence.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -50,13 +50,13 @@ if the "sequence_id" parameter supports string types.
 - "sequence_start" : boolean value if set to true in a request
   indicates that the request is the first in a sequence. If not set,
   or set to false the request is not the first in a sequence. If set
-  the "sequence_id" parameter must be set to a non-zero or non-empty string 
+  the "sequence_id" parameter must be set to a non-zero or non-empty string
   value.
 
 - "sequence_end" : boolean value if set to true in a request indicates
   that the request is the last in a sequence. If not set, or set to
   false the request is not the last in a sequence. If set the
-  "sequence_id" parameter must be set to a non-zero or non-empty string 
+  "sequence_id" parameter must be set to a non-zero or non-empty string
   value.
 
 ## HTTP/REST
diff --git a/docs/protocol/extension_shared_memory.md b/docs/protocol/extension_shared_memory.md
index c01938ca61..e9b5898ac9 100644
--- a/docs/protocol/extension_shared_memory.md
+++ b/docs/protocol/extension_shared_memory.md
@@ -55,22 +55,25 @@ The “shared_memory_offset” parameter is optional and defaults to
 zero. The other two parameters are required. If only one of the two is
 given Triton will return an error.
 
+Note that there is no Windows support for shared memory yet. Jetson only
+supports system shared memory.
+
 ## HTTP/REST
 
-In all JSON schemas shown in this document $number, $string, $boolean,
-$object and $array refer to the fundamental JSON types. #optional
+In all JSON schemas shown in this document `$number`, `$string`, `$boolean`,
+`$object` and `$array` refer to the fundamental JSON types. #optional
 indicates an optional JSON field.
 
-The shared-memory parameters may be used in the $request_input
+The shared-memory parameters may be used in the `$request_input`
 parameters to indicate that the corresponding input is being
 communicated via shared memory. The parameters may be used in the
-$request_output parameters to indicate that the requested output
+`$request_output` parameters to indicate that the requested output
 should be communicated via shared memory.
 
 When these parameters are set for an input tensor the “data” field of
-$request_input must not be set. If the “data” field is set Triton will
+`$request_input` must not be set. If the “data” field is set Triton will
 return an error. When these parameters are set for a requested output
-tensor the returned $response_output must not define the “data” field.
+tensor the returned `$response_output` must not define the “data” field.
 
 Shared memory regions must be created by the client and then
 registered with Triton before they can be referenced with a
@@ -105,7 +108,7 @@ registered regions.
 
 A successful status request is indicated by a 200 HTTP status
 code. The response object, identified as
-$system_shared_memory_status_response, is returned in the HTTP body
+`$system_shared_memory_status_response`, is returned in the HTTP body
 for every successful request.
 
 ```
@@ -133,7 +136,7 @@ $system_shared_memory_status_response =
 
 A failed status request must be indicated by an HTTP error status
 (typically 400). The HTTP body must contain the
-$system_shared_memory_status_error_response object.
+`$system_shared_memory_status_error_response` object.
 
 ```
 $system_shared_memory_status_error_response =
@@ -152,7 +155,7 @@ contains the response JSON. A successful register request is indicated
 by a 200 HTTP status code.
 
 The request object, identified as
-$system_shared_memory_register_request must be provided in the HTTP
+`$system_shared_memory_register_request` must be provided in the HTTP
 body.
 
 ```
@@ -174,7 +177,7 @@ $system_shared_memory_register_request =
 
 A failed register request must be indicated by an HTTP error status
 (typically 400). The HTTP body must contain the
-$system_shared_memory_register_error_response object.
+`$system_shared_memory_register_error_response` object.
 
 ```
 $system_shared_memory_register_error_response =
@@ -196,7 +199,7 @@ are unregisered.
 
 A failed unregister request must be indicated by an HTTP error status
 (typically 400). The HTTP body must contain the
-$system_shared_memory_unregister_error_response object.
+`$system_shared_memory_unregister_error_response` object.
 
 ```
 $system_shared_memory_unregister_error_response =
@@ -234,7 +237,7 @@ registered regions.
 
 A successful status request is indicated by a 200 HTTP status
 code. The response object, identified as
-$cuda_shared_memory_status_response, is returned in the HTTP body
+`$cuda_shared_memory_status_response`, is returned in the HTTP body
 for every successful request.
 
 ```
@@ -258,7 +261,7 @@ $cuda_shared_memory_status_response =
 
 A failed status request must be indicated by an HTTP error status
 (typically 400). The HTTP body must contain the
-$cuda_shared_memory_status_error_response object.
+`$cuda_shared_memory_status_error_response` object.
 
 ```
 $cuda_shared_memory_status_error_response =
@@ -277,7 +280,7 @@ contains the response JSON. A successful register request is indicated
 by a 200 HTTP status code.
 
 The request object, identified as
-$cuda_shared_memory_register_request must be provided in the HTTP
+`$cuda_shared_memory_register_request` must be provided in the HTTP
 body.
 
 ```
@@ -298,7 +301,7 @@ $cuda_shared_memory_register_request =
 
 A failed register request must be indicated by an HTTP error status
 (typically 400). The HTTP body must contain the
-$cuda_shared_memory_register_error_response object.
+`$cuda_shared_memory_register_error_response` object.
 
 ```
 $cuda_shared_memory_register_error_response =
@@ -321,7 +324,7 @@ are unregisered.
 
 A failed unregister request must be indicated by an HTTP error status
 (typically 400). The HTTP body must contain the
-$cuda_shared_memory_unregister_error_response object.
+`$cuda_shared_memory_unregister_error_response` object.
 
 ```
 $cuda_shared_memory_unregister_error_response =
diff --git a/docs/protocol/extension_statistics.md b/docs/protocol/extension_statistics.md
index a7e1ebcb05..040f165dde 100644
--- a/docs/protocol/extension_statistics.md
+++ b/docs/protocol/extension_statistics.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -37,8 +37,8 @@ its Server Metadata.
 
 ## HTTP/REST
 
-In all JSON schemas shown in this document $number, $string, $boolean,
-$object and $array refer to the fundamental JSON types. #optional
+In all JSON schemas shown in this document `$number`, `$string`, `$boolean`,
+`$object` and `$array` refer to the fundamental JSON types. #optional
 indicates an optional JSON field.
 
 Triton exposes the statistics endpoint at the following URL. The
@@ -55,7 +55,7 @@ GET v2/models[/${MODEL_NAME}[/versions/${MODEL_VERSION}]]/stats
 ### Statistics Response JSON Object
 
 A successful statistics request is indicated by a 200 HTTP status
-code. The response object, identified as $stats_model_response, is
+code. The response object, identified as `$stats_model_response`, is
 returned in the HTTP body for every successful statistics request.
 
 ```
@@ -65,8 +65,8 @@ $stats_model_response =
 }
 ```
 
-Each $model_stat object gives the statistics for a specific model and
-version. The $version field is optional for servers that do not
+Each `$model_stat` object gives the statistics for a specific model and
+version. The `$version` field is optional for servers that do not
 support versions.
 
 ```
@@ -78,7 +78,8 @@ $model_stat =
   "inference_count" : $number,
   "execution_count" : $number,
   "inference_stats" : $inference_stats,
-  "batch_stats" : [ $batch_stat, ... ]
+  "batch_stats" : [ $batch_stat, ... ],
+  "memory_usage" : [ $memory_usage, ...]
 }
 ```
 
@@ -119,6 +120,14 @@ $model_stat =
   due to different batch size (for example, larger batches typically
   take longer to compute).
 
+- "memory_usage" : The memory usage detected during model loading, which may be
+  used to estimate the memory to be released once the model is unloaded. Note
+  that the estimation is inferenced by the profiling tools and framework's
+  memory schema, therefore it is advised to perform experiments to understand
+  the scenario that the reported memory usage can be relied on. As a starting
+  point, the GPU memory usage for models in ONNX Runtime backend and TensorRT
+  backend is usually aligned.
+
 ```
 $inference_stats =
 {
@@ -201,7 +210,7 @@ $batch_stats =
   the given batch size. For example, this duration should include the
   time to copy output tensor data from the GPU.
 
-The $duration_stat object reports a count and a total time. This
+The `$duration_stat` object reports a count and a total time. This
 format can be sampled to determine not only long-running averages but
 also incremental averages between sample points.
 
@@ -217,11 +226,27 @@ $duration_stat =
 
 - “ns” : The total duration for the statistic in nanoseconds.
 
+```
+$memory_usage =
+{
+  "type" : $string,
+  "id" : $number,
+  "byte_size" : $number
+}
+```
+
+- "type" : The type of memory, the value can be "CPU", "CPU_PINNED", "GPU".
+
+- "id" : The id of the memory, typically used with "type" to identify
+  a device that hosts the memory.
+
+- "byte_size" : The byte size of the memory.
+
 ### Statistics Response JSON Error Object
 
 A failed statistics request will be indicated by an HTTP error status
 (typically 400). The HTTP body must contain the
-$repository_statistics_error_response object.
+`$repository_statistics_error_response` object.
 
 ```
 $repository_statistics_error_response =
@@ -325,7 +350,16 @@ message ModelStatistics
   // executed in the model. The batch statistics indicate how many actual
   // model executions were performed and show differences due to different
   // batch size (for example, larger batches typically take longer to compute).
-  InferBatchStatistics batch_stats = 7;
+  repeated InferBatchStatistics batch_stats = 7;
+
+  // The memory usage detected during model loading, which may be
+  // used to estimate the memory to be released once the model is unloaded. Note
+  // that the estimation is inferenced by the profiling tools and framework's
+  // memory schema, therefore it is advised to perform experiments to understand
+  // the scenario that the reported memory usage can be relied on. As a starting
+  // point, the GPU memory usage for models in ONNX Runtime backend and TensorRT
+  // backend is usually aligned.
+  repeated MemoryUsage memory_usage = 8;
 }
 
 // Inference statistics.
@@ -341,7 +375,7 @@ message InferStatistics
   StatisticDuration fail = 2;
 
   // The count and cumulative duration that inference requests wait in
-  // scheduling or other queues. The "queue" count and cumulative 
+  // scheduling or other queues. The "queue" count and cumulative
   // duration includes cache hits.
   StatisticDuration queue = 3;
 
@@ -371,7 +405,7 @@ message InferStatistics
   // and extract output tensor data from the Response Cache on a cache
   // hit. For example, this duration should include the time to copy
   // output tensor data from the Response Cache to the response object.
-  // On cache hits, triton does not need to go to the model/backend 
+  // On cache hits, triton does not need to go to the model/backend
   // for the output tensor data, so the "compute_input", "compute_infer",
   // and "compute_output" fields are not updated. Assuming the response
   // cache is enabled for a given model, a cache hit occurs for a
@@ -385,7 +419,7 @@ message InferStatistics
   // The count of response cache misses and cumulative duration to lookup
   // and insert output tensor data from the computed response to the cache
   // For example, this duration should include the time to copy
-  // output tensor data from the resposne object to the Response Cache.
+  // output tensor data from the response object to the Response Cache.
   // Assuming the response cache is enabled for a given model, a cache
   // miss occurs for a request to that model when the request metadata
   // does NOT hash to an existing entry in the cache. See the response
@@ -416,4 +450,18 @@ message InferBatchStatistics
   // tensor data from the GPU.
   StatisticDuration compute_output = 4;
 }
+
+// Memory usage.
+message MemoryUsage
+{
+  // The type of memory, the value can be "CPU", "CPU_PINNED", "GPU".
+  string type = 1;
+
+  // The id of the memory, typically used with "type" to identify
+  // a device that hosts the memory.
+  int64_t id = 2;
+
+  // The byte size of the memory.
+  uint64_t byte_size = 3;
+}
 ```
diff --git a/docs/protocol/extension_trace.md b/docs/protocol/extension_trace.md
index 90414271fa..6472e1db24 100644
--- a/docs/protocol/extension_trace.md
+++ b/docs/protocol/extension_trace.md
@@ -35,8 +35,8 @@ its Server Metadata.
 
 ## HTTP/REST
 
-In all JSON schemas shown in this document $number, $string, $boolean,
-$object and $array refer to the fundamental JSON types. #optional
+In all JSON schemas shown in this document `$number`, `$string`, `$boolean`,
+`$object` and `$array` refer to the fundamental JSON types. `#optional`
 indicates an optional JSON field.
 
 Triton exposes the trace endpoint at the following URL. The client may use
@@ -54,7 +54,7 @@ POST v2[/models/${MODEL_NAME}]/trace/setting
 ### Trace Setting Response JSON Object
 
 A successful trace setting request is indicated by a 200 HTTP status
-code. The response object, identified as $trace_setting_response, is
+code. The response object, identified as `$trace_setting_response`, is
 returned in the HTTP body for every successful trace setting request.
 
 ```
@@ -66,9 +66,9 @@ $trace_setting_response =
 $trace_setting = $string : $string | [ $string, ...]
 ```
 
-Each $trace_setting JSON describes a “name”/”value” pair, where the “name” is
-the name of the trace setting and the “value” is a $string representation of the
-setting value, or an array of $string for some settings. Currently the following
+Each `$trace_setting` JSON describes a “name”/”value” pair, where the “name” is
+the name of the trace setting and the “value” is a `$string representation` of the
+setting value, or an array of `$string` for some settings. Currently the following
 trace settings are defined:
 
 - "trace_file" : the file where the trace output will be saved. If
@@ -78,7 +78,7 @@ see trace setting "log_frequency" below for detail.
 - "trace_level" : the trace level. "OFF" to disable tracing,
 "TIMESTAMPS" to trace timestamps, "TENSORS" to trace tensors.
 This value is an array of string where user may specify multiple levels to
-trace multiple informations.
+trace multiple information.
 - "trace_rate" : the trace sampling rate. The value represents how many requests
 will one trace be sampled from. For example, if the trace rate is "1000",
 1 trace will be sampled for every 1000 requests.
@@ -89,12 +89,12 @@ in "log_frequency", regardless of the "log_frequency" status.
 If the value is "-1", the number of traces to be sampled will not be limited.
 - "log_frequency" : the frequency that Triton will log the
 trace output to the files. If the value is "0", Triton will only log
-the trace output to ${trace_file} when shutting down. Otherwise, Triton will log
+the trace output to `${trace_file}` when shutting down. Otherwise, Triton will log
 the trace output to `${trace_file}.${idx}` when it collects
 the specified number of traces. For example, if the log frequency is "100",
 when Triton collects the 100-th trace, it logs the traces to file
-"${trace_file}.0", and when it collects the 200-th trace, it logs the 101-th to
-the 200-th traces to file "${trace_file}.1". Note that the file index will be
+`"${trace_file}.0"`, and when it collects the 200-th trace, it logs the 101-th to
+the 200-th traces to file `"${trace_file}.1"`. Note that the file index will be
 reset to 0 when "trace_file" setting is updated.
 
 
@@ -102,7 +102,7 @@ reset to 0 when "trace_file" setting is updated.
 
 A failed trace setting request will be indicated by an HTTP error status
 (typically 400). The HTTP body must contain the
-$trace_setting_error_response object.
+`$trace_setting_error_response` object.
 
 ```
 $trace_setting_error_response =
@@ -119,7 +119,7 @@ A trace setting request is made with a HTTP POST to
 the trace endpoint. In the corresponding response the HTTP body contains the
 response JSON. A successful request is indicated by a 200 HTTP status code.
 
-The request object, identified as $trace_setting_request must be provided in the HTTP
+The request object, identified as `$trace_setting_request` must be provided in the HTTP
 body.
 
 ```
@@ -129,12 +129,12 @@ $trace_setting_request =
 }
 ```
 
-The $trace_setting JSON is defined in
-[Trace Setting Response JSON Object](#Trace-Setting-Response-JSON-Object), only the specified
-settings will be updated. In additon to the values mentioned in response JSON
+The `$trace_setting` JSON is defined in
+[Trace Setting Response JSON Object](#trace-setting-response-json-object), only the specified
+settings will be updated. In addition to the values mentioned in response JSON
 object, JSON null value may be used to remove the specification of
 the trace setting. In such case, the current global setting will be used.
-Similarly, if this is the first request to initalize a model trace settings,
+Similarly, if this is the first request to initialize a model trace settings,
 for the trace settings that are not specified in the request, the current global
 setting will be used.
 
@@ -191,7 +191,7 @@ message TraceSettingResponse
 ```
 
 The trace settings are mentioned in
-[Trace Setting Response JSON Object](#Trace-Setting-Response-JSON-Object).
-Note that if this is the first request to initalize
+[Trace Setting Response JSON Object](#trace-setting-response-json-object).
+Note that if this is the first request to initialize
 a model trace settings, for the trace settings that are not specified
 in the request, the value will be copied from the current global settings.
diff --git a/docs/response_cache.md b/docs/response_cache.md
deleted file mode 100644
index 2a99df7331..0000000000
--- a/docs/response_cache.md
+++ /dev/null
@@ -1,87 +0,0 @@
-<!--
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-#
-# Redistribution and use in source and binary forms, with or without
-# modification, are permitted provided that the following conditions
-# are met:
-#  * Redistributions of source code must retain the above copyright
-#    notice, this list of conditions and the following disclaimer.
-#  * Redistributions in binary form must reproduce the above copyright
-#    notice, this list of conditions and the following disclaimer in the
-#    documentation and/or other materials provided with the distribution.
-#  * Neither the name of NVIDIA CORPORATION nor the names of its
-#    contributors may be used to endorse or promote products derived
-#    from this software without specific prior written permission.
-#
-# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
-# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
-# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
-# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
-# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--->
-
-# Triton Response Cache (beta)
-
-**This feature is currently in beta and may be subject to change.**
-
-In this document an *inference request* is the model name, model version, and
-input tensors (name, shape, datatype and tensor data) that make up a request
-submitted to Triton. An inference result is the output tensors (name, shape,
-datatype and tensor data) produced by an inference execution. The response cache
-is used by Triton to hold inference results generated for previous executed
-inference requests. Triton will maintain the response cache so that inference
-requests that hit in the cache will not need to execute a model to produce
-results and will instead extract their results from the cache. For some use
-cases this can significantly reduce the inference request latency.
-
-The response cache is enabled by setting a non-zero size when Triton is launched
-using the `--response-cache-byte-size` flag. The flag defaults to 0 (zero). When
-non-zero, Triton allocates the requested size in CPU memory and **shares the
-cache across all inference requests and across all models**. For a given model
-to use response caching, the model must enable response caching in the model
-configuration. **By default, no model uses response caching even if the response
-cache is enabled with the `--response-cache-byte-size` flag.** For more
-information on enabling the response cache for each model, see the [model
-configuration
-docs](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#response-cache).
-
-Triton accesses the response cache with a hash of the inference request that
-includes the model name, model version and model inputs. If the hash is found in
-the cache, the corresponding inference result is extracted from the cache and
-used for the request. When this happens there is no need for Triton to execute
-the model to produce the inference result. If the hash is not found in the
-cache, Triton executes the model to produce the inference result, and then
-records that result in the cache so that subsequent inference requests can
-(re)use those results. 
-
-The response cache is a fixed-size resource, as a result it must be managed by a
-replacement policy when the number of cacheable responses exceeds the capacity
-of the cache. Currently, the cache only implements a least-recently-used
-([LRU](https://en.wikipedia.org/wiki/Cache_replacement_policies#Least_recently_used_(LRU)))
-replacement policy which will automatically evict one or more LRU entries to
-make room for new entries.
-
-## Known Limitations
-
-- Only input tensors located in CPU memory will be hashable for accessing the
-  cache. If an inference request contains input tensors not in CPU memory, the
-  request will not be hashed and therefore the response will not be cached.
-- Only responses with all output tensors located in CPU memory will be eligible
-  for caching. If any output tensor in a response is not located in CPU memory,
-  the response will not be cached.
-- The cache is accessed using only the inference request hash. As a result, if
-  two different inference requests generate the same hash (a hash collision),
-  then Triton may incorrectly use the cached result for an inference request.
-  The hash is a 64-bit value so the likelihood of collision is small.
-- Only successful inference requests will have their responses cached. If a
-  request fails or returns an error during inference, its response will not be
-  cached.
-- Only requests going through the Default Scheduler or Dynamic Batch Scheduler
-  are eligible for caching. The Sequence Batcher does not currently support
-  response caching.
diff --git a/docs/trace.md b/docs/trace.md
deleted file mode 100644
index 4725925dee..0000000000
--- a/docs/trace.md
+++ /dev/null
@@ -1,305 +0,0 @@
-<!--
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-#
-# Redistribution and use in source and binary forms, with or without
-# modification, are permitted provided that the following conditions
-# are met:
-#  * Redistributions of source code must retain the above copyright
-#    notice, this list of conditions and the following disclaimer.
-#  * Redistributions in binary form must reproduce the above copyright
-#    notice, this list of conditions and the following disclaimer in the
-#    documentation and/or other materials provided with the distribution.
-#  * Neither the name of NVIDIA CORPORATION nor the names of its
-#    contributors may be used to endorse or promote products derived
-#    from this software without specific prior written permission.
-#
-# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
-# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
-# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
-# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
-# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--->
-
-# Triton Server Trace
-
-Triton includes that capability to generate a detailed trace for
-individual inference requests. Tracing is enable by command-line
-arguments when running the tritonserver executable. For example,
-
-```
-$ tritonserver --trace-file=/tmp/trace.json --trace-rate=100 --trace-level=TIMESTAMPS ...
-```
-
-The --trace-file option indicates where the trace output should be
-written. The --trace-rate option specifies the sampling rate. In
-this example every 100-th inference request will be traced. The
---trace-level option indicates the level of trace detail that should
-be collected. --trace-level option may be specified multiple times to 
-trace multiple informations. Use the --help option to get more information.
-
-In addition to configure trace settings in command line arguments, The user may
-modify the trace setting when Triton server
-is running via the trace APIs, more information can be found in [trace
-protocol](protocol/extension_trace.md).
-
-## Supported Trace Level Option
-
-- `TIMESTAMPS`: Tracing execution timestamps of each request.
-- `TENSORS`: Tracing input and output tensors during the execution. 
-
-## JSON Trace Output
-
-The trace output is a JSON file with the following schema.
-
-```
-[
-  {
-    "model_name": $string,
-    "model_version": $number,
-    "id": $number
-    "parent_id": $number,
-    "timestamps": [
-      { "name" : $string, "ns" : $number },
-      ...
-    ]
-  },
-  {
-    "model_name": $string,
-    "model_version": $number,
-    "id": $number
-    "activity": $string,
-    "tensor":{
-      "name": $string,
-      "data": $string,
-      "dtype": $string
-    }
-  },
-  ...
-]
-```
-
-Each trace is assigned a "id", which indicates the model name and 
-version of the inference request. If the trace is from a
-model run as part of an ensemble, the "parent_id" will indicate the
-"id" of the containing ensemble.  
-
-Each `TIMESTAMPS` trace will have one or more "timestamps" with 
-each timestamp having a name and the timestamp in nanoseconds ("ns"). 
-For example:
-
-```
-[
-  {
-    "model_name": "simple",
-    "model_version": -1,
-    "id": 1,
-    "timestamps" : [
-      { "name": "http recv start", "ns": 2259961222771924 },
-      { "name": "http recv end", "ns": 2259961222820985 },
-      { "name": "request handler start", "ns": 2259961223164078 },
-      { "name": "queue start", "ns": 2259961223182400 },
-      { "name": "compute start", "ns": 2259961223232405 },
-      { "name": "compute end", "ns": 2259961230206777 },
-      { "name": "request handler end", "ns": 2259961230211887 },
-      { "name": "http send start", "ns": 2259961230529606 },
-      { "name": "http send end", "ns": 2259961230543930 }
-    ]
-  }
-]
-```
-
-Each `TENSORS` trace will contain an "activity" and a "tensor". 
-"activity" indicates the type of tensor, including "TENSOR_QUEUE_INPUT" 
-and "TENSOR_BACKEND_OUTPUT" by now. "tensor" has the detail of tensor, 
-including its "name", "data" and "dtype". For example:
-
-```
-[
-  {
-    "model_name": "simple",
-    "model_version": -1,
-    "id": 1,
-    "activity": "TENSOR_QUEUE_INPUT",
-    "tensor":{
-      "name": "input",
-      "data": "0.1,0.1,0.1,...",
-      "dtype": "FP32"
-    }
-  }
-]
-```
-
-## Trace Summary Tool
-
-An example [trace summary tool](../qa/common/trace_summary.py) can be
-used to summarize a set of traces collected from Triton. Basic usage
-is:
-
-```
-$ trace_summary.py <trace file>
-```
-
-This produces a summary report for all traces in the file. HTTP and
-GRPC inference requests are reported separately.
-
-```
-File: trace.json
-Summary for simple (-1): trace count = 1
-HTTP infer request (avg): 378us
-	Receive (avg): 21us
-	Send (avg): 7us
-	Overhead (avg): 79us
-	Handler (avg): 269us
-  		Overhead (avg): 11us
-  		Queue (avg): 15us
-  		Compute (avg): 242us
-  			Input (avg): 18us
-  			Infer (avg): 208us
-  			Output (avg): 15us
-Summary for simple (-1): trace count = 1
-GRPC infer request (avg): 21441us
-	Wait/Read (avg): 20923us
-	Send (avg): 74us
-	Overhead (avg): 46us
-	Handler (avg): 395us
-  		Overhead (avg): 16us
-  		Queue (avg): 47us
-  		Compute (avg): 331us
-  			Input (avg): 30us
-  			Infer (avg): 286us
-  			Output (avg): 14us
-```
-
-Use the -t option to get a summary for each trace in the file. This
-summary shows the time, in microseconds, between different points in
-the processing of an inference request. For example, the below output
-shows that it took 15us from the start of handling the request until
-the request was enqueued in the scheduling queue.
-
-```
-$ trace_summary.py -t <trace file>
-...
-simple (-1):
-  	grpc wait/read start
-  		26529us
-  	grpc wait/read end
-  		39us
-  	request handler start
-  		15us
-  	queue start
-  		20us
-  	compute start
-  		266us
-  	compute end
-  		4us
-  	request handler end
-  		19us
-  	grpc send start
-  		77us
-  	grpc send end
-...
-```
-
-The script can also show the data flow of the first request if there are 
-`TENSORS` traces in the file. If the `TENSORS` traces are from an ensemble, 
-the data flow will be shown with the dependency of each model.
-
-```
-...
-Data Flow:
-	==========================================================
-	Name:   ensemble
-	Version:1
-	QUEUE_INPUT:
-		input: [[0.705676  0.830855  0.833153]]
-	BACKEND_OUTPUT:
-		output: [[1. 2. 7. 0. 4. 7. 9. 3. 4. 9.]]
-	==========================================================
-		==================================================
-		Name:   test_trt1
-		Version:1
-		QUEUE_INPUT:
-			input: [[0.705676  0.830855  0.833153]]
-		BACKEND_OUTPUT:
-			output1: [[1. 1. ...]]
-		==================================================
-		==================================================
-		Name:   test_trt2
-		Version:1
-		QUEUE_INPUT:
-			input: [[0.705676  0.830855  0.833153]]
-		BACKEND_OUTPUT:
-			output2: [[2. 2. ...]]
-		==================================================
-		==================================================
-		Name:   test_py
-		Version:1
-		QUEUE_INPUT:
-			output1: [[1. 1. ...]]
-		QUEUE_INPUT:
-			output2: [[2. 2. ...]]
-		BACKEND_OUTPUT:
-			output: [[1. 2. 7. 0. 4. 7. 9. 3. 4. 9.]]
-		==================================================
-...
-```
-
-The meaning of the trace timestamps is:
-
-* GRPC Request Wait/Read: Collected only for inference requests that use the
-  GRPC protocol. The time spent waiting for a request to arrive at the
-  server and for that request to be read. Because wait time is
-  included in the time it is not a useful measure of how much time is
-  spent reading a request from the network. Tracing an HTTP request
-  will provide an accurate measure of the read time.
-
-* HTTP Request Receive: Collected only for inference requests that use the
-  HTTP protocol. The time required to read the inference request from
-  the network.
-
-* Send: The time required to send the inference response.
-
-* Overhead: Additional time required in the HTTP or GRPC endpoint to
-  process the inference request and response.
-
-* Handler: The total time spent handling the inference request, not
-  including the HTTP and GRPC request/response handling.
-
-  * Queue: The time the inference request spent in the scheduling queue.
-
-  * Compute: The time the inference request spent executing the actual
-    inference. This time includes the time spent copying input and
-    output tensors. If --trace-level=TIMESTAMPS then a breakdown of the
-    compute time will be provided as follows:
-
-    * Input: The time to copy input tensor data as required by the
-      inference framework / backend. This includes the time to copy
-      input tensor data to the GPU.
-
-    * Infer: The time spent executing the model to perform the
-      inference.
-
-    * Output: The time to copy output tensor data as required by the
-      inference framework / backend. This includes the time to copy
-      output tensor data from the GPU.
-
-  * Overhead: Additional time required for request handling not
-    covered by Queue or Compute times.
-
-* Data Flow: The data flow of the first request. It contains the input and 
-  output tensors of each part of execution.
-
-  * Name: The name of model.
-
-  * Version: The version of model.
-
-  * QUEUE_INPUT: The tensor entering the queue of a backend to wait for 
-    scheduling.
-
-  * BACKEND_OUTPUT: The tensor in the response of a backend.
diff --git a/docs/architecture.md b/docs/user_guide/architecture.md
similarity index 96%
rename from docs/architecture.md
rename to docs/user_guide/architecture.md
index c58d80f6b6..b343842014 100644
--- a/docs/architecture.md
+++ b/docs/user_guide/architecture.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2018-2020, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -32,8 +32,8 @@ The following figure shows the Triton Inference Server high-level
 architecture. The [model repository](model_repository.md) is a
 file-system based repository of the models that Triton will make
 available for inferencing. Inference requests arrive at the server via
-either [HTTP/REST or GRPC](inference_protocols.md) or by the [C
-API](inference_protocols.md) and are then routed to the appropriate per-model
+either [HTTP/REST or GRPC](../customization_guide/inference_protocols.md) or by the [C
+API](../customization_guide/inference_protocols.md) and are then routed to the appropriate per-model
 scheduler. Triton implements [multiple scheduling and batching
 algorithms](#models-and-schedulers) that can be configured on a
 model-by-model basis. Each model's scheduler optionally performs
@@ -298,7 +298,8 @@ additional tensors that have to be transferred.
 
 Implicit state management requires backend support. Currently, only
 [onnxruntime_backend](https://github.com/triton-inference-server/onnxruntime_backend)
-and [tensorrt_backend](https://github.com/triton-inference-server/tensorrt_backend)
+[tensorrt_backend](https://github.com/triton-inference-server/tensorrt_backend),
+and [pytorch_backend](https://github.com/triton-inference-server/pytorch_backend)
 support implicit state.
 
 ##### State Initialization
@@ -311,16 +312,17 @@ description of the model contains variable-sized dimensions, Triton will use *1*
 for every variable-sized dimension for the starting request. For other
 non-starting requests in the sequence, the input state is the output state of
 the previous request in the sequence. For an example ONNX model that uses
-implicit state you can refer to
-[this ONNX model](../qa/common/gen_qa_implicit_models.py#L101).
+implicit state you can refer to this onnx model generated from the
+`create_onnx_modelfile_wo_initial_state()`
+[from this generation script](https://github.com/triton-inference-server/server/blob/main/qa/common/gen_qa_implicit_models.py).
 This is a simple accumulator model that stores the partial sum of the requests
 in a sequence in Triton using implicit state. For state initialization, if the
 request is starting, the model sets the "OUTPUT\_STATE" to be equal to the
 "INPUT" tensor. For non-starting requests, it sets the "OUTPUT\_STATE" tensor
 to the sum of "INPUT" and "INPUT\_STATE" tensors.
 
-In addition to the default state initilization discussed above, Triton provides
-two other mechanisms for initilizing state.
+In addition to the default state initialization discussed above, Triton provides
+two other mechanisms for initializing state.
 
 ###### Initializing State from Zero.
 
@@ -352,7 +354,7 @@ converted to fixed size dimensions.
 
 For initializing state from file, you need to create a directory named
 "initial\_state" under the model directory. The file that contains the initial
-state under this directory needs to be provided in the *data_file* field. 
+state under this directory needs to be provided in the *data_file* field.
 The data stored in this file will be used in row-major order as the initial
 state. Below is an example state description initializing state from file.
 
@@ -520,7 +522,7 @@ model. Over time the following happens:
   the sequence scheduler sees them both available in their respective
   batch slots. The scheduler immediately schedules the model instance
   to perform a batch-size 2 inference and uses START and READY to show
-  that both slots have an inference request avaiable but that only
+  that both slots have an inference request available but that only
   slot1 is the start of a new sequence.
 
 * The processing continues in a similar manner for the other inference
@@ -797,7 +799,8 @@ scheduler will:
 #### Additional Resources
 
 You can find additional end-to-end ensemble examples in the links below:
-
+* [This guide](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_5-Model_Ensembles)
+explores the concept of ensembles with a running example.
 * [Preprocessing in Python Backend Using
   Ensemble](https://github.com/triton-inference-server/python_backend#preprocessing)
 * [Accelerating Inference with NVIDIA Triton Inference Server and NVIDIA
diff --git a/docs/custom_operations.md b/docs/user_guide/custom_operations.md
similarity index 81%
rename from docs/custom_operations.md
rename to docs/user_guide/custom_operations.md
index 77af0a01fa..daecf2e209 100644
--- a/docs/custom_operations.md
+++ b/docs/user_guide/custom_operations.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -48,7 +48,7 @@ libtrtcustom.so, starting Triton with the following command makes
 those custom layers available to all TensorRT models.
 
 ```bash
-$ LD_PRELOAD=libtrtcustom.so tritonserver --model-repository=/tmp/models ...
+$ LD_PRELOAD=libtrtcustom.so:${LD_PRELOAD} tritonserver --model-repository=/tmp/models ...
 ```
 
 A limitation of this approach is that the custom layers must be
@@ -64,26 +64,41 @@ simple way to ensure you are using the correct version of TensorRT is
 to use the [NGC TensorRT
 container](https://ngc.nvidia.com/catalog/containers/nvidia:tensorrt)
 corresponding to the Triton container. For example, if you are using
-the 22.05 version of Triton, use the 22.05 version of the TensorRT
+the 23.11 version of Triton, use the 23.11 version of the TensorRT
 container.
 
 ## TensorFlow
 
-Tensorflow allows users to [add custom
+TensorFlow allows users to [add custom
 operations](https://www.tensorflow.org/guide/create_op) which can then
-be used in TensorFlow models. By using LD_PRELOAD you can load your
-custom TensorFlow operations into Triton. For example, assuming your
-TensorFlow custom operations are compiled into libtfcustom.so,
+be used in TensorFlow models. You can load custom TensorFlow operations
+into Triton in two ways:
+* At model load time, by listing them in the model configuration.
+* At server launch time, by using LD_PRELOAD.
+
+To register your custom operations library via the the model configuration,
+you can include it as an additional field. See the below configuration as an example.
+
+```bash
+$ model_operations { op_library_filename: "path/to/libtfcustom.so" }
+```
+
+Note that even though the models are loaded at runtime, multiple models can use the custom
+operators. There is currently no way to deallocate the custom operators, so they will stay
+available until Triton is shut down.
+
+You can also register your custom operations library via LD_PRELOAD. For example,
+assuming your TensorFlow custom operations are compiled into libtfcustom.so,
 starting Triton with the following command makes those operations
 available to all TensorFlow models.
 
 ```bash
-$ LD_PRELOAD=libtfcustom.so tritonserver --model-repository=/tmp/models ...
+$ LD_PRELOAD=libtfcustom.so:${LD_PRELOAD} tritonserver --model-repository=/tmp/models ...
 ```
 
-All TensorFlow custom operations depend on a TensorFlow shared library
-that must be available to the custom shared library when it is
-loading. In practice this means that you must make sure that
+With this approach, all TensorFlow custom operations depend on a TensorFlow shared
+library that must be available to the custom shared library when it is
+loading. In practice, this means that you must make sure that
 /opt/tritonserver/backends/tensorflow1 or
 /opt/tritonserver/backends/tensorflow2 is on the library path before
 issuing the above command. There are several ways to control the
@@ -108,7 +123,7 @@ simple way to ensure you are using the correct version of TensorFlow
 is to use the [NGC TensorFlow
 container](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow)
 corresponding to the Triton container. For example, if you are using
-the 22.05 version of Triton, use the 22.05 version of the TensorFlow
+the 23.11 version of Triton, use the 23.11 version of the TensorFlow
 container.
 
 ## PyTorch
@@ -130,7 +145,7 @@ launching the server. There are several ways to control the library path
 and a common one is to use the LD_LIBRARY_PATH.
 
 ```bash
-$ LD_LIBRARY_PATH=/opt/tritonserver/backends/pytorch:$LD_LIBRARY_PATH LD_PRELOAD=libpytcustom.so tritonserver --model-repository=/tmp/models ...
+$ LD_LIBRARY_PATH=/opt/tritonserver/backends/pytorch:$LD_LIBRARY_PATH LD_PRELOAD=libpytcustom.so:${LD_PRELOAD} tritonserver --model-repository=/tmp/models ...
 ```
 
 A limitation of this approach is that the custom operations must be
@@ -152,13 +167,13 @@ simple way to ensure you are using the correct version of PyTorch is
 to use the [NGC PyTorch
 container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
 corresponding to the Triton container. For example, if you are using
-the 22.05 version of Triton, use the 22.05 version of the PyTorch
+the 23.11 version of Triton, use the 23.11 version of the PyTorch
 container.
 
 ## ONNX
 
 ONNX Runtime allows users to [add custom
-operations](https://github.com/microsoft/onnxruntime/blob/master/docs/AddingCustomOp.md)
+operations](https://onnxruntime.ai/docs/reference/operators/add-custom-op.html)
 which can then be used in ONNX models. To register your custom
 operations library you need to include it in the model configuration
 as an additional field. For example, if you follow [this
@@ -166,7 +181,7 @@ example](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/test/s
 from the
 [microsoft/onnxruntime](https://github.com/microsoft/onnxruntime)
 repository and your ONNXRuntime custom operations are compiled into
-libonnxcustom.so, adding the following to the model configuraion of
+libonnxcustom.so, adding the following to the model configuration of
 your model makes those operations available to that specific ONNX
 model.
 
diff --git a/docs/user_guide/debugging_guide.md b/docs/user_guide/debugging_guide.md
new file mode 100644
index 0000000000..701709d6ef
--- /dev/null
+++ b/docs/user_guide/debugging_guide.md
@@ -0,0 +1,151 @@
+<!--
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Debugging Guide
+This guide goes over first-step troubleshooting for common scenarios in which Triton is behaving unexpectedly or failing. Below, we break down the issues into these categories:
+
+- **[Configuration](#configuration-issues)**: Triton reports an error with your configuration file.
+- **[Model](#model-issues)**: Your model fails to load or perform inference.
+- Server: The server is crashing or unavailable.
+- Client: The client is failing in sending and receiving data to the server.
+- Performance: Triton is not achieving optimal performance.
+
+Regardless of the category of your issue, it is worthwhile to try running in the latest Triton container, whenever possible. While we provide support to older containers, fixes get merged into the next release. By checking the latest release, you can spot whether this issue has already been resolved.
+
+You can also search [Triton’s GitHub issues](https://github.com/triton-inference-server/server/issues) to see if someone previously asked about your issue. If you received an error, you can use a few keywords from the error as a search term.
+
+Triton provides different types of errors and statuses, relevant across a wide swath of issues. Here is an overview of them:
+
+| Error | Definition | Example |
+| ----- | ---------- | ------- |
+|Already Exists | Returned when an action cannot be done because there is already an existing item. | A registered model fails to be registered again.|
+| Internal | Returned when there is an unexpected failure within the Triton code. | A memory allocation fails. |
+| Invalid Arg | Returned when an invalid argument is provided to a function | A model config has an invalid parameter |
+| Not Found | Returned when a requested resource is unable to be found | A shared library is unable to be found |
+| Unavailable | Returned when a requested resource is found but unavailable | A requested model is not ready for inference |
+| Unknown | Returned for cases where the reason for the error is unknown | This error code should not be used |
+| Unsupported | Returned when an option is unsupported | A model config includes a parameter that is not yet supported for that backend |
+
+## Configuration Issues
+
+Before proceeding, please see if the model configuration documentation [here](./model_configuration.md) resolves your question. Beyond that, the best places to find a sample model configuration for your use cases are:
+
+- The server [qa folder](https://github.com/triton-inference-server/server/tree/main/qa). You can find test scripts covering most features, including some which update the model config files to do so.
+    - [Custom_models](https://github.com/triton-inference-server/server/tree/main/qa/custom_models), [ensemble_models](https://github.com/triton-inference-server/server/tree/main/qa/ensemble_models), and [python_models](https://github.com/triton-inference-server/server/tree/main/qa/python_models) include examples of configs for their respective use cases.
+    - [L0_model_config](https://github.com/triton-inference-server/server/tree/main/qa/L0_model_config) tests many types of incomplete model configs.
+
+Note that if you are running into an issue with [perf_analyzer](https://github.com/triton-inference-server/server/blob/main/docs/perf_analyzer.md) or [Model Analyzer](https://github.com/triton-inference-server/model_analyzer), try loading the model onto Triton directly. This checks if the configuration is incorrect or the perf_analyzer or Model Analyzer options need to be updated.
+
+## Model Issues
+**Step 1. Run Models Outside of Triton**
+
+If you are running into an issue with loading or running a model, the first step is to ensure your model runs in its framework outside of Triton. For example, you can run ONNX models in ONNX Runtime and TensorRT models in trtexec. If this check fails, the issue is happening within the framework and not within Triton.
+
+**Step 2. Find the Error Message**
+
+If you receive an error message, you may be able to find where it was generated by searching the code. GitHub provides instructions for searching code [here](https://docs.github.com/en/search-github/searching-on-github/searching-code). A generic search through the Triton organization is available at [this link](https://github.com/search?q=org%3Atriton-inference-server&type=Code).
+
+If your error message only occurs in one or a few places in the Triton code, you may be able to see what’s going wrong pretty quickly. Even if not, it’s good to save this link to provide to us when asking for help with your issue. This is often the first thing we look for.
+
+**Step 3. Build with Debug Flags**
+
+The next step is building with debug flags. We unfortunately don’t provide a debug container, so you’d need to follow the [build guide](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/build.md) to build the container, which includes a [section on adding debug symbols](https://github.com/triton-inference-server/server/blob/main/docs/build.md#building-with-debug-symbols). Once you do so, you can install GDB (`apt-get install gdb`) in the container and run Triton in GDB (`gdb --args tritonserver…`). If needed, you can open a second terminal to run a script in another container. If the server segfaults, you can enter `backtrace`, which will provide you a call stack that lets you know where the error got generated. You should then be able to trace the source of the error. If the bug still exists after debugging, we’ll need this to expedite our work.
+
+Advanced GDB users can also examine variable values, add breakpoints, and more to find the cause of their issue.
+
+### Specific Issues
+**Undefined Symbols**
+
+There are a few options here:
+- This often means a version mismatch between the version of a framework used by Triton and the one used to create the model. Check the version of the framework used in the Triton container and compare against the version used to generate the model.
+- If you are loading a shared library used by a backend, don’t forget to include LD_PRELOAD before the command to run Tritonserver. 
+    - `LD_PRELOAD=<name_of_so_file.so> tritonserver --model-repository…`
+If you built the backend yourself, this could be a linking error. If you are confident the backends and server were built correctly, double check that the server is loading the correct backend.
+
+## Server Issues
+
+You generally should not run into errors with the server itself. If the server goes down, it’s usually because something went wrong during model loading or inference and you can use the above section to debug. It’s particularly useful to work through the [Building with Debug Flags](https://github.com/triton-inference-server/server/blob/main/docs/build.md#building-with-debug-symbols) section above to resolve those sorts of issues. However, this section will go through some specific cases that may occur.
+
+### No Connection to Server
+
+If you are having trouble connecting to the server or getting its health via the health endpoint (`curl -v localhost:8000/v2/health/ready`), make sure you are able to reach the network your server is running on from where you are running your command. Most commonly, we see that when separate Docker containers are started for the client and server, they are not started with [--net=host](https://docs.docker.com/network/host/) to share the network.
+
+### Intermittent Failure
+
+This is going to be one of the hardest things to debug. If possible, you want to build your server with debug flags to get a backtrace of what is happening specifically. You would also want to keep notes to see how often this happens and whether that is a common cause. The server itself should not fail while idling, so see if a certain action (loading/unloading a model, running a model inference, etc.) is triggering it.
+
+### Server Failure Due to Individual Models
+
+If you want the server to start up even when models fail, use the `exit-on-error=false` option. If you want the server health endpoint to show ready even when specific models fail, use the `--strict-readiness=false` flag.
+
+### Deadlock
+
+Some useful steps for debugging a deadlock with `gdb`:
+1. Use `$info threads` to see which threads are waiting.
+2. Go to a thread: `$thread 4`.
+3. Print the backtrace: `$bt`.
+4. Go to the frame with the lock: `$f 1`.
+5. Print the memory of the mutex being held: `$p *mutex`.
+6. You can now see the owner of the mutex under `owner`.
+
+## Client Issues
+
+For working with different client cases, the best resources are the [client repo’s](https://github.com/triton-inference-server/client) examples. You can see clients written in Python, Java, and C++ with running examples across many common use cases. You can review the main functions of these clients to get a sense of the flow of the code.
+
+We often get performance optimization questions around the clients. Triton clients send input tensors as raw binary. However, GRPC uses protobuf which has some serialization and deserialization overhead. For those looking for the lowest-latency solution, C API eliminates the latency associated with GRPC/HTTP. Shared memory is also a good option to reduce data movement when the client and server are on the same system.
+
+## Performance Issues
+
+This section goes over debugging unexpected performance. If you are looking to optimize performance, please see the [Optimization](https://github.com/triton-inference-server/server/blob/main/docs/optimization.md) and [Performance Tuning](https://github.com/triton-inference-server/server/blob/main/docs/performance_tuning.md) guides.
+
+The easiest step to start with is running perf_analyzer to get a breakdown of the request lifecycle, throughput, and latency for each individual model. For a more detailed view, you can [enable tracing](https://github.com/triton-inference-server/server/blob/main/docs/trace.md) when running the server. This will provide exact timestamps to drill down into what is happening. You can also enable tracing with perf_analyzer for the GRPC and HTTP clients by using the tracing flags. Note that enabling tracing can impact Triton’s performance, but it can be helpful to examine the timestamps throughout a request’s lifecycle.
+
+### Performance Profiling
+
+The next step would be to use a performance profiler. One profiler we recommend is [Nsight Systems](https://developer.nvidia.com/nsight-systems) (nsys), optionally including NVIDIA Tools Extension (NVTX) markers to profile Triton.
+
+The Triton server container already has nsys installed. However, Triton does not build with the NVTX markers by default. If you want to use NVTX markers, you should build Triton with build.py, using the “--enable-nvtx” flag. This will provide details around some phases of processing a request, such as queueing, running inference, and handling outputs.
+
+You can profile Triton by running `nsys profile tritonserver --model-repository …`. The [nsys documentation](https://docs.nvidia.com/nsight-systems/UserGuide/index.html) provides more options and details for getting a thorough overview of what is going on.
+
+## Submitting an Issue
+
+If you’ve done the initial debugging steps with no results, the next step is to submit the issue to us. Before you do so, please answer these questions:
+- Is this reproducible with multiple models and/or our example models? Or is the issue unique to your model?
+- Is the bug reproducible with any protocol (ex: HTTP vs GRPC)? Or only one protocol?
+
+The answers to the above should inform what you submit. If you find that this issue only happens under specific circumstances, please include this in your report. If the issue still exists, please submit **all** of the below:
+
+- The commands or script used to build/pull Triton and run your models.
+    - If building Triton, please provide the version or branch you are building from.
+- Your model configuration file.
+- The error received, plus any logs.
+    - If your issue involves the server crashing, a backtrace of the dump would be helpful.
+    - Please enable verbose logging (--verbose-log=1) to get the most detailed logs.
+- If this issue is unique to your model, your model or a toy model that reproduces the issue.
+- Anything else that would expedite our investigation.
diff --git a/docs/decoupled_models.md b/docs/user_guide/decoupled_models.md
similarity index 73%
rename from docs/decoupled_models.md
rename to docs/user_guide/decoupled_models.md
index 23ac7febe7..fbe6f4c298 100644
--- a/docs/decoupled_models.md
+++ b/docs/user_guide/decoupled_models.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -56,7 +56,7 @@ TRITONBACKEND_ModelInstanceExecute until that instance is ready to
 handle another set of requests. If not designed properly the backend
 can be easily over-subscribed. This can also cause under-utilization
 of features like [Dynamic Batching](model_configuration.md#dynamic-batcher)
-as it leads to eager batching. 
+as it leads to eager batching.
 
 ### Python model using Python Backend
 
@@ -80,17 +80,48 @@ this configuration setting will throw errors at the runtime.
 
 ## Running Inference on Decoupled Models
 
-[Inference Protocols and APIs](inference_protocols.md) describes various ways
+[Inference Protocols and APIs](../customization_guide/inference_protocols.md) describes various ways
 a client can communicate and run inference on the server. For decoupled models,
 Triton's HTTP endpoint cannot be used for running inference as it supports
 exactly one response per request. Even standard ModelInfer RPC in the GRPC endpoint
 does not support decoupled responses. In order to run inference on a decoupled
 model, the client must use the bi-directional streaming RPC. See
 [here](https://github.com/triton-inference-server/common/blob/main/protobuf/grpc_service.proto)
-for more details. The [decoupled_test.py](../qa/L0_decoupled/decoupled_test.py) demonstrates
+for more details. The [decoupled_test.py](../../qa/L0_decoupled/decoupled_test.py) demonstrates
 how the gRPC streaming can be used to infer decoupled models.
 
-If using [Triton's in-process C API](inference_protocols.md#in-process-triton-server-api),
-your application should be cognizant that the callback function you registered with 
+If using [Triton's in-process C API](../customization_guide/inference_protocols.md#in-process-triton-server-api),
+your application should be cognizant that the callback function you registered with
 `TRITONSERVER_InferenceRequestSetResponseCallback` can be invoked any number of times,
-each time with a new response. You can take a look at [grpc_server.cc](../src/grpc_server.cc)
+each time with a new response. You can take a look at [grpc_server.cc](https://github.com/triton-inference-server/server/blob/main/src/grpc/grpc_server.cc)
+
+### Knowing When a Decoupled Inference Request is Complete
+
+An inference request is considered complete when a response containing the
+`TRITONSERVER_RESPONSE_COMPLETE_FINAL` flag is received from a model/backend.
+
+1. Client applications using streaming GRPC can access this information by
+   checking the response parameters for the `"triton_final_response"` parameter.
+   Decoupled models may not send a response for each request depending on how
+   the model/backend is designed. In these cases where no response is sent by
+   the backend, the streaming GRPC client can opt-in to receive an empty final
+   response for each request. By default, empty final responses are not sent to
+   save on network traffic.
+
+   ```python
+   # Example of streaming GRPC client opting-in
+   client.async_stream_infer(
+     ...,
+     enable_empty_final_response=True
+   )
+   ```
+
+2. Client applications using the C API can check the
+   `TRITONSERVER_RESPONSE_COMPLETE_FINAL` flag directly in their response
+   handling / callback logic.
+
+The [decoupled_test.py](../../qa/L0_decoupled/decoupled_test.py)
+demonstrates an example of opting-in through the streaming GRPC
+Python client API and programmatically identifying when a final response
+is received through the `"triton_final_response"` response parameter.
+
diff --git a/docs/faq.md b/docs/user_guide/faq.md
similarity index 68%
rename from docs/faq.md
rename to docs/user_guide/faq.md
index 5edb6a4843..523b38f750 100644
--- a/docs/faq.md
+++ b/docs/user_guide/faq.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2019-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -35,7 +35,7 @@ same as when using the model's framework directly. However, with
 Triton you get benefits like [concurrent model
 execution](architecture.md#concurrent-model-execution) (the ability to
 run multiple models at the same time on the same GPU) and [dynamic
-batching](architecture.md#dynamic-batcher) to get better
+batching](model_configuration.md#dynamic-batcher) to get better
 throughput. You can also [replace or upgrade models while Triton and
 client application are running](model_management.md). Another benefit
 is that Triton can be deployed as a Docker container, anywhere – on
@@ -48,12 +48,12 @@ leading to a streamlined deployment.
 ## Can Triton Inference Server run on systems that don't have GPUs?
 
 Yes, the QuickStart guide describes how to [run Triton on a CPU-Only
-System](quickstart.md#run-on-cpu-only-system).
+System](../getting_started/quickstart.md#run-on-cpu-only-system).
 
 ## Can Triton Inference Server be used in non-Docker environments?
 
 Yes. Triton Inference Server can also be [built from
-source](build.md#building-triton-without-docker) on your "bare metal"
+source](../customization_guide/build.md#building-without-docker) on your "bare metal"
 system.
 
 ## Do you provide client libraries for languages other than C++ and Python?
@@ -70,7 +70,7 @@ documentation and using
 [grpc_service.proto](https://github.com/triton-inference-server/common/blob/main/protobuf/grpc_service.proto)
 you can generate language bindings for all the languages supported by
 GRPC. We provide three examples of this for
-[Go](https://github.com/triton-inference-server/client/blob/main/src/grpc_generated/go), 
+[Go](https://github.com/triton-inference-server/client/blob/main/src/grpc_generated/go),
 [Python](https://github.com/triton-inference-server/client/blob/main/src/python/examples/grpc_client.py) and
 [Java](https://github.com/triton-inference-server/client/blob/main/src/grpc_generated/java).
 
@@ -84,7 +84,7 @@ library to suit your specific needs.
 
 In an AWS environment, the Triton Inference Server docker container
 can run on [CPU-only instances or GPU compute
-instances](quickstart.md#run-triton). Triton can run directly on the
+instances](../getting_started/quickstart.md#launch-triton). Triton can run directly on the
 compute instance or inside Elastic Kubernetes Service (EKS). In
 addition, other AWS services such as Elastic Load Balancer (ELB) can
 be used for load balancing traffic among multiple Triton
@@ -96,12 +96,13 @@ deep-learning models loaded by the inference server.
 The Triton Inference Server exposes performance information in two
 ways: by [Prometheus metrics](metrics.md) and by the statistics
 available through the [HTTP/REST, GRPC, and C
-APIs](inference_protocols.md).
+APIs](../customization_guide/inference_protocols.md).
 
-A client application, [perf_analyzer](perf_analyzer.md), allows you to
-measure the performance of an individual model using a synthetic
-load. The perf_analyzer application is designed to show you the
-tradeoff of latency vs. throughput.
+A client application,
+[perf_analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md),
+allows you to measure the performance of an individual model using a synthetic
+load. The perf_analyzer application is designed to show you the tradeoff of
+latency vs. throughput.
 
 ## How can I fully utilize the GPU with Triton Inference Server?
 
@@ -121,13 +122,13 @@ concurrency](model_configuration.md#instance-groups) on a
 model-by-model basis.
 
 * Triton can [batch together multiple inference requests into a single
-  inference execution](architecture.md#dynamic-batcher). Typically,
+  inference execution](model_configuration.md#dynamic-batcher). Typically,
   batching inference requests leads to much higher thoughput with only
   a relatively small increase in latency.
 
 As a general rule, batching is the most beneficial way to increase GPU
 utilization. So you should always try enabling the [dynamic
-batcher](architecture.md#dynamic-batcher) with your models. Using
+batcher](model_configuration.md#dynamic-batcher) with your models. Using
 multiple instances of a model can also provide some benefit but is
 typically most useful for models that have small compute
 requirements. Most models will benefit from using two instances but
@@ -153,11 +154,52 @@ available Triton instances.
 
 ## If the server segfaults, how can I debug it?
 
-The NGC build is a Release build and does not contain Debug symbols. 
+The NGC build is a Release build and does not contain Debug symbols.
 The build.py as well defaults to a Release build. Refer to the instructions
-in [build.md](build.md#building-with-debug-symbols) to create a Debug build
+in [build.md](../customization_guide/build.md#building-with-debug-symbols) to create a Debug build
 of Triton. This will help find the cause of the segmentation fault when
 looking at the gdb trace for the segfault.
 
 When opening a GitHub issue for the segfault with Triton, please include
 the backtrace to better help us resolve the problem.
+
+## What are the benefits of using [Triton Inference Server](https://developer.nvidia.com/triton-inference-server) as part of the [NVIDIA AI Enterprise Software Suite](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/)?
+
+NVIDIA AI Enterprise enables enterprises to implement full AI workflows by
+delivering an entire end-to-end AI platform. Four key benefits:
+
+### Enterprise-Grade Support, Security & API Stability:
+
+Business-critical AI projects stay on track with NVIDIA Enterprise Support,
+available globally to assist both IT teams with deploying and managing the
+lifecycle of AI applications and the developer teams with building AI
+applications.  Support includes maintenance updates, dependable SLAs and
+response times.  Regular security reviews and priority notifications mitigate
+potential risk of unmanaged opensource and ensure compliance with corporate
+standards.  Finally, long term support and regression testing ensures API
+stability between releases.
+
+### Speed time to production with AI Workflows & Pretrained Models:
+To reduce the complexity of developing common AI applications, NVIDIA AI
+Enterprise includes
+[AI workflows](https://www.nvidia.com/en-us/launchpad/ai/workflows/) which are
+reference applications for specific business outcomes such as Intelligent
+Virtual Assistants and Digital Fingerprinting for real-time cybersecurity threat
+detection.  AI workflow reference applications may include
+[AI frameworks](https://docs.nvidia.com/deeplearning/frameworks/index.html) and
+[pretrained models](https://developer.nvidia.com/ai-models),
+[Helm Charts](https://catalog.ngc.nvidia.com/helm-charts),
+[Jupyter Notebooks](https://developer.nvidia.com/run-jupyter-notebooks) and
+[documentation](https://docs.nvidia.com/ai-enterprise/index.html#overview).
+
+### Performance for Efficiency and Cost Savings:
+Using accelerated compute for AI workloads such as data process with
+[NVIDIA RAPIDS Accelerator](https://developer.nvidia.com/rapids) for Apache
+Spark and inference with Triton Inference Sever delivers better performance
+which also improves efficiency and reduces operation and infrastructure costs,
+including savings from reduced time and energy consumption.
+
+### Optimized and Certified to Deploy Everywhere:
+Cloud, Data Center, Edge Optimized and certified to ensure reliable performance
+whether it’s running your AI in the public cloud, virtualized data centers, or
+on DGX systems.
diff --git a/docs/images/arch.jpg b/docs/user_guide/images/arch.jpg
similarity index 100%
rename from docs/images/arch.jpg
rename to docs/user_guide/images/arch.jpg
diff --git a/docs/images/dyna_sequence_example0.png b/docs/user_guide/images/dyna_sequence_example0.png
similarity index 100%
rename from docs/images/dyna_sequence_example0.png
rename to docs/user_guide/images/dyna_sequence_example0.png
diff --git a/docs/images/dyna_sequence_example1.png b/docs/user_guide/images/dyna_sequence_example1.png
similarity index 100%
rename from docs/images/dyna_sequence_example1.png
rename to docs/user_guide/images/dyna_sequence_example1.png
diff --git a/docs/images/ensemble_example0.png b/docs/user_guide/images/ensemble_example0.png
similarity index 100%
rename from docs/images/ensemble_example0.png
rename to docs/user_guide/images/ensemble_example0.png
diff --git a/docs/images/multi_model_exec.png b/docs/user_guide/images/multi_model_exec.png
similarity index 100%
rename from docs/images/multi_model_exec.png
rename to docs/user_guide/images/multi_model_exec.png
diff --git a/docs/images/multi_model_parallel_exec.png b/docs/user_guide/images/multi_model_parallel_exec.png
similarity index 100%
rename from docs/images/multi_model_parallel_exec.png
rename to docs/user_guide/images/multi_model_parallel_exec.png
diff --git a/docs/images/multi_model_serial_exec.png b/docs/user_guide/images/multi_model_serial_exec.png
similarity index 100%
rename from docs/images/multi_model_serial_exec.png
rename to docs/user_guide/images/multi_model_serial_exec.png
diff --git a/docs/images/sequence_example0.png b/docs/user_guide/images/sequence_example0.png
similarity index 100%
rename from docs/images/sequence_example0.png
rename to docs/user_guide/images/sequence_example0.png
diff --git a/docs/images/sequence_example1.png b/docs/user_guide/images/sequence_example1.png
similarity index 100%
rename from docs/images/sequence_example1.png
rename to docs/user_guide/images/sequence_example1.png
diff --git a/docs/images/sequence_example2.png b/docs/user_guide/images/sequence_example2.png
similarity index 100%
rename from docs/images/sequence_example2.png
rename to docs/user_guide/images/sequence_example2.png
diff --git a/docs/images/triton_on_jetson.png b/docs/user_guide/images/triton_on_jetson.png
similarity index 100%
rename from docs/images/triton_on_jetson.png
rename to docs/user_guide/images/triton_on_jetson.png
diff --git a/docs/jetson.md b/docs/user_guide/jetson.md
similarity index 78%
rename from docs/jetson.md
rename to docs/user_guide/jetson.md
index 01bca6b844..c1f8488c1b 100644
--- a/docs/jetson.md
+++ b/docs/user_guide/jetson.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -40,8 +40,8 @@ Triton Inference Server support on JetPack includes:
 * [Dynamic batching](architecture.md#models-and-schedulers)
 * [Model pipelines](architecture.md#ensemble-models)
 * [Extensible backends](https://github.com/triton-inference-server/backend)
-* [HTTP/REST and GRPC inference protocols](inference_protocols.md)
-* [C API](inference_protocols.md#in-process-triton-server-api)
+* [HTTP/REST and GRPC inference protocols](../customization_guide/inference_protocols.md)
+* [C API](../customization_guide/inference_protocols.md#in-process-triton-server-api)
 
 Limitations on JetPack 5.0:
 
@@ -52,7 +52,7 @@ The CUDA execution provider is in Beta.
 * GPU metrics, GCS storage, S3 storage and Azure storage are not supported.
 
 On JetPack, although HTTP/REST and GRPC inference protocols are supported, for edge
-use cases, direct [C API integration](inference_protocols.md#in-process-triton-server-api)
+use cases, direct [C API integration](../customization_guide/inference_protocols.md#in-process-triton-server-api)
 is recommended.
 
 You can download the `.tgz` file for Jetson from the Triton Inference Server
@@ -131,17 +131,20 @@ pip3 install --upgrade wheel setuptools cython && \
 **Note**: OpenCV 4.2.0 is installed as a part of JetPack. It is one of the dependencies for the client build.
 
 **Note**: When building Triton on Jetson, you will require a recent version of cmake.
-We recommend using cmake 3.21.1. Below is a script to upgrade your cmake version to 3.21.1.
+We recommend using cmake 3.25.2. Below is a script to upgrade your cmake version to 3.25.2.
 
 ```
 apt remove cmake
-wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | \
-      gpg --dearmor - | \
-      tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null && \
-    apt-add-repository 'deb https://apt.kitware.com/ubuntu/ focal main' && \
-    apt-get update && \
-    apt-get install -y --no-install-recommends \
-        cmake-data=3.21.1-0kitware1ubuntu20.04.1 cmake=3.21.1-0kitware1ubuntu20.04.1
+# Using CMAKE installation instruction from:: https://apt.kitware.com/
+apt update && apt install -y gpg wget && \
+      wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | \
+            gpg --dearmor - |  \
+            tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null && \
+      . /etc/os-release && \
+      echo "deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ $UBUNTU_CODENAME main" | \
+      tee /etc/apt/sources.list.d/kitware.list >/dev/null && \
+      apt-get update && \
+      apt-get install -y --no-install-recommends cmake cmake-data
 ```
 
 ### Runtime Dependencies for Triton
@@ -152,7 +155,7 @@ The following runtime dependencies must be installed before running Triton serve
 apt-get update && \
         apt-get install -y --no-install-recommends \
         libb64-0d \
-        libre2-5 \
+        libre2-9 \
         libssl1.1 \
         rapidjson-dev \
         libopenblas-dev \
@@ -175,7 +178,7 @@ pip3 install --upgrade wheel setuptools && \
     pip3 install --upgrade grpcio-tools numpy attrdict pillow
 ```
 
-The PyTorch runtime depenencies are the same as the build dependencies listed above.
+The PyTorch runtime dependencies are the same as the build dependencies listed above.
 
 ### Usage
 
@@ -188,16 +191,19 @@ LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/lib/llvm-8/lib"
 ```
 
 **Note**: On Jetson, the backend directory must be explicitly specified using the
-`--backend-directory` flag. Triton defaults to using TensorFlow 1.x and a version string
-is required to use TensorFlow 2.x.
+`--backend-directory` flag. Starting from 23.04, Triton no longer supports
+TensorFlow 1.x. If you'd like to use TensorFlow 1.x with Triton prior to 23.04,
+a version string is required to use TensorFlow 1.x.
 
 ```
 tritonserver --model-repository=/path/to/model_repo --backend-directory=/path/to/tritonserver/backends \
              --backend-config=tensorflow,version=2
 ```
 
-**Note**: [perf_analyzer](perf_analyzer.md) is supported on Jetson, while the [model_analyzer](model_analyzer.md)
-is currently not available for Jetson. To execute `perf_analyzer` for C API, use
+**Note**:
+[perf_analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md)
+is supported on Jetson, while the [model_analyzer](model_analyzer.md) is
+currently not available for Jetson. To execute `perf_analyzer` for C API, use
 the CLI flag `--service-kind=triton_c_api`:
 
 ```shell
@@ -206,4 +212,4 @@ perf_analyzer -m graphdef_int32_int32_int32 --service-kind=triton_c_api \
     --model-repository=/workspace/qa/L0_perf_analyzer_capi/models
 ```
 
-Refer to these [examples](examples/jetson) that demonstrate how to use Triton Inference Server on Jetson.
+Refer to these [examples](../examples/jetson/README.md) that demonstrate how to use Triton Inference Server on Jetson.
diff --git a/docs/user_guide/metrics.md b/docs/user_guide/metrics.md
new file mode 100644
index 0000000000..855a5ffbab
--- /dev/null
+++ b/docs/user_guide/metrics.md
@@ -0,0 +1,345 @@
+<!--
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Metrics
+
+Triton provides [Prometheus](https://prometheus.io/) metrics
+indicating GPU and request statistics. By default, these metrics are
+available at http://localhost:8002/metrics. The metrics are only
+available by accessing the endpoint, and are not pushed or published
+to any remote server. The metric format is plain text so you can view
+them directly, for example:
+
+```
+$ curl localhost:8002/metrics
+```
+
+The `tritonserver --allow-metrics=false` option can be used to disable
+all metric reporting, while the `--allow-gpu-metrics=false` and
+`--allow-cpu-metrics=false` can be used to disable just the GPU and CPU
+metrics respectively.
+
+The `--metrics-port` option can be used to select a different port. By default,
+Triton reuses the `--http-address` option for the metrics endpoint and binds the
+http and metrics endpoints to the same specific address when http service is
+enabled. If http service is not enabled, the metric address will bind to `0.0.0.0`
+by default. To uniquely specify the metric endpoint, `--metrics-address` option
+can be used. See the `tritonserver --help` output for more info on these CLI options.
+
+To change the interval at which metrics are polled/updated, see the `--metrics-interval-ms` flag. Metrics that are updated "Per Request" are unaffected by this interval setting. This interval only applies to metrics that are designated as "Per Interval" in the tables of each section below:
+
+- [Inference Request Metrics](#inference-request-metrics)
+- [GPU Metrics](#gpu-metrics)
+- [CPU Metrics](#cpu-metrics)
+- [Response Cache Metrics](#response-cache-metrics)
+- [Custom Metrics](#custom-metrics)
+
+## Inference Request Metrics
+
+### Counts
+
+For models that do not support batching, *Request Count*, *Inference
+Count* and *Execution Count* will be equal, indicating that each
+inference request is executed separately.
+
+For models that support batching, the count metrics can be interpreted
+to determine average batch size as *Inference Count* / *Execution
+Count*. The count metrics are illustrated by the following examples:
+
+* Client sends a single batch-1 inference request. *Request Count* =
+  1, *Inference Count* = 1, *Execution Count* = 1.
+
+* Client sends a single batch-8 inference request. *Request Count* =
+  1, *Inference Count* = 8, *Execution Count* = 1.
+
+* Client sends 2 requests: batch-1 and batch-8. Dynamic batcher is not
+  enabled for the model. *Request Count* = 2, *Inference Count* = 9,
+  *Execution Count* = 2.
+
+* Client sends 2 requests: batch-1 and batch-1. Dynamic batcher is
+  enabled for the model and the 2 requests are dynamically batched by
+  the server. *Request Count* = 2, *Inference Count* = 2, *Execution
+  Count* = 1.
+
+* Client sends 2 requests: batch-1 and batch-8. Dynamic batcher is
+  enabled for the model and the 2 requests are dynamically batched by
+  the server. *Request Count* = 2, *Inference Count* = 9, *Execution
+  Count* = 1.
+
+|Category      |Metric          |Metric Name |Description                            |Granularity|Frequency    |
+|--------------|----------------|------------|---------------------------|-----------|-------------|
+|Count         |Success Count   |`nv_inference_request_success` |Number of successful inference requests received by Triton (each request is counted as 1, even if the request contains a batch) |Per model  |Per request  |
+|              |Failure Count   |`nv_inference_request_failure` |Number of failed inference requests received by Triton (each request is counted as 1, even if the request contains a batch) |Per model  |Per request  |
+|              |Inference Count |`nv_inference_count` |Number of inferences performed (a batch of "n" is counted as "n" inferences, does not include cached requests)|Per model|Per request|
+|              |Execution Count |`nv_inference_exec_count` |Number of inference batch executions (see [Inference Request Metrics](#inference-request-metrics), does not include cached requests)|Per model|Per request|
+|              |Pending Request Count |`nv_inference_pending_request_count` |Number of inference requests awaiting execution by a backend. This number is incremented when a request is enqueued to the server (`TRITONSERVER_ServerInferAsync`) and is decremented when a backend is about to start executing the request. More details can be found below. |Per model|Per request|
+
+#### Pending Request Count (Queue Size) Per-Model
+
+The *Pending Request Count* reflects the number of requests that have been
+received by Triton core via `TRITONSERVER_InferAsync`, but have not yet
+started execution by a backend model instance
+(`TRITONBACKEND_ModelInstanceExecute`).
+
+For all intents and purposes, the
+"pending request count" and "queue size" per-model can be used
+interchangeably, and the number reflected in the metric should
+intuitively represent the number of requests that are not currently
+being executed by any model instances. In simple terms, if you send a 100
+requests to a model that can only handle 5 requests concurrently, then you
+should see a pending count of 95 for that model in most cases.
+
+For those interested in more technical details, the term "pending request count"
+is a bit more accurate than "queue size" because Triton is highly configurable,
+and there are many places in Triton that a request be considered pending rather
+than a single queue. Some of the most common will be called out below:
+- Default Scheduler backlogs any requests not currently executing.
+  - Assuming 1 available model instance with the default scheduler settings,
+    and 10 requests are sent in rapid succession.
+  - The 1st request should be picked up for
+    execution immediately, and the remaining 9 requests should be considered
+    pending for this model, until the 1st request is finished. Afterwards, the
+    next request should be picked up and the pending count should be decremented
+    to 8, and so on until all requests are finished and the pending count is 0.
+- Dynamic Batcher queue for dynamically creating batches from requests.
+  - Assuming 1 available model instance with the dynamic batch scheduler
+    configured with `max_batch_size: 4` and a sufficiently large
+    `max_queue_delay_microseconds` (or queue of requests),
+    and 10 requests are sent in rapid succession.
+  - The first 4 requests, or as large of a batch the scheduler could form,
+    should be picked up for execution immediately, and the remaining 6 requests
+    should be considered pending. After the batch finishes, the next batch
+    should be picked up, decrementing the pending count again to 2 pending.
+    Then finally since only 2 requests remain, the final 2 requests will be
+    batched and picked up by the backend, decrementing the pending count to 0.
+- Sequence Batcher queues and backlogs for ongoing sequence requests, some may
+  be assigned sequence slots, some may not.
+  - Sequence Batchers of both strategies (direct and oldest) will have pending
+    counts that generally follow the same trend as the dynamic batching
+    description above. The sequence batchers will immediately execute as many
+    requests in a batch as it can based on the model/scheduler config settings,
+    and any further requests will be considered pending until the previous batch
+    finishes and the next batch can start.
+- Rate Limiter queues for prepared batches of requests.
+  - When rate limiting is enabled, requests can be held back from execution
+    to satisfy the rate limit constraints that were configured.
+
+There are some places where a request would not be considered pending:
+- Ensemble Scheduler
+  - The Ensemble Scheduler almost immediately enqueues any requests it receives
+    into the composing model schedulers at the first step in the ensemble.
+    Therefore, the requests could be considered pending by the composing model
+    scheduler's, however from the ensemble's perspective, these requests have been
+    scheduled.
+- Frontends (HTTP/GRPC Servers)
+  - Any requests sent from a client to a frontend server in-front of Triton
+    may spend some time in the corresponding server's code mapping
+    protocol-specific metadata to Triton metadata. Though this time is
+    generally brief, it will not be considered pending from Triton's
+    perspective until Triton core has received the request from the frontend.
+
+### Latencies
+
+Starting in 23.04, Triton exposes the ability to choose the types of metrics
+that are published through the `--metrics-config` CLI options.
+
+#### Counters
+
+By default, the following
+[Counter](https://prometheus.io/docs/concepts/metric_types/#counter)
+metrics are used for latencies:
+
+|Category      |Metric          |Metric Name |Description                            |Granularity|Frequency    |
+|--------------|----------------|------------|---------------------------|-----------|-------------|
+|Latency       |Request Time    |`nv_inference_request_duration_us` |Cumulative end-to-end inference request handling time (includes cached requests) |Per model  |Per request  |
+|              |Queue Time      |`nv_inference_queue_duration_us` |Cumulative time requests spend waiting in the scheduling queue (includes cached requests) |Per model  |Per request  |
+|              |Compute Input Time|`nv_inference_compute_input_duration_us` |Cumulative time requests spend processing inference inputs (in the framework backend, does not include cached requests)     |Per model  |Per request  |
+|              |Compute Time    |`nv_inference_compute_infer_duration_us` |Cumulative time requests spend executing the inference model (in the framework backend, does not include cached requests)     |Per model  |Per request  |
+|              |Compute Output Time|`nv_inference_compute_output_duration_us` |Cumulative time requests spend processing inference outputs (in the framework backend, does not include cached requests)     |Per model  |Per request  |
+
+To disable these metrics specifically, you can set `--metrics-config counter_latencies=false`
+
+#### Summaries
+
+> **Note**
+>
+> The following Summary feature is experimental for the time being and may be
+> subject to change based on user feedback.
+
+To get configurable quantiles over a sliding time window, Triton supports
+a set a [Summary](https://prometheus.io/docs/concepts/metric_types/#summary)
+metrics for latencies as well. These metrics are disabled by default, but can
+be enabled by setting `--metrics-config summary_latencies=true`.
+
+For more information on how the quantiles are calculated, see
+[this explanation](https://grafana.com/blog/2022/03/01/how-summary-metrics-work-in-prometheus/).
+
+The following summary metrics are available:
+
+|Category      |Metric          |Metric Name |Description                            |Granularity|Frequency    |
+|--------------|----------------|------------|---------------------------|-----------|-------------|
+|Latency       |Request Time    |`nv_inference_request_summary_us` |Summary of end-to-end inference request handling times (includes cached requests) |Per model  |Per request  |
+|              |Queue Time      |`nv_inference_queue_summary_us` |Summary of time requests spend waiting in the scheduling queue (includes cached requests) |Per model  |Per request  |
+|              |Compute Input Time|`nv_inference_compute_input_summary_us` |Summary time requests spend processing inference inputs (in the framework backend, does not include cached requests)     |Per model  |Per request  |
+|              |Compute Time    |`nv_inference_compute_infer_summary_us` |Summary of time requests spend executing the inference model (in the framework backend, does not include cached requests)     |Per model  |Per request  |
+|              |Compute Output Time|`nv_inference_compute_output_summary_us` |Summary of time requests spend processing inference outputs (in the framework backend, does not include cached requests)     |Per model  |Per request  |
+
+Each summary above is actually composed of several sub-metrics. For each
+metric, there is a set of `quantile` metrics tracking the latency for each
+quantile. Additionally, there are `_count` and `_sum` metrics that aggregate
+the count and observed values for each. For example, see the following
+information exposed by the Inference Queue Summary metrics:
+```
+# HELP nv_inference_queue_summary_us Summary of inference queuing duration in microseconds (includes cached requests)
+# TYPE nv_inference_queue_summary_us summary
+nv_inference_queue_summary_us_count{model="my_model",version="1"} 161
+nv_inference_queue_summary_us_sum{model="my_model",version="1"} 11110
+nv_inference_queue_summary_us{model="my_model",version="1",quantile="0.5"} 55
+nv_inference_queue_summary_us{model="my_model",version="1",quantile="0.9"} 97
+nv_inference_queue_summary_us{model="my_model",version="1",quantile="0.95"} 98
+nv_inference_queue_summary_us{model="my_model",version="1",quantile="0.99"} 101
+nv_inference_queue_summary_us{model="my_model",version="1",quantile="0.999"} 101
+```
+
+The count and sum for the summary above show that stats have been recorded for
+161 requests, and took a combined total of 11110 microseconds. The `_count` and
+`_sum` of a summary should generally match the counter metric equivalents when
+applicable, such as:
+```
+nv_inference_request_success{model="my_model",version="1"} 161
+nv_inference_queue_duration_us{model="my_model",version="1"} 11110
+```
+
+Triton has a set of default quantiles to track, as shown above. To set
+custom quantiles, you can use the `--metrics-config` CLI option. The format is:
+```
+tritonserver --metrics-config summary_quantiles="<quantile1>:<error1>,...,<quantileN>:<errorN>"`
+```
+
+For example:
+```
+tritonserver --metrics-config summary_quantiles="0.5:0.05,0.9:0.01,0.95:0.001,0.99:0.001"`
+```
+
+To better understand the setting of error values for computing each quantile, see the
+[best practices for histograms and summaries](https://prometheus.io/docs/practices/histograms/#histograms-and-summaries).
+
+
+## GPU Metrics
+
+GPU metrics are collected through the use of [DCGM](https://developer.nvidia.com/dcgm).
+Collection of GPU metrics can be toggled with the `--allow-gpu-metrics` CLI flag.
+If building Triton locally, the `TRITON_ENABLE_METRICS_GPU` CMake build flag can be used to toggle building the relevant code entirely.
+
+|Category        |Metric            |Metric Name                 |Description                                            |Granularity|Frequency    |
+|----------------|------------------|----------------------------|-------------------------------------------------------|-----------|-------------|
+|GPU Utilization |Power Usage       |`nv_gpu_power_usage`        |GPU instantaneous power, in watts                      |Per GPU    |Per interval |
+|                |Power Limit       |`nv_gpu_power_limit`        |Maximum GPU power limit, in watts                      |Per GPU    |Per interval |
+|                |Energy Consumption|`nv_energy_consumption`     |GPU energy consumption since Triton started, in joules |Per GPU    |Per interval |
+|                |GPU Utilization   |`nv_gpu_utilization`        |GPU utilization rate (0.0 - 1.0)                       |Per GPU    |Per interval |
+|GPU Memory      |GPU Total Memory  |`nv_gpu_memory_total_bytes` |Total GPU memory, in bytes                             |Per GPU    |Per interval |
+|                |GPU Used Memory   |`nv_gpu_memory_used_bytes`  |Used GPU memory, in bytes                              |Per GPU    |Per interval |
+
+
+## CPU Metrics
+
+Collection of CPU metrics can be toggled with the `--allow-cpu-metrics` CLI flag.
+If building Triton locally, the `TRITON_ENABLE_METRICS_CPU` CMake build flag can be used to toggle building the relevant code entirely.
+
+> **Note**
+>
+> CPU Metrics are currently only supported on Linux.
+> They collect information from the [/proc filesystem](https://www.kernel.org/doc/html/latest/filesystems/proc.html) such as `/proc/stat` and `/proc/meminfo`.
+
+|Category      |Metric          |Metric Name |Description                            |Granularity|Frequency    |
+|--------------|----------------|------------|---------------------------|-----------|-------------|
+|CPU Utilization | CPU Utilization | `nv_cpu_utilization` | Total CPU utilization rate [0.0 - 1.0] | Aggregated across all cores since last interval | Per interval |
+|CPU Memory      | CPU Total Memory | `nv_cpu_memory_total_bytes` | Total CPU memory (RAM), in bytes | System-wide | Per interval |
+|                | CPU Used Memory | `nv_cpu_memory_used_bytes` | Used CPU memory (RAM), in bytes | System-wide | Per interval |
+
+## Response Cache Metrics
+
+Cache metrics can be reported in two ways:
+
+1. A base set of cache metrics will be reported
+by Triton directly, such as the cache hit/miss counts and durations described
+below.
+
+2. As of 23.03, additional cache metrics may be reported depending on the
+[cache implementation](response_cache.md#cache-implementations)
+being used through Triton's [Metrics API](#custom-metrics).
+
+### Triton-reported Response Cache Metrics
+
+Compute latency metrics in the
+[Inference Request Metrics table](#inference-request-metrics) above are
+calculated for the time spent in model inference backends. If the response
+cache is enabled for a given model (see [Response Cache](response_cache.md)
+docs for more info), total inference times may be affected by response cache
+lookup times.
+
+On cache hits, "Cache Hit Time" indicates the time spent looking up the
+response, and "Compute Input Time" /  "Compute Time" / "Compute Output Time"
+are not recorded.
+
+On cache misses, "Cache Miss Time" indicates the time spent looking up
+the request hash and inserting the computed output tensor data into the cache.
+Otherwise, "Compute Input Time" /  "Compute Time" / "Compute Output Time" will
+be recorded as usual.
+
+|Category      |Metric          |Metric Name |Description                            |Granularity|Frequency    |
+|--------------|----------------|------------|---------------------------|-----------|-------------|
+|Count         |Cache Hit Count |`nv_cache_num_hits_per_model` |Number of response cache hits per model |Per model |Per request |
+|              |Cache Miss Count |`nv_cache_num_misses_per_model` |Number of response cache misses per model |Per model |Per request |
+|Latency       |Cache Hit Time |`nv_cache_hit_duration_per_model` |Cumulative time requests spend retrieving a cached response per model on cache hits (microseconds) |Per model |Per request |
+|              |Cache Miss Time |`nv_cache_miss_duration_per_model` |Cumulative time requests spend looking up and inserting responses into the cache on a cache miss (microseconds) |Per model |Per request |
+
+Similar to the Summaries section above for Inference Request Metrics, the
+per-model cache hit/miss latency metrics also support Summaries.
+
+> **Note**
+>
+> For models with response caching enabled, the inference request **summary** metric
+> is currently disabled. This is due to extra time spent internally on cache
+> management that wouldn't be reflected correctly in the end to end request time.
+> Other summary metrics are unaffected.
+
+## Custom Metrics
+
+Triton exposes a C API to allow users and backends to register and collect
+custom metrics with the existing Triton metrics endpoint. The user takes the
+ownership of the custom metrics created through the APIs and must manage their
+lifetime following the API documentation.
+
+The
+[identity_backend](https://github.com/triton-inference-server/identity_backend/blob/main/README.md#custom-metric-example)
+demonstrates a practical example of adding a custom metric to a backend.
+
+Further documentation can be found in the `TRITONSERVER_MetricFamily*` and
+`TRITONSERVER_Metric*` API annotations in
+[tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h).
diff --git a/docs/user_guide/model_analyzer.md b/docs/user_guide/model_analyzer.md
new file mode 100644
index 0000000000..663a8a277a
--- /dev/null
+++ b/docs/user_guide/model_analyzer.md
@@ -0,0 +1,45 @@
+<!--
+# Copyright (c) 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Model Analyzer
+
+The Triton [Model Analyzer](https://github.com/triton-inference-server/model_analyzer)
+ is a tool that uses
+[Performance Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md)
+to send requests to your model while measuring GPU memory and compute
+utilization. The Model Analyzer is specifically useful for characterizing the
+GPU memory requirements for your model under different batching and model
+instance configurations. Once you have this GPU memory usage information you can
+more intelligently decide on how to combine multiple models on the same GPU
+while remaining within the memory capacity of the GPU.
+
+For more detailed examples and explanations of using Model Analyzer, see:
+- [Model Analyzer Conceptual Guide](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_3-optimizing_triton_configuration)
+- [Maximizing Deep Learning
+Inference Performance with NVIDIA Model
+Analyzer](https://developer.nvidia.com/blog/maximizing-deep-learning-inference-performance-with-nvidia-model-analyzer)
\ No newline at end of file
diff --git a/docs/model_configuration.md b/docs/user_guide/model_configuration.md
similarity index 76%
rename from docs/model_configuration.md
rename to docs/user_guide/model_configuration.md
index e1062186d2..241301ade7 100644
--- a/docs/model_configuration.md
+++ b/docs/user_guide/model_configuration.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,13 +28,18 @@
 
 # Model Configuration
 
+**Is this your first time writing a config file?** Check out
+[this guide](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_1-model_deployment#model-configuration)
+ or this
+[example](https://github.com/triton-inference-server/tutorials/tree/main/HuggingFace#examples)!
+
 Each model in a [model repository](model_repository.md) must include a
 model configuration that provides required and optional information
 about the model. Typically, this configuration is provided in a
 config.pbtxt file specified as [ModelConfig
 protobuf](https://github.com/triton-inference-server/common/blob/main/protobuf/model_config.proto).
 In some cases, discussed in [Auto-Generated Model
-Configuraton](#auto-generated-model-configuration), the model
+Configuration](#auto-generated-model-configuration), the model
 configuration can be generated automatically by Triton and so does not
 need to be provided explicitly.
 
@@ -130,7 +135,7 @@ expected by the model.
 
 #### Special Conventions for PyTorch Backend
 
-**Naming Convention:** 
+**Naming Convention:**
 
 Due to the absence of sufficient metadata for inputs/outputs in TorchScript
 model files, the "name" attribute of inputs/outputs in the configuration must
@@ -142,7 +147,7 @@ the forward function in the model's definition.
 
 For example, if the forward function for the Torchscript model was defined as
 `forward(self, input0, input1)`, the first and second inputs should be named
-"input0" and "input1" respectively. 
+"input0" and "input1" respectively.
 
 2. `<name>__<index>`: Where \<name\> can be any string and \<index\> is an
 integer index that refers to the position of the corresponding input/output.
@@ -153,9 +158,9 @@ can be named "OUTPUT__0" and "OUTPUT__1" respectively.
 
 3. If all inputs (or outputs) do not follow the same naming convention, then we
 enforce strict ordering from the model configuration i.e. we assume the order of
-inputs (or outputs) in the configuartion is the true ordering of these inputs.
+inputs (or outputs) in the configuration is the true ordering of these inputs.
 
-***Dictionary of Tensors as Input:*** 
+***Dictionary of Tensors as Input:***
 
 The PyTorch backend supports passing of inputs to the model in the form of a
 Dictionary of Tensors. This is only supported when there is a *single* input to
@@ -264,28 +269,44 @@ enforcing the input to have the same shape in all requests.
 
 ## Auto-Generated Model Configuration
 
-By default, the model configuration file containing the required
-settings must be provided with each model. However, if Triton is
-started with the --strict-model-config=false option, then in some
-cases the required portions of the model configuration file can be
-generated automatically by Triton. The required portion of the model
-configuration are those settings shown in the [Minimal Model
-Configuration](#minimal-model-configuration). Specifically, TensorRT,
-TensorFlow saved-model, and ONNX models do not require a model
-configuration file because Triton can derive all the required settings
-automatically. For Python model,
-[`auto_complete_config`](https://github.com/triton-inference-server/python_backend/#auto_complete_config)
-function can be implemented in Python backend to provide [`max_batch_size`](#maximum-batch-size),
-[`input`](#inputs-and-outputs) and [`output`](#inputs-and-outputs) properties using
-`set_max_batch_size`, `add_input`, and `add_output` functions.
-These properties will allow Triton to load the Python model with [Minimal
-Model Configuration](#minimal-model-configuration) in absence of a configuration file.
-All other model types must provide a model configuration file.
-
-When using --strict-model-config=false you can see the model
-configuration that was generated for a model by using the [model
-configuration
-endpoint](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_model_configuration.md). The
+The model configuration file containing the required
+settings must be available with each model to be deployed
+on Triton. In some cases the required portions of the model
+configuration can be generated automatically by Triton. The
+required portion of the model configuration are the settings
+shown in the [Minimal Model Configuration](#minimal-model-configuration).
+By default, Triton will try to complete these sections. However,
+by starting Triton with `--disable-auto-complete-config` option,
+Triton can be configured to not auto-complete model configuration
+on the backend side. However, even with this option Triton will
+fill in missing [`instance_group`](#instance-groups) settings with
+default values.
+
+Triton can derive all the required settings automatically for
+most of the TensorRT, TensorFlow saved-model, ONNX models, and OpenVINO models.
+For Python models, [`auto_complete_config`](https://github.com/triton-inference-server/python_backend/#auto_complete_config)
+function can be implemented in Python backend to provide
+[`max_batch_size`](#maximum-batch-size), [`input`](#inputs-and-outputs)
+and [`output`](#inputs-and-outputs) properties using `set_max_batch_size`,
+`add_input`, and `add_output` functions. These properties will allow Triton
+to load the Python model with [Minimal Model Configuration](#minimal-model-configuration)
+in absence of a configuration file.
+All other model types *must* provide a model configuration file.
+
+When developing a custom backend, you can populate required settings
+in the configuration and call `TRITONBACKEND_ModelSetConfig` API to
+update completed configuration with Triton core. You can take a
+look at [TensorFlow](https://github.com/triton-inference-server/tensorflow_backend)
+and [Onnxruntime](https://github.com/triton-inference-server/onnxruntime_backend)
+backends as examples of how to achieve this. Currently, only
+[inputs, outputs](#inputs-and-outputs), [max_batch_size](#maximum-batch-size)
+and [dynamic batching](#dynamic-batcher) settings can be populated by
+backend. For custom backends, your config.pbtxt file must
+include a `backend` field or your model name must be in the
+form `<model_name>.<backend_name>`.
+
+You can also see the model configuration generated for a model by
+Triton using the [model configuration endpoint](../protocol/extension_model_configuration.md). The
 easiest way to do this is to use a utility like *curl*:
 
 ```bash
@@ -302,25 +323,25 @@ config.pbtxt file.
 
 ### Default Max Batch Size and Dynamic Batcher
 
-When a model is using the auto-complete feature, a default maximum 
-batch size may be set by using the `--backend-config=default-max-batch-size=<int>` 
+When a model is using the auto-complete feature, a default maximum
+batch size may be set by using the `--backend-config=default-max-batch-size=<int>`
 command line argument. This allows all models which are capable of
 batching and which make use of [Auto Generated Model Configuration](#auto-generated-model-configuration)
-to have a default maximum batch size. This value is set to 4 by 
+to have a default maximum batch size. This value is set to 4 by
 default. Backend developers may make use of this default-max-batch-size
 by obtaining it from the TRITONBACKEND_BackendConfig api. Currently, the
-following backends which utilize these default batch values and turn on 
+following backends which utilize these default batch values and turn on
 dynamic batching in their generated model configurations are:
 
 1. [TensorFlow backend](https://github.com/triton-inference-server/tensorflow_backend)
 2. [Onnxruntime backend](https://github.com/triton-inference-server/onnxruntime_backend)
 3. [TensorRT backend](https://github.com/triton-inference-server/tensorrt_backend)
    1. TensorRT models store the maximum batch size explicitly and do not make use
-   of the default-max-batch-size parameter. However, if max_batch_size > 1 
-   and no [scheduler](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#scheduling-and-batching)
+   of the default-max-batch-size parameter. However, if max_batch_size > 1
+   and no [scheduler](model_configuration.md#scheduling-and-batching)
    is provided, the dynamic batch scheduler will be enabled.
-   
-If a value greater than 1 for the maximum batch size is set for the 
+
+If a value greater than 1 for the maximum batch size is set for the
 model, the [dynamic_batching](#dynamic-batcher) config will be set
 if no scheduler is provided in the configuration file.
 
@@ -341,7 +362,7 @@ library.
 |Model Config  |TensorRT      |TensorFlow    |ONNX Runtime  |PyTorch  |API      |NumPy         |
 |--------------|--------------|--------------|--------------|---------|---------|--------------|
 |TYPE_BOOL     | kBOOL        |DT_BOOL       |BOOL          |kBool    |BOOL     |bool          |
-|TYPE_UINT8    |              |DT_UINT8      |UINT8         |kByte    |UINT8    |uint8         |
+|TYPE_UINT8    | kUINT8       |DT_UINT8      |UINT8         |kByte    |UINT8    |uint8         |
 |TYPE_UINT16   |              |DT_UINT16     |UINT16        |         |UINT16   |uint16        |
 |TYPE_UINT32   |              |DT_UINT32     |UINT32        |         |UINT32   |uint32        |
 |TYPE_UINT64   |              |DT_UINT64     |UINT64        |         |UINT64   |uint64        |
@@ -419,7 +440,8 @@ must be specified since each output must specify a non-empty
 
 For models that support shape tensors, the *is_shape_tensor* property
 must be set appropriately for inputs and outputs that are acting as
-shape tensors. The following shows and example configuration that specifies shape tensors.
+shape tensors. The following shows an example configuration that
+specifies shape tensors.
 
 ```
   name: "myshapetensormodel"
@@ -429,12 +451,12 @@ shape tensors. The following shows and example configuration that specifies shap
     {
       name: "input0"
       data_type: TYPE_FP32
-      dims: [ -1 ]
+      dims: [ 1 , 3]
     },
     {
       name: "input1"
       data_type: TYPE_INT32
-      dims: [ 1 ]
+      dims: [ 2 ]
       is_shape_tensor: true
     }
   ]
@@ -442,7 +464,7 @@ shape tensors. The following shows and example configuration that specifies shap
     {
       name: "output0"
       data_type: TYPE_FP32
-      dims: [ -1 ]
+      dims: [ 1 , 3]
     }
   ]
 ```
@@ -454,16 +476,49 @@ value. For the above example, an inference request must provide inputs
 with the following shapes.
 
 ```
-  "input0": [ x, -1]
-  "input1": [ 1 ]
-  "output0": [ x, -1]
+  "input0": [ x, 1, 3]
+  "input1": [ 3 ]
+  "output0": [ x, 1, 3]
 ```
 
 Where *x* is the batch size of the request. Triton requires the shape
 tensors to be marked as shape tensors in the model when using
-batching. Note that "input1" has shape *[ 1 ]* and not *[ 2 ]*. Triton
-will prepend the shape value *x* at "input1" before issuing the
-request to model.
+batching. Note that "input1" has shape *[ 3 ]* and not *[ 2 ]*, which
+is how it is described in model configuration. As `myshapetensormodel`
+model is a batching model, the batch size should be provided as an
+additional value. Triton will accumulate all the shape values together
+for "input1" in batch dimension before issuing the request to model.
+
+For example, assume the client sends following three requests to Triton
+with following inputs:
+
+```
+Request1:
+input0: [[[1,2,3]]] <== shape of this tensor [1,1,3]
+input1: [1,4,6] <== shape of this tensor [3]
+
+Request2:
+input0: [[[4,5,6]], [[7,8,9]]] <== shape of this tensor [2,1,3]
+input1: [2,4,6] <== shape of this tensor [3]
+
+Request3:
+input0: [[[10,11,12]]] <== shape of this tensor [1,1,3]
+input1: [1,4,6] <== shape of this tensor [3]
+```
+
+Assuming these requests get batched together would be delivered to the
+model as:
+
+
+```
+Batched Requests to model:
+input0: [[[1,2,3]], [[4,5,6]], [[7,8,9]], [[10,11,12]]] <== shape of this tensor [4,1,3]
+input1: [4, 4, 6] <== shape of this tensor [3]
+
+```
+
+Currently, only TensorRT supports shape tensors. Read [Shape Tensor I/O](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#shape_tensor_io)
+to learn more about shape tensors.
 
 ## Version Policy
 
@@ -564,7 +619,8 @@ GPU 0 and two execution instances on GPUs 1 and 2.
     }
   ]
 ```
-
+For a more detailed example of using instance groups, see
+ [this guide](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_2-improving_resource_utilization#concurrent-model-execution).
 ### CPU Model Instance
 
 The instance group setting is also used to enable execution of a model
@@ -675,21 +731,21 @@ requirements and run on the same device as them.
 
 [Ensemble models](architecture.md#ensemble-models)
 are an abstraction Triton uses to execute a user-defined pipeline of models.
-Since there is no physical instance associated with an ensemble model, the 
+Since there is no physical instance associated with an ensemble model, the
 `instance_group` field can not be specified for it.
 
-However, each composing model that makes up an ensemble can specify 
+However, each composing model that makes up an ensemble can specify
 `instance_group` in its config file and individually support parallel
 execution as described above when the ensemble receives multiple requests.
 
 ## CUDA Compute Capability
 
-Similar to the `default_model_filename` field, you can optionally specify the 
+Similar to the `default_model_filename` field, you can optionally specify the
 `cc_model_filenames` field to map the GPU's
-[CUDA Compute Capability](https://developer.nvidia.com/cuda-gpus) 
-to a correspoding model filename at model load time. This is particularly 
-useful for TensorRT models, since they are generally tied to a specific 
-compute capability. 
+[CUDA Compute Capability](https://developer.nvidia.com/cuda-gpus)
+to a corresponding model filename at model load time. This is particularly
+useful for TensorRT models, since they are generally tied to a specific
+compute capability.
 
 ```
 cc_model_filenames [
@@ -742,7 +798,9 @@ configuration. These settings control the preferred size(s) of the
 dynamically created batches, the maximum time that requests can be
 delayed in the scheduler to allow other requests to join the dynamic
 batch, and queue properties such a queue size, priorities, and
-time-outs.
+time-outs. Refer to
+[this guide](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_2-improving_resource_utilization#what-is-dynamic-batching)
+for a more detailed example of dynamic batching.
 
 #### Recommended Configuration Process
 
@@ -763,9 +821,10 @@ dynamic batcher configurations.
   dynamic_batching { }
 ```
 
-* Use the [Performance Analyzer](perf_analyzer.md) to determine the
-  latency and throughput provided by the default dynamic batcher
-  configuration.
+* Use the
+  [Performance Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md)
+  to determine the latency and throughput provided by the default dynamic
+  batcher configuration.
 
 * If the default configuration results in latency values that are
   within your latency budget, try one or both of the following to
@@ -790,7 +849,7 @@ dynamic batcher should attempt to create. For most models,
 [Recommended Configuration
 Process](#recommended-configuration-process). An exception is TensorRT
 models that specify multiple optimization profiles for different batch
-sizes. In this case, bacause some optimization profiles may give
+sizes. In this case, because some optimization profiles may give
 significant performance improvement compared to others, it may make
 sense to use *preferred_batch_size* for the batch sizes supported by
 those higher-performance optimization profiles.
@@ -817,7 +876,7 @@ maximum batch size allowed by the model (but see the following section
 for the delay option that changes this behavior).
 
 The size of generated batches can be examined in aggregate using
-[count metrics](metrics.md#count-metrics).
+[count metrics](metrics.md#inference-request-metrics).
 
 #### Delayed Batching
 
@@ -880,6 +939,29 @@ allow the queue to be configured so that individual requests are
 rejected or deferred if their time in the queue exceeds a specified
 timeout.
 
+#### Custom Batching
+
+You can set custom batching rules that work _in addition to_ the specified behavior of the dynamic batcher.
+To do so, you would implement five functions in [tritonbackend.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonbackend.h)
+and create a shared library. These functions are described below.
+
+| Function | Description|
+| :--          |   :--           |
+| TRITONBACKEND_ModelBatchIncludeRequest | Determines whether a request should be included in the current batch |
+| TRITONBACKEND_ModelBatchInitialize | Initializes a record-keeping data structure for a new batch |
+| TRITONBACKEND_ModelBatchFinalize | Deallocates the record-keeping data structure after a batch is formed |
+| TRITONBACKEND_ModelBatcherInitialize | Initializes a read-only data structure for use with all batches |
+| TRITONBACKEND_ModelBatcherFinalize | Deallocates the read-only data structure after the model is unloaded |
+
+The path to the shared library can be passed into the model configuration via the parameter
+`TRITON_BATCH_STRATEGY_PATH`. If not provided, the dynamic batcher will look for a custom
+batching strategy named batchstrategy.so in the model version, model, and backend directories,
+in that order. If found, it will load it. This lets you easily share a custom batching strategy
+among all models using the same backend.
+
+For a tutorial of how to create and use a custom batching library, please see the
+[backend examples directory](https://github.com/triton-inference-server/backend/tree/main/examples#volume-batching).
+
 ### Sequence Batcher
 
 Like the dynamic batcher, the sequence batcher combines non-batched
@@ -898,6 +980,74 @@ indicating sequence start, end, ready and correlation ID. See
 [Stateful Models](architecture.md#stateful-models) for more
 information and examples.
 
+#### Iterative Sequences
+
+> [!NOTE]
+> Iterative sequences are *provisional* and likely to change in future versions.
+
+The sequence batcher supports stateful execution of "iterative
+sequences" where a single request is processed over a number of
+scheduling iterations. "Iterative sequences" enable the scheduler to
+batch multiple inflight requests at each step and allow the model or
+backend to complete a request at any iteration.
+
+For models and backends that support "iterative sequences", users can
+enable support in the sequence batcher by specifying:
+
+```
+  sequence_batching {
+    iterative_sequence: true
+  }
+```
+
+An "iterative sequence" refers to stateful models that iteratively
+process a single request until a complete response is generated.  When
+iterative sequence is enabled, the sequence scheduler will expect a
+single incoming request to initiate the sequence. Backends that
+support iterative sequences can then yield back to the sequence
+batcher to reschedule the request for further execution in a future
+batch.
+
+Because only one request is used to represent the "iterative
+sequence", the user doesn't need to set [control
+inputs](architecture.md#control-inputs) mentioned in the previous
+section. They will be filled internally by the scheduler.
+
+"Iterative sequences" can be [decoupled](#decoupled) where more than
+one response can be generated during execution or non-decoupled where
+a single response is generated when the full response is complete.
+
+The main advantage of "iterative sequences" is the ability to use
+Triton's native batching capabilities to form batches of requests at
+different iteration stages without having to maintain additional state
+in the backend. Typically batches executed by backends are completed
+in the same execution which can waste resources if the execution of
+one of the requests in the batch takes much longer than the rest. With
+"iterative sequences", processing for each request in a batch can be
+broken down into multiple iterations and a backend can start
+processing new requests as soon as any request is complete.
+
+##### Continuous/Inflight Batching with Iterative Sequences
+
+Continuous batching, iteration level batching, and inflight batching
+are terms used in large language model (LLM) inferencing to describe
+batching strategies that form batches of requests at each iteration
+step. By forming batches "continuously" inference servers can increase
+throughput by reusing batch slots as soon as they are free without
+waiting for all requests in a batch to complete.
+
+As the number of steps required to process a request can vary
+significantly, batching existing requests and new requests continuously
+can have a significant improvement on throughput and latency.
+
+To achieve inflight batching with iterative sequences, the backend
+should break request processing into a number of steps, where each
+step corresponds to one Triton model instance execution. At the end of
+each step, the model instance will release requests that have been
+completed and reschedule requests that are still inflight. Triton will
+then form and schedule the next batch of requests that mixes new and
+rescheduled requests.
+
 ### Ensemble Scheduler
 
 The ensemble scheduler must be used for [ensemble
@@ -954,18 +1104,15 @@ for examples on specifying different variants of warmup samples.
 ## Response Cache
 
 The model configuration `response_cache` section has an `enable` boolean used to
-enable the Response Cache for this model. In addition to enabling the cache in
-the model config, a non-zero `--response-cache-byte-size` must be set when
-starting the server.
+enable the Response Cache for this model.
 
 ```
 response_cache {
-  enable: True
+  enable: true
 }
 ```
 
-See the [Response
-Cache](https://github.com/triton-inference-server/server/blob/main/docs/response_cache.md)
-and [ModelConfig
-protobuf](https://github.com/triton-inference-server/common/blob/main/protobuf/model_config.proto).
-docs for more information.
+In addition to enabling the cache in the model config, a `--cache-config` must
+be specified when starting the server to enable caching on the server-side. See
+the [Response Cache](response_cache.md) doc for more details on enabling
+server-side caching.
diff --git a/docs/model_management.md b/docs/user_guide/model_management.md
similarity index 66%
rename from docs/model_management.md
rename to docs/user_guide/model_management.md
index ca8f953755..4ce698feee 100644
--- a/docs/model_management.md
+++ b/docs/user_guide/model_management.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -30,7 +30,7 @@
 
 Triton provides model management APIs are part of the [HTTP/REST and
 GRPC protocols, and as part of the C
-API](inference_protocols.md). Triton operates in one of three model
+API](../customization_guide/inference_protocols.md). Triton operates in one of three model
 control modes: NONE, EXPLICIT or POLL. The model control mode
 determines how changes to the model repository are handled by Triton
 and which of these protocols and APIs are available.
@@ -43,7 +43,7 @@ UNAVAILABLE and will not be available for inferencing.
 
 Changes to the model repository while the server is running will be
 ignored. Model load and unload requests using the [model control
-protocol](protocol/extension_model_repository.md) will have no affect
+protocol](../protocol/extension_model_repository.md) will have no affect
 and will return an error response.
 
 This model control mode is selected by specifying
@@ -55,8 +55,8 @@ Repository](#modifying-the-model-repository).
 ## Model Control Mode EXPLICIT
 
 At startup, Triton loads only those models specified explicitly with the
-`--load-model` command-line option. To load ALL models at startup, specify 
-`--load-model=*` as the ONLY `--load-model` argument. Specifying 
+`--load-model` command-line option. To load ALL models at startup, specify
+`--load-model=*` as the ONLY `--load-model` argument. Specifying
 `--load-model=*` in conjunction with another `--load-model` argument will
 result in error. If `--load-model` is not specified then no models are loaded
 at startup. Models that Triton is not able to load will be marked as
@@ -64,7 +64,7 @@ UNAVAILABLE and will not be available for inferencing.
 
 After startup, all model load and unload actions must be initiated
 explicitly by using the [model control
-protocol](protocol/extension_model_repository.md). The response
+protocol](../protocol/extension_model_repository.md). The response
 status of the model control request indicates success or failure of
 the load or unload action. When attempting to reload an already loaded
 model, if the reload fails for any reason the already loaded model
@@ -77,6 +77,39 @@ This model control mode is enabled by specifying
 Triton is running must be done carefully, as explained in [Modifying
 the Model Repository](#modifying-the-model-repository).
 
+If you are seeing some memory growth when using the [model control
+protocol](../protocol/extension_model_repository.md) for loading and unloading
+models, it is possible that it's not an actual memory leak but some system's
+malloc heuristics that causes memory to be unable to be released back to the OS
+right away. To improve memory performance, you can consider switching from
+malloc to [tcmalloc](https://github.com/google/tcmalloc) or
+[jemalloc](https://github.com/jemalloc/jemalloc) by setting the `LD_PRELOAD`
+environment variable when running Triton, as shown below:
+```
+# Using tcmalloc
+LD_PRELOAD=/usr/lib/$(uname -m)-linux-gnu/libtcmalloc.so.4:${LD_PRELOAD} tritonserver --model-repository=/models ...
+```
+```
+# Using jemalloc
+LD_PRELOAD=/usr/lib/$(uname -m)-linux-gnu/libjemalloc.so:${LD_PRELOAD} tritonserver --model-repository=/models ...
+```
+We recommend experimenting with both tcmalloc and jemalloc to determine which
+one works better for your use case, as they have different strategies for
+memory allocation and deallocation and may perform differently depending on the
+workload.
+
+Both tcmalloc and jemalloc libraries are already installed within the Triton
+container. However, if you need to install them, you can do so using the
+following commands:
+```
+# Install tcmalloc
+apt-get install gperf libgoogle-perftools-dev
+```
+```
+# Install jemalloc
+apt-get install libjemalloc-dev
+```
+
 ## Model Control Mode POLL
 
 Triton attempts to load all models in the model repository at
@@ -97,7 +130,7 @@ polling interval with the `--repository-poll-secs` option. The console
 log or the [model ready
 protocol](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md)
 or the index operation of the [model control
-protocol](protocol/extension_model_repository.md) can be used to
+protocol](../protocol/extension_model_repository.md) can be used to
 determine when model repository changes have taken effect.
 
 **WARNING: There is no synchronization between when Triton polls the
@@ -107,7 +140,7 @@ to unexpected behavior. For this reason POLL mode is not recommended
 for use in production environments.**
 
 Model load and unload requests using the [model control
-protocol](protocols/extension_model_repository.md) will have no affect
+protocol](../protocol/extension_model_repository.md) will have no affect
 and will return an error response.
 
 This model control mode is enabled by specifying
@@ -155,8 +188,8 @@ sub-directory](model_repository.md#repository-layout). The activity
 allowed on the contents of a model's sub-directory varies depending on
 how Triton is using that model. The state of a model can be determined
 by using the [model
-metadata](inference_protocols.md#inference-protocols-and-apis) or
-[repository index](protocol/extension_model_repository.md#index) APIs.
+metadata](../customization_guide/inference_protocols.md#inference-protocols-and-apis) or
+[repository index](../protocol/extension_model_repository.md#index) APIs.
 
 * If the model is actively loading or unloading, no files or
 directories within that sub-directory must be added, removed or
@@ -177,3 +210,41 @@ model. On some OSes it may also be possible to simply move the
 existing shared-libraries to another location outside of the model
 repository, copy in the new shared libraries, and then reload the
 model.
+
+* If only the model instance configuration on the 'config.pbtxt' is modified
+(i.e. increasing/decreasing the instance count), then Triton will update the
+model rather then reloading it, when either a load request is received under
+[Model Control Mode EXPLICIT](#model-control-mode-explicit) or change to the
+'config.pbtxt' is detected under
+[Model Control Mode POLL](#model-control-mode-poll).
+  * The new model configuration may also be passed to Triton via the
+[load API](../protocol/extension_model_repository.md#load).
+  * Some text editors create a swap file in the model directory when the
+'config.pbtxt' is modified in place. The swap file is not part of the model
+configuration, so its presence in the model directory may be detected as a new file
+and cause the model to fully reload when only an update is expected.
+
+* If a sequence model is *updated* (i.e. decreasing the instance count), Triton
+will wait until the in-flight sequence is completed (or timed-out) before the
+instance behind the sequence is removed.
+  * If the instance count is decreased, arbitrary instance(s) are selected among
+idle instances and instances with in-flight sequence(s) for removal.
+
+* If a sequence model is *reloaded* with in-flight sequence(s) (i.e. changes to
+the model file), Triton does not guarantee any remaining request(s) from the
+in-flight sequence(s) will be routed to the same model instance for processing.
+It is currently the responsibility of the user to ensure any in-flight
+sequence(s) are completed before reloading a sequence model.
+
+## Concurrently Loading Models
+
+To reduce service downtime, Triton loads new models in the background while
+continuing to serve inferences on existing models. Based on use case and
+performance requirements, the optimal amount of resources dedicated to loading
+models may differ. Triton exposes a `--model-load-thread-count` option to
+configure the number of threads dedicated to loading models, which defaults to 4.
+
+To set this parameter with the C API, refer to
+`TRITONSERVER_ServerOptionsSetModelLoadThreadCount` in
+[tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h).
+
diff --git a/docs/model_repository.md b/docs/user_guide/model_repository.md
similarity index 67%
rename from docs/model_repository.md
rename to docs/user_guide/model_repository.md
index 2b3b666518..494efba8e7 100644
--- a/docs/model_repository.md
+++ b/docs/user_guide/model_repository.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,6 +28,10 @@
 
 # Model Repository
 
+**Is this your first time setting up a model repository?** Check out
+[these tutorials](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_1-model_deployment#setting-up-the-model-repository)
+ to begin your Triton journey!
+
 The Triton Inference Server serves models from one or more model
 repositories that are specified when the server is started. While
 Triton is running, the models being served can be modified as
@@ -76,7 +80,7 @@ corresponding model. The config.pbtxt file describes the [model
 configuration](model_configuration.md) for the model. For some models,
 config.pbtxt is required while for others it is optional. See
 [Auto-Generated Model
-Configuration](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#auto-generated-model-configuration) 
+Configuration](model_configuration.md#auto-generated-model-configuration)
 for more information.
 
 Each <model-name> directory must have at least one numeric
@@ -105,7 +109,9 @@ specified.
 $ tritonserver --model-repository=/path/to/model/repository ...
 ```
 
-### Google Cloud Storage
+### Cloud Storage with Environment variables
+
+#### Google Cloud Storage
 
 For a model repository residing in Google Cloud Storage, the
 repository path must be prefixed with gs://.
@@ -114,7 +120,44 @@ repository path must be prefixed with gs://.
 $ tritonserver --model-repository=gs://bucket/path/to/model/repository ...
 ```
 
-### S3
+When using Google Cloud Storage, credentials are fetched and attempted in the
+following order:
+1. [GOOGLE_APPLICATION_CREDENTIALS environment variable](https://cloud.google.com/docs/authentication/application-default-credentials#GAC)
+   - The environment variable should be set and contains the location of a
+credential JSON file.
+   - Authorized user credential will be attempted first, and then service
+account credential.
+2. [The attached service account](https://cloud.google.com/docs/authentication/application-default-credentials#attached-sa)
+   - A value for the
+[Authorization HTTP header](https://googleapis.dev/cpp/google-cloud-storage/1.42.0/classgoogle_1_1cloud_1_1storage_1_1oauth2_1_1ComputeEngineCredentials.html#a8c3a5d405366523e2f4df06554f0a676)
+should be obtainable.
+3. Anonymous credential (also known as public bucket)
+   - The bucket (and objects) should have granted `get` and `list` permission to
+all users.
+   - One way to grant such permission is by adding both
+[storage.objectViewer](https://cloud.google.com/storage/docs/access-control/iam-roles#standard-roles)
+and
+[storage.legacyBucketReader](https://cloud.google.com/storage/docs/access-control/iam-roles#legacy-roles)
+predefined roles for "allUsers" to the bucket, for example:
+        ```
+        $ gsutil iam ch allUsers:objectViewer "${BUCKET_URL}"
+        $ gsutil iam ch allUsers:legacyBucketReader "${BUCKET_URL}"
+        ```
+
+By default, Triton makes a local copy of a remote model repository in
+a temporary folder, which is deleted after Triton server is shut down.
+If you would like to control where remote model repository is copied to,
+you may set the `TRITON_GCS_MOUNT_DIRECTORY` environment variable to
+a path pointing to the existing folder on your local machine.
+
+```bash
+export TRITON_GCS_MOUNT_DIRECTORY=/path/to/your/local/directory
+```
+
+**Make sure, that `TRITON_GCS_MOUNT_DIRECTORY` exists on your local machine
+and it is empty.**
+
+#### S3
 
 For a model repository residing in Amazon S3, the path must be
 prefixed with s3://.
@@ -131,9 +174,9 @@ subsequently the bucket path.
 $ tritonserver --model-repository=s3://host:port/bucket/path/to/model/repository ...
 ```
 
-By default, Triton uses HTTP to communicate with your instance of S3. If 
+By default, Triton uses HTTP to communicate with your instance of S3. If
 your instance of S3 supports HTTPS and you wish for Triton to use the HTTPS
-protocol to communicate with it, you can specify the same in the model 
+protocol to communicate with it, you can specify the same in the model
 repository path by prefixing the host name with https://.
 
 ```bash
@@ -149,7 +192,20 @@ If the environment variables are set they will take a higher priority
 and will be used by Triton instead of the credentials set using the
 aws config command.
 
-### Azure Storage
+By default, Triton makes a local copy of a remote model repository
+in a temporary folder, which is deleted after Triton server is shut down.
+If you would like to control where remote model repository is copied to,
+you may set the `TRITON_AWS_MOUNT_DIRECTORY` environment variable to
+a path pointing to the existing folder on your local machine.
+
+```bash
+export TRITON_AWS_MOUNT_DIRECTORY=/path/to/your/local/directory
+```
+
+**Make sure, that `TRITON_AWS_MOUNT_DIRECTORY` exists on your local machine
+and it is empty.**
+
+#### Azure Storage
 
 For a model repository residing in Azure Storage, the repository path
 must be prefixed with as://.
@@ -168,6 +224,88 @@ here's an example of how to find a key corresponding to your `AZURE_STORAGE_ACCO
 $ export AZURE_STORAGE_ACCOUNT="account_name"
 $ export AZURE_STORAGE_KEY=$(az storage account keys list -n $AZURE_STORAGE_ACCOUNT --query "[0].value")
 ```
+By default, Triton makes a local copy of a remote model repository in
+a temporary folder, which is deleted after Triton server is shut down.
+If you would like to control where remote model repository is copied to,
+you may set the `TRITON_AZURE_MOUNT_DIRECTORY` environment variable to a path
+pointing to the existing folder on your local machine.
+
+```bash
+export TRITON_AZURE_MOUNT_DIRECTORY=/path/to/your/local/directory
+```
+
+**Make sure, that `TRITON_AZURE_MOUNT_DIRECTORY` exists on your local machine
+and it is empty.**
+
+
+### Cloud Storage with Credential file (Beta)
+
+*This feature is currently in beta and may be subject to change.*
+
+To group the credentials into a single file for Triton, you may set the
+`TRITON_CLOUD_CREDENTIAL_PATH` environment variable to a path pointing to a
+JSON file of the following format, residing in the local file system.
+
+```
+export TRITON_CLOUD_CREDENTIAL_PATH="cloud_credential.json"
+```
+
+"cloud_credential.json":
+```
+{
+  "gs": {
+    "": "PATH_TO_GOOGLE_APPLICATION_CREDENTIALS",
+    "gs://gcs-bucket-002": "PATH_TO_GOOGLE_APPLICATION_CREDENTIALS_2"
+  },
+  "s3": {
+    "": {
+      "secret_key": "AWS_SECRET_ACCESS_KEY",
+      "key_id": "AWS_ACCESS_KEY_ID",
+      "region": "AWS_DEFAULT_REGION",
+      "session_token": "",
+      "profile": ""
+    },
+    "s3://s3-bucket-002": {
+      "secret_key": "AWS_SECRET_ACCESS_KEY_2",
+      "key_id": "AWS_ACCESS_KEY_ID_2",
+      "region": "AWS_DEFAULT_REGION_2",
+      "session_token": "AWS_SESSION_TOKEN_2",
+      "profile": "AWS_PROFILE_2"
+    }
+  },
+  "as": {
+    "": {
+      "account_str": "AZURE_STORAGE_ACCOUNT",
+      "account_key": "AZURE_STORAGE_KEY"
+    },
+    "as://Account-002/Container": {
+      "account_str": "",
+      "account_key": ""
+    }
+  }
+}
+```
+
+To match a credential, the longest matching credential name against the start
+of a given path is used. For example: `gs://gcs-bucket-002/model_repository`
+will match the "gs://gcs-bucket-002" GCS credential, and
+`gs://any-other-gcs-bucket` will match the "" GCS credential.
+
+This feature is intended for use-cases which multiple credentials are needed
+for each cloud storage provider. Be sure to replace any credential paths/keys
+with the actual paths/keys from the example above.
+
+If the `TRITON_CLOUD_CREDENTIAL_PATH` environment variable is not set, the
+[Cloud Storage with Environment variables](#cloud-storage-with-environment-variables)
+will be used.
+
+### Caching of Cloud Storage
+
+Triton currently doesn't perform file caching for cloud storage.
+However, this functionality can be implemented through
+[repository agent API](https://github.com/triton-inference-server/server/blob/bbbcad7d87adc9596f99e3685da5d6b73380514f/docs/customization_guide/repository_agents.md) by injecting a proxy, which checks a specific local directory for caching
+given the cloud storage (original path) of the model,
+and then decides if cached files may be used.
 
 ## Model Versions
 
diff --git a/docs/optimization.md b/docs/user_guide/optimization.md
similarity index 85%
rename from docs/optimization.md
rename to docs/user_guide/optimization.md
index 21b49e006b..f842198a90 100644
--- a/docs/optimization.md
+++ b/docs/user_guide/optimization.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2019-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -32,7 +32,7 @@ The Triton Inference Server has many features that you can use to
 decrease latency and increase throughput for your model. This section
 discusses these features and demonstrates how you can use them to
 improve the performance of your model. As a prerequisite you should
-follow the [QuickStart](quickstart.md) to get Triton and client
+follow the [QuickStart](../getting_started/quickstart.md) to get Triton and client
 examples running with the example model repository.
 
 This section focuses on understanding latency and throughput tradeoffs
@@ -43,16 +43,17 @@ single GPU.
 
 Unless you already have a client application suitable for measuring
 the performance of your model on Triton, you should familiarize
-yourself with [Performance Analyzer](perf_analyzer.md). The
-Performance Analyzer is an essential tool for optimizing your model's
+yourself with
+[Performance Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md).
+The Performance Analyzer is an essential tool for optimizing your model's
 performance.
 
 As a running example demonstrating the optimization features and
 options, we will use a TensorFlow Inception model that you can obtain
-by following the [QuickStart](quickstart.md). As a baseline we use
+by following the [QuickStart](../getting_started/quickstart.md). As a baseline we use
 perf_analyzer to determine the performance of the model using a [basic
 model configuration that does not enable any performance
-features](examples/model_repository/inception_graphdef/config.pbtxt).
+features](../examples/model_repository/inception_graphdef/config.pbtxt).
 
 ```
 $ perf_analyzer -m inception_graphdef --percentile=95 --concurrency-range 1:4
@@ -80,10 +81,13 @@ latency.
 
 For most models, the Triton feature that provides the largest
 performance improvement is [dynamic
-batching](architecture.md#dynamic-batcher). If your model does not
+batching](model_configuration.md#dynamic-batcher).
+[This example](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_2-improving_resource_utilization#dynamic-batching--concurrent-model-execution)
+ sheds more light on conceptual details. If your model does not
 support batching then you can skip ahead to [Model
 Instances](#model-instances).
 
+
 ### Dynamic Batcher
 
 The dynamic batcher combines individual inference requests into a
@@ -91,7 +95,7 @@ larger batch that will often execute much more efficiently than
 executing the individual requests independently. To enable the dynamic
 batcher stop Triton, add the following line to the end of the [model
 configuration file for
-inception_graphdef](examples/model_repository/inception_graphdef/config.pbtxt),
+inception_graphdef](../examples/model_repository/inception_graphdef/config.pbtxt),
 and then restart Triton.
 
 ```
@@ -127,8 +131,8 @@ typically applies when perf_analyzer is running on the same system as
 Triton. The first rule is that for minimum latency set the request
 concurrency to 1 and disable the dynamic batcher and use only 1 [model
 instance](#model-instances). The second rule is that for maximum
-throughput set the request concurrency to be 
-`2 * <maximum batch size> * <model instance count>`. We will discuss model 
+throughput set the request concurrency to be
+`2 * <maximum batch size> * <model instance count>`. We will discuss model
 instances [below](#model-instances), for now we are working with one model
 instance. So for maximum-batch-size 4 we want to run perf_analyzer
 with request concurrency of `2 * 4 * 1 = 8`.
@@ -159,7 +163,7 @@ remove any dynamic batching settings you may have previously added to
 the model configuration (we discuss combining dynamic batcher and
 multiple model instances below), add the following lines to the end of
 the [model configuration
-file](examples/model_repository/inception_graphdef/config.pbtxt), and
+file](../examples/model_repository/inception_graphdef/config.pbtxt), and
 then restart Triton.
 
 ```
@@ -215,17 +219,19 @@ settings that best satisfy your throughput and latency requirements.
 Triton has several optimization settings that apply to only a subset
 of the supported model frameworks. These optimization settings are
 controlled by the model configuration [optimization
-policy](model_configuration.md#optimization-policy).
+policy](model_configuration.md#optimization-policy). Visit
+[this guide](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_4-inference_acceleration)
+ for an end to end discussion.
 
 ### ONNX with TensorRT Optimization (ORT-TRT)
 
 One especially powerful optimization is to use TensorRT in
 conjunction with an ONNX model. As an example of TensorRT optimization
 applied to an ONNX model, we will use an ONNX DenseNet model that you
-can obtain by following [QuickStart](quickstart.md). As a baseline we
+can obtain by following [QuickStart](../getting_started/quickstart.md). As a baseline we
 use perf_analyzer to determine the performance of the model using a
 [basic model configuration that does not enable any performance
-features](examples/model_repository/densenet_onnx/config.pbtxt).
+features](../examples/model_repository/densenet_onnx/config.pbtxt).
 
 ```
 $ perf_analyzer -m densenet_onnx --percentile=95 --concurrency-range 1:4
@@ -310,10 +316,10 @@ section of the model configuration protobuf.
 
 As an example of TensorRT optimization applied to a TensorFlow model,
 we will use a TensorFlow Inception model that you can obtain by
-following the [QuickStart](quickstart.md). As a baseline we use
+following the [QuickStart](../getting_started/quickstart.md). As a baseline we use
 perf_analyzer to determine the performance of the model using a [basic
 model configuration that does not enable any performance
-features](examples/model_repository/inception_graphdef/config.pbtxt).
+features](../examples/model_repository/inception_graphdef/config.pbtxt).
 
 ```
 $ perf_analyzer -m inception_graphdef --percentile=95 --concurrency-range 1:4
@@ -352,6 +358,34 @@ cutting latency by more than half. The benefit provided by TensorRT
 will vary based on the model, but in general it can provide
 significant performance improvement.
 
+### TensorFlow JIT Graph Optimizations
+
+Tensorflow allows its user to specify the optimization level
+while running the model graph via GlobalJitLevel setting.
+See [config.proto](https://github.com/tensorflow/tensorflow/blob/v2.10.0/tensorflow/core/protobuf/config.proto)
+for more information. When running
+TensorFlow models in Triton, the users can provide this setting
+by providing graph levels like below:
+
+```
+optimization {
+  graph { level: 1
+}}
+```
+
+The users can also utilize the [XLA optimization](https://www.tensorflow.org/xla)
+by setting `TF_XLA_FLAGS` environment variable before launching
+Triton. An example to launch Triton with GPU and CPU auto-clustering:
+
+```
+$ TF_XLA_FLAGS="--tf_xla_auto_jit=2 --tf_xla_cpu_global_jit" tritonserver --model-repository=...
+```
+
+As in the case of TensorRT optimization above, these optimizations
+occur when the first inference request is run. To mitigate the
+model startup slowdown in production systems, you can use
+[model warmup](model_configuration.md#model-warmup).
+
 ### TensorFlow Automatic FP16 Optimization
 
 TensorFlow has an option to provide FP16 optimization that can be
@@ -379,8 +413,9 @@ to evaluate the model's performance with and without the optimization.
 
 Many modern CPUs are composed of multiple cores, memories and interconnects that
 expose different performance characteristics depending on how threads and
-data are allocated [cite https://www.kernel.org/doc/html/latest/vm/numa.html].
-Triton allows you to set host policies that describe this NUMA configuration for
+data are allocated.
+Triton allows you to set host policies that describe this
+[NUMA](https://www.kernel.org/doc/html/latest/mm/numa.html) configuration for
 your system and then assign model instances to different host policies
 to exploit these NUMA properties.
 
diff --git a/docs/user_guide/perf_analyzer.md b/docs/user_guide/perf_analyzer.md
new file mode 100644
index 0000000000..7019d51c63
--- /dev/null
+++ b/docs/user_guide/perf_analyzer.md
@@ -0,0 +1,30 @@
+<!--
+# Copyright (c) 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+Perf Analyzer documentation has been relocated to
+[here](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md).
diff --git a/docs/user_guide/performance_tuning.md b/docs/user_guide/performance_tuning.md
new file mode 100644
index 0000000000..e28789a2d3
--- /dev/null
+++ b/docs/user_guide/performance_tuning.md
@@ -0,0 +1,393 @@
+<!--
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Deploying your trained model using Triton
+
+Given a trained model, how do I deploy it at-scale with an optimal configuration
+using Triton Inference Server?  This document is here to help answer that.
+
+For those who like a [high level overview](#overview), below is the common flow
+for most use cases.
+
+For those who wish to jump right in, skip to the
+[end-to-end example](#end-to-end-example).
+
+For additional material, see the
+[Triton Conceptual Guide tutorial](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_4-inference_acceleration).
+
+## Overview
+
+1. Is my model compatible with Triton?
+    - If your model falls under one of Triton's
+    [supported backends](https://github.com/triton-inference-server/backend),
+    then we can simply try to deploy the model as described in the
+    [Quickstart](../getting_started/quickstart.md) guide.
+    For the ONNXRuntime, TensorFlow SavedModel, and TensorRT backends, the
+    minimal model configuration can be inferred from the model using Triton's
+    [AutoComplete](model_configuration.md#auto-generated-model-configuration)
+    feature.
+    This means that a `config.pbtxt` may still be provided, but is not required
+    unless you want to explicitly set certain parameters.
+    Additionally, by enabling verbose logging via `--log-verbose=1`, you can see
+    the complete config that Triton sees internally in the server log output.
+    For other backends, refer to the
+    [Minimal Model Configuration](model_configuration.md#minimal-model-configuration)
+    required to get started.
+    - If your model does not come from a supported backend, you can look into
+    the [Python Backend](https://github.com/triton-inference-server/python_backend)
+    or writing a
+    [Custom C++ Backend](https://github.com/triton-inference-server/backend/blob/main/examples/README.md)
+    to support your model. The Python Backend provides a simple interface to
+    execute requests through a generic python script, but may not be as
+    performant as a Custom C++ Backend.  Depending on your use case, the Python
+    Backend performance may be a sufficient tradeoff for the simplicity of
+    implementation.
+
+2. Can I run inference on my served model?
+    - Assuming you were able to load your model on Triton, the next step is to
+    verify that we can run inference requests and get a baseline performance
+    benchmark of your model.
+    Triton's
+    [Perf Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md)
+    tool specifically fits this purpose. Here is a simplified output for
+    demonstration purposes:
+
+    ```
+    # NOTE: "my_model" represents a model currently being served by Triton
+    $ perf_analyzer -m my_model
+    ...
+
+    Inferences/Second vs. Client Average Batch Latency
+    Concurrency: 1, throughput: 482.8 infer/sec, latency 12613 usec
+    ```
+
+    - This gives us a sanity test that we are able to successfully form input
+    requests and receive output responses to communicate with the model backend
+    via Triton APIs.
+    - If Perf Analyzer fails to send requests and it is unclear from the error
+    how to proceed, then you may want to sanity check that your model
+    `config.pbtxt` inputs/outputs match what the model expects. If the config
+    is correct, check that the model runs successfully using its original
+    framework directly.  If you don't have your own script or tool to do so,
+    [Polygraphy](https://github.com/NVIDIA/TensorRT/tree/main/tools/Polygraphy)
+    is a useful tool to run sample inferences on your model via various
+    frameworks.  Currently, Polygraphy supports ONNXRuntime, TensorRT, and
+    TensorFlow 1.x.
+    - The definition of "performing well" is subject to change for each use
+    case. Some common metrics are throughput, latency, and GPU utilization.
+    There are many variables that can be tweaked just within your model
+    configuration (`config.pbtxt`) to obtain different results.
+    - As your model, config, or use case evolves,
+    [Perf Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md)
+    is a great tool to quickly verify model functionality and performance.
+
+3. How can I improve my model performance?
+    - To further understand the best model configuration you can provide to
+    Triton for your use case, Triton's
+    [Model Analyzer](https://github.com/triton-inference-server/model_analyzer)
+    tool can help.
+    Model Analyzer can automatically or
+    [manually](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/config_search.md)
+    search through config combinations to find the optimal triton configuration
+    to meet your constraints.  After running Model Analyzer to find the optimal
+    configurations for your model/use case, you can transfer the generated
+    config files to your [Model Repository](model_repository.md).
+    Model Analyzer provides a
+    [Quickstart](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/quick_start.md)
+    guide with some examples to walk through.
+    - Upon serving the model with the newly optimized configuration file found
+    by Model Analyzer and running Perf Analyzer again, you should expect to find
+    better performance numbers in most cases compared to a default config.
+    - Some parameters that can be tuned for a model may not be exposed to Model
+    Analyzer's automatic search since they don't apply to all models.
+    For instance, [backends](https://github.com/triton-inference-server/backend)
+    can expose backend-specific configuration options that can be tuned as well.
+    The [ONNXRuntime
+    Backend](https://github.com/triton-inference-server/onnxruntime_backend),
+    for example, has several
+    [parameters](https://github.com/triton-inference-server/onnxruntime_backend#model-config-options)
+    that affect the level of parallelization when executing inference on a
+    model.
+    These backend-specific options may be worth investigating if the defaults
+    are not providing sufficient performance.  To tune custom sets of
+    parameters, Model Analyzer supports
+    [Manual Configuration Search](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/config_search.md).
+    - To learn more about further optimizations for your model configuration,
+    see the [Optimization](optimization.md) docs.
+
+### Other Areas of Interest
+
+1. My model performs slowly when it is first loaded by Triton
+(cold-start penalty), what do I do?
+    - Triton exposes the ability to run
+    [ModelWarmup](model_configuration.md#model-warmup) requests when first
+    loading the model to ensure that the model is sufficiently warmed up before
+    being marked "READY" for inference.
+
+2. Why doesn't my model perform significantly faster on GPU?
+    - Most official backends supported by Triton are optimized for GPU inference
+    and should perform well on GPU out of the box.
+    - Triton exposes options for you to optimize your model further on the GPU.
+    Triton's
+    [Framework Specific Optimizations](optimization.md#framework-specific-optimization)
+    goes into further detail on this topic.
+    - Complete conversion of your model to a backend fully optimized for GPU
+    inference such as [TensorRT](https://developer.nvidia.com/tensorrt) may
+    provide even better results.
+    You may find more Triton-specific details about TensorRT in the
+    [TensorRT Backend](https://github.com/triton-inference-server/tensorrt_backend).
+    - If none of the above can help get sufficient GPU-accelerated performance
+    for your model, the model may simply be better designed for CPU execution
+    and the [OpenVINO Backend](https://github.com/triton-inference-server/openvino_backend) may
+    help further optimize your CPU execution.
+
+## End-to-end Example
+
+> **Note**
+> If you have never worked with Triton before, you may be interested in first
+checking out the [Quickstart](../getting_started/quickstart.md) example.
+> Some basic understanding of Triton may be useful for the following section,
+but this example is meant to be straightforward enough without prior experience.
+
+Let's take an ONNX model as our example since ONNX is designed to be a format
+that can be [easily
+exported](https://github.com/onnx/tutorials#converting-to-onnx-format) from most
+other frameworks.
+
+1. Create a [Model Repository](model_repository.md) and download our example
+`densenet_onnx` model into it.
+
+```bash
+# Create model repository with placeholder for model and version 1
+mkdir -p ./models/densenet_onnx/1
+
+# Download model and place it in model repository
+wget -O models/densenet_onnx/1/model.onnx
+https://contentmamluswest001.blob.core.windows.net/content/14b2744cf8d6418c87ffddc3f3127242/9502630827244d60a1214f250e3bbca7/08aed7327d694b8dbaee2c97b8d0fcba/densenet121-1.2.onnx
+```
+
+2. Create a minimal [Model Configuration](model_configuration.md) for the
+`densenet_onnx` model in our [Model Repository](model_repository.md) at
+`./models/densenet_onnx/config.pbtxt`.
+
+> **Note**
+> This is a slightly simplified version of another [example
+config](../examples/model_repository/densenet_onnx/config.pbtxt) that utilizes
+other [Model Configuration](model_configuration.md) features not necessary for
+this example.
+
+```protobuf
+name: "densenet_onnx"
+backend: "onnxruntime"
+max_batch_size: 0
+input: [
+  {
+    name: "data_0",
+    data_type: TYPE_FP32,
+    dims: [ 1, 3, 224, 224]
+  }
+]
+output: [
+  {
+    name: "prob_1",
+    data_type: TYPE_FP32,
+    dims: [ 1, 1000, 1, 1 ]
+  }
+]
+```
+
+> **Note**
+> As of the 22.07 release, both Triton and Model Analyzer support fully
+auto-completing the config file for
+[backends that support it](model_configuration.md#auto-generated-model-configuration).
+> So for an ONNX model, for example, this step can be skipped unless you want to
+explicitly set certain parameters.
+
+3. Start the server container
+
+To serve our model, we will use the server container which comes pre-installed
+with a `tritonserver` binary.
+
+```bash
+# Start server container
+docker run -ti --rm --gpus=all --network=host -v $PWD:/mnt --name triton-server nvcr.io/nvidia/tritonserver:23.11-py3
+
+# Start serving your models
+tritonserver --model-repository=/mnt/models
+```
+
+> **Note**
+> The `-v $PWD:/mnt` is mounting your current directory on the host into the
+`/mnt` directory inside the container.
+> So if you created your model repository in `$PWD/models`, you will find it
+inside the container at `/mnt/models`.
+> You can change these paths as needed. See
+[docker volume](https://docs.docker.com/storage/volumes/) docs for more information on
+how this works.
+
+
+To check if the model loaded successfully, we expect to see our model in a
+`READY` state in the output of the previous command:
+
+```
+...
+I0802 18:11:47.100537 135 model_repository_manager.cc:1345] successfully loaded 'densenet_onnx' version 1
+...
++---------------+---------+--------+
+| Model         | Version | Status |
++---------------+---------+--------+
+| densenet_onnx | 1       | READY  |
++---------------+---------+--------+
+...
+```
+
+4. Verify the model can run inference
+
+To verify our model can perform inference, we will use the `triton-client`
+container that we already started which comes with `perf_analyzer`
+pre-installed.
+
+In a separate shell, we use Perf Analyzer to sanity check that we can run
+inference and get a baseline for the kind of performance we expect from this
+model.
+
+In the example below, Perf Analyzer is sending requests to models served on the
+same machine (`localhost` from the server container via `--network=host`).
+However, you may also test models being served remotely at some `<IP>:<PORT>`
+by setting the `-u` flag, such as `perf_analyzer -m densenet_onnx -u
+127.0.0.1:8000`.
+
+```bash
+# Start the SDK container interactively
+docker run -ti --rm --gpus=all --network=host -v $PWD:/mnt --name triton-client nvcr.io/nvidia/tritonserver:23.11-py3-sdk
+
+# Benchmark model being served from step 3
+perf_analyzer -m densenet_onnx --concurrency-range 1:4
+```
+
+```
+...
+Inferences/Second vs. Client Average Batch Latency
+Concurrency: 1, throughput: 265.147 infer/sec, latency 3769 usec
+Concurrency: 2, throughput: 890.793 infer/sec, latency 2243 usec
+Concurrency: 3, throughput: 937.036 infer/sec, latency 3199 usec
+Concurrency: 4, throughput: 965.21 infer/sec, latency 4142 usec
+```
+
+5. Run Model Analyzer to find the best configurations for our model
+
+While Model Analyzer comes pre-installed in the SDK (client) container and
+supports various modes of connecting to a Triton server, for simplicity we will
+use install Model Analyzer in our `server` container to use the `local`
+(default) mode.
+To learn more about other methods of connecting Model Analyzer to a running
+Triton Server, see the `--triton-launch-mode` Model Analyzer flag.
+
+```bash
+# Enter server container interactively
+docker exec -ti triton-server bash
+
+# Stop existing tritonserver process if still running
+# because model-analyzer will start its own server
+SERVER_PID=`ps | grep tritonserver | awk '{ printf $1 }'`
+kill ${SERVER_PID}
+
+# Install model analyzer
+pip install --upgrade pip
+pip install triton-model-analyzer wkhtmltopdf
+
+# Profile the model using local (default) mode
+# NOTE: This may take some time, in this example it took ~10 minutes
+model-analyzer profile \
+  --model-repository=/mnt/models \
+  --profile-models=densenet_onnx \
+  --output-model-repository-path=results
+
+# Summarize the profiling results
+model-analyzer analyze --analysis-models=densenet_onnx
+```
+
+Example Model Analyzer output summary:
+
+> In 51 measurements across 6 configurations, `densenet_onnx_config_3` provides
+the best throughput: **323 infer/sec**.
+>
+> **This is a 92% gain over the default configuration (168 infer/sec), under the
+given constraints.**
+
+| Model Config Name | Max Batch Size | Dynamic Batching | Instance Count | p99 Latency (ms) | Throughput (infer/sec) | Max GPU Memory Usage (MB) | Average GPU Utilization (%) |
+|---|---|---|---|---|---|---|---|
+| densenet_onnx_config_3 | 0 | Enabled | 4/GPU | 35.8 | 323.13 | 3695 | 58.6 |
+| densenet_onnx_config_2 | 0 | Enabled | 3/GPU | 59.575 | 295.82 | 3615 | 58.9 |
+| densenet_onnx_config_4 | 0 | Enabled | 5/GPU | 69.939 | 291.468 | 3966 | 58.2 |
+| densenet_onnx_config_default | 0 | Disabled | 1/GPU | 12.658 | 167.549 | 3116 | 51.3 |
+
+In the table above, we see that setting our GPU [Instance
+Count](model_configuration.md#instance-groups) to 4 allows us to achieve the
+highest throughput and almost lowest latency on this system.
+
+Also, note that this `densenet_onnx` model has a fixed batch-size that is
+explicitly specified in the first dimension of the Input/Output `dims`,
+therefore the `max_batch_size` parameter is set to 0 as described
+[here](model_configuration.md#maximum-batch-size).
+For models that support dynamic batch size, Model Analyzer would also tune the
+`max_batch_size` parameter.
+
+> **Warning**
+> These results are specific to the system running the Triton server, so for
+example, on a smaller GPU we may not see improvement from increasing the GPU
+instance count.
+> In general, running the same configuration on systems with different hardware
+(CPU, GPU, RAM, etc.) may provide different results, so it is important to
+profile your model on a system that accurately reflects where you will deploy
+your models for your use case.
+
+6. Extract optimal config from Model Analyzer results
+
+In our example above, `densenet_onnx_config_3` was the optimal configuration.
+So let's extract that `config.pbtxt` and put it back in our model repository for future use.
+
+```bash
+# (optional) Backup our original config.pbtxt (if any) to another directory
+cp /mnt/models/densenet_onnx/config.pbtxt /tmp/original_config.pbtxt
+
+# Copy over the optimal config.pbtxt from Model Analyzer results to our model repository
+cp ./results/densenet_onnx_config_3/config.pbtxt /mnt/models/densenet_onnx/
+```
+
+Now that we have an optimized Model Configuration, we are ready to take our
+model to deployment.  For further manual tuning, read the [Model
+Configuration](model_configuration.md) and [Optimization](optimization.md) docs
+to learn more about Triton's complete set of capabilities.
+
+In this example, we happened to get both the highest throughput and almost
+lowest latency from the same configuration, but in some cases this is a tradeoff
+that must be made. Certain models or configurations may achieve a higher
+throughput but also incur a higher latency in return.  It is worthwhile to fully
+inspect the reports generated by Model Analyzer to ensure your model performance
+meets your requirements.
diff --git a/docs/ragged_batching.md b/docs/user_guide/ragged_batching.md
similarity index 97%
rename from docs/ragged_batching.md
rename to docs/user_guide/ragged_batching.md
index 3e69beb912..308b75fa57 100644
--- a/docs/ragged_batching.md
+++ b/docs/user_guide/ragged_batching.md
@@ -57,12 +57,13 @@ How ragged input are processed in a batch of requests depends on the backend
 implementation. The backends, such as
 [ONNX Runtime backend](https://github.com/triton-inference-server/onnxruntime_backend),
 [TensorFlow backend](https://github.com/triton-inference-server/tensorflow_backend),
+[PyTorch backend](https://github.com/triton-inference-server/pytorch_backend),
 and [TensorRT backend](https://github.com/triton-inference-server/tensorrt_backend),
 require models to accept ragged inputs as 1-dimensional tensors.
 These backends concatenates the request inputs into the 1-dimensional tensor.
 
 Because the concatenated input doesn't track the start and end index for each
-request, the backends also require the model to have additional input(s),
+request, the backends often require the model to have additional input(s),
 [batch input](#batch-input), that describe various information about the batch
 formed.
 
diff --git a/docs/rate_limiter.md b/docs/user_guide/rate_limiter.md
similarity index 98%
rename from docs/rate_limiter.md
rename to docs/user_guide/rate_limiter.md
index 2e38327042..69b94fd8b8 100644
--- a/docs/rate_limiter.md
+++ b/docs/user_guide/rate_limiter.md
@@ -42,9 +42,9 @@ frameworks dynamically allocate memory. Running all such models
 simultaneously may lead to system going out-of-memory.
 
 Rate limiter allows to postpone the inference execution on some
-model instances such that not all of them runs simultaneously. 
+model instances such that not all of them runs simultaneously.
 The model priorities are used to decide which model instance
-to schedule next. 
+to schedule next.
 
 ## Using Rate Limiter
 
diff --git a/docs/user_guide/request_cancellation.md b/docs/user_guide/request_cancellation.md
new file mode 100644
index 0000000000..8db4e3b8c1
--- /dev/null
+++ b/docs/user_guide/request_cancellation.md
@@ -0,0 +1,102 @@
+<!--
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Request Cancellation
+
+Starting from r23.10, Triton supports handling request cancellation received
+from the gRPC client or a C API user. Long running inference requests such
+as for auto generative large language models may run for an indeterminate
+amount of time or indeterminate number of steps. Additionally clients may
+enqueue a large number of requests as part of a sequence or request stream
+and later determine the results are no longer needed. Continuing to process
+requests whose results are no longer required can significantly impact server
+resources.
+
+## Issuing Request Cancellation
+
+### In-Process C API
+
+[In-Process Triton Server C API](../customization_guide/inference_protocols.md#in-process-triton-server-api) has been enhanced with `TRITONSERVER_InferenceRequestCancel`
+and `TRITONSERVER_InferenceRequestIsCancelled` to issue cancellation and query
+whether cancellation has been issued on an inflight request respectively. Read more
+about the APIs in [tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h).
+
+
+### gRPC Endpoint
+
+In addition, [gRPC endpoint](../customization_guide/inference_protocols.md#httprest-and-grpc-protocols) can
+now detect cancellation from the client and attempt to terminate request.
+At present, only gRPC python client supports issuing request cancellation
+to the server endpoint. See [request-cancellation](https://github.com/triton-inference-server/client#request-cancellation)
+for more details on how to issue requests from the client-side.
+See gRPC guide on RPC [cancellation](https://grpc.io/docs/guides/cancellation/) for
+finer details.
+
+## Handling in Triton Core
+
+Triton core checks for requests that have been cancelled at some critical points
+when using [dynamic](./model_configuration.md#dynamic-batcher) or
+[sequence](./model_configuration.md#sequence-batcher) batching. The checking is
+also performed between each
+[ensemble](./model_configuration.md#ensemble-scheduler) steps and terminates
+further processing if the request is cancelled.
+
+On detecting a cancelled request, Triton core responds with CANCELLED status. If a request
+is cancelled when using [sequence_batching](./model_configuration.md#sequence-batcher),
+then all the pending requests in the same sequence will also be cancelled. The sequence
+is represented by the requests that has identical sequence id.
+
+**Note**: Currently, Triton core does not detect cancellation status of a request once
+it is forwarded to [rate limiter](./rate_limiter.md). Improving the request cancellation
+detection and handling within Triton core is work in progress.
+
+## Handling in Backend
+
+Upon receiving request cancellation, Triton does its best to terminate request
+at various points. However, once a request has been given to the backend
+for execution, it is up to the individual backends to detect and handle
+request termination.
+Currently, the following backends support early termination:
+- [TensorRT-LLM backend](https://github.com/triton-inference-server/tensorrtllm_backend)
+- [vLLM backend](https://github.com/triton-inference-server/vllm_backend)
+- [python backend](https://github.com/triton-inference-server/python_backend)
+
+Python backend is a special case where we expose the APIs to detect cancellation
+status of the request but it is up to the `model.py` developer to detect whether
+the request is cancelled and terminate further execution.
+
+**For the backend developer**: The backend APIs have also been enhanced to let the
+backend detect whether the request received from Triton core has been cancelled.
+See `TRITONBACKEND_RequestIsCancelled` and `TRITONBACKEND_ResponseFactoryIsCancelled`
+in [tritonbackend.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonbackend.h)
+for more details. The backend upon detecting request cancellation can stop processing
+it any further.
+The Python models running behind Python backend can also query the cancellation status
+of request and response_sender. See [this](https://github.com/triton-inference-server/python_backend#request-cancellation-handling)
+section in python backend documentation for more details.
+
diff --git a/docs/user_guide/response_cache.md b/docs/user_guide/response_cache.md
new file mode 100644
index 0000000000..e70085e798
--- /dev/null
+++ b/docs/user_guide/response_cache.md
@@ -0,0 +1,243 @@
+<!--
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Triton Response Cache
+
+## Overview
+
+In this document an *inference request* is the model name, model version, and
+input tensors (name, shape, datatype and tensor data) that make up a request
+submitted to Triton. An inference result is the output tensors (name, shape,
+datatype and tensor data) produced by an inference execution. The response cache
+is used by Triton to hold inference results generated for previous executed
+inference requests. Triton will maintain the response cache so that inference
+requests that hit in the cache will not need to execute a model to produce
+results and will instead extract their results from the cache. For some use
+cases this can significantly reduce the inference request latency.
+
+Triton accesses the response cache with a hash of the inference request that
+includes the model name, model version and model inputs. If the hash is found in
+the cache, the corresponding inference result is extracted from the cache and
+used for the request. When this happens there is no need for Triton to execute
+the model to produce the inference result. If the hash is not found in the
+cache, Triton executes the model to produce the inference result, and then
+records that result in the cache so that subsequent inference requests can
+(re)use those results.
+
+## Usage
+
+In order for caching to be used on a given model, it must be enabled
+on both the server-side, and in the model's
+[model config](model_configuration.md#response-cache). See the following
+sections below for more details.
+
+### Enable Caching on Server-side
+
+The response cache is enabled on the server-side by specifying a
+`<cache_implementation>` and corresponding configuration when starting
+the Triton server.
+
+Through the CLI, this translates to setting
+`tritonserver --cache-config <cache_implementation>,<key>=<value> ...`. For example:
+```
+tritonserver --cache-config local,size=1048576
+```
+
+For in-process C API applications, this translates to calling
+`TRITONSERVER_SetCacheConfig(const char* cache_implementation, const char* config_json)`.
+
+This allows users to enable/disable caching globally on server startup.
+
+### Enable Caching for a Model
+
+**By default, no model uses response caching even if the response cache
+is enabled globally with the `--cache-config` flag.**
+
+For a given model to use response caching, the model must also have
+response caching enabled in its model configuration:
+```
+# config.pbtxt
+
+response_cache {
+  enable: true
+}
+```
+
+This allows users to enable/disable caching for specific models.
+
+For more information on enabling the response cache for each model, see the
+[model configuration docs](model_configuration.md#response-cache).
+
+### Cache Implementations
+
+Starting in the 23.03 release, Triton has a set of
+[TRITONCACHE APIs](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritoncache.h)
+that are used to communicate with a cache implementation of the user's choice.
+
+A cache implementation is a shared library that implements the required
+TRITONCACHE APIs and is dynamically loaded on server startup, if enabled.
+
+Triton's most recent
+[tritonserver release containers](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver)
+come with the following cache implementations out of the box:
+- [local](https://github.com/triton-inference-server/local_cache): `/opt/tritonserver/caches/local/libtritoncache_local.so`
+- [redis](https://github.com/triton-inference-server/redis_cache): `/opt/tritonserver/caches/redis/libtritoncache_redis.so`
+
+With these TRITONCACHE APIs, `tritonserver` exposes a new `--cache-config`
+CLI flag that gives the user flexible customization of which cache implementation
+to use, and how to configure it. Similar to the `--backend-config` flag,
+the expected format is `--cache-config <cache_name>,<key>=<value>` and may
+be specified multiple times to specify multiple keys if the cache implementation
+requires it.
+
+#### Local Cache
+
+The `local` cache implementation is equivalent to the response cache used
+internally before the 23.03 release. For more implementation specific details,
+see the
+[local cache implementation](https://github.com/triton-inference-server/local_cache).
+
+When `--cache-config local,size=SIZE` is specified with a non-zero `SIZE`,
+Triton allocates the requested size in CPU memory and **shares the
+cache across all inference requests and across all models**.
+
+#### Redis Cache
+
+The `redis` cache implementation exposes the ability for Triton to communicate
+with a Redis server for caching. The `redis_cache` implementation is essentially
+a Redis client that acts as an intermediary between Triton and Redis.
+
+To list a few benefits of the `redis` cache compared to the `local` cache in
+the context of Triton:
+- The Redis server can be hosted remotely as long as it is accessible by Triton,
+  so it is not tied directly to the Triton process lifetime.
+  - This means Triton can be restarted and still have access to previously cached entries.
+  - This also means that Triton doesn't have to compete with the cache for memory/resource usage.
+- Multiple Triton instances can share a cache by configuring each Triton instance
+  to communicate with the same Redis server.
+- The Redis server can be updated/restarted independently of Triton, and
+  Triton will fallback to operating as it would with no cache access during
+  any Redis server downtime, and log appropriate errors.
+
+In general, the Redis server can be configured/deployed as needed for your use
+case, and Triton's `redis` cache will simply act as a client of your Redis
+deployment. The [Redis docs](https://redis.io/docs/) should be consulted for
+questions and details about configuring the Redis server.
+
+For Triton-specific `redis` cache implementation details/configuration, see the
+[redis cache implementation](https://github.com/triton-inference-server/redis_cache).
+
+#### Custom Cache
+
+With the TRITONCACHE API interface, it is now possible for
+users to implement their own cache to suit any use-case specific needs.
+To see the required interface that must be implemented by a cache
+developer, see the
+[TRITONCACHE API header](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritoncache.h).
+The `local` or `redis` cache implementations may be used as reference.
+
+Upon successfully developing and building a custom cache, the resulting shared
+library (ex: `libtritoncache_<name>.so`) must be placed in the cache directory
+similar to where the `local` and `redis` cache implementations live. By default,
+this directory is `/opt/tritonserver/caches`, but a custom directory may be
+specified with `--cache-dir` as needed.
+
+To put this example together, if the custom cache were named "custom"
+(this name is arbitrary), by default Triton would expect to find the
+cache implementation at `/opt/tritonserver/caches/custom/libtritoncache_custom.so`.
+
+## Deprecation Notes
+
+> **Note**
+> Prior to 23.03, enabling the `local` cache used to be done through setting a non-zero size
+> (in bytes) when Triton was launched using the `--response-cache-byte-size` flag.
+>
+> Starting in 23.03, the `--response-cache-byte-size` flag is now deprecated and
+> `--cache-config` should be used instead. For backwards compatibility,
+> `--response-cache-byte-size` will continue to function under the hood by being
+> converted to the corresponding `--cache-config` argument, but it will default
+> to using the `local` cache implementation. It is not possible to choose other
+> cache implementations using the `--response-cache-byte-size` flag.
+>
+> For example, `--response-cache-byte-size 1048576`
+> would be equivalent to `--cache-config local,size=1048576`. However, the
+> `--cache-config` flag is much more flexible and should be used instead.
+
+> **Warning**
+>
+> The `local` cache implementation may fail to initialize for very small values
+> of `--cache-config local,size=<small_value>` or `--response-cache-byte-size`
+> (ex: less than 1024 bytes) due to internal memory management requirements.
+> If you encounter an initialization error for a relatively small cache size,
+> try increasing it.
+>
+> Similarly, the size is upper bounded by the available RAM on the system.
+> If you encounter an initial allocation error for a very large cache size
+> setting, try decreasing it.
+
+## Performance
+
+The response cache is intended to be used for use cases where a significant
+number of duplicate requests (cache hits) are expected and therefore would
+benefit from caching. The term "significant" here is subjective to the use
+case, but a simple interpretation would be to consider the proportion of
+expected cache hits/misses, as well as the average time spend computing
+a response.
+
+For cases where cache hits are common and computation is expensive,
+the cache can significantly improve overall performance.
+
+For cases where most requests are unique (cache misses) or the compute is
+fast/cheap (the model is not compute-bound), the cache can negatively impact
+the overall performance due to the overhead of managing and communicating with
+the cache.
+
+## Known Limitations
+
+- Only input tensors located in CPU memory will be hashable for accessing the
+  cache. If an inference request contains input tensors not in CPU memory, the
+  request will not be hashed and therefore the response will not be cached.
+- Only responses with all output tensors located in CPU memory will be eligible
+  for caching. If any output tensor in a response is not located in CPU memory,
+  the response will not be cached.
+- The cache is accessed using only the inference request hash. As a result, if
+  two different inference requests generate the same hash (a hash collision),
+  then Triton may incorrectly use the cached result for an inference request.
+  The hash is a 64-bit value so the likelihood of collision is small.
+- Only successful inference requests will have their responses cached. If a
+  request fails or returns an error during inference, its response will not be
+  cached.
+- Only requests going through the Default Scheduler or Dynamic Batch Scheduler
+  are eligible for caching. The Sequence Batcher does not currently support
+  response caching.
+- The response cache does not currently support
+  [decoupled models](decoupled_models.md).
+- Top-level requests to ensemble models do not currently support response
+  caching. However, composing models within an ensemble may have their
+  responses cached if supported and enabled by that composing model.
+
diff --git a/docs/user_guide/trace.md b/docs/user_guide/trace.md
new file mode 100644
index 0000000000..23d1c402d1
--- /dev/null
+++ b/docs/user_guide/trace.md
@@ -0,0 +1,539 @@
+<!--
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Triton Server Trace
+
+Triton includes that capability to generate a detailed trace for
+individual inference requests. Tracing is enable by command-line
+arguments when running the tritonserver executable.
+
+`--trace-config` command line option in Triton can be used to specify
+global and trace mode specific config setting. The format of this flag
+is `--trace-config <mode>,<setting>=<value>`, where `<mode>`
+is either `triton` or `opentelemetry`. By default, the trace mode is set to `triton`,
+and the server will use Triton's trace APIs. For `opentelemetry` mode,
+the server will use the [OpenTelemetry's APIs](#opentelemetry-trace-support) to generate,
+collect and export traces for individual inference requests.
+
+To specify global trace settings (level, rate, count, or mode),
+the format is `--trace-config <setting>=<value>`.
+
+An example usage, which invokes Triton's trace APIs:
+
+```
+$ tritonserver \
+    --trace-config triton,file=/tmp/trace.json \
+    --trace-config triton,log-frequency=50 \
+    --trace-config rate=100 \
+    --trace-config level=TIMESTAMPS \
+    --trace-config count=100 ...
+```
+
+## Trace Settings
+### Global Settings
+The following table shows available global trace settings to pass to `--trace-config`
+<table>
+  <thead>
+  <tr>
+    <th>Setting</th>
+    <th>Default Value</th>
+    <th>Description</th>
+  </tr>
+  </thead>
+  <tbody>
+    <tr>
+    <td><code>rate</code></td>
+    <td>1000</td>
+    <td>
+      Specifies the sampling rate. The same as deprecated
+      <code>--trace-rate</code>. <br/>
+      For example, a value of 1000 specifies that every 1000-th inference <br/>
+      request will be traced.
+    </td>
+    </tr>
+    <tr>
+    <td><code>level</code></td>
+    <td>OFF</td>
+    <td>
+      Indicates the level of trace detail that should be collected and <br/>
+      may be specified  multiple times to trace multiple information. <br/>
+      The same as deprecated <code>--trace-level</code>. <br/>
+      Choices are <code>TIMESTAMPS</code> and <code>TENSORS</code>.<br/>
+      <b>Note</b> that <code>opentelemetry</code> mode does not currently <br/>
+      support <code>TENSORS</code> level.
+    </td>
+    </tr>
+    <tr>
+    <td><code>count</code></td>
+    <td>-1</td>
+    <td>
+      Specifies the remaining number of traces to be collected. <br/>
+      The default value of -1 specifies to never stop collecting traces. <br/>
+      With a value  of 100, Triton will stop tracing requests<br/>
+      after 100 traces are collected.<br/>
+      The same as  deprecated <code>--trace-count</code>.
+    </td>
+    </tr>
+    <tr>
+    <td><code>mode</code></td>
+    <td>triton</td>
+    <td>
+      Specifies which trace APIs to use for collecting traces. <br/>
+      The choices are <code>triton</code> or <code>opentelemetry</code>. <br/>
+    </td>
+    </tr>
+  </tbody>
+</table>
+
+### Triton Trace APIs Settings
+
+The following table shows available Triton trace APIs settings for
+`--trace-config triton,<setting>=<value>`.
+<table>
+  <thead>
+  <tr>
+    <th>Setting</th>
+    <th>Default Value</th>
+    <th>Description</th>
+  </tr>
+  </thead>
+  <tbody>
+    <tr>
+    <td><code>file</code></td>
+    <td>empty string</td>
+    <td>
+      Indicates where the trace output should be written. <br/>
+      The same as deprecated <code>--trace-file</code>. <br/>
+    </td>
+    </tr>
+    <tr>
+    <td><code>log-frequency</code></td>
+    <td>0</td>
+    <td>
+      Specifies the rate that the traces are written to file. <br/>
+      For example, a value of 50 specifies that Triton will log <br/>
+      to file for every 50 traces collected. <br/>
+      The same as deprecated <code>--trace-log-frequency</code>.<br/>
+    </td>
+    </tr>
+  </tbody>
+</table>
+
+In addition to the trace configuration settings in the command line, you can
+modify the trace configuration using the [trace
+protocol](../protocol/extension_trace.md). This option is currently not supported,
+when trace mode is set to `opentelemetry`.
+
+**Note**: the following flags are **deprecated**:
+
+The `--trace-file` option indicates where the trace output should be
+written. The `--trace-rate` option specifies the sampling rate. In
+this example every 100-th inference request will be traced. The
+`--trace-level` option indicates the level of trace detail that should
+be collected. `--trace-level` option may be specified multiple times to
+trace multiple information. The `--trace-log-frequency` option specifies the
+rate that the traces are written to file. In this example Triton will log to
+file for every 50 traces collected. The `--trace-count` option specifies the
+remaining number of traces to be collected. In this example Triton will stop
+tracing more requests after 100 traces are collected.  Use the `--help` option
+to get more information.
+
+## Supported Trace Level Option
+
+- `TIMESTAMPS`: Tracing execution timestamps of each request.
+- `TENSORS`: Tracing input and output tensors during the execution.
+
+## JSON Trace Output
+
+The trace output is a JSON file with the following schema.
+
+```
+[
+  {
+    "model_name": $string,
+    "model_version": $number,
+    "id": $number,
+    "request_id": $string,
+    "parent_id": $number
+  },
+  {
+    "id": $number,
+    "timestamps": [
+      { "name" : $string, "ns" : $number }
+    ]
+  },
+  {
+    "id": $number
+    "activity": $string,
+    "tensor":{
+      "name": $string,
+      "data": $string,
+      "shape": $string,
+      "dtype": $string
+    }
+  },
+  ...
+]
+```
+
+Each trace is assigned a "id", which indicates the model name and
+version of the inference request. If the trace is from a
+model run as part of an ensemble, the "parent_id" will indicate the
+"id" of the containing ensemble.
+For example:
+```
+[
+  {
+    "id": 1,
+    "model_name": "simple",
+    "model_version": 1
+  },
+  ...
+]
+```
+
+Each `TIMESTAMPS` trace will have one or more "timestamps" with
+each timestamp having a name and the timestamp in nanoseconds ("ns").
+For example:
+
+```
+[
+  {"id": 1, "timestamps": [{ "name": "HTTP_RECV_START", "ns": 2356425054587444 }] },
+  {"id": 1, "timestamps": [{ "name": "HTTP_RECV_END", "ns": 2356425054632308 }] },
+  {"id": 1, "timestamps": [{ "name": "REQUEST_START", "ns": 2356425054785863 }] },
+  {"id": 1, "timestamps": [{ "name": "QUEUE_START", "ns": 2356425054791517 }] },
+  {"id": 1, "timestamps": [{ "name": "INFER_RESPONSE_COMPLETE", "ns": 2356425057587919 }] },
+  {"id": 1, "timestamps": [{ "name": "COMPUTE_START", "ns": 2356425054887198 }] },
+  {"id": 1, "timestamps": [{ "name": "COMPUTE_INPUT_END", "ns": 2356425057152908 }] },
+  {"id": 1, "timestamps": [{ "name": "COMPUTE_OUTPUT_START", "ns": 2356425057497763 }] },
+  {"id": 1, "timestamps": [{ "name": "COMPUTE_END", "ns": 2356425057540989 }] },
+  {"id": 1, "timestamps": [{ "name": "REQUEST_END", "ns": 2356425057643164 }] },
+  {"id": 1, "timestamps": [{ "name": "HTTP_SEND_START", "ns": 2356425057681578 }] },
+  {"id": 1, "timestamps": [{ "name": "HTTP_SEND_END", "ns": 2356425057712991 }] }
+]
+```
+
+Each `TENSORS` trace will contain an "activity" and a "tensor".
+"activity" indicates the type of tensor, including "TENSOR_QUEUE_INPUT"
+and "TENSOR_BACKEND_OUTPUT" by now. "tensor" has the detail of tensor,
+including its "name", "data" and "dtype". For example:
+
+```
+[
+  {
+    "id": 1,
+    "activity": "TENSOR_QUEUE_INPUT",
+    "tensor":{
+      "name": "input",
+      "data": "0.1,0.1,0.1,...",
+      "shape": "1,16",
+      "dtype": "FP32"
+    }
+  }
+]
+```
+
+## Trace Summary Tool
+
+An example [trace summary tool](https://github.com/triton-inference-server/server/blob/main/qa/common/trace_summary.py) can be
+used to summarize a set of traces collected from Triton. Basic usage
+is:
+
+```
+$ trace_summary.py <trace file>
+```
+
+This produces a summary report for all traces in the file. HTTP and
+GRPC inference requests are reported separately.
+
+```
+File: trace.json
+Summary for simple (-1): trace count = 1
+HTTP infer request (avg): 378us
+	Receive (avg): 21us
+	Send (avg): 7us
+	Overhead (avg): 79us
+	Handler (avg): 269us
+  		Overhead (avg): 11us
+  		Queue (avg): 15us
+  		Compute (avg): 242us
+  			Input (avg): 18us
+  			Infer (avg): 208us
+  			Output (avg): 15us
+Summary for simple (-1): trace count = 1
+GRPC infer request (avg): 21441us
+	Wait/Read (avg): 20923us
+	Send (avg): 74us
+	Overhead (avg): 46us
+	Handler (avg): 395us
+  		Overhead (avg): 16us
+  		Queue (avg): 47us
+  		Compute (avg): 331us
+  			Input (avg): 30us
+  			Infer (avg): 286us
+  			Output (avg): 14us
+```
+
+Use the -t option to get a summary for each trace in the file. This
+summary shows the time, in microseconds, between different points in
+the processing of an inference request. For example, the below output
+shows that it took 15us from the start of handling the request until
+the request was enqueued in the scheduling queue.
+
+```
+$ trace_summary.py -t <trace file>
+...
+simple (-1):
+  	grpc wait/read start
+  		26529us
+  	grpc wait/read end
+  		39us
+  	request handler start
+  		15us
+  	queue start
+  		20us
+  	compute start
+  		266us
+  	compute end
+  		4us
+  	request handler end
+  		19us
+  	grpc send start
+  		77us
+  	grpc send end
+...
+```
+
+The script can also show the data flow of the first request if there are
+`TENSORS` traces in the file. If the `TENSORS` traces are from an ensemble,
+the data flow will be shown with the dependency of each model.
+
+```
+...
+Data Flow:
+	==========================================================
+	Name:   ensemble
+	Version:1
+	QUEUE_INPUT:
+		input: [[0.705676  0.830855  0.833153]]
+	BACKEND_OUTPUT:
+		output: [[1. 2. 7. 0. 4. 7. 9. 3. 4. 9.]]
+	==========================================================
+		==================================================
+		Name:   test_trt1
+		Version:1
+		QUEUE_INPUT:
+			input: [[0.705676  0.830855  0.833153]]
+		BACKEND_OUTPUT:
+			output1: [[1. 1. ...]]
+		==================================================
+		==================================================
+		Name:   test_trt2
+		Version:1
+		QUEUE_INPUT:
+			input: [[0.705676  0.830855  0.833153]]
+		BACKEND_OUTPUT:
+			output2: [[2. 2. ...]]
+		==================================================
+		==================================================
+		Name:   test_py
+		Version:1
+		QUEUE_INPUT:
+			output1: [[1. 1. ...]]
+		QUEUE_INPUT:
+			output2: [[2. 2. ...]]
+		BACKEND_OUTPUT:
+			output: [[1. 2. 7. 0. 4. 7. 9. 3. 4. 9.]]
+		==================================================
+...
+```
+
+The meaning of the trace timestamps is:
+
+* GRPC Request Wait/Read: Collected only for inference requests that use the
+  GRPC protocol. The time spent waiting for a request to arrive at the
+  server and for that request to be read. Because wait time is
+  included in the time it is not a useful measure of how much time is
+  spent reading a request from the network. Tracing an HTTP request
+  will provide an accurate measure of the read time.
+
+* HTTP Request Receive: Collected only for inference requests that use the
+  HTTP protocol. The time required to read the inference request from
+  the network.
+
+* Send: The time required to send the inference response.
+
+* Overhead: Additional time required in the HTTP or GRPC endpoint to
+  process the inference request and response.
+
+* Handler: The total time spent handling the inference request, not
+  including the HTTP and GRPC request/response handling.
+
+  * Queue: The time the inference request spent in the scheduling queue.
+
+  * Compute: The time the inference request spent executing the actual
+    inference. This time includes the time spent copying input and
+    output tensors. If --trace-level=TIMESTAMPS then a breakdown of the
+    compute time will be provided as follows:
+
+    * Input: The time to copy input tensor data as required by the
+      inference framework / backend. This includes the time to copy
+      input tensor data to the GPU.
+
+    * Infer: The time spent executing the model to perform the
+      inference.
+
+    * Output: The time to copy output tensor data as required by the
+      inference framework / backend. This includes the time to copy
+      output tensor data from the GPU.
+
+  * Overhead: Additional time required for request handling not
+    covered by Queue or Compute times.
+
+* Data Flow: The data flow of the first request. It contains the input and
+  output tensors of each part of execution.
+
+  * Name: The name of model.
+
+  * Version: The version of model.
+
+  * QUEUE_INPUT: The tensor entering the queue of a backend to wait for
+    scheduling.
+
+  * BACKEND_OUTPUT: The tensor in the response of a backend.
+
+## Tracing for BLS models
+
+Triton does not collect traces for child models invoked from
+[BLS](https://github.com/triton-inference-server/python_backend/tree/main#business-logic-scripting)
+models by default.
+
+To include child models into collected traces, user needs to provide the `trace`
+argument (as shown in the example below), when constructing an InferenceRequest object.
+This helps Triton associate the child model with the parent model's trace (`request.trace()`).
+
+```python
+
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+  ...
+    def execute(self, requests):
+      ...
+      for request in requests:
+        ...
+        inference_request = pb_utils.InferenceRequest(
+            model_name='model_name',
+            requested_output_names=['REQUESTED_OUTPUT_1', 'REQUESTED_OUTPUT_2'],
+            inputs=[<pb_utils.Tensor object>], trace = request.trace())
+
+```
+
+## OpenTelemetry trace support
+
+Triton provides an option to generate and export traces using
+[OpenTelemetry APIs and SDKs](https://opentelemetry.io/).
+
+To specify OpenTelemetry mode for tracing, specify the `--trace-config`
+flag as follows:
+
+```
+$ tritonserver --trace-config mode=opentelemetry \
+    --trace-config opentelemetry,url=<endpoint> ...
+```
+### Differences in trace contents from Triton's trace [output](#json-trace-output)
+
+OpenTelemetry APIs produce [spans](https://opentelemetry.io/docs/concepts/observability-primer/#spans)
+that collect the same timestamps as Triton's Trace
+APIs. Each span also includes `model_name`, `model_version`, `request_id`,
+and `parent_id` as an [attribute](https://opentelemetry.io/docs/concepts/observability-primer/#span-attributes).
+
+The span collects `TIMESTAMPS` that consist of a name and a timestamp
+in nanoseconds, which is similar to Triton Trace APIs. However,
+OpenTelemetry relies on the system's clock for event timestamps, which is based
+on the system's real-time clock. On the other hand, Triton Trace APIs
+report timestamps using steady clock, which is a monotonic clock that ensures
+time always movess forward. This clock is not related to wall clock time
+and, for example, can measure time since last reboot.
+
+
+### OpenTelemetry trace APIs settings
+
+The following table shows available OpenTelemetry trace APIs settings for
+`--trace-config opentelemetry,<setting>=<value>`.
+<table>
+  <thead>
+  <tr>
+    <th>Setting</th>
+    <th>Default Value</th>
+    <th>Description</th>
+  </tr>
+  </thead>
+  <tbody>
+    <tr>
+    <td><code>url</code></td>
+    <td><code>http://localhost:4318/v1/traces</code></td>
+    <td>
+      <code>host:port</code> to which the receiver is going to receive
+      trace data.
+    </td>
+    </tr>
+    <tr>
+    <td><code>resource</code></td>
+    <td><code>service.name=triton-inference-server</code></td>
+    <td>
+      Key-value pairs to be used as resource attributes. <br/>
+      Should be specified following the provided template:<br/>
+      <code>--trace-config opentelemetry,resource=<<text>key</text>>=<<text>value</text>></code><br/>
+      For example:<br/>
+      <code>--trace-config opentelemetry,resource=service.name=triton</code><br/>
+      <code>--trace-config opentelemetry,resource=service.version=1</code><br/>
+      Alternatively, key-vaue attributes can be specified through <br/>
+      <a href="https://opentelemetry.io/docs/concepts/sdk-configuration/general-sdk-configuration/#otel_resource_attributes">
+      OTEL_RESOURCE_ATTRIBUTES</a>
+      environment variable.
+    </td>
+    </tr>
+  </tbody>
+</table>
+
+
+### Limitations
+
+- OpenTelemetry trace mode is not supported on Windows systems.
+
+- Triton supports only
+[OTLP/HTTP Exporter](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/otlp.md#otlphttp)
+and allows specification of only url for this exporter through
+`--trace-config`. Other options and corresponding default values can be
+found [here](https://github.com/open-telemetry/opentelemetry-cpp/tree/v1.8.3/exporters/otlp#configuration-options--otlp-http-exporter-).
+
+- Triton does not support configuration of the opentelemetry trace settings
+during a Triton run and opentelemetry specific settings are not available
+for the retrieval through [Triton's trace extension](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_trace.md).
\ No newline at end of file
diff --git a/docs/v1_to_v2.md b/docs/user_guide/v1_to_v2.md
similarity index 95%
rename from docs/v1_to_v2.md
rename to docs/user_guide/v1_to_v2.md
index ed01313b34..d9da6f6cf8 100644
--- a/docs/v1_to_v2.md
+++ b/docs/user_guide/v1_to_v2.md
@@ -51,7 +51,7 @@ version 2.
 
 * The HTTP/REST and GRPC protocols, while conceptually similar to
   version 1, are completely changed in version 2. See [inference
-  protocols](inference_protocols.md) for more information.
+  protocols](../customization_guide/inference_protocols.md) for more information.
 
 * Python and C++ client libraries are re-implemented to match the new
   HTTP/REST and GRPC protocols. The Python client no longer depends on
@@ -61,7 +61,7 @@ version 2.
   more information.
 
 * Building Triton has changed significantly in version 2. See
-  [build](build.md) for more information.
+  [build](../customization_guide/build.md) for more information.
 
 * In the Docker containers the environment variables indicating the
   Triton version have changed to have a TRITON prefix, for example,
diff --git a/pyproject.toml b/pyproject.toml
new file mode 100644
index 0000000000..2843ad2d42
--- /dev/null
+++ b/pyproject.toml
@@ -0,0 +1,51 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+[tool.codespell]
+# note: pre-commit passes explicit lists of files here, which this skip file list doesn't override -
+# this is only to allow you to run codespell interactively
+skip = "./.git,./.github"
+# ignore short words, and typename parameters like OffsetT
+ignore-regex = "\\b(.{1,4}|[A-Z]\\w*T)\\b"
+# ignore allowed words
+ignore-words-list = "passin"
+# use the 'clear' dictionary for unambiguous spelling mistakes
+builtin = "clear"
+# disable warnings about binary files and wrong encoding
+quiet-level = 3
+
+[tool.isort]
+profile = "black"
+use_parentheses = true
+multi_line_output = 3
+include_trailing_comma = true
+force_grid_wrap = 0
+ensure_newline_before_comments = true
+line_length = 88
+balanced_wrapping = true
+indent = "    "
+skip = ["build"]
+
diff --git a/qa/L0_async_work_queue/test.sh b/qa/L0_async_work_queue/test.sh
old mode 100644
new mode 100755
diff --git a/qa/L0_backend_bls/test.sh b/qa/L0_backend_bls/test.sh
index 505d572608..f2193ee801 100755
--- a/qa/L0_backend_bls/test.sh
+++ b/qa/L0_backend_bls/test.sh
@@ -37,13 +37,14 @@ source ../common/util.sh
 RET=0
 
 # Backend build requires recent version of CMake (FetchContent required)
-wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | \
-    gpg --dearmor - |  \
-    tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null && \
-    apt-add-repository 'deb https://apt.kitware.com/ubuntu/ focal main' && \
-    apt-get update && \
-    apt-get install -y --no-install-recommends \
-            cmake-data=3.21.1-0kitware1ubuntu20.04.1 cmake=3.21.1-0kitware1ubuntu20.04.1 \
+# Using CMAKE installation instruction from:: https://apt.kitware.com/
+apt update -q=2 \
+    && apt install -y gpg wget \
+    && wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - |  tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null \
+    && . /etc/os-release \
+    && echo "deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ $UBUNTU_CODENAME main" | tee /etc/apt/sources.list.d/kitware.list >/dev/null \
+    && apt-get update -q=2 \
+    && apt-get install -y --no-install-recommends cmake=3.27.7* cmake-data=3.27.7* \
             rapidjson-dev
 cmake --version
 
diff --git a/qa/L0_backend_config/test.sh b/qa/L0_backend_config/test.sh
old mode 100644
new mode 100755
index 3bd7890ceb..b898735798
--- a/qa/L0_backend_config/test.sh
+++ b/qa/L0_backend_config/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -66,7 +66,7 @@ POSITIVE_TEST_ARGS=("--backend-config=tensorflow,default-max-batch-size=5 $COMMO
                     "--backend-config=default-max-batch-size=7 --backend-config=tensorflow,default-max-batch-size=8 $COMMON_ARGS" \
 )
 
-# These integers correspond to the expected default-max-batch-size which gets set 
+# These integers correspond to the expected default-max-batch-size which gets set
 # in the POSITIVE_TEST_ARGS
 POSITIVE_TEST_ANSWERS=(5 6 8)
 
@@ -86,12 +86,12 @@ else
 
     RESULT_LOG_LINE=$(grep -a "Adding default backend config setting:" $SERVER_LOG)
     if [ "$RESULT_LOG_LINE" != "" ]; then
-        
+
         # Pick out the logged value of the default-max-batch-size which gets passed into model creation
         RESOLVED_DEFAULT_MAX_BATCH_SIZE=$(awk -v line="$RESULT_LOG_LINE" 'BEGIN {split(line, a, "]"); split(a[2], b, ": "); split(b[2], c, ","); print c[2]}')
 
         if [ "$RESOLVED_DEFAULT_MAX_BATCH_SIZE" != "4" ]; then
-            echo "*** FAILED: Found default-max-batch-size not equal to the expected default-max-batch-size. Expected: default-max-batch-size,4, Found: $RESOLVED_DEFAULT_MAX_BATCH_SIZE \n" 
+            echo "*** FAILED: Found default-max-batch-size not equal to the expected default-max-batch-size. Expected: default-max-batch-size,4, Found: $RESOLVED_DEFAULT_MAX_BATCH_SIZE \n"
             RET=1
         fi
     else
@@ -104,7 +104,7 @@ for ((i=0; i < ${#POSITIVE_TEST_ARGS[@]}; i++)); do
     SERVER_ARGS=${POSITIVE_TEST_ARGS[$i]}
     SERVER_LOG=$SERVER_LOG_BASE.backend_config_positive_$i.log
     run_server
-    
+
     if [ "$SERVER_PID" == "0" ]; then
         echo -e "*** FAILED: Server failed to start $SERVER\n"
         RET=1
@@ -115,12 +115,12 @@ for ((i=0; i < ${#POSITIVE_TEST_ARGS[@]}; i++)); do
 
         RESULT_LOG_LINE=$(grep -a "Found overwritten default setting:" $SERVER_LOG)
         if [ "$RESULT_LOG_LINE" != "" ]; then
-            
+
             # Pick out the logged value of the default-max-batch-size which gets passed into model creation
             RESOLVED_DEFAULT_MAX_BATCH_SIZE=$(awk -v line="$RESULT_LOG_LINE" 'BEGIN {split(line, a, "]"); split(a[2], b, ": "); split(b[2], c, ","); print c[2]}')
 
             if [ "$RESOLVED_DEFAULT_MAX_BATCH_SIZE" != "${POSITIVE_TEST_ANSWERS[$i]}" ]; then
-                echo "*** FAILED: Found default-max-batch-size not equal to the expected default-max-batch-size. Expected: ${POSITIVE_TEST_ANSWERS[$i]}, Found: $RESOLVED_DEFAULT_MAX_BATCH_SIZE \n" 
+                echo "*** FAILED: Found default-max-batch-size not equal to the expected default-max-batch-size. Expected: ${POSITIVE_TEST_ANSWERS[$i]}, Found: $RESOLVED_DEFAULT_MAX_BATCH_SIZE \n"
                 RET=1
             fi
         else
@@ -152,11 +152,11 @@ done
 
 
 #
-# Sepcific backend tests
-# 
+# Specific backend tests
+#
 
-# While inference server is running, save the 
-# config of the 'no_config' model to the TRIAL 
+# While inference server is running, save the
+# config of the 'no_config' model to the TRIAL
 # file.
 function save_model_config() {
     CODE=`curl -s -w %{http_code} -o ./$TRIAL.out localhost:8000/v2/models/no_config/config`
@@ -192,13 +192,13 @@ else
         RET=1
     fi
 
-    # Assert we are also turning on the dynamic_batcher    
+    # Assert we are also turning on the dynamic_batcher
     DYNAMIC_BATCHING_LOG_LINE=$(grep -a "Starting dynamic-batcher thread" $SERVER_LOG)
     if [ "$DYNAMIC_BATCHING_LOG_LINE" == "" ]; then
         echo "*** FAILED: Expected dynamic batching to be set in model config but was not found\n"
         RET=1
     fi
-    
+
     kill $SERVER_PID
     wait $SERVER_PID
 
@@ -225,7 +225,7 @@ else
         RET=1
     fi
 
-    # Assert batching disabled    
+    # Assert batching disabled
     if [ "$(grep -a -E '\"dynamic_batching\": \{}' $SERVER_LOG)" != "" ]; then
         echo "*** FAILED: Found dynamic batching enabled in configuration when none expected.\n"
         RET=1
@@ -252,7 +252,7 @@ if [ "$SERVER_PID" == "0" ]; then
 
 else
     save_model_config
-    
+
     # Assert the max-batch-size is the command line value
     MAX_BATCH_LOG_LINE=$(grep -a "\"max_batch_size\":5" $TRIAL.out)
     if [ "$MAX_BATCH_LOG_LINE" == "" ]; then
@@ -260,13 +260,13 @@ else
         RET=1
     fi
 
-    # Assert we are also turning on the dynamic_batcher    
+    # Assert we are also turning on the dynamic_batcher
     DYNAMIC_BATCHING_LOG_LINE=$(grep -a "Starting dynamic-batcher thread" $SERVER_LOG)
     if [ "$DYNAMIC_BATCHING_LOG_LINE" == "" ]; then
         echo "*** FAILED: Expected dynamic batching to be set in model config but was not found\n"
         RET=1
     fi
-    
+
     kill $SERVER_PID
     wait $SERVER_PID
 fi
@@ -296,7 +296,7 @@ else
         RET=1
     fi
 
-    # Assert batching disabled    
+    # Assert batching disabled
     if [ "$(grep -a -E '\"dynamic_batching\": \{}' $SERVER_LOG)" != "" ]; then
         echo "*** FAILED: Found dynamic batching in configuration when none expected.\n"
         RET=1
@@ -307,6 +307,97 @@ else
 
 fi
 
+#
+# General backend tests
+#
+
+# We want to make sure that backend configurations
+# are not lost. For this purpose we are using only onnx backend
+
+rm -rf ./models/
+mkdir -p ./models/no_config/
+cp -r /data/inferenceserver/${REPO_VERSION}/qa_model_repository/onnx_float32_float32_float32/1 ./models/no_config/
+
+# First getting a baseline for the number of default configs
+# added during a server set up
+SERVER_ARGS="$COMMON_ARGS"
+SERVER_LOG=$SERVER_LOG_BASE.default_configs.log
+run_server
+
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "*** FAILED: Server failed to start $SERVER\n"
+    RET=1
+
+else
+    # Count number of default configs
+    BACKEND_CONFIG_MAP=$(grep -a "backend configuration:" $SERVER_LOG -A 1  | grep -v "backend configuration")
+    DEFAULT_CONFIG_COUNT=$(echo $BACKEND_CONFIG_MAP | jq -r | jq '.["cmdline"]' | jq 'length')
+    if [ $DEFAULT_CONFIG_COUNT -lt 4 ]; then
+        echo "*** FAILED: Expected number of default configs to be at least 4 but found: $DEFAULT_CONFIG_COUNT\n"
+        RET=1
+    fi
+
+    kill $SERVER_PID
+    wait $SERVER_PID
+
+fi
+
+# Now make sure that when setting specific backend configs
+# default ones are not lost.
+# Current logic for backend config resolution reads default configs first,
+# then specific configs and overrides defaults if needed.
+# We would like to make sure that none of configs are lost and
+# defaults are properly overridden.
+# One of defaultconfigs is `min-compute-capability`. This test
+# checks if it is properlly overridden.
+MIN_COMPUTE_CAPABILITY=XX
+SERVER_ARGS="--backend-config=onnxruntime,min-compute-capability=$MIN_COMPUTE_CAPABILITY $COMMON_ARGS"
+SERVER_LOG=$SERVER_LOG_BASE.global_configs.log
+run_server
+
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "*** FAILED: Server failed to start $SERVER\n"
+    RET=1
+
+else
+    # Count number of default configs
+    BACKEND_CONFIG_MAP=$(grep -a "backend configuration:" $SERVER_LOG -A 1  | grep -v "backend configuration")
+    CONFIG_VALUE=$(echo $BACKEND_CONFIG_MAP | jq -r | jq '.["cmdline"]' | jq -r '.["min-compute-capability"]')
+
+    if [ $CONFIG_VALUE != $MIN_COMPUTE_CAPABILITY ]; then
+        echo "*** FAILED: Expected min-compute-capability config to be $MIN_COMPUTE_CAPABILITY but found: $CONFIG_VALUE\n"
+        RET=1
+    fi
+
+    kill $SERVER_PID
+    wait $SERVER_PID
+
+fi
+# Now make sure that specific backend configs are not lost.
+SERVER_ARGS="--backend-config=onnxruntime,a=0 --backend-config=onnxruntime,y=0 --backend-config=onnxruntime,z=0 $COMMON_ARGS"
+SERVER_LOG=$SERVER_LOG_BASE.specific_configs.log
+EXPECTED_CONFIG_COUNT=$(($DEFAULT_CONFIG_COUNT+3))
+run_server
+
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "*** FAILED: Server failed to start $SERVER\n"
+    RET=1
+
+else
+    # Count number of default configs
+    BACKEND_CONFIG_MAP=$(grep -a "backend configuration:" $SERVER_LOG -A 1  | grep -v "backend configuration")
+    TOTAL_CONFIG_COUNT=$(echo $BACKEND_CONFIG_MAP | jq -r | jq '.["cmdline"]' | jq 'length')
+
+    if [ $TOTAL_CONFIG_COUNT -ne $EXPECTED_CONFIG_COUNT ]; then
+        echo "*** FAILED: Expected number of backend configs to be $EXPECTED_CONFIG_COUNT but found: $TOTAL_CONFIG_COUNT\n"
+        RET=1
+    fi
+
+    kill $SERVER_PID
+    wait $SERVER_PID
+
+fi
+
 
 # Print test outcome
 if [ $RET -eq 0 ]; then
diff --git a/qa/L0_backend_fastertransformer/test.sh b/qa/L0_backend_fastertransformer/test.sh
new file mode 100755
index 0000000000..8e5d20271a
--- /dev/null
+++ b/qa/L0_backend_fastertransformer/test.sh
@@ -0,0 +1,83 @@
+#!/bin/bash
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+FASTERTRANSFORMER_BRANCH_TAG=${FASTERTRANSFORMER_BRANCH_TAG:="main"}
+FASTERTRANSFORMER_BRANCH=${FASTERTRANSFORMER_BRANCH:="https://github.com/triton-inference-server/fastertransformer_backend.git"}
+SERVER_TIMEOUT=600
+SERVER_LOG="$PWD/inference_server"
+CLIENT_LOG="$PWD/client"
+
+MODEL_DIR=${MODEL_DIR:=$PWD/fastertransformer_backend/all_models/t5/}
+TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"}
+SERVER=${TRITON_DIR}/bin/tritonserver
+BACKEND_DIR=${TRITON_DIR}/backends
+SERVER_ARGS_EXTRA="--exit-timeout-secs=${SERVER_TIMEOUT} --backend-directory=${BACKEND_DIR}"
+SERVER_ARGS="--model-repository=${MODEL_DIR} ${SERVER_ARGS_EXTRA}"
+source ../common/util.sh
+
+rm -f $SERVER_LOG* $CLIENT_LOG*
+
+RET=0
+# install dependencies
+apt-get update && \
+    apt-get install -y --no-install-recommends python3 python3-pip python3-protobuf
+python3 -m pip install --upgrade pip && \
+    pip3 install --upgrade numpy
+
+# install client libraries
+pip3 install tritonclient[all]
+
+# Clone repo
+git clone --single-branch --depth=1 -b ${FASTERTRANSFORMER_BRANCH_TAG} ${FASTERTRANSFORMER_BRANCH}
+cd fastertransformer_backend
+
+run_server
+
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+
+python3 tools/issue_request.py tools/requests/sample_request_single_t5.json >$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    RET=1
+fi
+
+kill_server
+
+if [ $RET -eq 0 ]; then
+  echo -e "\n***\n*** Test Passed\n***"
+else
+    cat $SERVER_LOG
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test FAILED\n***"
+fi
+
+exit $RET
diff --git a/qa/L0_backend_identity/identity_test.py b/qa/L0_backend_identity/identity_test.py
old mode 100644
new mode 100755
index 009576aa34..ef0634b95c
--- a/qa/L0_backend_identity/identity_test.py
+++ b/qa/L0_backend_identity/identity_test.py
@@ -27,74 +27,45 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import argparse
-import numpy as np
 import sys
-import requests as httpreq
 from builtins import range
+
+import numpy as np
+import requests as httpreq
 import tritongrpcclient as grpcclient
 import tritonhttpclient as httpclient
 from tritonclientutils import np_to_triton_dtype
 
 FLAGS = None
 
-def test_bf16_raw_http(shape):
-    model = "identity_bf16"
-    # Using fp16 data as a WAR since it is same byte_size as bf16
-    # and is supported by numpy for ease-of-use. Since this is an
-    # identity model, it's OK that the bytes are interpreted differently
-    input_data = (16384 * np.random.randn(*shape)).astype(np.float16)
-    input_bytes = input_data.tobytes()
-    headers = {'Inference-Header-Content-Length': '0'}
-    r = httpreq.post("http://localhost:8000/v2/models/{}/infer".format(model),
-                      data=input_bytes,
-                      headers=headers)
-    r.raise_for_status()
-
-    # Get the inference header size so we can locate the output binary data
-    header_size = int(r.headers["Inference-Header-Content-Length"])
-    output_bytes = r.content[header_size:]
-    # Sanity check output on pass
-    print("Response content:", r.content)
-    print("Input Bytes:", input_bytes)
-    print("Output Bytes:", output_bytes)
-
-    # Assert correct output datatype
-    import json
-    response_json = json.loads(r.content[:header_size].decode("utf-8"))
-    assert(response_json["outputs"][0]["datatype"] == "BF16")
-
-    # Assert equality of input/output for identity model
-    if not np.array_equal(output_bytes, input_bytes):
-        print("error: Expected response body contains correct output binary " \
-              "data: {}; got: {}".format(input_bytes, output_bytes))
-        sys.exit(1)
-
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('-v',
-                        '--verbose',
-                        action="store_true",
-                        required=False,
-                        default=False,
-                        help='Enable verbose output')
-    parser.add_argument('-u',
-                        '--url',
-                        type=str,
-                        required=False,
-                        help='Inference server URL.')
     parser.add_argument(
-        '-i',
-        '--protocol',
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable verbose output",
+    )
+    parser.add_argument(
+        "-u", "--url", type=str, required=False, help="Inference server URL."
+    )
+    parser.add_argument(
+        "-i",
+        "--protocol",
         type=str,
         required=False,
-        default='http',
-        help='Protocol ("http"/"grpc") used to ' +
-        'communicate with inference service. Default is "http".')
+        default="http",
+        help='Protocol ("http"/"grpc") used to '
+        + 'communicate with inference service. Default is "http".',
+    )
 
     FLAGS = parser.parse_args()
     if (FLAGS.protocol != "http") and (FLAGS.protocol != "grpc"):
-        print("unexpected protocol \"{}\", expects \"http\" or \"grpc\"".format(
-            FLAGS.protocol))
+        print(
+            'unexpected protocol "{}", expects "http" or "grpc"'.format(FLAGS.protocol)
+        )
         exit(1)
 
     client_util = httpclient if FLAGS.protocol == "http" else grpcclient
@@ -109,17 +80,18 @@ def test_bf16_raw_http(shape):
         model_name = "identity_uint32"
         request_parallelism = 4
         shape = [2, 2]
-        with client_util.InferenceServerClient(FLAGS.url,
-                                               concurrency=request_parallelism,
-                                               verbose=FLAGS.verbose) as client:
+        with client_util.InferenceServerClient(
+            FLAGS.url, concurrency=request_parallelism, verbose=FLAGS.verbose
+        ) as client:
             input_datas = []
             requests = []
             for i in range(request_parallelism):
                 input_data = (16384 * np.random.randn(*shape)).astype(np.uint32)
                 input_datas.append(input_data)
                 inputs = [
-                    client_util.InferInput("INPUT0", input_data.shape,
-                                           np_to_triton_dtype(input_data.dtype))
+                    client_util.InferInput(
+                        "INPUT0", input_data.shape, np_to_triton_dtype(input_data.dtype)
+                    )
                 ]
                 inputs[0].set_data_from_numpy(input_data)
                 requests.append(client.async_infer(model_name, inputs))
@@ -136,32 +108,44 @@ def test_bf16_raw_http(shape):
                     sys.exit(1)
 
                 if not np.array_equal(output_data, input_datas[i]):
-                    print("error: expected output {} to match input {}".format(
-                        output_data, input_datas[i]))
+                    print(
+                        "error: expected output {} to match input {}".format(
+                            output_data, input_datas[i]
+                        )
+                    )
                     sys.exit(1)
 
             # Make sure the requests ran in parallel.
             stats = client.get_inference_statistics(model_name)
-            if (len(stats['model_stats']) !=
-                    1) or (stats['model_stats'][0]['name'] != model_name):
+            if (len(stats["model_stats"]) != 1) or (
+                stats["model_stats"][0]["name"] != model_name
+            ):
                 print("error: expected statistics for {}".format(model_name))
                 sys.exit(1)
 
-            stat = stats['model_stats'][0]
-            if (stat['inference_count'] != 8) or (stat['execution_count'] != 1):
+            stat = stats["model_stats"][0]
+            if (stat["inference_count"] != 8) or (stat["execution_count"] != 1):
                 print(
-                    "error: expected execution_count == 1 and inference_count == 8, got {} and {}"
-                    .format(stat['execution_count'], stat['inference_count']))
+                    "error: expected execution_count == 1 and inference_count == 8, got {} and {}".format(
+                        stat["execution_count"], stat["inference_count"]
+                    )
+                )
                 sys.exit(1)
 
             # Check metrics to make sure they are reported correctly
-            metrics = httpreq.get('http://localhost:8002/metrics')
+            metrics = httpreq.get("http://localhost:8002/metrics")
             print(metrics.text)
 
-            success_str = 'nv_inference_request_success{model="identity_uint32",version="1"}'
+            success_str = (
+                'nv_inference_request_success{model="identity_uint32",version="1"}'
+            )
             infer_count_str = 'nv_inference_count{model="identity_uint32",version="1"}'
-            infer_exec_str = 'nv_inference_exec_count{model="identity_uint32",version="1"}'
-            custom_metric_str = 'input_byte_size_counter{model="identity_uint32",version="1"}'
+            infer_exec_str = (
+                'nv_inference_exec_count{model="identity_uint32",version="1"}'
+            )
+            custom_metric_str = (
+                'input_byte_size_counter{model="identity_uint32",version="1"}'
+            )
 
             success_val = None
             infer_count_val = None
@@ -169,55 +153,69 @@ def test_bf16_raw_http(shape):
             custom_metric_val = None
             for line in metrics.text.splitlines():
                 if line.startswith(success_str):
-                    success_val = float(line[len(success_str):])
+                    success_val = float(line[len(success_str) :])
                 if line.startswith(infer_count_str):
-                    infer_count_val = float(line[len(infer_count_str):])
+                    infer_count_val = float(line[len(infer_count_str) :])
                 if line.startswith(infer_exec_str):
-                    infer_exec_val = float(line[len(infer_exec_str):])
+                    infer_exec_val = float(line[len(infer_exec_str) :])
                 if line.startswith(custom_metric_str):
-                    custom_metric_val = float(line[len(custom_metric_str):])
+                    custom_metric_val = float(line[len(custom_metric_str) :])
 
             if success_val != 4:
-                print("error: expected metric {} == 4, got {}".format(
-                    success_str, success_val))
+                print(
+                    "error: expected metric {} == 4, got {}".format(
+                        success_str, success_val
+                    )
+                )
                 sys.exit(1)
             if infer_count_val != 8:
-                print("error: expected metric {} == 8, got {}".format(
-                    infer_count_str, infer_count_val))
+                print(
+                    "error: expected metric {} == 8, got {}".format(
+                        infer_count_str, infer_count_val
+                    )
+                )
                 sys.exit(1)
             if infer_exec_val != 1:
-                print("error: expected metric {} == 1, got {}".format(
-                    infer_exec_str, infer_exec_val))
+                print(
+                    "error: expected metric {} == 1, got {}".format(
+                        infer_exec_str, infer_exec_val
+                    )
+                )
                 sys.exit(1)
             if custom_metric_val != 64:
-                print("error: expected metric {} == 64, got {}".format(
-                    custom_metric_str, custom_metric_val))
+                print(
+                    "error: expected metric {} == 64, got {}".format(
+                        custom_metric_str, custom_metric_val
+                    )
+                )
                 sys.exit(1)
 
     # Reuse a single client for all sync tests
-    with client_util.InferenceServerClient(FLAGS.url,
-                                           verbose=FLAGS.verbose) as client:
+    with client_util.InferenceServerClient(FLAGS.url, verbose=FLAGS.verbose) as client:
         for model_name, np_dtype, shape in (
-                # yapf: disable
+            # yapf: disable
             ("identity_fp32", np.float32, [1, 0]),
             ("identity_fp32", np.float32, [1, 5]),
             ("identity_uint32", np.uint32, [4, 0]),
             ("identity_uint32", np.uint32, [8, 5]),
             ("identity_nobatch_int8", np.int8, [0]),
             ("identity_nobatch_int8", np.int8, [7]),
-            ("identity_bytes", object, [1, 1])):
+            ("identity_bytes", object, [1, 1]),
+            ("identity_bf16", np.float32, [1, 0]),
+            ("identity_bf16", np.float32, [1, 5])
+        ):
             # yapf: enable
             if np_dtype != object:
                 input_data = (16384 * np.random.randn(*shape)).astype(np_dtype)
             else:
-                in0 = (16384 * np.ones(shape, dtype='int'))
-                in0n = np.array([str(x) for x in in0.reshape(in0.size)],
-                                dtype=object)
+                in0 = 16384 * np.ones(shape, dtype="int")
+                in0n = np.array([str(x) for x in in0.reshape(in0.size)], dtype=object)
                 input_data = in0n.reshape(in0.shape)
-            inputs = [
-                client_util.InferInput("INPUT0", input_data.shape,
-                                       np_to_triton_dtype(input_data.dtype))
-            ]
+            if model_name != "identity_bf16":
+                triton_type = np_to_triton_dtype(input_data.dtype)
+            else:
+                triton_type = "BF16"
+            inputs = [client_util.InferInput("INPUT0", input_data.shape, triton_type)]
             inputs[0].set_data_from_numpy(input_data)
 
             results = client.infer(model_name, inputs)
@@ -228,17 +226,48 @@ def test_bf16_raw_http(shape):
 
             if np_dtype == object:
                 output_data = np.array(
-                    [str(x, encoding='utf-8') for x in output_data.flatten()],
-                    dtype=object).reshape(output_data.shape)
+                    [str(x, encoding="utf-8") for x in output_data.flatten()],
+                    dtype=object,
+                ).reshape(output_data.shape)
 
             if output_data is None:
                 print("error: expected 'OUTPUT0'")
                 sys.exit(1)
 
-            if not np.array_equal(output_data, input_data):
-                print("error: expected output {} to match input {}".format(
-                    output_data, input_data))
-                sys.exit(1)
+            if model_name == "identity_bf16":
+                if input_data.shape != output_data.shape:
+                    print(
+                        "error: expected output shape {} to match input shape {}".format(
+                            output_data.shape, input_data.shape
+                        )
+                    )
+                    sys.exit(1)
+                for input, output in zip(
+                    np.nditer(input_data, flags=["refs_ok", "zerosize_ok"], order="C"),
+                    np.nditer(output_data, flags=["refs_ok", "zerosize_ok"], order="C"),
+                ):
+                    if input.tobytes()[2:4] != output.tobytes()[2:4]:
+                        print(
+                            "error: expected low-order bits of output {} to match low-order bits of input {}".format(
+                                output, input
+                            )
+                        )
+                        sys.exit(1)
+                    if output.tobytes()[0:2] != b"\x00\x00":
+                        print(
+                            "error: expected output {} to have all-zero high-order bits, got {}".format(
+                                output, output.tobytes()[0:2]
+                            )
+                        )
+                        sys.exit(1)
+            else:
+                if not np.array_equal(output_data, input_data):
+                    print(
+                        "error: expected output {} to match input {}".format(
+                            output_data, input_data
+                        )
+                    )
+                    sys.exit(1)
 
             # Make sure response parameters are correct
             response = results.get_response()
@@ -254,8 +283,7 @@ def test_bf16_raw_http(shape):
                 param2 = params["param2"].bool_param
 
             if param0 != "an example string parameter":
-                print(
-                    "error: expected 'param0' == 'an example string parameter'")
+                print("error: expected 'param0' == 'an example string parameter'")
                 sys.exit(1)
             if param1 != 42:
                 print("error: expected 'param1' == 42")
@@ -263,8 +291,3 @@ def test_bf16_raw_http(shape):
             if param2 != False:
                 print("error: expected 'param2' == False")
                 sys.exit(1)
-
-    # FIXME: Use identity_bf16 model in test above once proper python client
-    #        support is added, and remove this raw HTTP test. See DLIS-3720.
-    test_bf16_raw_http([2, 2])
-
diff --git a/qa/L0_backend_identity/test.sh b/qa/L0_backend_identity/test.sh
index d49686493c..bd29951ba6 100755
--- a/qa/L0_backend_identity/test.sh
+++ b/qa/L0_backend_identity/test.sh
@@ -82,7 +82,7 @@ wait $SERVER_PID
 
 # Validate the byte_sizes reported by backend
 OLDIFS=$IFS; IFS=','
-for i in "byte_size = 0, 6", \
+for i in "byte_size = 0, 8", \
          "byte_size = 7, 2", \
          "byte_size = 16, 6", \
          "byte_size = 20, 2", \
diff --git a/qa/L0_backend_output_detail/test.sh b/qa/L0_backend_output_detail/test.sh
new file mode 100755
index 0000000000..a8f4de59d1
--- /dev/null
+++ b/qa/L0_backend_output_detail/test.sh
@@ -0,0 +1,69 @@
+#!/bin/bash
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
+if [ "$#" -ge 1 ]; then
+    REPO_VERSION=$1
+fi
+if [ -z "$REPO_VERSION" ]; then
+    echo -e "No Repo version detected"
+    exit 1
+fi
+if [ ! -z "$TEST_REPO_ARCH" ]; then
+    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
+fi
+export CUDA_VISIBLE_DEVICES=0
+
+rm -f *.log
+MODELSDIR=`pwd`/models
+rm -fr $MODELSDIR && mkdir -p $MODELSDIR/add_sub/1 && \
+    cp  ../python_models/add_sub/config.pbtxt $MODELSDIR/add_sub && \
+    cp  ../python_models/add_sub/model.py $MODELSDIR/add_sub/1 && \
+
+source ../common/util.sh
+
+RET=0
+
+TEST_LOG="./backend_output_detail_test.log"
+TEST_EXEC=./backend_output_detail_test
+
+set +e
+LD_LIBRARY_PATH=/opt/tritonserver/lib:$LD_LIBRARY_PATH $TEST_EXEC >>$TEST_LOG 2>&1
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Backend Output Detail Unit Test Failed\n***"
+    RET=1
+fi
+set -e
+
+if [ $RET -eq 0 ]; then
+    echo -e "\n***\n*** Test Passed\n***"
+else
+    cat $TEST_LOG
+    echo -e "\n***\n*** Test FAILED\n***"
+fi
+
+exit $RET
diff --git a/qa/L0_backend_python/argument_validation/models/argument_validation/1/model.py b/qa/L0_backend_python/argument_validation/models/argument_validation/1/model.py
index 5af497aa0b..df1b298a35 100644
--- a/qa/L0_backend_python/argument_validation/models/argument_validation/1/model.py
+++ b/qa/L0_backend_python/argument_validation/models/argument_validation/1/model.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,18 +24,18 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
 import unittest
+
+import numpy as np
 import triton_python_backend_utils as pb_utils
 
 
 class ArgumentValidationTest(unittest.TestCase):
-
     def test_infer_request_args(self):
         # Dummy arguments used in the tests.
-        inputs = [pb_utils.Tensor('INPUT0', np.asarray([1, 2], dtype=np.int32))]
-        model_name = 'my_model'
-        requested_output_names = ['my_output']
+        inputs = [pb_utils.Tensor("INPUT0", np.asarray([1, 2], dtype=np.int32))]
+        model_name = "my_model"
+        requested_output_names = ["my_output"]
 
         #
         # inputs field validation
@@ -46,21 +46,24 @@ def test_infer_request_args(self):
             pb_utils.InferenceRequest(
                 inputs=[None],
                 model_name=model_name,
-                requested_output_names=requested_output_names)
+                requested_output_names=requested_output_names,
+            )
 
         # Test None object as list of inputs
         with self.assertRaises(TypeError) as e:
             pb_utils.InferenceRequest(
                 inputs=None,
                 model_name=model_name,
-                requested_output_names=requested_output_names)
+                requested_output_names=requested_output_names,
+            )
 
         # model_name validation
         with self.assertRaises(TypeError) as e:
             pb_utils.InferenceRequest(
                 model_name=None,
                 inputs=inputs,
-                requested_output_names=requested_output_names)
+                requested_output_names=requested_output_names,
+            )
 
         #
         # Requested output name validations
@@ -68,14 +71,14 @@ def test_infer_request_args(self):
 
         # Test list of None objects as requested_output_names
         with self.assertRaises(TypeError) as e:
-            pb_utils.InferenceRequest(requested_output_names=[None],
-                                      inputs=inputs,
-                                      model_name=model_name)
+            pb_utils.InferenceRequest(
+                requested_output_names=[None], inputs=inputs, model_name=model_name
+            )
 
         with self.assertRaises(TypeError) as e:
-            pb_utils.InferenceRequest(requested_output_names=None,
-                                      inputs=inputs,
-                                      model_name=model_name)
+            pb_utils.InferenceRequest(
+                requested_output_names=None, inputs=inputs, model_name=model_name
+            )
 
         # Other arguments validation
 
@@ -85,7 +88,8 @@ def test_infer_request_args(self):
                 requested_output_names=requested_output_names,
                 inputs=inputs,
                 model_name=model_name,
-                correleation_id=None)
+                correleation_id=None,
+            )
 
         # request_id set to None
         with self.assertRaises(TypeError) as e:
@@ -93,7 +97,8 @@ def test_infer_request_args(self):
                 requested_output_names=requested_output_names,
                 inputs=inputs,
                 model_name=model_name,
-                request_id=None)
+                request_id=None,
+            )
 
         # model_version set to None
         with self.assertRaises(TypeError) as e:
@@ -101,7 +106,8 @@ def test_infer_request_args(self):
                 requested_output_names=requested_output_names,
                 inputs=inputs,
                 model_name=model_name,
-                model_version=None)
+                model_version=None,
+            )
 
         # flags set to None
         with self.assertRaises(TypeError) as e:
@@ -109,17 +115,16 @@ def test_infer_request_args(self):
                 requested_output_names=requested_output_names,
                 inputs=inputs,
                 model_name=model_name,
-                flags=None)
+                flags=None,
+            )
 
         # Empty lists should not raise an exception
-        pb_utils.InferenceRequest(requested_output_names=[],
-                                  inputs=[],
-                                  model_name=model_name)
+        pb_utils.InferenceRequest(
+            requested_output_names=[], inputs=[], model_name=model_name
+        )
 
     def test_infer_response_args(self):
-        outputs = [
-            pb_utils.Tensor('OUTPUT0', np.asarray([1, 2], dtype=np.int32))
-        ]
+        outputs = [pb_utils.Tensor("OUTPUT0", np.asarray([1, 2], dtype=np.int32))]
 
         # Test list of None object as output tensor
         with self.assertRaises(pb_utils.TritonModelException) as e:
@@ -145,17 +150,47 @@ def test_tensor_args(self):
             pb_utils.Tensor("OUTPUT0", None)
 
         # Test None as dlpack capsule
-        with self.assertRaises(TypeError) as e:
+        with self.assertRaises(pb_utils.TritonModelException) as e:
             pb_utils.Tensor.from_dlpack("OUTPUT0", None)
 
-        # Test empty string as model name (from_dlpack)
-        with self.assertRaises(TypeError) as e:
+        # Test empty string as tensor name (from_dlpack)
+        with self.assertRaises(pb_utils.TritonModelException) as e:
             pb_utils.Tensor.from_dlpack("", None)
 
-        # Test empty string as model name
+        # Test empty string as tensor name
         with self.assertRaises(TypeError) as e:
             pb_utils.Tensor("", None)
 
+    def test_log_args(self):
+        logger = pb_utils.Logger
+
+        # Test None as log level setting
+        with self.assertRaises(TypeError) as e:
+            logger.log("Invalid Level", None)
+
+        # Test integer as log level setting
+        with self.assertRaises(TypeError) as e:
+            logger.log("Invalid Level", 1)
+
+        # Test None as log info msg
+        with self.assertRaises(TypeError) as e:
+            logger.log_info(None)
+
+        # Test None as log warning msg
+        with self.assertRaises(TypeError) as e:
+            logger.log_warn(None)
+
+        # Test None as log error msg
+        with self.assertRaises(TypeError) as e:
+            logger.log_error(None)
+
+        # Test None as log verbose msg
+        with self.assertRaises(TypeError) as e:
+            logger.log_verbose(None)
+
+        # This should not raise an exception
+        logger.log("Level unspecified")
+
 
 class TritonPythonModel:
     """This model tests the Python API arguments to make sure invalid args are
@@ -165,12 +200,15 @@ def execute(self, requests):
         responses = []
         for _ in requests:
             # Run the unittest and store the results in InferenceResponse.
-            test = unittest.main('model', exit=False)
+            test = unittest.main("model", exit=False)
             responses.append(
-                pb_utils.InferenceResponse([
-                    pb_utils.Tensor(
-                        'OUTPUT0',
-                        np.array([test.result.wasSuccessful()],
-                                 dtype=np.float16))
-                ]))
+                pb_utils.InferenceResponse(
+                    [
+                        pb_utils.Tensor(
+                            "OUTPUT0",
+                            np.array([test.result.wasSuccessful()], dtype=np.float16),
+                        )
+                    ]
+                )
+            )
         return responses
diff --git a/qa/L0_backend_python/argument_validation/test.sh b/qa/L0_backend_python/argument_validation/test.sh
old mode 100644
new mode 100755
index f80ce3e84b..b7f6e96293
--- a/qa/L0_backend_python/argument_validation/test.sh
+++ b/qa/L0_backend_python/argument_validation/test.sh
@@ -1,4 +1,5 @@
-# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/bin/bash
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,14 +26,14 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 CLIENT_PY=../python_unittest.py
-CLIENT_LOG="./client.log"
+CLIENT_LOG="./arg_validation_client.log"
 EXPECTED_NUM_TESTS="1"
 TEST_RESULT_FILE='test_results.txt'
 TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"}
 SERVER=${TRITON_DIR}/bin/tritonserver
 BACKEND_DIR=${TRITON_DIR}/backends
 SERVER_ARGS="--model-repository=`pwd`/models --backend-directory=${BACKEND_DIR} --log-verbose=1"
-SERVER_LOG="./inference_server.log"
+SERVER_LOG="./arg_validation_server.log"
 
 RET=0
 source ../../common/util.sh
diff --git a/qa/L0_backend_python/bls/bls_parameters_test.py b/qa/L0_backend_python/bls/bls_parameters_test.py
new file mode 100755
index 0000000000..e08ab2b96f
--- /dev/null
+++ b/qa/L0_backend_python/bls/bls_parameters_test.py
@@ -0,0 +1,71 @@
+#!/usr/bin/env python3
+
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import json
+import unittest
+
+import numpy as np
+import tritonclient.grpc as grpcclient
+from tritonclient.utils import np_to_triton_dtype
+
+
+class TestBlsParameters(unittest.TestCase):
+    def test_bls_parameters(self):
+        model_name = "bls_parameters"
+        shape = [1]
+        num_params = 3
+
+        # Based on the num_params specified, the model will generate a JSON response
+        # containing all the supported parameter types for num_params times recursively.
+        # Make sure the model has at least num_params + 1 instances.
+        expected_params = {}
+        for i in range(1, num_params + 1):
+            expected_params["bool_" + str(i)] = bool(i)
+            expected_params["int_" + str(i)] = i
+            expected_params["str_" + str(i)] = str(i)
+
+        with grpcclient.InferenceServerClient("localhost:8001") as client:
+            input_data = np.array([num_params], dtype=np.ubyte)
+            inputs = [
+                grpcclient.InferInput(
+                    "NUMBER_PARAMETERS", shape, np_to_triton_dtype(input_data.dtype)
+                )
+            ]
+            inputs[0].set_data_from_numpy(input_data)
+            outputs = [grpcclient.InferRequestedOutput("PARAMETERS_AGGREGATED")]
+            result = client.infer(model_name, inputs, outputs=outputs)
+            params_json = str(
+                result.as_numpy("PARAMETERS_AGGREGATED")[0], encoding="utf-8"
+            )
+
+        params = json.loads(params_json)
+        self.assertEqual(params, expected_params)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_backend_python/bls/test.sh b/qa/L0_backend_python/bls/test.sh
old mode 100644
new mode 100755
index 62a98dd228..95abc84e06
--- a/qa/L0_backend_python/bls/test.sh
+++ b/qa/L0_backend_python/bls/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,7 +26,7 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 CLIENT_PY=../python_unittest.py
-CLIENT_LOG="./client.log"
+CLIENT_LOG="./bls_client.log"
 EXPECTED_NUM_TESTS="1"
 TEST_RESULT_FILE='test_results.txt'
 source ../../common/util.sh
@@ -34,14 +34,14 @@ source ../../common/util.sh
 TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"}
 SERVER=${TRITON_DIR}/bin/tritonserver
 BACKEND_DIR=${TRITON_DIR}/backends
-SERVER_ARGS="--model-repository=`pwd`/models --backend-directory=${BACKEND_DIR} --log-verbose=1"
-SERVER_LOG="./inference_server.log"
 
 RET=0
-rm -fr *.log ./models
+# This variable is used to print out the correct server log for each sub-test.
+SUB_TEST_RET=0
+rm -fr *.log ./models *.txt
 
 pip3 uninstall -y torch
-pip3 install torch==1.11.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html
+pip3 install torch==1.13.0+cu117 -f https://download.pytorch.org/whl/torch_stable.html
 
 mkdir -p models/bls/1/
 cp ../../python_models/bls/model.py models/bls/1/
@@ -81,6 +81,178 @@ cp ../../python_models/dlpack_identity/config.pbtxt models/dlpack_identity
 
 cp -r ${DATADIR}/qa_sequence_implicit_model_repository/onnx_nobatch_sequence_int32/ ./models
 
+git clone https://github.com/triton-inference-server/python_backend -b $PYTHON_BACKEND_REPO_TAG
+mkdir -p models/square_int32/1/
+cp python_backend/examples/decoupled/square_model.py models/square_int32/1/model.py
+cp python_backend/examples/decoupled/square_config.pbtxt models/square_int32/config.pbtxt
+
+mkdir -p models/dlpack_square/1/
+cp ../../python_models/dlpack_square/model.py models/dlpack_square/1/
+cp ../../python_models/dlpack_square/config.pbtxt models/dlpack_square
+
+mkdir -p models/identity_fp32_timeout/1/
+cp ../../python_models/identity_fp32_timeout/model.py models/identity_fp32_timeout/1/
+cp ../../python_models/identity_fp32_timeout/config.pbtxt models/identity_fp32_timeout
+
+cp -r ${DATADIR}/qa_model_repository/libtorch_nobatch_float32_float32_float32/ ./models/libtorch_gpu && \
+    sed -i 's/libtorch_nobatch_float32_float32_float32/libtorch_gpu/' models/libtorch_gpu/config.pbtxt && \
+    echo "instance_group [ { kind: KIND_GPU} ]" >> models/libtorch_gpu/config.pbtxt
+
+cp -r ${DATADIR}/qa_model_repository/libtorch_nobatch_float32_float32_float32/ ./models/libtorch_cpu && \
+    sed -i 's/libtorch_nobatch_float32_float32_float32/libtorch_cpu/' models/libtorch_cpu/config.pbtxt && \
+    echo "instance_group [ { kind: KIND_CPU} ]" >> models/libtorch_cpu/config.pbtxt
+
+# Test with different sizes of CUDA memory pool
+for CUDA_MEMORY_POOL_SIZE_MB in 64 128 ; do
+    CUDA_MEMORY_POOL_SIZE_BYTES=$((CUDA_MEMORY_POOL_SIZE_MB * 1024 * 1024))
+    SERVER_ARGS="--model-repository=`pwd`/models --backend-directory=${BACKEND_DIR} --log-verbose=1 --cuda-memory-pool-byte-size=0:${CUDA_MEMORY_POOL_SIZE_BYTES}"
+    for TRIAL in non_decoupled decoupled ; do
+        export BLS_KIND=$TRIAL
+        SERVER_LOG="./bls_$TRIAL.$CUDA_MEMORY_POOL_SIZE_MB.inference_server.log"
+
+        run_server
+        if [ "$SERVER_PID" == "0" ]; then
+            echo -e "\n***\n*** Failed to start $SERVER\n***"
+            cat $SERVER_LOG
+            exit 1
+        fi
+
+        set +e
+
+        export MODEL_NAME='bls'
+        python3 $CLIENT_PY >> $CLIENT_LOG 2>&1
+        if [ $? -ne 0 ]; then
+            echo -e "\n***\n*** 'bls' $BLS_KIND test FAILED. \n***"
+            cat $CLIENT_LOG
+            RET=1
+            SUB_TEST_RET=1
+        else
+            check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+            if [ $? -ne 0 ]; then
+                cat $CLIENT_LOG
+                echo -e "\n***\n*** Test Result Verification Failed\n***"
+                RET=1
+                SUB_TEST_RET=1
+            fi
+        fi
+
+        export MODEL_NAME='bls_memory'
+        python3 $CLIENT_PY >> $CLIENT_LOG 2>&1
+        if [ $? -ne 0 ]; then
+            echo -e "\n***\n*** 'bls_memory' $BLS_KIND test FAILED. \n***"
+            cat $CLIENT_LOG
+            RET=1
+            SUB_TEST_RET=1
+        else
+            check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+            if [ $? -ne 0 ]; then
+                cat $CLIENT_LOG
+                echo -e "\n***\n*** Test Result Verification Failed\n***"
+                RET=1
+                SUB_TEST_RET=1
+            fi
+        fi
+
+        export MODEL_NAME='bls_memory_async'
+        python3 $CLIENT_PY >> $CLIENT_LOG 2>&1
+        if [ $? -ne 0 ]; then
+            echo -e "\n***\n*** 'bls_async_memory' $BLS_KIND test FAILED. \n***"
+            cat $CLIENT_LOG
+            RET=1
+            SUB_TEST_RET=1
+        else
+            check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+            if [ $? -ne 0 ]; then
+                cat $CLIENT_LOG
+                echo -e "\n***\n*** Test Result Verification Failed\n***"
+                RET=1
+                SUB_TEST_RET=1
+            fi
+        fi
+
+        export MODEL_NAME='bls_async'
+        python3 $CLIENT_PY >> $CLIENT_LOG 2>&1
+        if [ $? -ne 0 ]; then
+            echo -e "\n***\n*** 'bls_async' $BLS_KIND test FAILED. \n***"
+            cat $CLIENT_LOG
+            RET=1
+            SUB_TEST_RET=1
+        else
+            check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+            if [ $? -ne 0 ]; then
+                cat $CLIENT_LOG
+                echo -e "\n***\n*** Test Result Verification Failed\n***"
+                RET=1
+                SUB_TEST_RET=1
+            fi
+        fi
+
+        set -e
+
+        kill $SERVER_PID
+        wait $SERVER_PID
+
+        if [ $SUB_TEST_RET -eq 1 ]; then
+            cat $CLIENT_LOG
+            cat $SERVER_LOG
+        fi
+
+        # Check for bls 'test_timeout' to ensure timeout value is being correctly passed
+        if [ `grep -c "Request timeout: 11000000000" $SERVER_LOG` == "0" ]; then
+            echo -e "\n***\n*** BLS timeout value not correctly passed to model: line ${LINENO}\n***"
+            cat $SERVER_LOG
+            RET=1
+        fi
+
+        if [[ $CUDA_MEMORY_POOL_SIZE_MB -eq 128 ]]; then
+            if [ `grep -c "Failed to allocate memory from CUDA memory pool" $SERVER_LOG` != "0" ]; then
+                echo -e "\n***\n*** Expected to use CUDA memory pool for all tests when CUDA_MEMOY_POOL_SIZE_MB is 128 MB for 'bls' $BLS_KIND test\n***"
+                cat $SERVER_LOG
+                RET=1
+            fi
+        fi
+    done
+done
+
+# Test error handling when BLS is used in "initialize" or "finalize" function
+ERROR_MESSAGE="BLS is only supported during the 'execute' function."
+
+rm -fr ./models
+mkdir -p models/bls_init_error/1/
+cp ../../python_models/bls_init_error/model.py models/bls_init_error/1/
+cp ../../python_models/bls_init_error/config.pbtxt models/bls_init_error
+SERVER_LOG="./bls_init_error_server.log"
+SUB_TEST_RET=0
+
+run_server
+if [ "$SERVER_PID" != "0" ]; then
+    echo -e "*** FAILED: unexpected success starting $SERVER" >> $CLIENT_LOG
+    RET=1
+    SUB_TEST_RET=1
+    kill $SERVER_PID
+    wait $SERVER_PID
+else
+    if grep "$ERROR_MESSAGE" $SERVER_LOG; then
+        echo -e "Found \"$ERROR_MESSAGE\"" >> $CLIENT_LOG
+    else
+        echo -e "Not found \"$ERROR_MESSAGE\"" >> $CLIENT_LOG
+        RET=1
+        SUB_TEST_RET=1
+    fi
+fi
+
+if [ $SUB_TEST_RET -eq 1 ]; then
+    cat $CLIENT_LOG
+    cat $SERVER_LOG
+fi
+
+rm -fr ./models
+mkdir -p models/bls_finalize_error/1/
+cp ../../python_models/bls_finalize_error/model.py models/bls_finalize_error/1/
+cp ../../python_models/bls_finalize_error/config.pbtxt models/bls_finalize_error/
+SERVER_LOG="./bls_finalize_error_server.log"
+SUB_TEST_RET=0
+
 run_server
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
@@ -88,66 +260,152 @@ if [ "$SERVER_PID" == "0" ]; then
     exit 1
 fi
 
-set +e
+kill $SERVER_PID
+wait $SERVER_PID
 
-export MODEL_NAME='bls'
-python3 $CLIENT_PY >> $CLIENT_LOG 2>&1 
-if [ $? -ne 0 ]; then
-    echo -e "\n***\n*** 'bls' test FAILED. \n***"
+if grep "$ERROR_MESSAGE" $SERVER_LOG; then
+    echo -e "Found \"$ERROR_MESSAGE\"" >> $CLIENT_LOG
+else
+    echo -e "Not found \"$ERROR_MESSAGE\"" >> $CLIENT_LOG
+    RET=1
+    SUB_TEST_RET=1
+fi
+
+if [ $SUB_TEST_RET -eq 1 ]; then
     cat $CLIENT_LOG
+    cat $SERVER_LOG
+fi
+
+# Test model loading API with BLS
+SUB_TEST_RET=0
+rm -fr ./models
+mkdir -p models/bls_model_loading/1/
+cp ../../python_models/bls_model_loading/model.py models/bls_model_loading/1/
+cp ../../python_models/bls_model_loading/config.pbtxt models/bls_model_loading/
+cp -fr ${DATADIR}/qa_model_repository/onnx_int32_int32_int32 models/.
+# Make only version 2, 3 is valid version directory
+rm -rf models/onnx_int32_int32_int32/1
+
+SERVER_LOG="./bls_model_loading_server.log"
+SERVER_ARGS="--model-repository=`pwd`/models --backend-directory=${BACKEND_DIR} --model-control-mode=explicit --log-verbose=1"
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+export MODEL_NAME='bls_model_loading'
+
+set +e
+code=`curl -s -w %{http_code} -X POST localhost:8000/v2/repository/models/${MODEL_NAME}/load`
+set -e
+if [ "$code" == "400" ]; then
+    echo -e "\n***\n*** Failed to load model '${MODEL_NAME}'\n***"
     RET=1
-else
-    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
+    SUB_TEST_RET=1
 fi
 
-export MODEL_NAME='bls_memory'
-python3 $CLIENT_PY >> $CLIENT_LOG 2>&1 
+set +e
+
+python3 $CLIENT_PY >> $CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
-    echo -e "\n***\n*** 'bls_memory' test FAILED. \n***"
+    echo -e "\n***\n*** 'bls_model_loading' test FAILED. \n***"
     cat $CLIENT_LOG
     RET=1
+    SUB_TEST_RET=1
 else
     check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
     if [ $? -ne 0 ]; then
         cat $CLIENT_LOG
         echo -e "\n***\n*** Test Result Verification Failed\n***"
         RET=1
+        SUB_TEST_RET=1
     fi
 fi
 
-export MODEL_NAME='bls_memory_async'
-python3 $CLIENT_PY >> $CLIENT_LOG 2>&1 
-if [ $? -ne 0 ]; then
-    echo -e "\n***\n*** 'bls_async_memory' test FAILED. \n***"
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+if [ $SUB_TEST_RET -eq 1 ]; then
     cat $CLIENT_LOG
+    cat $SERVER_LOG
+fi
+
+# Test model loading API with BLS warmup
+(cd models/bls_model_loading && \
+        echo "model_warmup [{" >> config.pbtxt && \
+        echo "    name : \"regular sample\"" >> config.pbtxt && \
+        echo "    batch_size: 1" >> config.pbtxt && \
+        echo "    inputs {" >> config.pbtxt && \
+        echo "        key: \"INPUT0\"" >> config.pbtxt && \
+        echo "        value: {" >> config.pbtxt && \
+        echo "            data_type: TYPE_FP32" >> config.pbtxt && \
+        echo "            dims: 4" >> config.pbtxt && \
+        echo "            zero_data: false" >> config.pbtxt && \
+        echo "        }" >> config.pbtxt && \
+        echo "    }" >> config.pbtxt && \
+        echo "    inputs {" >> config.pbtxt && \
+        echo "        key: \"INPUT1\"" >> config.pbtxt && \
+        echo "        value: {" >> config.pbtxt && \
+        echo "            data_type: TYPE_FP32" >> config.pbtxt && \
+        echo "            dims: 4" >> config.pbtxt && \
+        echo "            zero_data: false" >> config.pbtxt && \
+        echo "        }" >> config.pbtxt && \
+        echo "    }" >> config.pbtxt && \
+        echo "}]" >> config.pbtxt )
+
+SUB_TEST_RET=0
+SERVER_LOG="./bls_model_loading_server_warmup.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+code=`curl -s -w %{http_code} -X POST localhost:8000/v2/repository/models/${MODEL_NAME}/load`
+set -e
+if [ "$code" == "400" ]; then
+    echo -e "\n***\n*** Failed to load model '${MODEL_NAME}'\n***"
     RET=1
-else
-    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
+    SUB_TEST_RET=1
 fi
 
-export MODEL_NAME='bls_async'
-python3 $CLIENT_PY >> $CLIENT_LOG 2>&1 
-if [ $? -ne 0 ]; then
-    echo -e "\n***\n*** 'bls_async' test FAILED. \n***"
+kill $SERVER_PID
+wait $SERVER_PID
+
+if [ $SUB_TEST_RET -eq 1 ]; then
     cat $CLIENT_LOG
+    cat $SERVER_LOG
+fi
+
+# Test BLS parameters
+rm -rf params_models && mkdir -p params_models/bls_parameters/1
+cp ../../python_models/bls_parameters/model.py ./params_models/bls_parameters/1
+cp ../../python_models/bls_parameters/config.pbtxt ./params_models/bls_parameters
+
+TEST_LOG="./bls_parameters.log"
+SERVER_LOG="./bls_parameters.server.log"
+
+SERVER_ARGS="--model-repository=`pwd`/params_models --backend-directory=${BACKEND_DIR} --log-verbose=1"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python3 bls_parameters_test.py > $TEST_LOG 2>&1
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** bls_parameters_test.py FAILED. \n***"
+    cat $TEST_LOG
     RET=1
-else
-    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
 fi
 set -e
 
@@ -155,8 +413,6 @@ kill $SERVER_PID
 wait $SERVER_PID
 
 if [ $RET -eq 1 ]; then
-    cat $CLIENT_LOG
-    cat $SERVER_LOG
     echo -e "\n***\n*** BLS test FAILED. \n***"
 else
     echo -e "\n***\n*** BLS test PASSED. \n***"
diff --git a/qa/L0_backend_python/common.sh b/qa/L0_backend_python/common.sh
old mode 100644
new mode 100755
index 78d4998e2b..d66f99c75f
--- a/qa/L0_backend_python/common.sh
+++ b/qa/L0_backend_python/common.sh
@@ -1,4 +1,5 @@
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#!/bin/bash
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -31,7 +32,7 @@ get_shm_pages() {
 
 install_conda() {
   rm -rf ./miniconda
-  file_name="Miniconda3-py38_4.9.2-Linux-x86_64.sh"
+  file_name="Miniconda3-py310_23.3.1-0-Linux-x86_64.sh"
   wget https://repo.anaconda.com/miniconda/$file_name
 
   # install miniconda in silent mode
@@ -43,21 +44,30 @@ install_conda() {
 
 install_build_deps() {
   apt update && apt install software-properties-common rapidjson-dev -y
-  wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | \
-  	gpg --dearmor - |  \
-  	tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null && \
-  	apt-add-repository 'deb https://apt.kitware.com/ubuntu/ focal main' && \
-  	apt-get update && \
-  	apt-get install -y --no-install-recommends \
-  	cmake-data=3.18.4-0kitware1ubuntu20.04.1 cmake=3.18.4-0kitware1ubuntu20.04.1
+  # Using CMAKE installation instruction from:: https://apt.kitware.com/
+  apt update -q=2 \
+    && apt install -y gpg wget \
+    && wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - |  tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null \
+    && . /etc/os-release \
+    && echo "deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ $UBUNTU_CODENAME main" | tee /etc/apt/sources.list.d/kitware.list >/dev/null \
+    && apt-get update -q=2 \
+    && apt-get install -y --no-install-recommends cmake=3.27.7* cmake-data=3.27.7*
 }
 
 create_conda_env() {
-  python_version=$1
-  env_name=$2
+  local python_version=$1
+  local env_name=$2
   conda create -n $env_name python=$python_version -y
   conda activate $env_name
-  conda install conda-pack -y
+  conda install -c conda-forge conda-pack -y
+}
+
+create_conda_env_with_specified_path() {
+  local python_version=$1
+  local env_path=$2
+  conda create -p $env_path python=$python_version -y
+  conda activate $env_path
+  conda install -c conda-forge conda-pack -y
 }
 
 create_python_backend_stub() {
diff --git a/qa/L0_backend_python/custom_metrics/test.sh b/qa/L0_backend_python/custom_metrics/test.sh
new file mode 100755
index 0000000000..9ba098f493
--- /dev/null
+++ b/qa/L0_backend_python/custom_metrics/test.sh
@@ -0,0 +1,85 @@
+#!/bin/bash
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+CLIENT_PY=../python_unittest.py
+CLIENT_LOG="./custom_metrics_client.log"
+EXPECTED_NUM_TESTS="1"
+TEST_RESULT_FILE='test_results.txt'
+source ../../common/util.sh
+
+TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"}
+SERVER=${TRITON_DIR}/bin/tritonserver
+BACKEND_DIR=${TRITON_DIR}/backends
+SERVER_ARGS="--model-repository=`pwd`/models --backend-directory=${BACKEND_DIR} --log-verbose=1"
+SERVER_LOG="./custom_metrics_server.log"
+
+RET=0
+rm -fr *.log ./models *.txt
+
+mkdir -p models/custom_metrics/1/
+cp ../../python_models/custom_metrics/model.py models/custom_metrics/1/
+cp ../../python_models/custom_metrics/config.pbtxt models/custom_metrics
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+
+export MODEL_NAME='custom_metrics'
+python3 $CLIENT_PY >> $CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** 'Custom Metrics' test FAILED. \n***"
+    cat $CLIENT_LOG
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+
+if [ $RET -eq 1 ]; then
+    cat $CLIENT_LOG
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Custom Metrics test FAILED. \n***"
+else
+    echo -e "\n***\n*** Custom Metrics test PASSED. \n***"
+fi
+
+exit $RET
diff --git a/qa/L0_backend_python/decoupled/decoupled_test.py b/qa/L0_backend_python/decoupled/decoupled_test.py
old mode 100644
new mode 100755
index 715860f3b0..1fc862fd5c
--- a/qa/L0_backend_python/decoupled/decoupled_test.py
+++ b/qa/L0_backend_python/decoupled/decoupled_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,21 +27,21 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../../common")
 
-import test_util as tu
-import tritonclient
+import queue
 import time
-import tritonclient.grpc as grpcclient
-from tritonclient.utils import *
-import numpy as np
 import unittest
 from functools import partial
-import queue
 
+import numpy as np
+import test_util as tu
+import tritonclient.grpc as grpcclient
+from tritonclient.utils import *
 
-class UserData:
 
+class UserData:
     def __init__(self):
         self._completed_requests = queue.Queue()
 
@@ -52,10 +54,9 @@ def callback(user_data, result, error):
 
 
 class DecoupledTest(tu.TestResultCollector):
-
     def test_decoupled_execute_error(self):
         # The decoupled_execute_error model returns an error for the first
-        # request and sucessfully processes the second request. This is making
+        # request and successfully processes the second request. This is making
         # sure that an error in a single request does not completely fail the
         # batch.
 
@@ -63,8 +64,7 @@ def test_decoupled_execute_error(self):
         shape = [2, 2]
         number_of_requests = 2
         user_data = UserData()
-        with grpcclient.InferenceServerClient(
-                "localhost:8001") as triton_client:
+        with grpcclient.InferenceServerClient("localhost:8001") as triton_client:
             triton_client.start_stream(callback=partial(callback, user_data))
 
             input_datas = []
@@ -72,12 +72,12 @@ def test_decoupled_execute_error(self):
                 input_data = np.random.randn(*shape).astype(np.float32)
                 input_datas.append(input_data)
                 inputs = [
-                    grpcclient.InferInput("IN", input_data.shape,
-                                          np_to_triton_dtype(input_data.dtype))
+                    grpcclient.InferInput(
+                        "IN", input_data.shape, np_to_triton_dtype(input_data.dtype)
+                    )
                 ]
                 inputs[0].set_data_from_numpy(input_data)
-                triton_client.async_stream_infer(model_name=model_name,
-                                                 inputs=inputs)
+                triton_client.async_stream_infer(model_name=model_name, inputs=inputs)
 
             for i in range(number_of_requests):
                 result = user_data._completed_requests.get()
@@ -91,27 +91,28 @@ def test_decoupled_execute_error(self):
                 self.assertTrue(
                     np.array_equal(output_data, input_datas[i]),
                     "error: expected output {} to match input {}".format(
-                        output_data, input_datas[i]))
+                        output_data, input_datas[i]
+                    ),
+                )
 
     def test_decoupled_bls(self):
         # Test combinations of BLS and decoupled API in Python backend.
         model_name = "decoupled_bls"
         shape = [1, 2]
         user_data = UserData()
-        with grpcclient.InferenceServerClient(
-                "localhost:8001") as triton_client:
+        with grpcclient.InferenceServerClient("localhost:8001") as triton_client:
             triton_client.start_stream(callback=partial(callback, user_data))
 
             input_datas = []
             input_data = np.random.randn(*shape).astype(np.float32)
             input_datas.append(input_data)
             inputs = [
-                grpcclient.InferInput("IN", input_data.shape,
-                                      np_to_triton_dtype(input_data.dtype))
+                grpcclient.InferInput(
+                    "IN", input_data.shape, np_to_triton_dtype(input_data.dtype)
+                )
             ]
             inputs[0].set_data_from_numpy(input_data)
-            triton_client.async_stream_infer(model_name=model_name,
-                                             inputs=inputs)
+            triton_client.async_stream_infer(model_name=model_name, inputs=inputs)
 
             # Check the results of the decoupled model using BLS
             def check_result(result):
@@ -123,11 +124,79 @@ def check_result(result):
                 self.assertTrue(
                     np.array_equal(output_data, input_data),
                     "error: expected output {} to match input {}".format(
-                        output_data, input_data))
+                        output_data, input_data
+                    ),
+                )
 
             result = user_data._completed_requests.get()
             check_result(result)
 
+    def test_decoupled_bls_stream(self):
+        # Test combinations of BLS and decoupled API in Python backend.
+        model_name = "decoupled_bls_stream"
+        in_values = [4, 2, 0, 1]
+        shape = [1]
+        user_data = UserData()
+        with grpcclient.InferenceServerClient("localhost:8001") as triton_client:
+            triton_client.start_stream(callback=partial(callback, user_data))
+            for i in range(len(in_values)):
+                input_data = np.array([in_values[i]], dtype=np.int32)
+                inputs = [
+                    grpcclient.InferInput(
+                        "IN", input_data.shape, np_to_triton_dtype(input_data.dtype)
+                    )
+                ]
+                inputs[0].set_data_from_numpy(input_data)
+                triton_client.async_stream_infer(
+                    model_name=model_name, inputs=inputs, request_id=str(i)
+                )
+
+            # Retrieve results...
+            recv_count = 0
+            expected_count = sum(in_values)
+            result_dict = {}
+            while recv_count < expected_count:
+                data_item = user_data._completed_requests.get()
+                self.assertIsNot(type(data_item), InferenceServerException)
+
+                this_id = data_item.get_response().id
+                if this_id not in result_dict.keys():
+                    result_dict[this_id] = []
+                result_dict[this_id].append((recv_count, data_item))
+
+                recv_count += 1
+            # Validate results...
+            for i in range(len(in_values)):
+                this_id = str(i)
+                is_received = False
+                if this_id in result_dict.keys():
+                    is_received = True
+
+                if in_values[i] != 0:
+                    self.assertTrue(
+                        is_received,
+                        "response for request id {} not received".format(this_id),
+                    )
+                    self.assertEqual(len(result_dict[this_id]), in_values[i])
+
+                    result_list = result_dict[this_id]
+                    expected_data = np.array([in_values[i]], dtype=np.int32)
+                    for j in range(len(result_list)):
+                        this_data = result_list[j][1].as_numpy("OUT")
+                        self.assertTrue(
+                            np.array_equal(expected_data, this_data),
+                            "error: incorrect data: expected {}, got {}".format(
+                                expected_data, this_data
+                            ),
+                        )
+                else:
+                    self.assertFalse(
+                        is_received,
+                        "received unexpected response for request id {}".format(
+                            this_id
+                        ),
+                    )
+
     def test_decoupled_return_response_error(self):
         model_name = "decoupled_return_response_error"
         shape = [16]
@@ -137,10 +206,12 @@ def test_decoupled_return_response_error(self):
             input_data_0 = np.random.random(shape).astype(np.float32)
             input_data_1 = np.random.random(shape).astype(np.float32)
             inputs = [
-                grpcclient.InferInput("INPUT0", input_data_0.shape,
-                                      np_to_triton_dtype(input_data_0.dtype)),
-                grpcclient.InferInput("INPUT1", input_data_1.shape,
-                                      np_to_triton_dtype(input_data_1.dtype))
+                grpcclient.InferInput(
+                    "INPUT0", input_data_0.shape, np_to_triton_dtype(input_data_0.dtype)
+                ),
+                grpcclient.InferInput(
+                    "INPUT1", input_data_1.shape, np_to_triton_dtype(input_data_1.dtype)
+                ),
             ]
             inputs[0].set_data_from_numpy(input_data_0)
             inputs[1].set_data_from_numpy(input_data_1)
@@ -149,9 +220,11 @@ def test_decoupled_return_response_error(self):
             if type(data_item) == InferenceServerException:
                 self.assertEqual(
                     data_item.message(),
-                    "Python model 'decoupled_return_response_error_0' is using "
+                    "Python model 'decoupled_return_response_error_0_0' is using "
                     "the decoupled mode and the execute function must return "
-                    "None.", "Exception message didn't match.")
+                    "None.",
+                    "Exception message didn't match.",
+                )
 
     def test_decoupled_send_after_close_error(self):
         model_name = "decoupled_send_after_close_error"
@@ -162,10 +235,12 @@ def test_decoupled_send_after_close_error(self):
             input_data_0 = np.random.random(shape).astype(np.float32)
             input_data_1 = np.random.random(shape).astype(np.float32)
             inputs = [
-                grpcclient.InferInput("INPUT0", input_data_0.shape,
-                                      np_to_triton_dtype(input_data_0.dtype)),
-                grpcclient.InferInput("INPUT1", input_data_1.shape,
-                                      np_to_triton_dtype(input_data_1.dtype))
+                grpcclient.InferInput(
+                    "INPUT0", input_data_0.shape, np_to_triton_dtype(input_data_0.dtype)
+                ),
+                grpcclient.InferInput(
+                    "INPUT1", input_data_1.shape, np_to_triton_dtype(input_data_1.dtype)
+                ),
             ]
             inputs[0].set_data_from_numpy(input_data_0)
             inputs[1].set_data_from_numpy(input_data_1)
@@ -175,9 +250,75 @@ def test_decoupled_send_after_close_error(self):
             # way to deliver the error message to the client. The error
             # will be logged on the server side.
             time.sleep(4)
-            self.assertEqual(user_data._completed_requests.qsize(), 0,
-                             "The completed request size must be zero.")
+            self.assertEqual(
+                user_data._completed_requests.qsize(),
+                0,
+                "The completed request size must be zero.",
+            )
+
+    def test_decoupled_execute_cancel(self):
+        model_name = "execute_cancel"
+        log_path = "decoupled_server.log"
+        execute_delay = 4.0  # seconds
+        shape = [1, 1]
+
+        user_data = UserData()
+        with grpcclient.InferenceServerClient("localhost:8001") as client:
+            client.start_stream(callback=partial(callback, user_data))
+            input_data = np.array([[execute_delay]], dtype=np.float32)
+            inputs = [
+                grpcclient.InferInput(
+                    "EXECUTE_DELAY", shape, np_to_triton_dtype(input_data.dtype)
+                )
+            ]
+            inputs[0].set_data_from_numpy(input_data)
+            client.async_stream_infer(model_name, inputs)
+            time.sleep(2)  # model delay for decoupling request and response sender
+            time.sleep(2)  # ensure the request is executing
+            client.stop_stream(cancel_requests=True)
+            time.sleep(2)  # ensure the cancellation is delivered
+
+        self.assertFalse(user_data._completed_requests.empty())
+        while not user_data._completed_requests.empty():
+            data_item = user_data._completed_requests.get()
+            self.assertIsInstance(data_item, InferenceServerException)
+            self.assertEqual(data_item.status(), "StatusCode.CANCELLED")
+
+        with open(log_path, mode="r", encoding="utf-8", errors="strict") as f:
+            log_text = f.read()
+        self.assertIn("[execute_cancel] Request not cancelled at 1.0 s", log_text)
+        self.assertIn("[execute_cancel] Request cancelled at ", log_text)
+
+    def test_decoupled_raise_exception(self):
+        # The decoupled_raise_exception model raises an exception for the request.
+        # This test case is making sure that repeated exceptions are properly handled.
+
+        model_name = "decoupled_raise_exception"
+        shape = [2, 2]
+        number_of_requests = 10
+        user_data = UserData()
+        with grpcclient.InferenceServerClient("localhost:8001") as triton_client:
+            triton_client.start_stream(callback=partial(callback, user_data))
+
+            input_datas = []
+            for i in range(number_of_requests):
+                input_data = np.random.randn(*shape).astype(np.float32)
+                input_datas.append(input_data)
+                inputs = [
+                    grpcclient.InferInput(
+                        "IN", input_data.shape, np_to_triton_dtype(input_data.dtype)
+                    )
+                ]
+                inputs[0].set_data_from_numpy(input_data)
+                triton_client.async_stream_infer(model_name=model_name, inputs=inputs)
+
+            for i in range(number_of_requests):
+                result = user_data._completed_requests.get()
+                self.assertIs(type(result), InferenceServerException)
+                self.assertIn("Intentional Error", result.message())
+
+            self.assertTrue(triton_client.is_model_ready(model_name))
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_backend_python/decoupled/models/decoupled_bls/1/model.py b/qa/L0_backend_python/decoupled/models/decoupled_bls/1/model.py
index 56f79f99e6..782e7ec86e 100644
--- a/qa/L0_backend_python/decoupled/models/decoupled_bls/1/model.py
+++ b/qa/L0_backend_python/decoupled/models/decoupled_bls/1/model.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,78 +24,102 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import triton_python_backend_utils as pb_utils
 import json
+import sys
 import threading
 import time
+
 import numpy as np
-import asyncio
 import torch
+import triton_python_backend_utils as pb_utils
 from torch.utils.dlpack import from_dlpack, to_dlpack
-import sys
 
 
 class TritonPythonModel:
-    """ This model sends an error message with the first request.
-    """
+    """This model sends an error message with the first request."""
 
     def initialize(self, args):
+        logger = pb_utils.Logger
+        logger.log("Initialize-Specific Msg!", logger.INFO)
+        logger.log_info("Initialize-Info Msg!")
+        logger.log_warn("Initialize-Warning Msg!")
+        logger.log_error("Initialize-Error Msg!")
         # You must parse model_config. JSON string is not parsed here
-        self.model_config = model_config = json.loads(args['model_config'])
+        self.model_config = model_config = json.loads(args["model_config"])
 
         using_decoupled = pb_utils.using_decoupled_model_transaction_policy(
-            model_config)
+            model_config
+        )
         if not using_decoupled:
             raise pb_utils.TritonModelException(
                 """the model `{}` can generate any number of responses per request,
                 enable decoupled transaction policy in model configuration to
-                serve this model""".format(args['model_name']))
+                serve this model""".format(
+                    args["model_name"]
+                )
+            )
 
         # Get OUT configuration
         out_config = pb_utils.get_output_config_by_name(model_config, "OUT")
 
         # Convert Triton types to numpy types
-        self.out_dtype = pb_utils.triton_string_to_numpy(
-            out_config['data_type'])
+        self.out_dtype = pb_utils.triton_string_to_numpy(out_config["data_type"])
 
         self.inflight_thread_count = 0
         self.inflight_thread_count_lck = threading.Lock()
+        logger = pb_utils.Logger
+        logger.log("Initialize-Specific Msg!", logger.INFO)
+        logger.log_info("Initialize-Info Msg!")
+        logger.log_warn("Initialize-Warning Msg!")
+        logger.log_error("Initialize-Error Msg!")
 
     def execute(self, requests):
-        """ This function is called on inference request.
-        """
-
+        """This function is called on inference request."""
+        logger = pb_utils.Logger
+        logger.log("Execute-Specific Msg!", logger.INFO)
+        logger.log_info("Execute-Info Msg!")
+        logger.log_warn("Execute-Warning Msg!")
+        logger.log_error("Execute-Error Msg!")
         # Only generate the error for the first request
         for i, request in enumerate(requests):
-            request_input = pb_utils.get_input_tensor_by_name(request, 'IN')
+            request_input = pb_utils.get_input_tensor_by_name(request, "IN")
 
             # Sync BLS request
             infer_request = pb_utils.InferenceRequest(
-                model_name='identity_fp32',
+                model_name="identity_fp32",
                 requested_output_names=["OUTPUT0"],
-                inputs=[pb_utils.Tensor('INPUT0', request_input.as_numpy())])
+                inputs=[pb_utils.Tensor("INPUT0", request_input.as_numpy())],
+            )
             infer_response = infer_request.exec()
             if infer_response.has_error():
                 raise pb_utils.TritonModelException(
                     f"BLS Response has an error: {infer_response.error().message()}"
                 )
 
-            output0 = pb_utils.get_output_tensor_by_name(
-                infer_response, "OUTPUT0")
+            output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0")
             if np.any(output0.as_numpy() != request_input.as_numpy()):
                 raise pb_utils.TritonModelException(
                     f"BLS Request input and BLS response output do not match. {request_input.as_numpy()} != {output0.as_numpy()}"
                 )
 
-            thread1 = threading.Thread(target=self.response_thread,
-                                       args=(request.get_response_sender(),
-                                             pb_utils.get_input_tensor_by_name(
-                                                 request, 'IN').as_numpy()))
+            thread1 = threading.Thread(
+                target=self.response_thread,
+                args=(
+                    request.get_response_sender(),
+                    pb_utils.get_input_tensor_by_name(request, "IN").as_numpy(),
+                ),
+            )
             thread1.daemon = True
             with self.inflight_thread_count_lck:
                 self.inflight_thread_count += 1
             thread1.start()
 
+        logger = pb_utils.Logger
+        logger.log("Execute-Specific Msg!", logger.INFO)
+        logger.log_info("Execute-Info Msg!")
+        logger.log_warn("Execute-Warning Msg!")
+        logger.log_error("Execute-Error Msg!")
+
         return None
 
     def _get_gpu_bls_outputs(self, input0_pb, input1_pb):
@@ -105,16 +129,23 @@ def _get_gpu_bls_outputs(self, input0_pb, input1_pb):
 
         Returns True on success and False on failure.
         """
+        logger = pb_utils.Logger
+        logger.log("_get_gpu_bls_outputs-Specific Msg!", logger.INFO)
+        logger.log_info("_get_gpu_bls_outputs-Info Msg!")
+        logger.log_warn("_get_gpu_bls_outputs-Warning Msg!")
+        logger.log_error("_get_gpu_bls_outputs-Error Msg!")
+
         infer_request = pb_utils.InferenceRequest(
-            model_name='dlpack_add_sub',
+            model_name="dlpack_add_sub",
             inputs=[input0_pb, input1_pb],
-            requested_output_names=['OUTPUT0', 'OUTPUT1'])
+            requested_output_names=["OUTPUT0", "OUTPUT1"],
+        )
         infer_response = infer_request.exec()
         if infer_response.has_error():
             return False
 
-        output0 = pb_utils.get_output_tensor_by_name(infer_response, 'OUTPUT0')
-        output1 = pb_utils.get_output_tensor_by_name(infer_response, 'OUTPUT1')
+        output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0")
+        output1 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT1")
         if output0 is None or output1 is None:
             return False
 
@@ -158,44 +189,56 @@ def _get_gpu_bls_outputs(self, input0_pb, input1_pb):
         return output0.to_dlpack(), output1.to_dlpack()
 
     def _test_gpu_bls_add_sub(self, is_input0_gpu, is_input1_gpu):
+        logger = pb_utils.Logger
+        logger.log("_test_gpu_bls_add_sub-Specific Msg!", logger.INFO)
+        logger.log_info("_test_gpu_bls_add_sub-Info Msg!")
+        logger.log_warn("_test_gpu_bls_add_sub-Warning Msg!")
+        logger.log_error("_test_gpu_bls_add_sub-Error Msg!")
+
         input0 = torch.rand(16)
         input1 = torch.rand(16)
 
         if is_input0_gpu:
-            input0 = input0.to('cuda')
+            input0 = input0.to("cuda")
 
         if is_input1_gpu:
-            input1 = input1.to('cuda')
+            input1 = input1.to("cuda")
 
-        input0_pb = pb_utils.Tensor.from_dlpack('INPUT0', to_dlpack(input0))
-        input1_pb = pb_utils.Tensor.from_dlpack('INPUT1', to_dlpack(input1))
+        input0_pb = pb_utils.Tensor.from_dlpack("INPUT0", to_dlpack(input0))
+        input1_pb = pb_utils.Tensor.from_dlpack("INPUT1", to_dlpack(input1))
         gpu_bls_return = self._get_gpu_bls_outputs(input0_pb, input1_pb)
         if gpu_bls_return:
             output0_dlpack, output1_dlpack = gpu_bls_return
         else:
             return False
 
-        expected_output_0 = from_dlpack(
-            input0_pb.to_dlpack()).to('cpu') + from_dlpack(
-                input1_pb.to_dlpack()).to('cpu')
-        expected_output_1 = from_dlpack(
-            input0_pb.to_dlpack()).to('cpu') - from_dlpack(
-                input1_pb.to_dlpack()).to('cpu')
+        expected_output_0 = from_dlpack(input0_pb.to_dlpack()).to("cpu") + from_dlpack(
+            input1_pb.to_dlpack()
+        ).to("cpu")
+        expected_output_1 = from_dlpack(input0_pb.to_dlpack()).to("cpu") - from_dlpack(
+            input1_pb.to_dlpack()
+        ).to("cpu")
 
         output0_matches = torch.all(
-            expected_output_0 == from_dlpack(output0_dlpack).to('cpu'))
+            expected_output_0 == from_dlpack(output0_dlpack).to("cpu")
+        )
         output1_matches = torch.all(
-            expected_output_1 == from_dlpack(output1_dlpack).to('cpu'))
+            expected_output_1 == from_dlpack(output1_dlpack).to("cpu")
+        )
         if not output0_matches or not output1_matches:
             return False
 
         return True
 
     def execute_gpu_bls(self):
+        logger = pb_utils.Logger
+        logger.log("execute_gpu_bls-Specific Msg!", logger.INFO)
+        logger.log_info("execute_gpu_bls-Info Msg!")
+        logger.log_warn("execute_gpu_bls-Warning Msg!")
+        logger.log_error("execute_gpu_bls-Error Msg!")
         for input0_device in [True, False]:
             for input1_device in [True, False]:
-                test_status = self._test_gpu_bls_add_sub(
-                    input0_device, input1_device)
+                test_status = self._test_gpu_bls_add_sub(input0_device, input1_device)
                 if not test_status:
                     return False
 
@@ -205,59 +248,69 @@ def response_thread(self, response_sender, in_input):
         # The response_sender is used to send response(s) associated with the
         # corresponding request.
         # Sleep 5 seconds to make sure the main thread has exited.
+        logger = pb_utils.Logger
+        logger.log("response_thread-Specific Msg!", logger.INFO)
+        logger.log_info("response_thread-Info Msg!")
+        logger.log_warn("response_thread-Warning Msg!")
+        logger.log_error("response_thread-Error Msg!")
         time.sleep(5)
 
         status = self.execute_gpu_bls()
         if not status:
-            infer_response = pb_utils.InferenceResponse(
-                error="GPU BLS test failed.")
+            infer_response = pb_utils.InferenceResponse(error="GPU BLS test failed.")
             response_sender.send(infer_response)
         else:
             in_value = in_input
             infer_request = pb_utils.InferenceRequest(
-                model_name='identity_fp32',
+                model_name="identity_fp32",
                 requested_output_names=["OUTPUT0"],
-                inputs=[pb_utils.Tensor('INPUT0', in_input)])
+                inputs=[pb_utils.Tensor("INPUT0", in_input)],
+            )
             infer_response = infer_request.exec()
-            output0 = pb_utils.get_output_tensor_by_name(
-                infer_response, "OUTPUT0")
+            output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0")
             if infer_response.has_error():
                 response = pb_utils.InferenceResponse(
-                    error=infer_response.error().message())
+                    error=infer_response.error().message()
+                )
                 response_sender.send(
-                    response,
-                    flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)
+                    response, flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL
+                )
             elif np.any(in_input != output0.as_numpy()):
                 error_message = (
                     "BLS Request input and BLS response output do not match."
-                    f" {in_value} != {output0.as_numpy()}")
+                    f" {in_value} != {output0.as_numpy()}"
+                )
                 response = pb_utils.InferenceResponse(error=error_message)
                 response_sender.send(
-                    response,
-                    flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)
+                    response, flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL
+                )
             else:
-                output_tensors = [pb_utils.Tensor('OUT', in_value)]
-                response = pb_utils.InferenceResponse(
-                    output_tensors=output_tensors)
+                output_tensors = [pb_utils.Tensor("OUT", in_value)]
+                response = pb_utils.InferenceResponse(output_tensors=output_tensors)
                 response_sender.send(
-                    response,
-                    flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)
+                    response, flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL
+                )
 
         with self.inflight_thread_count_lck:
             self.inflight_thread_count -= 1
+        logger.log("response_thread-Specific Msg!", logger.INFO)
+        logger.log_info("response_thread-Info Msg!")
+        logger.log_warn("response_thread-Warning Msg!")
+        logger.log_error("response_thread-Error Msg!")
 
     def finalize(self):
         """`finalize` is called only once when the model is being unloaded.
         Implementing `finalize` function is OPTIONAL. This function allows
         the model to perform any necessary clean ups before exit.
         """
-        print('Finalize invoked')
+        logger = pb_utils.Logger
+        logger.log_info("Finalize invoked")
 
         inflight_threads = True
         while inflight_threads:
             with self.inflight_thread_count_lck:
-                inflight_threads = (self.inflight_thread_count != 0)
+                inflight_threads = self.inflight_thread_count != 0
             if inflight_threads:
                 time.sleep(0.1)
 
-        print('Finalize complete...')
+        logger.log_info("Finalize complete...")
diff --git a/qa/L0_backend_python/decoupled/models/decoupled_bls_stream/1/model.py b/qa/L0_backend_python/decoupled/models/decoupled_bls_stream/1/model.py
new file mode 100644
index 0000000000..8643482912
--- /dev/null
+++ b/qa/L0_backend_python/decoupled/models/decoupled_bls_stream/1/model.py
@@ -0,0 +1,132 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import json
+import threading
+import time
+
+import numpy as np
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    """This model sends a BLS request to a decoupled model 'square_int32' and
+    returns the output from 'square_int32' as responses.
+    """
+
+    def initialize(self, args):
+        # You must parse model_config. JSON string is not parsed here
+        self.model_config = model_config = json.loads(args["model_config"])
+
+        using_decoupled = pb_utils.using_decoupled_model_transaction_policy(
+            model_config
+        )
+        if not using_decoupled:
+            raise pb_utils.TritonModelException(
+                """the model `{}` can generate any number of responses per request,
+                enable decoupled transaction policy in model configuration to
+                serve this model""".format(
+                    args["model_name"]
+                )
+            )
+
+        self.inflight_thread_count = 0
+        self.inflight_thread_count_lck = threading.Lock()
+
+    def execute(self, requests):
+        """This function is called on inference request."""
+
+        for request in requests:
+            thread = threading.Thread(
+                target=self.response_thread,
+                args=(
+                    request.get_response_sender(),
+                    pb_utils.get_input_tensor_by_name(request, "IN").as_numpy(),
+                ),
+            )
+            thread.daemon = True
+            with self.inflight_thread_count_lck:
+                self.inflight_thread_count += 1
+            thread.start()
+
+        return None
+
+    def response_thread(self, response_sender, in_value):
+        infer_request = pb_utils.InferenceRequest(
+            model_name="square_int32",
+            requested_output_names=["OUT"],
+            inputs=[pb_utils.Tensor("IN", in_value)],
+        )
+        infer_responses = infer_request.exec(decoupled=True)
+
+        response_count = 0
+        for infer_response in infer_responses:
+            if len(infer_response.output_tensors()) > 0:
+                output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUT")
+                if infer_response.has_error():
+                    response = pb_utils.InferenceResponse(
+                        error=infer_response.error().message()
+                    )
+                    response_sender.send(
+                        response, flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL
+                    )
+                elif np.any(in_value != output0.as_numpy()):
+                    error_message = (
+                        "BLS Request input and BLS response output do not match."
+                        f" {in_value} != {output0.as_numpy()}"
+                    )
+                    response = pb_utils.InferenceResponse(error=error_message)
+                    response_sender.send(
+                        response, flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL
+                    )
+                else:
+                    output_tensors = [pb_utils.Tensor("OUT", output0.as_numpy())]
+                    response = pb_utils.InferenceResponse(output_tensors=output_tensors)
+                    response_sender.send(response)
+
+            response_count += 1
+
+        if in_value != response_count - 1:
+            error_message = "Expected {} responses, got {}".format(
+                in_value, len(infer_responses) - 1
+            )
+            response = pb_utils.InferenceResponse(error=error_message)
+            response_sender.send(
+                response, flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL
+            )
+        else:
+            response_sender.send(flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)
+
+        with self.inflight_thread_count_lck:
+            self.inflight_thread_count -= 1
+
+    def finalize(self):
+        inflight_threads = True
+        while inflight_threads:
+            with self.inflight_thread_count_lck:
+                inflight_threads = self.inflight_thread_count != 0
+            if inflight_threads:
+                time.sleep(0.1)
diff --git a/qa/L0_backend_python/decoupled/models/decoupled_bls_stream/config.pbtxt b/qa/L0_backend_python/decoupled/models/decoupled_bls_stream/config.pbtxt
new file mode 100644
index 0000000000..23ad453212
--- /dev/null
+++ b/qa/L0_backend_python/decoupled/models/decoupled_bls_stream/config.pbtxt
@@ -0,0 +1,54 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "decoupled_bls_stream"
+backend: "python"
+
+model_transaction_policy {
+  decoupled: True
+}
+input [
+  {
+    name: "IN"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  }
+]
+
+output [
+  {
+    name: "OUT"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  }
+]
+
+instance_group [
+  {
+    count: 1
+    kind : KIND_CPU
+  }
+]
diff --git a/qa/L0_backend_python/decoupled/models/decoupled_execute_error/1/model.py b/qa/L0_backend_python/decoupled/models/decoupled_execute_error/1/model.py
index 1a7bd7abed..3882f0da9c 100644
--- a/qa/L0_backend_python/decoupled/models/decoupled_execute_error/1/model.py
+++ b/qa/L0_backend_python/decoupled/models/decoupled_execute_error/1/model.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,49 +24,55 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import triton_python_backend_utils as pb_utils
 import json
 import threading
 import time
 
+import triton_python_backend_utils as pb_utils
+
 
 class TritonPythonModel:
-    """ This model sends an error message with the first request.
-    """
+    """This model sends an error message with the first request."""
 
     def initialize(self, args):
         # You must parse model_config. JSON string is not parsed here
-        self.model_config = model_config = json.loads(args['model_config'])
+        self.model_config = model_config = json.loads(args["model_config"])
 
         using_decoupled = pb_utils.using_decoupled_model_transaction_policy(
-            model_config)
+            model_config
+        )
         if not using_decoupled:
             raise pb_utils.TritonModelException(
                 """the model `{}` can generate any number of responses per request,
                 enable decoupled transaction policy in model configuration to
-                serve this model""".format(args['model_name']))
+                serve this model""".format(
+                    args["model_name"]
+                )
+            )
 
         # Get OUT configuration
         out_config = pb_utils.get_output_config_by_name(model_config, "OUT")
 
         # Convert Triton types to numpy types
-        self.out_dtype = pb_utils.triton_string_to_numpy(
-            out_config['data_type'])
+        self.out_dtype = pb_utils.triton_string_to_numpy(out_config["data_type"])
 
         self.inflight_thread_count = 0
         self.inflight_thread_count_lck = threading.Lock()
 
     def execute(self, requests):
-        """ This function is called on inference request.
-        """
+        """This function is called on inference request."""
 
         # Only generate the error for the first request
         for i, request in enumerate(requests):
             # Start a separate thread to send the responses for the request.
-            thread = threading.Thread(target=self.response_thread,
-                                      args=(request.get_response_sender(), i,
-                                            pb_utils.get_input_tensor_by_name(
-                                                request, 'IN').as_numpy()))
+            thread = threading.Thread(
+                target=self.response_thread,
+                args=(
+                    request.get_response_sender(),
+                    i,
+                    pb_utils.get_input_tensor_by_name(request, "IN").as_numpy(),
+                ),
+            )
             thread.daemon = True
 
             with self.inflight_thread_count_lck:
@@ -86,9 +92,10 @@ def response_thread(self, response_sender, index, in_input):
         out_output = pb_utils.Tensor("OUT", in_value)
 
         if index == 0:
-            error = pb_utils.TritonError('An error occured during execution')
-            response = pb_utils.InferenceResponse(output_tensors=[out_output],
-                                                  error=error)
+            error = pb_utils.TritonError("An error occurred during execution")
+            response = pb_utils.InferenceResponse(
+                output_tensors=[out_output], error=error
+            )
         else:
             response = pb_utils.InferenceResponse(output_tensors=[out_output])
         response_sender.send(response)
@@ -96,8 +103,7 @@ def response_thread(self, response_sender, index, in_input):
         # We must close the response sender to indicate to Triton that we are
         # done sending responses for the corresponding request. We can't use the
         # response sender after closing it.
-        response_sender.send(
-            flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)
+        response_sender.send(flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)
 
         with self.inflight_thread_count_lck:
             self.inflight_thread_count -= 1
@@ -107,13 +113,13 @@ def finalize(self):
         Implementing `finalize` function is OPTIONAL. This function allows
         the model to perform any necessary clean ups before exit.
         """
-        print('Finalize invoked')
+        print("Finalize invoked")
 
         inflight_threads = True
         while inflight_threads:
             with self.inflight_thread_count_lck:
-                inflight_threads = (self.inflight_thread_count != 0)
+                inflight_threads = self.inflight_thread_count != 0
             if inflight_threads:
                 time.sleep(0.1)
 
-        print('Finalize complete...')
+        print("Finalize complete...")
diff --git a/qa/L0_backend_python/decoupled/models/decoupled_raise_exception/1/model.py b/qa/L0_backend_python/decoupled/models/decoupled_raise_exception/1/model.py
new file mode 100644
index 0000000000..03a19db98d
--- /dev/null
+++ b/qa/L0_backend_python/decoupled/models/decoupled_raise_exception/1/model.py
@@ -0,0 +1,35 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+class TritonPythonModel:
+    def initialize(self, args):
+        pass
+
+    def execute(self, requests):
+        for request in requests:
+            raise Exception("Intentional Error")
+        return None
diff --git a/qa/L0_backend_python/decoupled/models/decoupled_raise_exception/config.pbtxt b/qa/L0_backend_python/decoupled/models/decoupled_raise_exception/config.pbtxt
new file mode 100644
index 0000000000..046687dfe7
--- /dev/null
+++ b/qa/L0_backend_python/decoupled/models/decoupled_raise_exception/config.pbtxt
@@ -0,0 +1,55 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "decoupled_raise_exception"
+backend: "python"
+max_batch_size: 64
+
+model_transaction_policy {
+  decoupled: True
+}
+input [
+  {
+    name: "IN"
+    data_type: TYPE_FP32
+    dims: [ -1 ]
+  }
+]
+
+output [
+  {
+    name: "OUT"
+    data_type: TYPE_FP32
+    dims: [ -1 ]
+  }
+]
+
+instance_group [
+  {
+    count: 1
+    kind : KIND_CPU
+  }
+]
diff --git a/qa/L0_backend_python/decoupled/models/decoupled_return_response_error/1/model.py b/qa/L0_backend_python/decoupled/models/decoupled_return_response_error/1/model.py
index 959fb0fcae..ecde9c7168 100644
--- a/qa/L0_backend_python/decoupled/models/decoupled_return_response_error/1/model.py
+++ b/qa/L0_backend_python/decoupled/models/decoupled_return_response_error/1/model.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,39 +24,43 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
 import json
+
 import triton_python_backend_utils as pb_utils
 
 
 class TritonPythonModel:
-    """ This model tries to return a response directly from
+    """This model tries to return a response directly from
     execute function when configured as decoupled model.
     """
 
     def initialize(self, args):
-        self.model_config = model_config = json.loads(args['model_config'])
+        self.model_config = model_config = json.loads(args["model_config"])
 
         using_decoupled = pb_utils.using_decoupled_model_transaction_policy(
-            model_config)
+            model_config
+        )
         if not using_decoupled:
             raise pb_utils.TritonModelException(
                 """the model `{}` can generate any number of responses per request,
-                enable decoupled transaction policy in model configuration to 
-                serve this model""".format(args['model_name']))
+                enable decoupled transaction policy in model configuration to
+                serve this model""".format(
+                    args["model_name"]
+                )
+            )
 
-        output0_config = pb_utils.get_output_config_by_name(
-            model_config, "OUTPUT0")
-        output1_config = pb_utils.get_output_config_by_name(
-            model_config, "OUTPUT1")
+        output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0")
+        output1_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT1")
 
         self.output0_dtype = pb_utils.triton_string_to_numpy(
-            output0_config['data_type'])
+            output0_config["data_type"]
+        )
         self.output1_dtype = pb_utils.triton_string_to_numpy(
-            output1_config['data_type'])
+            output1_config["data_type"]
+        )
 
     def execute(self, requests):
-        """ Tries to create a response sender object and use that
+        """Tries to create a response sender object and use that
         for sending the response.
         """
 
@@ -67,13 +71,12 @@ def execute(self, requests):
         for request in requests:
             in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
             in_1 = pb_utils.get_input_tensor_by_name(request, "INPUT1")
-            out_0, out_1 = (in_0.as_numpy() + in_1.as_numpy(),
-                            in_0.as_numpy() - in_1.as_numpy())
+            out_0, out_1 = (
+                in_0.as_numpy() + in_1.as_numpy(),
+                in_0.as_numpy() - in_1.as_numpy(),
+            )
 
-            out_tensor_0 = pb_utils.Tensor("OUTPUT0",
-                                           out_0.astype(output0_dtype))
-            out_tensor_1 = pb_utils.Tensor("OUTPUT1",
-                                           out_1.astype(output1_dtype))
-            responses.append(
-                pb_utils.InferenceResponse([out_tensor_0, out_tensor_1]))
+            out_tensor_0 = pb_utils.Tensor("OUTPUT0", out_0.astype(output0_dtype))
+            out_tensor_1 = pb_utils.Tensor("OUTPUT1", out_1.astype(output1_dtype))
+            responses.append(pb_utils.InferenceResponse([out_tensor_0, out_tensor_1]))
         return responses
diff --git a/qa/L0_backend_python/decoupled/models/decoupled_send_after_close_error/1/model.py b/qa/L0_backend_python/decoupled/models/decoupled_send_after_close_error/1/model.py
index 296269bb27..52aa17ac0d 100644
--- a/qa/L0_backend_python/decoupled/models/decoupled_send_after_close_error/1/model.py
+++ b/qa/L0_backend_python/decoupled/models/decoupled_send_after_close_error/1/model.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,46 +24,51 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
 import json
+
 import triton_python_backend_utils as pb_utils
 
 
 class TritonPythonModel:
-    """ This model tries to send response after closing
+    """This model tries to send response after closing
     the response_sender.
     """
 
     def initialize(self, args):
-        self.model_config = model_config = json.loads(args['model_config'])
+        self.model_config = model_config = json.loads(args["model_config"])
 
         using_decoupled = pb_utils.using_decoupled_model_transaction_policy(
-            model_config)
+            model_config
+        )
         if not using_decoupled:
             raise pb_utils.TritonModelException(
                 """the model `{}` can generate any number of responses per request,
-                enable decoupled transaction policy in model configuration to 
-                serve this model""".format(args['model_name']))
+                enable decoupled transaction policy in model configuration to
+                serve this model""".format(
+                    args["model_name"]
+                )
+            )
 
-        output0_config = pb_utils.get_output_config_by_name(
-            model_config, "OUTPUT0")
-        output1_config = pb_utils.get_output_config_by_name(
-            model_config, "OUTPUT1")
+        output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0")
+        output1_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT1")
 
         self.output0_dtype = pb_utils.triton_string_to_numpy(
-            output0_config['data_type'])
+            output0_config["data_type"]
+        )
         self.output1_dtype = pb_utils.triton_string_to_numpy(
-            output1_config['data_type'])
+            output1_config["data_type"]
+        )
 
     def execute(self, requests):
-        """ Create a response sender object and use that
+        """Create a response sender object and use that
         for sending the response.
         """
 
         # This model does not support batching, so 'request_count' should always be 1.
         if len(requests) != 1:
-            raise pb_utils.TritonModelException("unsupported batch size " +
-                                                len(requests))
+            raise pb_utils.TritonModelException(
+                "unsupported batch size " + len(requests)
+            )
 
         output0_dtype = self.output0_dtype
         output1_dtype = self.output1_dtype
@@ -71,13 +76,14 @@ def execute(self, requests):
         response_sender = requests[0].get_response_sender()
         in_0 = pb_utils.get_input_tensor_by_name(requests[0], "INPUT0")
         in_1 = pb_utils.get_input_tensor_by_name(requests[0], "INPUT1")
-        out_0, out_1 = (in_0.as_numpy() + in_1.as_numpy(),
-                        in_0.as_numpy() - in_1.as_numpy())
+        out_0, out_1 = (
+            in_0.as_numpy() + in_1.as_numpy(),
+            in_0.as_numpy() - in_1.as_numpy(),
+        )
 
         out_tensor_0 = pb_utils.Tensor("OUTPUT0", out_0.astype(output0_dtype))
         out_tensor_1 = pb_utils.Tensor("OUTPUT1", out_1.astype(output1_dtype))
         response = pb_utils.InferenceResponse([out_tensor_0, out_tensor_1])
 
-        response_sender.send(
-            flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)
+        response_sender.send(flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)
         response_sender.send(response)
diff --git a/qa/L0_backend_python/decoupled/test.sh b/qa/L0_backend_python/decoupled/test.sh
old mode 100644
new mode 100755
index 5c73af6c4a..db8d4625f1
--- a/qa/L0_backend_python/decoupled/test.sh
+++ b/qa/L0_backend_python/decoupled/test.sh
@@ -1,4 +1,5 @@
-# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/bin/bash
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,14 +26,17 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 CLIENT_PY=./decoupled_test.py
-CLIENT_LOG="./client.log"
-EXPECTED_NUM_TESTS="4"
+CLIENT_LOG="./decoupled_client.log"
+EXPECTED_NUM_TESTS="7"
 TEST_RESULT_FILE='test_results.txt'
 TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"}
 SERVER=${TRITON_DIR}/bin/tritonserver
 BACKEND_DIR=${TRITON_DIR}/backends
 SERVER_ARGS="--model-repository=`pwd`/models --backend-directory=${BACKEND_DIR} --log-verbose=1"
-SERVER_LOG="./inference_server.log"
+SERVER_LOG="./decoupled_server.log"
+
+pip3 uninstall -y torch
+pip3 install torch==1.13.0+cu117 -f https://download.pytorch.org/whl/torch_stable.html
 
 RET=0
 source ../../common/util.sh
@@ -46,6 +50,43 @@ mkdir -p models/dlpack_add_sub/1/
 cp ../../python_models/dlpack_add_sub/model.py models/dlpack_add_sub/1/
 cp ../../python_models/dlpack_add_sub/config.pbtxt models/dlpack_add_sub/
 
+mkdir -p models/execute_cancel/1/
+cp ../../python_models/execute_cancel/model.py ./models/execute_cancel/1/
+cp ../../python_models/execute_cancel/config.pbtxt ./models/execute_cancel/
+echo "model_transaction_policy { decoupled: True }" >> ./models/execute_cancel/config.pbtxt
+
+git clone https://github.com/triton-inference-server/python_backend -b $PYTHON_BACKEND_REPO_TAG
+mkdir -p models/square_int32/1/
+cp python_backend/examples/decoupled/square_model.py models/square_int32/1/model.py
+cp python_backend/examples/decoupled/square_config.pbtxt models/square_int32/config.pbtxt
+
+function verify_log_counts () {
+  if [ `grep -c "Specific Msg!" $SERVER_LOG` -lt 1 ]; then
+    echo -e "\n***\n*** Test Failed: Specific Msg Count Incorrect\n***"
+    RET=1
+  fi
+  if [ `grep -c "Info Msg!" $SERVER_LOG` -lt 1 ]; then
+    echo -e "\n***\n*** Test Failed: Info Msg Count Incorrect\n***"
+    RET=1
+  fi
+  if [ `grep -c "Warning Msg!" $SERVER_LOG` -lt 1 ]; then
+    echo -e "\n***\n*** Test Failed: Warning Msg Count Incorrect\n***"
+    RET=1
+  fi
+  if [ `grep -c "Error Msg!" $SERVER_LOG` -lt 1 ]; then
+    echo -e "\n***\n*** Test Failed: Error Msg Count Incorrect\n***"
+    RET=1
+  fi
+  if [ `grep -c "Finalize invoked" $SERVER_LOG` -ne 3 ]; then
+    echo -e "\n***\n*** Test Failed: 'Finalize invoked' message missing\n***"
+    RET=1
+  fi
+  if [ `grep -c "Finalize complete..." $SERVER_LOG` -ne 3 ]; then
+    echo -e "\n***\n*** Test Failed: 'Finalize complete...' message missing\n***"
+    RET=1
+  fi
+}
+
 run_server
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
@@ -72,6 +113,8 @@ set -e
 kill $SERVER_PID
 wait $SERVER_PID
 
+verify_log_counts
+
 if [ $RET -eq 1 ]; then
     cat $CLIENT_LOG
     cat $SERVER_LOG
diff --git a/qa/L0_backend_python/ensemble/ensemble_test.py b/qa/L0_backend_python/ensemble/ensemble_test.py
old mode 100644
new mode 100755
index 831f1fa5a3..9fb60e5a4e
--- a/qa/L0_backend_python/ensemble/ensemble_test.py
+++ b/qa/L0_backend_python/ensemble/ensemble_test.py
@@ -1,4 +1,6 @@
-# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,23 +27,23 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../../common")
 
-import test_util as tu
+import unittest
+
+import numpy as np
 import shm_util
+import test_util as tu
 import tritonclient.http as httpclient
 from tritonclient.utils import *
-import numpy as np
-import unittest
 
 
 class EnsembleTest(tu.TestResultCollector):
-
     def setUp(self):
         self._shm_leak_detector = shm_util.ShmLeakDetector()
 
-    def test_ensemble(self):
-        model_name = "ensemble"
+    def infer(self, model_name):
         shape = [16]
         with self._shm_leak_detector.Probe() as shm_probe:
             with httpclient.InferenceServerClient("localhost:8000") as client:
@@ -49,47 +51,37 @@ def test_ensemble(self):
                 input_data_1 = np.random.random(shape).astype(np.float32)
                 inputs = [
                     httpclient.InferInput(
-                        "INPUT0", input_data_0.shape,
-                        np_to_triton_dtype(input_data_0.dtype)),
+                        "INPUT0",
+                        input_data_0.shape,
+                        np_to_triton_dtype(input_data_0.dtype),
+                    ),
                     httpclient.InferInput(
-                        "INPUT1", input_data_1.shape,
-                        np_to_triton_dtype(input_data_1.dtype))
+                        "INPUT1",
+                        input_data_1.shape,
+                        np_to_triton_dtype(input_data_1.dtype),
+                    ),
                 ]
                 inputs[0].set_data_from_numpy(input_data_0)
                 inputs[1].set_data_from_numpy(input_data_1)
                 result = client.infer(model_name, inputs)
-                output0 = result.as_numpy('OUTPUT0')
-                output1 = result.as_numpy('OUTPUT1')
+                output0 = result.as_numpy("OUTPUT0")
+                output1 = result.as_numpy("OUTPUT1")
                 self.assertIsNotNone(output0)
                 self.assertIsNotNone(output1)
 
-                self.assertTrue(np.allclose(output0, 2 * input_data_0))
-                self.assertTrue(np.allclose(output1, 2 * input_data_1))
+                # Set a big enough tolerance to reduce intermittence. May be
+                # better to test integer outputs in the future for consistency.
+                self.assertTrue(np.allclose(output0, 2 * input_data_0, atol=1e-06))
+                self.assertTrue(np.allclose(output1, 2 * input_data_1, atol=1e-06))
 
-        model_name = "ensemble_gpu"
-        with self._shm_leak_detector.Probe() as shm_probe:
-            with httpclient.InferenceServerClient("localhost:8000") as client:
-                input_data_0 = np.random.random(shape).astype(np.float32)
-                input_data_1 = np.random.random(shape).astype(np.float32)
-                inputs = [
-                    httpclient.InferInput(
-                        "INPUT0", input_data_0.shape,
-                        np_to_triton_dtype(input_data_0.dtype)),
-                    httpclient.InferInput(
-                        "INPUT1", input_data_1.shape,
-                        np_to_triton_dtype(input_data_1.dtype))
-                ]
-                inputs[0].set_data_from_numpy(input_data_0)
-                inputs[1].set_data_from_numpy(input_data_1)
-                result = client.infer(model_name, inputs)
-                output0 = result.as_numpy('OUTPUT0')
-                output1 = result.as_numpy('OUTPUT1')
-                self.assertIsNotNone(output0)
-                self.assertIsNotNone(output1)
+    def test_ensemble(self):
+        model_name = "ensemble"
+        self.infer(model_name)
 
-                self.assertTrue(np.allclose(output0, 2 * input_data_0))
-                self.assertTrue(np.allclose(output1, 2 * input_data_1))
+    def test_ensemble_gpu(self):
+        model_name = "ensemble_gpu"
+        self.infer(model_name)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_backend_python/ensemble/test.sh b/qa/L0_backend_python/ensemble/test.sh
old mode 100644
new mode 100755
index cd1018733b..c9292c4f4a
--- a/qa/L0_backend_python/ensemble/test.sh
+++ b/qa/L0_backend_python/ensemble/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,9 +25,8 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-CLIENT_PY=./lifecycle_test.py
-CLIENT_LOG="./client.log"
-EXPECTED_NUM_TESTS="1"
+CLIENT_LOG="./ensemble_client.log"
+EXPECTED_NUM_TESTS="2"
 TEST_RESULT_FILE='test_results.txt'
 source ../common.sh
 source ../../common/util.sh
@@ -36,7 +35,7 @@ TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"}
 SERVER=${TRITON_DIR}/bin/tritonserver
 BACKEND_DIR=${TRITON_DIR}/backends
 SERVER_ARGS="--model-repository=`pwd`/models --backend-directory=${BACKEND_DIR} --log-verbose=1"
-SERVER_LOG="./inference_server.log"
+SERVER_LOG="./ensemble_server.log"
 
 RET=0
 rm -rf models/ $CLIENT_LOG
@@ -47,14 +46,10 @@ cp ../../python_models/ensemble/config.pbtxt ./models/ensemble
 
 mkdir -p models/add_sub_1/1/
 cp ../../python_models/add_sub/config.pbtxt ./models/add_sub_1
-(cd models/add_sub_1 && \
-          sed -i "s/^name:.*/name: \"add_sub_1\"/" config.pbtxt)
 cp ../../python_models/add_sub/model.py ./models/add_sub_1/1/
 
 mkdir -p models/add_sub_2/1/
 cp ../../python_models/add_sub/config.pbtxt ./models/add_sub_2/
-(cd models/add_sub_2 && \
-          sed -i "s/^name:.*/name: \"add_sub_2\"/" config.pbtxt)
 cp ../../python_models/add_sub/model.py ./models/add_sub_2/1/
 
 # Ensemble GPU Model
diff --git a/qa/L0_backend_python/env/test.sh b/qa/L0_backend_python/env/test.sh
old mode 100644
new mode 100755
index 361635a9c4..e1106f8e79
--- a/qa/L0_backend_python/env/test.sh
+++ b/qa/L0_backend_python/env/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,15 +25,15 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-CLIENT_LOG="./client.log"
+CLIENT_LOG="./env_client.log"
 source ../common.sh
 source ../../common/util.sh
 
 SERVER=/opt/tritonserver/bin/tritonserver
-BASE_SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=1 --strict-model-config=false"
+BASE_SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=1 --disable-auto-complete-config"
 PYTHON_BACKEND_BRANCH=$PYTHON_BACKEND_REPO_TAG
 SERVER_ARGS=$BASE_SERVER_ARGS
-SERVER_LOG="./inference_server.log"
+SERVER_LOG="./env_server.log"
 
 RET=0
 
@@ -48,6 +48,9 @@ install_conda
 create_conda_env "3.7" "python-3-7"
 conda install numpy=1.20.1 -y
 conda install tensorflow=2.1.0 -y
+conda install -c conda-forge libstdcxx-ng=12 -y
+
+PY37_VERSION_STRING="Python version is 3.7, NumPy version is 1.20.1, and Tensorflow version is 2.1.0"
 create_python_backend_stub
 conda-pack -o python3.7.tar.gz
 path_to_conda_pack=`pwd`/python3.7.tar.gz
@@ -60,12 +63,38 @@ cp ../../python_models/python_version/model.py ./models/python_3_7/1/
 cp python_backend/builddir/triton_python_backend_stub ./models/python_3_7
 conda deactivate
 
+# Use python-3-7 without conda pack
+# Create a model with python 3.7 version and numpy 1.20.3 to distinguish from
+# previous test.
+# Tensorflow 2.1.0 only works with Python 3.4 - 3.7. Successful execution of
+# the Python model indicates that the environment has been setup correctly.
+path_to_conda_pack="$PWD/python-3-7-1"
+create_conda_env_with_specified_path "3.7" $path_to_conda_pack
+conda install numpy=1.20.3 -y
+conda install tensorflow=2.1.0 -y
+conda install -c conda-forge libstdcxx-ng=12 -y
+
+PY37_1_VERSION_STRING="Python version is 3.7, NumPy version is 1.20.3, and Tensorflow version is 2.1.0"
+create_python_backend_stub
+mkdir -p models/python_3_7_1/1/
+cp ../../python_models/python_version/config.pbtxt ./models/python_3_7_1
+(cd models/python_3_7_1 && \
+          sed -i "s/^name:.*/name: \"python_3_7_1\"/" config.pbtxt && \
+          echo "parameters: {key: \"EXECUTION_ENV_PATH\", value: {string_value: \"$path_to_conda_pack\"}}">> config.pbtxt)
+cp ../../python_models/python_version/model.py ./models/python_3_7_1/1/
+# Copy activate script to folder
+cp $path_to_conda_pack/lib/python3.7/site-packages/conda_pack/scripts/posix/activate $path_to_conda_pack/bin/.
+cp python_backend/builddir/triton_python_backend_stub ./models/python_3_7_1
+conda deactivate
+
 # Create a model with python 3.6 version
 # Tensorflow 2.1.0 only works with Python 3.4 - 3.7. Successful execution of
 # the Python model indicates that the environment has been setup correctly.
 create_conda_env "3.6" "python-3-6"
+conda install -c conda-forge libstdcxx-ng=12 -y
 conda install numpy=1.18.1 -y
 conda install tensorflow=2.1.0 -y
+PY36_VERSION_STRING="Python version is 3.6, NumPy version is 1.18.1, and Tensorflow version is 2.1.0"
 conda-pack -o python3.6.tar.gz
 
 # Test relative execution env path
@@ -79,21 +108,26 @@ cp python3.6.tar.gz models/python_3_6/python_3_6_environment.tar.gz
           echo "parameters: {key: \"EXECUTION_ENV_PATH\", value: {string_value: \"$path_to_conda_pack\"}}" >> config.pbtxt)
 cp ../../python_models/python_version/model.py ./models/python_3_6/1/
 cp python_backend/builddir/triton_python_backend_stub ./models/python_3_6
+conda deactivate
 
-# Test conda env without custom Python backend stub
-# Tensorflow 2.3.0 only works with Python 3.5 - 3.8.
-path_to_conda_pack='$$TRITON_MODEL_DIRECTORY/python_3_8_environment.tar.gz'
-create_conda_env "3.8" "python-3-8"
-conda install numpy=1.19.1 -y
-conda install tensorflow=2.3.0 -y
-conda-pack -o python3.8.tar.gz
-mkdir -p models/python_3_8/1/
-cp ../../python_models/python_version/config.pbtxt ./models/python_3_8
-cp python3.8.tar.gz models/python_3_8/python_3_8_environment.tar.gz
-(cd models/python_3_8 && \
-          sed -i "s/^name:.*/name: \"python_3_8\"/" config.pbtxt && \
+# Test conda env without custom Python backend stub This environment should
+# always use the default Python version shipped in the container. For Ubuntu 22.04
+# it is Python 3.10 and for Ubuntu 20.04 is 3.8
+path_to_conda_pack='$$TRITON_MODEL_DIRECTORY/python_3_10_environment.tar.gz'
+create_conda_env "3.10" "python-3-10"
+conda install -c conda-forge libstdcxx-ng=12 -y
+conda install numpy=1.23.4 -y
+conda install tensorflow=2.10.0 -y
+PY310_VERSION_STRING="Python version is 3.10, NumPy version is 1.23.4, and Tensorflow version is 2.10.0"
+conda pack -o python3.10.tar.gz
+mkdir -p models/python_3_10/1/
+cp ../../python_models/python_version/config.pbtxt ./models/python_3_10
+cp python3.10.tar.gz models/python_3_10/python_3_10_environment.tar.gz
+(cd models/python_3_10 && \
+          sed -i "s/^name:.*/name: \"python_3_10\"/" config.pbtxt && \
           echo "parameters: {key: \"EXECUTION_ENV_PATH\", value: {string_value: \"$path_to_conda_pack\"}}" >> config.pbtxt)
-cp ../../python_models/python_version/model.py ./models/python_3_8/1/
+cp ../../python_models/python_version/model.py ./models/python_3_10/1/
+conda deactivate
 rm -rf ./miniconda
 
 run_server
@@ -107,31 +141,81 @@ kill $SERVER_PID
 wait $SERVER_PID
 
 set +e
-grep "Python version is 3.6, NumPy version is 1.18.1, and Tensorflow version is 2.1.0" $SERVER_LOG
-if [ $? -ne 0 ]; then
-    cat $SERVER_LOG
-    echo -e "\n***\n*** Python version is 3.6, NumPy version is 1.18.1, and Tensorflow version is 2.1.0 was not found in Triton logs. \n***"
-    RET=1
-fi
+for EXPECTED_VERSION_STRING in "$PY36_VERSION_STRING" "$PY37_VERSION_STRING" "$PY37_1_VERSION_STRING" "$PY310_VERSION_STRING"; do
+    grep "$EXPECTED_VERSION_STRING" $SERVER_LOG
+    if [ $? -ne 0 ]; then
+        cat $SERVER_LOG
+        echo -e "\n***\n*** $EXPECTED_VERSION_STRING was not found in Triton logs. \n***"
+        RET=1
+    fi
+done
 
-grep "Python version is 3.7, NumPy version is 1.20.1, and Tensorflow version is 2.1.0" $SERVER_LOG
-if [ $? -ne 0 ]; then
+# Test default (non set) locale in python stub processes
+# NOTE: In certain pybind versions, the locale settings may not be propagated from parent to
+#       stub processes correctly. See https://github.com/triton-inference-server/python_backend/pull/260.
+export LC_ALL=INVALID
+grep "Locale is (None, None)" $SERVER_LOG
+    if [ $? -ne 0 ]; then
+        cat $SERVER_LOG
+        echo -e "\n***\n*** Default unset Locale was not found in Triton logs. \n***"
+        RET=1
+    fi
+set -e
+
+rm $SERVER_LOG
+
+# Test locale set via environment variable in python stub processes
+# NOTE: In certain pybind versions, the locale settings may not be propagated from parent to
+#       stub processes correctly. See https://github.com/triton-inference-server/python_backend/pull/260.
+export LC_ALL=C.UTF-8
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
     cat $SERVER_LOG
-    echo -e "\n***\n*** Python version is 3.7, NumPy version is 1.20.1, and Tensorflow version is 2.1.0 was not found in Triton logs. \n***"
-    RET=1
+    exit 1
 fi
 
-grep "Python version is 3.8, NumPy version is 1.19.1, and Tensorflow version is 2.3.0" $SERVER_LOG
-if [ $? -ne 0 ]; then
+kill $SERVER_PID
+wait $SERVER_PID
+
+set +e
+grep "Locale is ('en_US', 'UTF-8')" $SERVER_LOG
+    if [ $? -ne 0 ]; then
+        cat $SERVER_LOG
+        echo -e "\n***\n*** Locale UTF-8 was not found in Triton logs. \n***"
+        RET=1
+    fi
+set -e
+
+rm $SERVER_LOG
+
+## Test re-extraction of environment.
+SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=1 --model-control-mode=explicit"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
     cat $SERVER_LOG
-    echo -e "\n***\n*** Python version is 3.8, NumPy version is 1.19.1, and Tensorflow version is 2.3.0 was not found in Triton logs. \n***"
-    RET=1
+    exit 1
 fi
 
-grep "no version information available (required by /bin/bash)." $SERVER_LOG
-if [ $? -eq 0 ]; then
+# The environment should be extracted
+curl -v -X POST localhost:8000/v2/repository/models/python_3_10/load
+touch -m models/python_3_10/1/model.py
+# The environment should not be re-extracted
+curl -v -X POST localhost:8000/v2/repository/models/python_3_10/load
+touch -m models/python_3_10/python_3_10_environment.tar.gz
+# The environment should be re-extracted
+curl -v -X POST localhost:8000/v2/repository/models/python_3_10/load
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+set +e
+
+PY310_ENV_EXTRACTION="Extracting Python execution env"
+if [ `grep -c "${PY310_ENV_EXTRACTION}" ${SERVER_LOG}` != "2" ]; then
     cat $SERVER_LOG
-    echo -e "\n***\n*** \"no version information available (required by /bin/bash).\" was found in the server logs. \n***"
+    echo -e "\n***\n*** Python execution environment should be extracted exactly twice. \n***"
     RET=1
 fi
 set -e
@@ -156,12 +240,15 @@ aws s3 mb "${BUCKET_URL}"
 BUCKET_URL=${BUCKET_URL%/}
 BUCKET_URL_SLASH="${BUCKET_URL}/"
 
-# Model Python 3.7 contains absolute paths and because of this it cannot be used
+# Remove Python 3.7 model because it contains absolute paths and cannot be used
 # with S3.
 rm -rf models/python_3_7
-rm $SERVER_LOG
 
+# Test with the bucket url as model repository
 aws s3 cp models/ "${BUCKET_URL_SLASH}" --recursive --include "*"
+
+rm $SERVER_LOG
+
 SERVER_ARGS="--model-repository=$BUCKET_URL_SLASH --log-verbose=1"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
@@ -174,14 +261,49 @@ kill $SERVER_PID
 wait $SERVER_PID
 
 set +e
-grep "Python version is 3.6, NumPy version is 1.18.1, and Tensorflow version is 2.1.0" $SERVER_LOG
+grep "$PY36_VERSION_STRING" $SERVER_LOG
 if [ $? -ne 0 ]; then
     cat $SERVER_LOG
-    echo -e "\n***\n*** Python version is 3.6, NumPy version is 1.18.1, and Tensorflow version is 2.1.0 was not found in Triton logs. \n***"
+    echo -e "\n***\n*** $PY36_VERSION_STRING was not found in Triton logs. \n***"
     RET=1
 fi
 set -e
 
+# Clean up bucket contents
+aws s3 rm "${BUCKET_URL_SLASH}" --recursive --include "*"
+
+# Test with EXECUTION_ENV_PATH outside the model directory
+sed -i "s/TRITON_MODEL_DIRECTORY\/python_3_6_environment/TRITON_MODEL_DIRECTORY\/..\/python_3_6_environment/" models/python_3_6/config.pbtxt
+mv models/python_3_6/python_3_6_environment.tar.gz models
+sed -i "s/\$\$TRITON_MODEL_DIRECTORY\/python_3_10_environment/s3:\/\/triton-bucket-${CI_JOB_ID}\/python_3_10_environment/" models/python_3_10/config.pbtxt
+mv models/python_3_10/python_3_10_environment.tar.gz models
+
+aws s3 cp models/ "${BUCKET_URL_SLASH}" --recursive --include "*"
+
+rm $SERVER_LOG
+
+SERVER_ARGS="--model-repository=$BUCKET_URL_SLASH --log-verbose=1"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+set +e
+for EXPECTED_VERSION_STRING in "$PY36_VERSION_STRING" "$PY310_VERSION_STRING"; do
+    grep "$EXPECTED_VERSION_STRING" $SERVER_LOG
+    if [ $? -ne 0 ]; then
+        cat $SERVER_LOG
+        echo -e "\n***\n*** $EXPECTED_VERSION_STRING was not found in Triton logs. \n***"
+        RET=1
+    fi
+done
+set -e
+
 # Clean up bucket contents and delete bucket
 aws s3 rm "${BUCKET_URL_SLASH}" --recursive --include "*"
 aws s3 rb "${BUCKET_URL}"
diff --git a/qa/L0_backend_python/examples/test.sh b/qa/L0_backend_python/examples/test.sh
old mode 100644
new mode 100755
index bde23b3506..4f9cddab8d
--- a/qa/L0_backend_python/examples/test.sh
+++ b/qa/L0_backend_python/examples/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -32,22 +32,32 @@ TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"}
 SERVER=${TRITON_DIR}/bin/tritonserver
 BACKEND_DIR=${TRITON_DIR}/backends
 SERVER_ARGS="--model-repository=`pwd`/python_backend/models --backend-directory=${BACKEND_DIR} --log-verbose=1"
-SERVER_LOG="./inference_server.log"
+SERVER_LOG="./examples_server.log"
 
 RET=0
 rm -fr *.log python_backend/
 
-# # Skip torch install on Jetson since it is already installed.
+# Install torch
+pip3 uninstall -y torch
 if [ "$TEST_JETSON" == "0" ]; then
-    pip3 uninstall -y torch
-    pip3 install torch==1.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
+    pip3 install torch==2.0.0+cu117 -f https://download.pytorch.org/whl/torch_stable.html torchvision==0.15.0+cu117
+else
+    pip3 install torch==2.0.0 -f https://download.pytorch.org/whl/torch_stable.html torchvision==0.15.0
+fi
+
+# Install `validators` for Model Instance Kind example
+pip3 install validators
+
+# Install JAX
+if [ "$TEST_JETSON" == "0" ]; then
+    pip3 install --upgrade "jax[cuda12_local]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
 fi
 
 git clone https://github.com/triton-inference-server/python_backend -b $PYTHON_BACKEND_REPO_TAG
 cd python_backend
 
 # Example 1
-CLIENT_LOG="./add_sub_client.log"
+CLIENT_LOG="./examples_add_sub_client.log"
 mkdir -p models/add_sub/1/
 cp examples/add_sub/model.py models/add_sub/1/model.py
 cp examples/add_sub/config.pbtxt models/add_sub/config.pbtxt
@@ -77,7 +87,7 @@ kill $SERVER_PID
 wait $SERVER_PID
 
 # Example 2
-CLIENT_LOG="./pytorch_client.log"
+CLIENT_LOG="./examples_pytorch_client.log"
 mkdir -p models/pytorch/1/
 cp examples/pytorch/model.py models/pytorch/1/model.py
 cp examples/pytorch/config.pbtxt models/pytorch/config.pbtxt
@@ -108,8 +118,43 @@ wait $SERVER_PID
 
 # Example 3
 
+# JAX AddSub
+# JAX is not supported on Jetson
+if [ "$TEST_JETSON" == "0" ]; then
+    CLIENT_LOG="./examples_jax_client.log"
+    mkdir -p models/jax/1/
+    cp examples/jax/model.py models/jax/1/model.py
+    cp examples/jax/config.pbtxt models/jax/config.pbtxt
+    run_server
+    if [ "$SERVER_PID" == "0" ]; then
+        echo -e "\n***\n*** Failed to start $SERVER\n***"
+        cat $SERVER_LOG
+        RET=1
+    fi
+
+    set +e
+    python3 examples/jax/client.py > $CLIENT_LOG
+    if [ $? -ne 0 ]; then
+        echo -e "\n***\n*** Failed to verify jax example. \n***"
+        RET=1
+    fi
+
+    grep "PASS" $CLIENT_LOG
+    if [ $? -ne 0 ]; then
+        echo -e "\n***\n*** Failed to verify jax example. \n***"
+        cat $CLIENT_LOG
+        RET=1
+    fi
+    set -e
+
+    kill $SERVER_PID
+    wait $SERVER_PID
+fi
+
+# Example 4
+
 # BLS Sync
-CLIENT_LOG="./sync_client.log"
+CLIENT_LOG="./examples_sync_client.log"
 mkdir -p models/bls_sync/1
 cp examples/bls/sync_model.py models/bls_sync/1/model.py
 cp examples/bls/sync_config.pbtxt models/bls_sync/config.pbtxt
@@ -138,10 +183,10 @@ set -e
 kill $SERVER_PID
 wait $SERVER_PID
 
-# Example 4
+# Example 5
 
 # Decoupled Repeat
-CLIENT_LOG="./repeat_client.log"
+CLIENT_LOG="./examples_repeat_client.log"
 mkdir -p models/repeat_int32/1/
 cp examples/decoupled/repeat_model.py models/repeat_int32/1/model.py
 cp examples/decoupled/repeat_config.pbtxt models/repeat_int32/config.pbtxt
@@ -170,10 +215,10 @@ set -e
 kill $SERVER_PID
 wait $SERVER_PID
 
-# Example 5
+# Example 6
 
 # Decoupled Square
-CLIENT_LOG="./square_client.log"
+CLIENT_LOG="./examples_square_client.log"
 mkdir -p models/square_int32/1/
 cp examples/decoupled/square_model.py models/square_int32/1/model.py
 cp examples/decoupled/square_config.pbtxt models/square_int32/config.pbtxt
@@ -209,7 +254,7 @@ wait $SERVER_PID
 # Having multiple python versions lead to build issues.
 # Anaconda is not officially supported on Jetson.
 if [ "$TEST_JETSON" == "0" ]; then
-    CLIENT_LOG="./async_client.log"
+    CLIENT_LOG="./examples_async_client.log"
     mkdir -p models/bls_async/1
     cp examples/bls/async_model.py models/bls_async/1/model.py
     cp examples/bls/async_config.pbtxt models/bls_async/config.pbtxt
@@ -241,17 +286,11 @@ if [ "$TEST_JETSON" == "0" ]; then
 fi
 
 # Auto Complete Model Configuration Example
-CLIENT_LOG="./auto_complete_client.log"
+CLIENT_LOG="./examples_auto_complete_client.log"
 mkdir -p models/nobatch_auto_complete/1/
 mkdir -p models/batch_auto_complete/1/
 cp examples/auto_complete/nobatch_model.py models/nobatch_auto_complete/1/model.py
 cp examples/auto_complete/batch_model.py models/batch_auto_complete/1/model.py
-if [ "$TEST_JETSON" == "1" ]; then
-    echo -e 'name: "nobatch_auto_complete" \ninstance_group [{ kind: KIND_CPU }]' > \
-        models/nobatch_auto_complete/config.pbtxt
-    echo -e 'name: "batch_auto_complete" \ninstance_group [{ kind: KIND_CPU }]' > \
-        models/batch_auto_complete/config.pbtxt
-fi
 
 SERVER_ARGS="$SERVER_ARGS --strict-model-config=false"
 
@@ -280,6 +319,132 @@ set -e
 kill $SERVER_PID
 wait $SERVER_PID
 
+# BLS Decoupled Sync
+CLIENT_LOG="./examples_bls_decoupled_sync_client.log"
+mkdir -p models/bls_decoupled_sync/1
+cp examples/bls_decoupled/sync_model.py models/bls_decoupled_sync/1/model.py
+cp examples/bls_decoupled/sync_config.pbtxt models/bls_decoupled_sync/config.pbtxt
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    RET=1
+fi
+
+set +e
+python3 examples/bls_decoupled/sync_client.py > $CLIENT_LOG
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Failed to verify BLS Decoupled Sync example. \n***"
+    RET=1
+fi
+
+grep "PASS" $CLIENT_LOG
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Failed to verify BLS Decoupled Sync example. \n***"
+    cat $CLIENT_LOG
+    RET=1
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+# BLS Decoupled Async
+if [ "$TEST_JETSON" == "0" ]; then
+    CLIENT_LOG="./examples_bls_decoupled_async_client.log"
+    mkdir -p models/bls_decoupled_async/1
+    cp examples/bls_decoupled/async_model.py models/bls_decoupled_async/1/model.py
+    cp examples/bls_decoupled/async_config.pbtxt models/bls_decoupled_async/config.pbtxt
+    run_server
+    if [ "$SERVER_PID" == "0" ]; then
+        echo -e "\n***\n*** Failed to start $SERVER\n***"
+        cat $SERVER_LOG
+        RET=1
+    fi
+
+    set +e
+    python3 examples/bls_decoupled/async_client.py > $CLIENT_LOG
+    if [ $? -ne 0 ]; then
+        echo -e "\n***\n*** Failed to verify BLS Decoupled Async example. \n***"
+        RET=1
+    fi
+
+    grep "PASS" $CLIENT_LOG
+    if [ $? -ne 0 ]; then
+        echo -e "\n***\n*** Failed to verify BLS Decoupled Async example. \n***"
+        cat $CLIENT_LOG
+        RET=1
+    fi
+
+    set -e
+
+    kill $SERVER_PID
+    wait $SERVER_PID
+fi
+
+# Example 7
+
+# Model Instance Kind
+CLIENT_LOG="./examples_model_instance_kind.log"
+mkdir -p models/resnet50/1
+cp examples/instance_kind/model.py models/resnet50/1/
+cp examples/instance_kind/config.pbtxt models/resnet50/
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    RET=1
+fi
+
+set +e
+python3 examples/instance_kind/client.py --label_file examples/instance_kind/resnet50_labels.txt > $CLIENT_LOG
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Failed to verify Model instance Kind example. \n***"
+    RET=1
+fi
+
+grep "PASS" $CLIENT_LOG
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Failed to verify Model Instance Kind example. Example failed to pass. \n***"
+    cat $CLIENT_LOG
+    RET=1
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Custom Metrics
+CLIENT_LOG="./examples_custom_metrics_client.log"
+mkdir -p models/custom_metrics/1
+cp examples/custom_metrics/model.py models/custom_metrics/1/model.py
+cp examples/custom_metrics/config.pbtxt models/custom_metrics/config.pbtxt
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    RET=1
+fi
+
+set +e
+python3 examples/custom_metrics/client.py > $CLIENT_LOG
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Failed to verify Custom Metrics example. \n***"
+    RET=1
+fi
+
+grep "PASS" $CLIENT_LOG
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Failed to verify Custom Metrics example. \n***"
+    cat $CLIENT_LOG
+    RET=1
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+
 if [ $RET -eq 0 ]; then
     echo -e "\n***\n*** Example verification test PASSED.\n***"
 else
diff --git a/qa/L0_backend_python/io/io_test.py b/qa/L0_backend_python/io/io_test.py
old mode 100644
new mode 100755
index 8a88837478..ff67e8c0ff
--- a/qa/L0_backend_python/io/io_test.py
+++ b/qa/L0_backend_python/io/io_test.py
@@ -1,4 +1,6 @@
-# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,22 +30,21 @@
 
 sys.path.append("../../common")
 
+import os
+import queue
+import unittest
 from functools import partial
-import test_util as tu
+
+import numpy as np
 import shm_util
-import tritonclient.http as httpclient
+import test_util as tu
 import tritonclient.grpc as grpcclient
 from tritonclient.utils import *
-import numpy as np
-import unittest
-import queue
-import os
 
-TRIAL = os.getenv('TRIAL')
+TRIAL = os.getenv("TRIAL")
 
 
 class UserData:
-
     def __init__(self):
         self._completed_requests = queue.Queue()
 
@@ -56,55 +57,102 @@ def callback(user_data, result, error):
 
 
 class IOTest(tu.TestResultCollector):
-
     def setUp(self):
         self._shm_leak_detector = shm_util.ShmLeakDetector()
+        self._client = grpcclient.InferenceServerClient("localhost:8001")
 
-    def _run_test(self):
+    def _run_ensemble_test(self):
         model_name = "ensemble_io"
         user_data = UserData()
-        with grpcclient.InferenceServerClient("localhost:8001") as client:
-            input0 = np.random.random([1000]).astype(np.float32)
-            client.start_stream(callback=partial(callback, user_data))
-            for model_1_in_gpu in [True, False]:
-                for model_2_in_gpu in [True, False]:
-                    for model_3_in_gpu in [True, False]:
-                        gpu_output = np.asarray(
-                            [model_1_in_gpu, model_2_in_gpu, model_3_in_gpu],
-                            dtype=bool)
-                        inputs = [
-                            grpcclient.InferInput(
-                                "INPUT0", input0.shape,
-                                np_to_triton_dtype(input0.dtype)),
-                            grpcclient.InferInput(
-                                "GPU_OUTPUT", gpu_output.shape,
-                                np_to_triton_dtype(gpu_output.dtype))
-                        ]
-                        inputs[0].set_data_from_numpy(input0)
-                        inputs[1].set_data_from_numpy(gpu_output)
-                        client.async_stream_infer(model_name=model_name,
-                                                  inputs=inputs)
-                        if TRIAL == 'default':
+        input0 = np.random.random([1000]).astype(np.float32)
+        self._client.start_stream(callback=partial(callback, user_data))
+        for model_1_in_gpu in [True, False]:
+            for model_2_in_gpu in [True, False]:
+                for model_3_in_gpu in [True, False]:
+                    gpu_output = np.asarray(
+                        [model_1_in_gpu, model_2_in_gpu, model_3_in_gpu], dtype=bool
+                    )
+                    inputs = [
+                        grpcclient.InferInput(
+                            "INPUT0", input0.shape, np_to_triton_dtype(input0.dtype)
+                        ),
+                        grpcclient.InferInput(
+                            "GPU_OUTPUT",
+                            gpu_output.shape,
+                            np_to_triton_dtype(gpu_output.dtype),
+                        ),
+                    ]
+                    inputs[0].set_data_from_numpy(input0)
+                    inputs[1].set_data_from_numpy(gpu_output)
+                    self._client.async_stream_infer(
+                        model_name=model_name, inputs=inputs
+                    )
+                    if TRIAL == "default":
+                        result = user_data._completed_requests.get()
+                        output0 = result.as_numpy("OUTPUT0")
+                        self.assertIsNotNone(output0)
+                        self.assertTrue(np.all(output0 == input0))
+                    else:
+                        response_repeat = 2
+                        for _ in range(response_repeat):
                             result = user_data._completed_requests.get()
-                            output0 = result.as_numpy('OUTPUT0')
+                            output0 = result.as_numpy("OUTPUT0")
                             self.assertIsNotNone(output0)
                             self.assertTrue(np.all(output0 == input0))
-                        else:
-                            response_repeat = 2
-                            for _ in range(response_repeat):
-                                result = user_data._completed_requests.get()
-                                output0 = result.as_numpy('OUTPUT0')
-                                self.assertIsNotNone(output0)
-                                self.assertTrue(np.all(output0 == input0))
 
     def test_ensemble_io(self):
         # Only run the shared memory leak detection with the default trial
-        if TRIAL == 'default':
+        if TRIAL == "default":
             with self._shm_leak_detector.Probe():
-                self._run_test()
+                self._run_ensemble_test()
         else:
-            self._run_test()
-
+            self._run_ensemble_test()
+
+    def test_empty_gpu_output(self):
+        model_name = "dlpack_empty_output"
+        input_data = np.array([[1.0]], dtype=np.float32)
+        inputs = [
+            grpcclient.InferInput(
+                "INPUT", input_data.shape, np_to_triton_dtype(input_data.dtype)
+            )
+        ]
+        inputs[0].set_data_from_numpy(input_data)
+        result = self._client.infer(model_name, inputs)
+        output = result.as_numpy("OUTPUT")
+        self.assertIsNotNone(output)
+        self.assertEqual(output.size, 0)
+
+    def test_variable_gpu_output(self):
+        # Input is not important in this test
+        model_name = "variable_gpu_output"
+        input_data = np.array([[1.0]], dtype=np.float32)
+        inputs = [
+            grpcclient.InferInput(
+                "INPUT", input_data.shape, np_to_triton_dtype(input_data.dtype)
+            )
+        ]
+        inputs[0].set_data_from_numpy(input_data)
+        user_data = UserData()
 
-if __name__ == '__main__':
+        # The test sends five requests to the model and the model returns five
+        # responses with different GPU output shapes
+        num_requests = 5
+        for _ in range(num_requests):
+            _ = self._client.async_infer(
+                model_name=model_name,
+                inputs=inputs,
+                callback=partial(callback, user_data),
+            )
+
+        for i in range(num_requests):
+            result = user_data._completed_requests.get()
+            if result is InferenceServerException:
+                self.assertTrue(False, result)
+            output = result.as_numpy("OUTPUT")
+            self.assertIsNotNone(output)
+            self.assertEqual(output.size, i + 1)
+            np.testing.assert_almost_equal(output, np.ones(i + 1) * (i + 1))
+
+
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_backend_python/io/test.sh b/qa/L0_backend_python/io/test.sh
old mode 100644
new mode 100755
index eb642e6e4b..86827a4260
--- a/qa/L0_backend_python/io/test.sh
+++ b/qa/L0_backend_python/io/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,7 +26,7 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 UNITTEST_PY=./io_test.py
-CLIENT_LOG="./client.log"
+CLIENT_LOG="./io_client.log"
 EXPECTED_NUM_TESTS="1"
 TEST_RESULT_FILE='test_results.txt'
 source ../common.sh
@@ -37,14 +37,15 @@ SERVER=${TRITON_DIR}/bin/tritonserver
 BACKEND_DIR=${TRITON_DIR}/backends
 
 SERVER_ARGS="--model-repository=`pwd`/models --backend-directory=${BACKEND_DIR} --log-verbose=1"
-SERVER_LOG="./inference_server.log"
+SERVER_LOG="./io_server.log"
 
 RET=0
 rm -fr *.log ./models
 
 pip3 uninstall -y torch
-pip3 install torch==1.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
+pip3 install torch==1.13.0+cu117 -f https://download.pytorch.org/whl/torch_stable.html
 
+# IOTest.test_ensemble_io
 TRIALS="default decoupled"
 
 for trial in $TRIALS; do
@@ -82,26 +83,89 @@ for trial in $TRIALS; do
     fi
 
     set +e
-    python3 $UNITTEST_PY > $CLIENT_LOG
+    python3 $UNITTEST_PY IOTest.test_ensemble_io > $CLIENT_LOG.test_ensemble_io
     if [ $? -ne 0 ]; then
-        echo -e "\n***\n*** io_test.py FAILED. \n***"
-        cat $CLIENT_LOG
+        echo -e "\n***\n*** IOTest.test_ensemble_io FAILED. \n***"
+        cat $CLIENT_LOG.test_ensemble_io
         RET=1
     else
         check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
         if [ $? -ne 0 ]; then
-            cat $CLIENT_LOG
+            cat $CLIENT_LOG.test_ensemble_io
             echo -e "\n***\n*** Test Result Verification Failed\n***"
             RET=1
         fi
     fi
-
     set -e
 
     kill $SERVER_PID
     wait $SERVER_PID
 done
 
+# IOTest.test_empty_gpu_output
+rm -rf models && mkdir models
+mkdir -p models/dlpack_empty_output/1/
+cp ../../python_models/dlpack_empty_output/model.py ./models/dlpack_empty_output/1/
+cp ../../python_models/dlpack_empty_output/config.pbtxt ./models/dlpack_empty_output/
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    RET=1
+fi
+
+set +e
+python3 $UNITTEST_PY IOTest.test_empty_gpu_output > $CLIENT_LOG.test_empty_gpu_output
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** IOTest.test_empty_gpu_output FAILED. \n***"
+    cat $CLIENT_LOG.test_empty_gpu_output
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG.test_empty_gpu_output
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+# IOTest.test_variable_gpu_output
+rm -rf models && mkdir models
+mkdir -p models/variable_gpu_output/1/
+cp ../../python_models/variable_gpu_output/model.py ./models/variable_gpu_output/1/
+cp ../../python_models/variable_gpu_output/config.pbtxt ./models/variable_gpu_output/
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    RET=1
+fi
+
+set +e
+python3 $UNITTEST_PY IOTest.test_variable_gpu_output > $CLIENT_LOG.test_variable_gpu_output
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** IOTest.variable_gpu_output FAILED. \n***"
+    cat $CLIENT_LOG.test_variable_gpu_output
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG.test_variable_gpu_output
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
 if [ $RET -eq 0 ]; then
     echo -e "\n***\n*** IO test PASSED.\n***"
 else
diff --git a/qa/L0_backend_python/lifecycle/lifecycle_test.py b/qa/L0_backend_python/lifecycle/lifecycle_test.py
old mode 100644
new mode 100755
index f9805d7984..82856bbd32
--- a/qa/L0_backend_python/lifecycle/lifecycle_test.py
+++ b/qa/L0_backend_python/lifecycle/lifecycle_test.py
@@ -1,4 +1,6 @@
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,21 +27,23 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../../common")
 
-import test_util as tu
-import shm_util
+import queue
+import time
+import unittest
 from functools import partial
-import tritonclient.http as httpclient
+
+import numpy as np
+import shm_util
+import test_util as tu
 import tritonclient.grpc as grpcclient
+import tritonclient.http as httpclient
 from tritonclient.utils import *
-import numpy as np
-import unittest
-import queue
 
 
 class UserData:
-
     def __init__(self):
         self._completed_requests = queue.Queue()
 
@@ -52,17 +56,87 @@ def callback(user_data, result, error):
 
 
 class LifecycleTest(tu.TestResultCollector):
-
     def setUp(self):
         self._shm_leak_detector = shm_util.ShmLeakDetector()
 
+    def test_error_code(self):
+        model_name = "error_code"
+        shape = [1, 1]
+        # [(Triton error, expected gRPC error message starting), ...]
+        errors = [
+            ("UNKNOWN", "[StatusCode.UNKNOWN]"),
+            ("INTERNAL", "[StatusCode.INTERNAL]"),
+            ("NOT_FOUND", "[StatusCode.NOT_FOUND]"),
+            ("INVALID_ARG", "[StatusCode.INVALID_ARGUMENT]"),
+            ("UNAVAILABLE", "[StatusCode.UNAVAILABLE]"),
+            ("UNSUPPORTED", "[StatusCode.UNIMPLEMENTED]"),
+            ("ALREADY_EXISTS", "[StatusCode.ALREADY_EXISTS]"),
+            ("CANCELLED", "[StatusCode.CANCELLED]"),
+            ("(default)", "[StatusCode.INTERNAL] unrecognized"),
+        ]
+        with self._shm_leak_detector.Probe() as shm_probe:
+            with grpcclient.InferenceServerClient("localhost:8001") as client:
+                for error, expected_grpc_error_start in errors:
+                    input_data = np.array([[error]], dtype=np.object_)
+                    inputs = [
+                        grpcclient.InferInput(
+                            "ERROR_CODE", shape, np_to_triton_dtype(input_data.dtype)
+                        )
+                    ]
+                    inputs[0].set_data_from_numpy(input_data)
+                    with self.assertRaises(InferenceServerException) as e:
+                        client.infer(model_name, inputs)
+                    # e.g. [StatusCode.UNKNOWN] error code: TRITONSERVER_ERROR_UNKNOWN
+                    # e.g. [StatusCode.INTERNAL] unrecognized error code: (default)
+                    self.assertEqual(
+                        str(e.exception),
+                        expected_grpc_error_start + " error code: " + error,
+                    )
+
+    def test_execute_cancel(self):
+        model_name = "execute_cancel"
+        log_path = "lifecycle_server.log"
+        execute_delay = 4.0  # seconds
+        shape = [1, 1]
+        response = {"responded": False, "result": None, "error": None}
+
+        def callback(result, error):
+            response["responded"] = True
+            response["result"] = result
+            response["error"] = error
+
+        with self._shm_leak_detector.Probe() as shm_probe:
+            with grpcclient.InferenceServerClient("localhost:8001") as client:
+                input_data = np.array([[execute_delay]], dtype=np.float32)
+                inputs = [
+                    grpcclient.InferInput(
+                        "EXECUTE_DELAY", shape, np_to_triton_dtype(input_data.dtype)
+                    )
+                ]
+                inputs[0].set_data_from_numpy(input_data)
+                exec_future = client.async_infer(model_name, inputs, callback)
+                time.sleep(2)  # ensure the request is executing
+                self.assertFalse(response["responded"])
+                exec_future.cancel()
+                time.sleep(2)  # ensure the cancellation is delivered
+                self.assertTrue(response["responded"])
+
+        self.assertEqual(response["result"], None)
+        self.assertIsInstance(response["error"], InferenceServerException)
+        self.assertEqual(response["error"].status(), "StatusCode.CANCELLED")
+        with open(log_path, mode="r", encoding="utf-8", errors="strict") as f:
+            log_text = f.read()
+            self.assertIn("[execute_cancel] Request not cancelled at 1.0 s", log_text)
+            self.assertIn("[execute_cancel] Request cancelled at ", log_text)
+
     def test_batch_error(self):
-        # The execute_error model returns an error for the first request and
-        # sucessfully processes the second request. This is making sure that
-        # an error in a single request does not completely fail the batch.
+        # The execute_error model returns an error for the first and third
+        # request and successfully processes the second request. This is making
+        # sure that an error in a single request does not completely fail the
+        # batch.
         model_name = "execute_error"
         shape = [2, 2]
-        number_of_requests = 2
+        number_of_requests = 3
         user_data = UserData()
         triton_client = grpcclient.InferenceServerClient("localhost:8001")
         triton_client.start_stream(callback=partial(callback, user_data))
@@ -73,16 +147,16 @@ def test_batch_error(self):
                 input_data = np.random.randn(*shape).astype(np.float32)
                 input_datas.append(input_data)
                 inputs = [
-                    grpcclient.InferInput("IN", input_data.shape,
-                                          np_to_triton_dtype(input_data.dtype))
+                    grpcclient.InferInput(
+                        "IN", input_data.shape, np_to_triton_dtype(input_data.dtype)
+                    )
                 ]
                 inputs[0].set_data_from_numpy(input_data)
-                triton_client.async_stream_infer(model_name=model_name,
-                                                 inputs=inputs)
+                triton_client.async_stream_infer(model_name=model_name, inputs=inputs)
 
             for i in range(number_of_requests):
                 result = user_data._completed_requests.get()
-                if i == 0:
+                if i == 0 or i == 2:
                     self.assertIs(type(result), InferenceServerException)
                     continue
 
@@ -92,7 +166,9 @@ def test_batch_error(self):
                 self.assertTrue(
                     np.array_equal(output_data, input_datas[i]),
                     "error: expected output {} to match input {}".format(
-                        output_data, input_datas[i]))
+                        output_data, input_datas[i]
+                    ),
+                )
 
     def test_infer_pymodel_error(self):
         model_name = "wrong_model"
@@ -102,8 +178,9 @@ def test_infer_pymodel_error(self):
             with httpclient.InferenceServerClient("localhost:8000") as client:
                 input_data = (16384 * np.random.randn(*shape)).astype(np.uint32)
                 inputs = [
-                    httpclient.InferInput("IN", input_data.shape,
-                                          np_to_triton_dtype(input_data.dtype))
+                    httpclient.InferInput(
+                        "IN", input_data.shape, np_to_triton_dtype(input_data.dtype)
+                    )
                 ]
                 inputs[0].set_data_from_numpy(input_data)
                 try:
@@ -113,21 +190,24 @@ def test_infer_pymodel_error(self):
                     self.assertTrue(
                         e.message().startswith(
                             "Failed to process the request(s) for model instance"
-                        ), "Exception message is not correct")
+                        ),
+                        "Exception message is not correct",
+                    )
                 else:
                     self.assertTrue(
-                        False,
-                        "Wrong exception raised or did not raise an exception")
+                        False, "Wrong exception raised or did not raise an exception"
+                    )
 
     def test_incorrect_execute_return(self):
-        model_name = 'execute_return_error'
+        model_name = "execute_return_error"
         shape = [1, 1]
         with self._shm_leak_detector.Probe() as shm_probe:
             with httpclient.InferenceServerClient("localhost:8000") as client:
                 input_data = (5 * np.random.randn(*shape)).astype(np.float32)
                 inputs = [
-                    httpclient.InferInput("INPUT", input_data.shape,
-                                          np_to_triton_dtype(input_data.dtype))
+                    httpclient.InferInput(
+                        "INPUT", input_data.shape, np_to_triton_dtype(input_data.dtype)
+                    )
                 ]
                 inputs[0].set_data_from_numpy(input_data)
 
@@ -136,10 +216,11 @@ def test_incorrect_execute_return(self):
                     client.infer(model_name, inputs)
 
                 self.assertTrue(
-                    str(e.exception).startswith(
-                        "Failed to process the request(s) for model instance "
-                        "'execute_return_error_0', message: Expected a list in the "
-                        "execute return"), "Exception message is not correct.")
+                    "Failed to process the request(s) for model instance "
+                    "'execute_return_error_0_0', message: Expected a list in the "
+                    "execute return" in str(e.exception),
+                    "Exception message is not correct.",
+                )
 
                 # The second inference request will return a list of None object
                 # instead of Python InferenceResponse objects.
@@ -147,12 +228,13 @@ def test_incorrect_execute_return(self):
                     client.infer(model_name, inputs)
 
                 self.assertTrue(
-                    str(e.exception).startswith(
-                        "Failed to process the request(s) for model instance "
-                        "'execute_return_error_0', message: Expected an "
-                        "'InferenceResponse' object in the execute function return"
-                        " list"), "Exception message is not correct.")
+                    "Failed to process the request(s) for model instance "
+                    "'execute_return_error_0_0', message: Expected an "
+                    "'InferenceResponse' object in the execute function return"
+                    " list" in str(e.exception),
+                    "Exception message is not correct.",
+                )
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_backend_python/lifecycle/test.sh b/qa/L0_backend_python/lifecycle/test.sh
old mode 100644
new mode 100755
index 9d7917b538..3d843ea874
--- a/qa/L0_backend_python/lifecycle/test.sh
+++ b/qa/L0_backend_python/lifecycle/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,8 +25,8 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-CLIENT_LOG="./client.log"
-EXPECTED_NUM_TESTS="3"
+CLIENT_LOG="./lifecycle_client.log"
+EXPECTED_NUM_TESTS="5"
 TEST_RESULT_FILE='test_results.txt'
 source ../common.sh
 source ../../common/util.sh
@@ -35,11 +35,19 @@ TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"}
 SERVER=${TRITON_DIR}/bin/tritonserver
 BACKEND_DIR=${TRITON_DIR}/backends
 SERVER_ARGS="--model-repository=`pwd`/models --backend-directory=${BACKEND_DIR} --log-verbose=1"
-SERVER_LOG="./inference_server.log"
+SERVER_LOG="./lifecycle_server.log"
 
 RET=0
 rm -fr *.log ./models
 
+mkdir -p models/error_code/1/
+cp ../../python_models/error_code/model.py ./models/error_code/1/
+cp ../../python_models/error_code/config.pbtxt ./models/error_code/
+
+mkdir -p models/execute_cancel/1/
+cp ../../python_models/execute_cancel/model.py ./models/execute_cancel/1/
+cp ../../python_models/execute_cancel/config.pbtxt ./models/execute_cancel/
+
 mkdir -p models/execute_error/1/
 cp ../../python_models/execute_error/model.py ./models/execute_error/1/
 cp ../../python_models/execute_error/config.pbtxt ./models/execute_error/
@@ -72,7 +80,7 @@ set +e
 
 # Run this multiple times to catch any intermittent segfault.
 for i in {0..4}; do
-    python3 lifecycle_test.py > $CLIENT_LOG 2>&1 
+    python3 lifecycle_test.py > $CLIENT_LOG 2>&1
     if [ $? -ne 0 ]; then
         cat $CLIENT_LOG
         echo -e "\n***\n*** lifecycle_test.py FAILED. \n***"
@@ -171,10 +179,6 @@ set -e
 rm -rf models/
 mkdir -p models/auto_complete_error/1/
 cp ../../python_models/auto_complete_error/model.py ./models/auto_complete_error/1/
-if [ "$TEST_JETSON" == "1" ]; then
-    echo -e 'name: "auto_complete_error" \ninstance_group [{ kind: KIND_CPU }]' > \
-        models/auto_complete_error/config.pbtxt
-fi
 
 SERVER_ARGS="${SERVER_ARGS} --strict-model-config=false"
 
diff --git a/qa/L0_backend_python/logging/logging_test.py b/qa/L0_backend_python/logging/logging_test.py
new file mode 100755
index 0000000000..b21919df65
--- /dev/null
+++ b/qa/L0_backend_python/logging/logging_test.py
@@ -0,0 +1,58 @@
+#!/usr/bin/env python3
+
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../../common")
+import unittest
+
+import numpy as np
+import test_util as tu
+import tritonclient.http as httpclient
+from tritonclient.utils import *
+
+
+class LogTest(tu.TestResultCollector):
+    def test_log_output(self):
+        model_name = "identity_fp32_logging"
+        with httpclient.InferenceServerClient("localhost:8000") as client:
+            input_data = np.array([[1.0]], dtype=np.float32)
+            inputs = [
+                httpclient.InferInput(
+                    "INPUT0", input_data.shape, np_to_triton_dtype(input_data.dtype)
+                )
+            ]
+            inputs[0].set_data_from_numpy(input_data)
+            result = client.infer(model_name, inputs)
+            output0 = result.as_numpy("OUTPUT0")
+            self.assertIsNotNone(output0)
+            self.assertTrue(np.all(output0 == input_data))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_backend_python/logging/test.sh b/qa/L0_backend_python/logging/test.sh
new file mode 100755
index 0000000000..b665ead7dd
--- /dev/null
+++ b/qa/L0_backend_python/logging/test.sh
@@ -0,0 +1,231 @@
+#!/bin/bash
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+CLIENT_LOG="logging_client.log"
+TEST_RESULT_FILE="test_results.txt"
+LOG_TEST="logging_test.py"
+SERVER_LOG="./logging_server.log"
+
+REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
+if [ "$#" -ge 1 ]; then
+    REPO_VERSION=$1
+fi
+if [ -z "$REPO_VERSION" ]; then
+    echo -e "Repository version must be specified"
+    echo -e "\n***\n*** Test Failed\n***"
+    exit 1
+fi
+if [ ! -z "$TEST_REPO_ARCH" ]; then
+    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
+fi
+
+export CUDA_VISIBLE_DEVICES=0
+
+# On windows the paths invoked by the script (running in WSL) must use
+# /mnt/c when needed but the paths on the tritonserver command-line
+# must be C:/ style.
+if [[ "$(< /proc/sys/kernel/osrelease)" == *microsoft* ]]; then
+    MODELDIR=${MODELDIR:=C:/models}
+    DATADIR=${DATADIR:="/mnt/c/data/inferenceserver/${REPO_VERSION}"}
+    BACKEND_DIR=${BACKEND_DIR:=C:/tritonserver/backends}
+    SERVER=${SERVER:=/mnt/c/tritonserver/bin/tritonserver.exe}
+    export WSLENV=$WSLENV:TRITONSERVER_DELAY_SCHEDULER
+else
+    MODELDIR=${MODELDIR:=`pwd`}
+    DATADIR=${DATADIR:="/data/inferenceserver/${REPO_VERSION}"}
+    TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"}
+    SERVER=${TRITON_DIR}/bin/tritonserver
+    BACKEND_DIR=${TRITON_DIR}/backends
+fi
+
+MODELSDIR=`pwd`/models
+source ../../common/util.sh
+
+function verify_log_counts () {
+  non_verbose_expected=$1
+  verbose_expected=$2
+
+  if [ `grep -c "Specific Msg!" $SERVER_LOG` != $non_verbose_expected ]; then
+    echo -e "\n***\n*** Test Failed: Specific Msg Count Incorrect\n***"
+    RET=1
+  fi
+  if [ `grep -c "Info Msg!" $SERVER_LOG` != $non_verbose_expected ]; then
+    echo -e "\n***\n*** Test Failed: Info Msg Count Incorrect\n***"
+    RET=1
+  fi
+  if [ `grep -c "Warning Msg!" $SERVER_LOG` != $non_verbose_expected ]; then
+    echo -e "\n***\n*** Test Failed: Warning Msg Count Incorrect\n***"
+    RET=1
+  fi
+  if [ `grep -c "Error Msg!" $SERVER_LOG` != $non_verbose_expected ]; then
+    echo -e "\n***\n*** Test Failed: Error Msg Count Incorrect\n***"
+    RET=1
+  fi
+  if [ `grep -c "Verbose Msg!" $SERVER_LOG` != $verbose_expected ]; then
+    echo -e "\n***\n*** Test Failed: Verbose Msg Count Incorrect\n***"
+    RET=1
+  fi
+}
+
+rm -f *.log
+
+# set up simple repository MODELBASE
+rm -fr $MODELSDIR && mkdir -p $MODELSDIR && \
+    python_model="identity_fp32_logging"
+    mkdir -p models/$python_model/1/
+    cp ../../python_models/$python_model/config.pbtxt models/$python_model/config.pbtxt
+    cp ../../python_models/$python_model/model.py models/$python_model/1/
+RET=0
+
+#Run Server with Default Log Settings
+SERVER_ARGS="--model-repository=$MODELSDIR --backend-directory=${BACKEND_DIR}"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python3 $LOG_TEST >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    cat $CLIENT_LOG
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE 1
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Check if correct # log messages are present [ non-verbose-msg-cnt | verbose-msg-cnt ]
+verify_log_counts 4 0
+
+rm -f *.log
+#Run Server Enabling Verbose Messages
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+# Enable verbose logging
+code=`curl -s -w %{http_code} -o ./curl.out -d'{"log_verbose_level":1}' localhost:8000/v2/logging`
+
+if [ "$code" != "200" ]; then
+    cat ./curl.out
+    echo -e "\n***\n*** Test Failed: Could not Change Log Settings\n***"
+    RET=1
+fi
+
+python3 $LOG_TEST >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    cat $CLIENT_LOG
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE 1
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Verbose only 3 because model must initialize before
+# log settings can be modified
+verify_log_counts 4 3
+
+rm -f *.log
+#Run Server Enabling Verbose Messages
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+# Disable all logging
+BOOL_PARAMS=${BOOL_PARAMS:="log_info log_warning log_error"}
+for BOOL_PARAM in $BOOL_PARAMS; do
+    # Attempt to use integer instead of bool
+    code=`curl -s -w %{http_code} -o ./curl.out -d'{"'"$BOOL_PARAM"'":false}' localhost:8000/v2/logging`
+    if [ "$code" != "200" ]; then
+        cat ./curl.out
+        echo -e "\n***\n*** Test Failed: Could not Change Log Settings\n***"
+        RET=1
+    fi
+done
+
+python3 $LOG_TEST >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    cat $CLIENT_LOG
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE 1
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Will have 1 occurrence of each non-verbose log type
+# because the server must initialize before log settings
+# can be modified
+verify_log_counts 1 0
+
+
+if [ $RET -eq 0 ]; then
+    echo -e "\n***\n*** Logging test PASSED. \n***"
+else
+    echo -e "\n***\n*** Logging test FAILED. \n***"
+fi
+
+exit $RET
diff --git a/qa/L0_backend_python/model_control/model_control_test.py b/qa/L0_backend_python/model_control/model_control_test.py
old mode 100644
new mode 100755
index feceda01e4..17686f97d5
--- a/qa/L0_backend_python/model_control/model_control_test.py
+++ b/qa/L0_backend_python/model_control/model_control_test.py
@@ -1,4 +1,6 @@
-# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,22 +30,22 @@
 
 sys.path.append("../../common")
 
+import unittest
+
+import numpy as np
+import shm_util
 import test_util as tu
 import tritonclient.http as httpclient
 from tritonclient.utils import *
-import numpy as np
-import unittest
-import shm_util
 
 
 class ExplicitModelTest(tu.TestResultCollector):
-
     def setUp(self):
         self._shm_leak_detector = shm_util.ShmLeakDetector()
 
     def send_identity_request(self, client, model_name):
         inputs = []
-        inputs.append(httpclient.InferInput('INPUT0', [1, 16], "FP32"))
+        inputs.append(httpclient.InferInput("INPUT0", [1, 16], "FP32"))
         input0_data = np.arange(start=0, stop=16, dtype=np.float32)
         input0_data = np.expand_dims(input0_data, axis=0)
         inputs[0].set_data_from_numpy(input0_data)
@@ -52,13 +54,14 @@ def send_identity_request(self, client, model_name):
             result = client.infer(
                 model_name=model_name,
                 inputs=inputs,
-                outputs=[httpclient.InferRequestedOutput('OUTPUT0')])
-        output_numpy = result.as_numpy('OUTPUT0')
+                outputs=[httpclient.InferRequestedOutput("OUTPUT0")],
+            )
+        output_numpy = result.as_numpy("OUTPUT0")
         self.assertTrue(np.all(input0_data == output_numpy))
 
     def test_model_reload(self):
         model_name = "identity_fp32"
-        ensemble_model_name = 'simple_' + "identity_fp32"
+        ensemble_model_name = "simple_" + "identity_fp32"
         with httpclient.InferenceServerClient("localhost:8000") as client:
             for _ in range(5):
                 self.assertFalse(client.is_model_ready(model_name))
@@ -76,5 +79,5 @@ def test_model_reload(self):
                 self.assertFalse(client.is_model_ready(ensemble_model_name))
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_backend_python/model_control/test.sh b/qa/L0_backend_python/model_control/test.sh
old mode 100644
new mode 100755
index 63fabd8bd2..c4709ce217
--- a/qa/L0_backend_python/model_control/test.sh
+++ b/qa/L0_backend_python/model_control/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,14 +25,14 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-CLIENT_LOG="./client.log"
+CLIENT_LOG="./model_control_client.log"
 EXPECTED_NUM_TESTS="1"
 TEST_RESULT_FILE='test_results.txt'
 TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"}
 SERVER=${TRITON_DIR}/bin/tritonserver
 BACKEND_DIR=${TRITON_DIR}/backends
 SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit --backend-directory=${BACKEND_DIR} --log-verbose=1"
-SERVER_LOG="./inference_server.log"
+SERVER_LOG="./model_control_server.log"
 
 RET=0
 rm -fr *.log ./models
@@ -77,3 +77,5 @@ if [ $RET -eq 1 ]; then
 else
     echo -e "\n***\n*** model_control_test PASSED. \n***"
 fi
+
+exit $RET
diff --git a/qa/L0_backend_python/python_based_backends/python_based_backends_test.py b/qa/L0_backend_python/python_based_backends/python_based_backends_test.py
new file mode 100644
index 0000000000..13fe204267
--- /dev/null
+++ b/qa/L0_backend_python/python_based_backends/python_based_backends_test.py
@@ -0,0 +1,144 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+import unittest
+from random import randint
+
+import numpy as np
+import tritonclient.grpc as grpcclient
+from tritonclient.utils import *
+
+sys.path.append("../../common")
+from test_util import TestResultCollector
+
+
+class PythonBasedBackendsTest(TestResultCollector):
+    def setUp(self):
+        self.triton_client = grpcclient.InferenceServerClient(url="localhost:8001")
+        self.add_sub_model_1 = "add"
+        self.add_sub_model_2 = "sub"
+        self.python_model = "add_sub"
+        self.pytorch_model = "add_sub_pytorch"
+
+        self.triton_client.load_model(
+            self.add_sub_model_1,
+            config='{"backend":"add_sub","version_policy":{"latest":{"num_versions":2}}}',
+        )
+        self.triton_client.load_model(self.add_sub_model_2)
+        self.triton_client.load_model(self.python_model)
+        self.triton_client.load_model(self.pytorch_model)
+
+    def test_add_sub_models(self):
+        self.assertTrue(
+            self.triton_client.is_model_ready(self.add_sub_model_1, model_version="2")
+        )
+        self._test_add_sub_model(
+            model_name=self.add_sub_model_1, model_version="2", single_output=True
+        )
+
+        self.assertTrue(
+            self.triton_client.is_model_ready(self.add_sub_model_1, model_version="1")
+        )
+        self._test_add_sub_model(
+            model_name=self.add_sub_model_1, model_version="1", single_output=True
+        )
+
+        self.assertTrue(self.triton_client.is_model_ready(self.add_sub_model_2))
+        self._test_add_sub_model(model_name=self.add_sub_model_2, single_output=True)
+
+    def test_python_model(self):
+        self.assertTrue(
+            self.triton_client.is_model_ready(self.python_model, model_version="2")
+        )
+        self._test_add_sub_model(
+            model_name=self.python_model, shape=[16], model_version="2"
+        )
+
+    def test_pytorch_model(self):
+        self.assertTrue(
+            self.triton_client.is_model_ready(self.pytorch_model, model_version="1")
+        )
+        self._test_add_sub_model(model_name=self.pytorch_model)
+
+    def _test_add_sub_model(
+        self, model_name, model_version="1", shape=[4], single_output=False
+    ):
+        input0_data = np.random.rand(*shape).astype(np.float32)
+        input1_data = np.random.rand(*shape).astype(np.float32)
+
+        inputs = [
+            grpcclient.InferInput(
+                "INPUT0", input0_data.shape, np_to_triton_dtype(input0_data.dtype)
+            ),
+            grpcclient.InferInput(
+                "INPUT1", input1_data.shape, np_to_triton_dtype(input1_data.dtype)
+            ),
+        ]
+
+        inputs[0].set_data_from_numpy(input0_data)
+        inputs[1].set_data_from_numpy(input1_data)
+
+        if single_output:
+            outputs = [grpcclient.InferRequestedOutput("OUTPUT")]
+
+        else:
+            outputs = [
+                grpcclient.InferRequestedOutput("OUTPUT0"),
+                grpcclient.InferRequestedOutput("OUTPUT1"),
+            ]
+
+        response = self.triton_client.infer(
+            model_name=model_name,
+            inputs=inputs,
+            model_version=model_version,
+            request_id=str(randint(10, 99)),
+            outputs=outputs,
+        )
+
+        if single_output:
+            if model_name == "add":
+                self.assertTrue(
+                    np.allclose(input0_data + input1_data, response.as_numpy("OUTPUT"))
+                )
+            else:
+                self.assertTrue(
+                    np.allclose(input0_data - input1_data, response.as_numpy("OUTPUT"))
+                )
+        else:
+            self.assertTrue(
+                np.allclose(input0_data + input1_data, response.as_numpy("OUTPUT0"))
+            )
+            self.assertTrue(
+                np.allclose(input0_data - input1_data, response.as_numpy("OUTPUT1"))
+            )
+
+    def tearDown(self):
+        self.triton_client.close()
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_backend_python/python_based_backends/test.sh b/qa/L0_backend_python/python_based_backends/test.sh
new file mode 100755
index 0000000000..0f332eb3e0
--- /dev/null
+++ b/qa/L0_backend_python/python_based_backends/test.sh
@@ -0,0 +1,113 @@
+#!/bin/bash
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+source ../../common/util.sh
+
+TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"}
+SERVER=${TRITON_DIR}/bin/tritonserver
+BACKEND_DIR=${TRITON_DIR}/backends
+QA_MODELS_PATH="../../python_models"
+MODEL_REPOSITORY="$(pwd)/models"
+SERVER_ARGS="--model-repository=${MODEL_REPOSITORY} --backend-directory=${BACKEND_DIR} --model-control-mode=explicit --log-verbose=1"
+SERVER_LOG="./python_based_backends_server.log"
+CLIENT_LOG="./python_based_backends_client.log"
+TEST_RESULT_FILE="./test_results.txt"
+CLIENT_PY="./python_based_backends_test.py"
+GEN_PYTORCH_MODEL_PY="../../common/gen_qa_pytorch_model.py"
+EXPECTED_NUM_TESTS=3
+RET=0
+
+rm -rf ${MODEL_REPOSITORY}
+pip3 install torch
+
+# Setup add_sub backend and models
+mkdir -p ${BACKEND_DIR}/add_sub
+cp ${QA_MODELS_PATH}/python_based_backends/add_sub_backend/model.py ${BACKEND_DIR}/add_sub/model.py
+
+mkdir -p ${MODEL_REPOSITORY}/add/1/
+echo '{ "operation": "add" }' > ${MODEL_REPOSITORY}/add/1/model.json
+echo "backend: \"add_sub\"" > ${MODEL_REPOSITORY}/add/config.pbtxt
+cp -r ${MODEL_REPOSITORY}/add/1/ ${MODEL_REPOSITORY}/add/2/
+
+mkdir -p ${MODEL_REPOSITORY}/sub/1/
+echo '{ "operation": "sub" }' > ${MODEL_REPOSITORY}/sub/1/model.json
+echo "backend: \"add_sub\"" > ${MODEL_REPOSITORY}/sub/config.pbtxt
+
+# Setup python backend model
+mkdir -p ${MODEL_REPOSITORY}/add_sub/1
+cp ${QA_MODELS_PATH}/add_sub/model.py ${MODEL_REPOSITORY}/add_sub/1/
+cp ${QA_MODELS_PATH}/add_sub/config.pbtxt ${MODEL_REPOSITORY}/add_sub/
+cp -r ${MODEL_REPOSITORY}/add_sub/1/ ${MODEL_REPOSITORY}/add_sub/2/
+
+# Setup pytorch backend model
+cp ${GEN_PYTORCH_MODEL_PY} ./gen_qa_pytorch_model.py
+GEN_PYTORCH_MODEL_PY=./gen_qa_pytorch_model.py
+
+set +e
+python3 ${GEN_PYTORCH_MODEL_PY} -m ${MODEL_REPOSITORY}
+
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Running ${GEN_PYTORCH_MODEL_PY} FAILED. \n***"
+    exit 1
+fi
+set -e
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    exit 1
+fi
+
+set +e
+python3 $CLIENT_PY -v >$CLIENT_LOG 2>&1
+
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Running $CLIENT_PY FAILED. \n***"
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+    if [ $? -ne 0 ]; then
+        echo -e "\n***\n*** Test Result Verification FAILED.\n***"
+        RET=1
+    fi
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+rm -rf ${MODEL_REPOSITORY} ${GEN_PYTORCH_MODEL_PY}
+
+if [ $RET -eq 1 ]; then
+    cat $CLIENT_LOG
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Python-based Backends test FAILED. \n***"
+else
+    echo -e "\n***\n*** Python-based Backends test PASSED. \n***"
+fi
+
+exit $RET
diff --git a/qa/L0_backend_python/python_test.py b/qa/L0_backend_python/python_test.py
old mode 100644
new mode 100755
index 3c5d520775..eb4d02aa53
--- a/qa/L0_backend_python/python_test.py
+++ b/qa/L0_backend_python/python_test.py
@@ -1,6 +1,6 @@
 #!/usr/bin/python
 
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -30,21 +30,20 @@
 
 sys.path.append("../common")
 
+import os
 import unittest
+
 import numpy as np
-import test_util as tu
-import shm_util
 import requests as httpreq
-import os
-
-from tritonclient.utils import *
+import shm_util
+import test_util as tu
 import tritonclient.http as httpclient
+from tritonclient.utils import *
 
-TEST_JETSON = bool(int(os.environ.get('TEST_JETSON', 0)))
+TEST_JETSON = bool(int(os.environ.get("TEST_JETSON", 0)))
 
 
 class PythonTest(tu.TestResultCollector):
-
     def setUp(self):
         self._shm_leak_detector = shm_util.ShmLeakDetector()
 
@@ -52,33 +51,39 @@ def _infer_help(self, model_name, shape, data_type):
         with httpclient.InferenceServerClient("localhost:8000") as client:
             input_data_0 = np.array(np.random.randn(*shape), dtype=data_type)
             inputs = [
-                httpclient.InferInput("INPUT0", shape,
-                                      np_to_triton_dtype(input_data_0.dtype))
+                httpclient.InferInput(
+                    "INPUT0", shape, np_to_triton_dtype(input_data_0.dtype)
+                )
             ]
             inputs[0].set_data_from_numpy(input_data_0)
 
             result = client.infer(model_name, inputs)
-            output0 = result.as_numpy('OUTPUT0')
+            output0 = result.as_numpy("OUTPUT0")
             self.assertTrue(np.all(input_data_0 == output0))
 
+    def _create_cuda_region(self, client, size, name):
+        import tritonclient.utils.cuda_shared_memory as cuda_shared_memory
+
+        shm0_handle = cuda_shared_memory.create_shared_memory_region(
+            name, byte_size=size, device_id=0
+        )
+        client.register_cuda_shared_memory(
+            name, cuda_shared_memory.get_raw_handle(shm0_handle), 0, size
+        )
+        return shm0_handle
+
     def _optional_input_infer(self, model_name, has_input0, has_input1):
         with httpclient.InferenceServerClient("localhost:8000") as client:
             shape = (1,)
             if has_input0:
-                input0_numpy = np.random.randint(0,
-                                                 100,
-                                                 size=shape,
-                                                 dtype=np.int32)
+                input0_numpy = np.random.randint(0, 100, size=shape, dtype=np.int32)
             else:
                 # Set the input0 to a default value if it is optional. This is
                 # the input used by the model if it is not provided.
                 input0_numpy = np.array([5], dtype=np.int32)
 
             if has_input1:
-                input1_numpy = np.random.randint(0,
-                                                 100,
-                                                 size=shape,
-                                                 dtype=np.int32)
+                input1_numpy = np.random.randint(0, 100, size=shape, dtype=np.int32)
             else:
                 # Set the input1 to a default value if it is optional. This is
                 # the input used by the model if it is not provided.
@@ -88,68 +93,136 @@ def _optional_input_infer(self, model_name, has_input0, has_input1):
             if has_input0:
                 inputs.append(
                     httpclient.InferInput(
-                        "INPUT0", shape,
-                        np_to_triton_dtype(input0_numpy.dtype)))
+                        "INPUT0", shape, np_to_triton_dtype(input0_numpy.dtype)
+                    )
+                )
                 inputs[-1].set_data_from_numpy(input0_numpy)
 
             if has_input1:
                 inputs.append(
                     httpclient.InferInput(
-                        "INPUT1", shape,
-                        np_to_triton_dtype(input1_numpy.dtype)))
+                        "INPUT1", shape, np_to_triton_dtype(input1_numpy.dtype)
+                    )
+                )
                 inputs[-1].set_data_from_numpy(input1_numpy)
 
             result = client.infer(model_name, inputs)
-            output0 = result.as_numpy('OUTPUT0')
+            output0 = result.as_numpy("OUTPUT0")
             self.assertIsNotNone(output0, "OUTPUT0 was not found.")
 
-            output1 = result.as_numpy('OUTPUT1')
+            output1 = result.as_numpy("OUTPUT1")
             self.assertIsNotNone(output1, "OUTPUT1 was not found.")
 
             expected_output0 = input0_numpy + input1_numpy
             expected_output1 = input0_numpy - input1_numpy
-            np.testing.assert_equal(output0, expected_output0,
-                                    "OUTPUT0 doesn't match expected OUTPUT0")
-            np.testing.assert_equal(output1, expected_output1,
-                                    "OUTPUT1 doesn't match expected OUTPUT1")
-
-    # We do not use a docker on Jetson so it does not impose a shared memory
-    # allocation limit of 1GB. This means test will pass without the expected
-    # error on jetson and is hence unnecessary.
+            np.testing.assert_equal(
+                output0, expected_output0, "OUTPUT0 doesn't match expected OUTPUT0"
+            )
+            np.testing.assert_equal(
+                output1, expected_output1, "OUTPUT1 doesn't match expected OUTPUT1"
+            )
+
+    def test_growth_error(self):
+        # 2 MiBs
+        total_byte_size = 2 * 1024 * 1024
+        shape = [total_byte_size]
+        model_name = "identity_uint8_nobatch"
+        dtype = np.uint8
+        with self._shm_leak_detector.Probe() as shm_probe:
+            self._infer_help(model_name, shape, dtype)
+
+        # 1 GiB payload leads to error in the main Python backend process.
+        # Total shared memory available is 1GiB.
+        total_byte_size = 1024 * 1024 * 1024
+        shape = [total_byte_size]
+        with self.assertRaises(InferenceServerException) as ex:
+            self._infer_help(model_name, shape, dtype)
+        self.assertIn(
+            "Failed to increase the shared memory pool size", str(ex.exception)
+        )
+
+        # 512 MiBs payload leads to error in the Python stub process.
+        total_byte_size = 512 * 1024 * 1024
+        shape = [total_byte_size]
+        with self.assertRaises(InferenceServerException) as ex:
+            self._infer_help(model_name, shape, dtype)
+        self.assertIn(
+            "Failed to increase the shared memory pool size", str(ex.exception)
+        )
+
+        # 2 MiBs
+        # Send a small paylaod to make sure it is still working properly
+        total_byte_size = 2 * 1024 * 1024
+        shape = [total_byte_size]
+        with self._shm_leak_detector.Probe() as shm_probe:
+            self._infer_help(model_name, shape, dtype)
+
+    # GPU tensors are not supported on jetson
+    # CUDA Shared memory is not supported on jetson
     if not TEST_JETSON:
 
-        def test_growth_error(self):
-            # 2 MiBs
-            total_byte_size = 2 * 1024 * 1024
-            shape = [total_byte_size]
-            model_name = 'identity_uint8_nobatch'
-            dtype = np.uint8
-            with self._shm_leak_detector.Probe() as shm_probe:
-                self._infer_help(model_name, shape, dtype)
-
-            # 1 GiB payload leads to error in the main Python backned process.
-            # Total shared memory available is 1GiB.
-            total_byte_size = 1024 * 1024 * 1024
-            shape = [total_byte_size]
-            with self.assertRaises(InferenceServerException) as ex:
-                self._infer_help(model_name, shape, dtype)
-            self.assertIn("Failed to increase the shared memory pool size",
-                          str(ex.exception))
-
-            # 512 MiBs payload leads to error in the Python stub process.
-            total_byte_size = 512 * 1024 * 1024
-            shape = [total_byte_size]
-            with self.assertRaises(InferenceServerException) as ex:
-                self._infer_help(model_name, shape, dtype)
-            self.assertIn("Failed to increase the shared memory pool size",
-                          str(ex.exception))
-
-            # 2 MiBs
-            # Send a small paylaod to make sure it is still working properly
-            total_byte_size = 2 * 1024 * 1024
-            shape = [total_byte_size]
-            with self._shm_leak_detector.Probe() as shm_probe:
-                self._infer_help(model_name, shape, dtype)
+        def test_gpu_tensor_error(self):
+            import tritonclient.utils.cuda_shared_memory as cuda_shared_memory
+
+            model_name = "identity_bool"
+            with httpclient.InferenceServerClient("localhost:8000") as client:
+                input_data = np.array([[True] * 1000], dtype=bool)
+                inputs = [
+                    httpclient.InferInput(
+                        "INPUT0", input_data.shape, np_to_triton_dtype(input_data.dtype)
+                    )
+                ]
+                inputs[0].set_data_from_numpy(input_data)
+
+                requested_outputs = [httpclient.InferRequestedOutput("OUTPUT0")]
+
+                # intentionally create a shared memory region with not enough size.
+                client.unregister_cuda_shared_memory()
+                shm0_handle = self._create_cuda_region(client, 1, "output0_data")
+
+                requested_outputs[0].set_shared_memory("output0_data", 1)
+                with self.assertRaises(InferenceServerException) as ex:
+                    client.infer(model_name, inputs, outputs=requested_outputs)
+                self.assertIn(
+                    "should be at least 1000 bytes to hold the results",
+                    str(ex.exception),
+                )
+                client.unregister_cuda_shared_memory()
+                cuda_shared_memory.destroy_shared_memory_region(shm0_handle)
+
+        def test_dlpack_tensor_error(self):
+            import tritonclient.utils.cuda_shared_memory as cuda_shared_memory
+
+            model_name = "dlpack_identity"
+            with httpclient.InferenceServerClient("localhost:8000") as client:
+                input_data = np.array([[1] * 1000], dtype=np.float32)
+                inputs = [
+                    httpclient.InferInput(
+                        "INPUT0", input_data.shape, np_to_triton_dtype(input_data.dtype)
+                    )
+                ]
+
+                requested_outputs = [httpclient.InferRequestedOutput("OUTPUT0")]
+                input_data_size = input_data.itemsize * input_data.size
+                client.unregister_cuda_shared_memory()
+                input_region = self._create_cuda_region(
+                    client, input_data_size, "input0_data"
+                )
+                inputs[0].set_shared_memory("input0_data", input_data_size)
+                cuda_shared_memory.set_shared_memory_region(input_region, [input_data])
+
+                # Intentionally create a small region to trigger an error
+                shm0_handle = self._create_cuda_region(client, 1, "output0_data")
+                requested_outputs[0].set_shared_memory("output0_data", 1)
+
+                with self.assertRaises(InferenceServerException) as ex:
+                    client.infer(model_name, inputs, outputs=requested_outputs)
+                self.assertIn(
+                    "should be at least 4000 bytes to hold the results",
+                    str(ex.exception),
+                )
+                client.unregister_cuda_shared_memory()
+                cuda_shared_memory.destroy_shared_memory_region(shm0_handle)
 
     def test_async_infer(self):
         model_name = "identity_uint8"
@@ -158,18 +231,19 @@ def test_async_infer(self):
 
         with self._shm_leak_detector.Probe() as shm_probe:
             with httpclient.InferenceServerClient(
-                    "localhost:8000",
-                    concurrency=request_parallelism) as client:
+                "localhost:8000", concurrency=request_parallelism
+            ) as client:
                 input_datas = []
                 requests = []
                 for i in range(request_parallelism):
-                    input_data = (16384 * np.random.randn(*shape)).astype(
-                        np.uint8)
+                    input_data = (16384 * np.random.randn(*shape)).astype(np.uint8)
                     input_datas.append(input_data)
                     inputs = [
                         httpclient.InferInput(
-                            "INPUT0", input_data.shape,
-                            np_to_triton_dtype(input_data.dtype))
+                            "INPUT0",
+                            input_data.shape,
+                            np_to_triton_dtype(input_data.dtype),
+                        )
                     ]
                     inputs[0].set_data_from_numpy(input_data)
                     requests.append(client.async_infer(model_name, inputs))
@@ -180,76 +254,92 @@ def test_async_infer(self):
                     results = requests[i].get_result()
 
                     output_data = results.as_numpy("OUTPUT0")
-                    self.assertIsNotNone(output_data,
-                                         "error: expected 'OUTPUT0'")
+                    self.assertIsNotNone(output_data, "error: expected 'OUTPUT0'")
                     self.assertTrue(
                         np.array_equal(output_data, input_datas[i]),
                         "error: expected output {} to match input {}".format(
-                            output_data, input_datas[i]))
+                            output_data, input_datas[i]
+                        ),
+                    )
 
                 # Make sure the requests ran in parallel.
                 stats = client.get_inference_statistics(model_name)
-                test_cond = (len(stats['model_stats']) != 1) or (
-                    stats['model_stats'][0]['name'] != model_name)
+                test_cond = (len(stats["model_stats"]) != 1) or (
+                    stats["model_stats"][0]["name"] != model_name
+                )
                 self.assertFalse(
-                    test_cond,
-                    "error: expected statistics for {}".format(model_name))
-
-                stat = stats['model_stats'][0]
-                self.assertFalse((stat['inference_count'] != 8) or (
-                    stat['execution_count'] != 1
-                ), "error: expected execution_count == 1 and inference_count == 8, got {} and {}"
-                                 .format(stat['execution_count'],
-                                         stat['inference_count']))
-                batch_stat = stat['batch_stats'][0]
+                    test_cond, "error: expected statistics for {}".format(model_name)
+                )
+
+                stat = stats["model_stats"][0]
                 self.assertFalse(
-                    batch_stat['batch_size'] != 8,
-                    f"error: expected batch_size == 8, got {batch_stat['batch_size']}"
+                    (stat["inference_count"] != 8) or (stat["execution_count"] != 1),
+                    "error: expected execution_count == 1 and inference_count == 8, got {} and {}".format(
+                        stat["execution_count"], stat["inference_count"]
+                    ),
+                )
+                batch_stat = stat["batch_stats"][0]
+                self.assertFalse(
+                    batch_stat["batch_size"] != 8,
+                    f"error: expected batch_size == 8, got {batch_stat['batch_size']}",
                 )
                 # Check metrics to make sure they are reported correctly
-                metrics = httpreq.get('http://localhost:8002/metrics')
+                metrics = httpreq.get("http://localhost:8002/metrics")
                 print(metrics.text)
 
-                success_str = 'nv_inference_request_success{model="identity_uint8",version="1"}'
-                infer_count_str = 'nv_inference_count{model="identity_uint8",version="1"}'
-                infer_exec_str = 'nv_inference_exec_count{model="identity_uint8",version="1"}'
+                success_str = (
+                    'nv_inference_request_success{model="identity_uint8",version="1"}'
+                )
+                infer_count_str = (
+                    'nv_inference_count{model="identity_uint8",version="1"}'
+                )
+                infer_exec_str = (
+                    'nv_inference_exec_count{model="identity_uint8",version="1"}'
+                )
 
                 success_val = None
                 infer_count_val = None
                 infer_exec_val = None
                 for line in metrics.text.splitlines():
                     if line.startswith(success_str):
-                        success_val = float(line[len(success_str):])
+                        success_val = float(line[len(success_str) :])
                     if line.startswith(infer_count_str):
-                        infer_count_val = float(line[len(infer_count_str):])
+                        infer_count_val = float(line[len(infer_count_str) :])
                     if line.startswith(infer_exec_str):
-                        infer_exec_val = float(line[len(infer_exec_str):])
+                        infer_exec_val = float(line[len(infer_exec_str) :])
 
                 self.assertFalse(
                     success_val != 4,
                     "error: expected metric {} == 4, got {}".format(
-                        success_str, success_val))
+                        success_str, success_val
+                    ),
+                )
                 self.assertFalse(
                     infer_count_val != 8,
                     "error: expected metric {} == 8, got {}".format(
-                        infer_count_str, infer_count_val))
+                        infer_count_str, infer_count_val
+                    ),
+                )
                 self.assertFalse(
                     infer_exec_val != 1,
                     "error: expected metric {} == 1, got {}".format(
-                        infer_exec_str, infer_exec_val))
+                        infer_exec_str, infer_exec_val
+                    ),
+                )
 
     def test_bool(self):
-        model_name = 'identity_bool'
+        model_name = "identity_bool"
         with self._shm_leak_detector.Probe() as shm_probe:
             with httpclient.InferenceServerClient("localhost:8000") as client:
                 input_data = np.array([[True, False, True]], dtype=bool)
                 inputs = [
-                    httpclient.InferInput("INPUT0", input_data.shape,
-                                          np_to_triton_dtype(input_data.dtype))
+                    httpclient.InferInput(
+                        "INPUT0", input_data.shape, np_to_triton_dtype(input_data.dtype)
+                    )
                 ]
                 inputs[0].set_data_from_numpy(input_data)
                 result = client.infer(model_name, inputs)
-                output0 = result.as_numpy('OUTPUT0')
+                output0 = result.as_numpy("OUTPUT0")
                 self.assertIsNotNone(output0)
                 self.assertTrue(np.all(output0 == input_data))
 
@@ -260,21 +350,32 @@ def test_infer_pytorch(self):
             with httpclient.InferenceServerClient("localhost:8000") as client:
                 input_data = np.zeros(shape, dtype=np.float32)
                 inputs = [
-                    httpclient.InferInput("IN", input_data.shape,
-                                          np_to_triton_dtype(input_data.dtype))
+                    httpclient.InferInput(
+                        "IN", input_data.shape, np_to_triton_dtype(input_data.dtype)
+                    )
                 ]
                 inputs[0].set_data_from_numpy(input_data)
                 result = client.infer(model_name, inputs)
-                output_data = result.as_numpy('OUT')
+                output_data = result.as_numpy("OUT")
                 self.assertIsNotNone(output_data, "error: expected 'OUT'")
 
-                # expected inference resposne from a zero tensor
+                # expected inference response from a zero tensor
                 expected_result = [
-                    -2.2377274, -2.3976364, -2.2464046, -2.2790744, -2.3828976,
-                    -2.2940576, -2.2928185, -2.340665, -2.275219, -2.292135
+                    -2.2377274,
+                    -2.3976364,
+                    -2.2464046,
+                    -2.2790744,
+                    -2.3828976,
+                    -2.2940576,
+                    -2.2928185,
+                    -2.340665,
+                    -2.275219,
+                    -2.292135,
                 ]
-                self.assertTrue(np.allclose(output_data[0], expected_result),
-                                'Inference result is not correct')
+                self.assertTrue(
+                    np.allclose(output_data[0], expected_result),
+                    "Inference result is not correct",
+                )
 
     def test_init_args(self):
         model_name = "init_args"
@@ -283,35 +384,39 @@ def test_init_args(self):
             with httpclient.InferenceServerClient("localhost:8000") as client:
                 input_data = np.zeros(shape, dtype=np.float32)
                 inputs = [
-                    httpclient.InferInput("IN", input_data.shape,
-                                          np_to_triton_dtype(input_data.dtype))
+                    httpclient.InferInput(
+                        "IN", input_data.shape, np_to_triton_dtype(input_data.dtype)
+                    )
                 ]
                 inputs[0].set_data_from_numpy(input_data)
                 result = client.infer(model_name, inputs)
-                # output respone in this model is the number of keys in the args
+                # output response in this model is the number of keys in the args
                 self.assertTrue(
                     result.as_numpy("OUT") == 7,
-                    "Number of keys in the init args is not correct")
+                    "Number of keys in the init args is not correct",
+                )
 
     def test_unicode(self):
         model_name = "string"
         shape = [1]
 
-        for i in range(3):
+        # The first run will use np.bytes_ and the second run will use
+        # np.object_
+        for i in range(2):
             with self._shm_leak_detector.Probe() as shm_probe:
-                with httpclient.InferenceServerClient(
-                        "localhost:8000") as client:
-                    utf8 = '😀'
-                    input_data = np.array([bytes(utf8, encoding='utf-8')],
-                                          dtype=np.bytes_)
+                with httpclient.InferenceServerClient("localhost:8000") as client:
+                    utf8 = "😀"
+                    input_data = np.array(
+                        [bytes(utf8, encoding="utf-8")], dtype=np.bytes_
+                    )
                     inputs = [
                         httpclient.InferInput(
-                            "INPUT0", shape,
-                            np_to_triton_dtype(input_data.dtype))
+                            "INPUT0", shape, np_to_triton_dtype(input_data.dtype)
+                        )
                     ]
                     inputs[0].set_data_from_numpy(input_data)
                     result = client.infer(model_name, inputs)
-                    output0 = result.as_numpy('OUTPUT0')
+                    output0 = result.as_numpy("OUTPUT0")
                     self.assertIsNotNone(output0)
                     self.assertEqual(output0[0], input_data)
 
@@ -321,36 +426,36 @@ def test_optional_input(self):
         with self._shm_leak_detector.Probe() as shm_probe:
             for has_input0 in [True, False]:
                 for has_input1 in [True, False]:
-                    self._optional_input_infer(model_name, has_input0,
-                                               has_input1)
+                    self._optional_input_infer(model_name, has_input0, has_input1)
 
     def test_string(self):
         model_name = "string_fixed"
         shape = [1]
 
-        for i in range(6):
+        # Test different string outputs. This test will send 4 requests to the
+        # backend. The model will return 4 responses (np.object_ and np.bytes) *
+        # (empty output and fixed output)
+        for i in range(4):
             with self._shm_leak_detector.Probe() as shm_probe:
-                with httpclient.InferenceServerClient(
-                        "localhost:8000") as client:
-                    input_data = np.array(['123456'], dtype=np.object_)
+                with httpclient.InferenceServerClient("localhost:8000") as client:
+                    input_data = np.array(["123456"], dtype=np.object_)
                     inputs = [
                         httpclient.InferInput(
-                            "INPUT0", shape,
-                            np_to_triton_dtype(input_data.dtype))
+                            "INPUT0", shape, np_to_triton_dtype(input_data.dtype)
+                        )
                     ]
                     inputs[0].set_data_from_numpy(input_data)
                     result = client.infer(model_name, inputs)
-                    output0 = result.as_numpy('OUTPUT0')
+                    output0 = result.as_numpy("OUTPUT0")
                     self.assertIsNotNone(output0)
 
                     if i % 2 == 0:
-                        self.assertEqual(output0[0],
-                                         input_data.astype(np.bytes_))
+                        self.assertEqual(output0[0], input_data.astype(np.bytes_))
                     else:
                         self.assertEqual(output0.size, 0)
 
     def test_non_contiguous(self):
-        model_name = 'non_contiguous'
+        model_name = "non_contiguous"
         shape = [2, 10, 11, 6, 5]
         new_shape = [10, 2, 6, 5, 11]
         shape_reorder = [1, 0, 4, 2, 3]
@@ -358,8 +463,9 @@ def test_non_contiguous(self):
             input_numpy = np.random.rand(*shape)
             input_numpy = input_numpy.astype(np.float32)
             inputs = [
-                httpclient.InferInput("INPUT0", shape,
-                                      np_to_triton_dtype(input_numpy.dtype))
+                httpclient.InferInput(
+                    "INPUT0", shape, np_to_triton_dtype(input_numpy.dtype)
+                )
             ]
             inputs[0].set_data_from_numpy(input_numpy)
             result = client.infer(model_name, inputs)
@@ -369,10 +475,10 @@ def test_non_contiguous(self):
             output1 = input_numpy.T
             output2 = np.transpose(input_numpy, shape_reorder)
 
-            self.assertTrue(np.all(output0 == result.as_numpy('OUTPUT0')))
-            self.assertTrue(np.all(output1 == result.as_numpy('OUTPUT1')))
-            self.assertTrue(np.all(output2 == result.as_numpy('OUTPUT2')))
+            self.assertTrue(np.all(output0 == result.as_numpy("OUTPUT0")))
+            self.assertTrue(np.all(output1 == result.as_numpy("OUTPUT1")))
+            self.assertTrue(np.all(output2 == result.as_numpy("OUTPUT2")))
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_backend_python/python_unittest.py b/qa/L0_backend_python/python_unittest.py
old mode 100644
new mode 100755
index c29e2d80dd..c956412f9d
--- a/qa/L0_backend_python/python_unittest.py
+++ b/qa/L0_backend_python/python_unittest.py
@@ -1,4 +1,6 @@
-# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,46 +30,59 @@
 
 sys.path.append("../../common")
 
-import test_util as tu
-import shm_util
+import os
 import unittest
+
+import shm_util
+import test_util as tu
 import tritonclient.grpc as grpcclient
 from tritonclient.utils import *
-import os
 
 
 class PythonUnittest(tu.TestResultCollector):
-
     def setUp(self):
         self._shm_leak_detector = shm_util.ShmLeakDetector()
 
     def _run_unittest(self, model_name):
         with grpcclient.InferenceServerClient("localhost:8001") as client:
             # No input is required
-            result = client.infer(model_name, [], client_timeout=120)
-            output0 = result.as_numpy('OUTPUT0')
+            result = client.infer(model_name, [], client_timeout=240)
+            output0 = result.as_numpy("OUTPUT0")
 
-            # The model returns 1 if the tests were sucessfully passed.
+            # The model returns 1 if the tests were successfully passed.
             # Otherwise, it will return 0.
             self.assertEqual(output0, [1])
 
     def test_python_unittest(self):
-        model_name = os.environ['MODEL_NAME']
-
-        if model_name == 'bls' or model_name == 'bls_memory' or model_name == 'bls_memory_async':
-            # For these tests, the memory region size will be grown. Because of
-            # this we need to use the shared memory probe only on the later
-            # call so that the probe can detect the leak correctly.
-            self._run_unittest(model_name)
+        model_name = os.environ["MODEL_NAME"]
+        bls_kind = os.environ.get("BLS_KIND", "non_decoupled")
 
-            # [FIXME] See DLIS-3684
+        if bls_kind == "decoupled":
+            # Skip the shared memory probe for decoupled models for now as
+            # there are some small changes in the shared memory usage when
+            # running decoupled inferences. Confirmed that the memory growth
+            # is bounded.
             self._run_unittest(model_name)
-            with self._shm_leak_detector.Probe() as shm_probe:
-                self._run_unittest(model_name)
         else:
-            with self._shm_leak_detector.Probe() as shm_probe:
+            if (
+                model_name == "bls"
+                or model_name == "bls_memory"
+                or model_name == "bls_memory_async"
+                or model_name == "bls_request_rescheduling"
+            ):
+                # For these tests, the memory region size will be grown. Because of
+                # this we need to use the shared memory probe only on the later
+                # call so that the probe can detect the leak correctly.
+                self._run_unittest(model_name)
+
+                # [FIXME] See DLIS-3684
                 self._run_unittest(model_name)
+                with self._shm_leak_detector.Probe() as shm_probe:
+                    self._run_unittest(model_name)
+            else:
+                with self._shm_leak_detector.Probe() as shm_probe:
+                    self._run_unittest(model_name)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_backend_python/request_rescheduling/grpc_endpoint_test.py b/qa/L0_backend_python/request_rescheduling/grpc_endpoint_test.py
new file mode 100755
index 0000000000..06b5cd7fad
--- /dev/null
+++ b/qa/L0_backend_python/request_rescheduling/grpc_endpoint_test.py
@@ -0,0 +1,111 @@
+#!/usr/bin/env python
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../../common")
+
+# GRPC streaming helpers..
+import queue
+import unittest
+from functools import partial
+
+import numpy as np
+import test_util as tu
+import tritonclient.grpc as grpcclient
+from tritonclient.utils import InferenceServerException
+
+
+class UserData:
+    def __init__(self):
+        self._completed_requests = queue.Queue()
+
+
+def callback(user_data, result, error):
+    if error:
+        user_data._completed_requests.put(error)
+    else:
+        user_data._completed_requests.put(result)
+
+
+class GrpcEndpointTest(tu.TestResultCollector):
+    def test_grpc_decoupled(self, sequence_id=0, sequence_start=False):
+        user_data = UserData()
+        with grpcclient.InferenceServerClient("localhost:8001") as triton_client:
+            # Reload the model to reset the flag
+            triton_client.unload_model("iterative_sequence")
+            triton_client.load_model("iterative_sequence")
+
+            triton_client.start_stream(callback=partial(callback, user_data))
+            inputs = []
+            inputs.append(grpcclient.InferInput("IN", [1], "INT32"))
+            inputs[0].set_data_from_numpy(np.array([3], dtype=np.int32))
+
+            triton_client.async_stream_infer(
+                model_name="iterative_sequence",
+                inputs=inputs,
+                sequence_id=sequence_id,
+                sequence_start=sequence_start,
+            )
+            res_count = 3
+            while res_count > 0:
+                data_item = user_data._completed_requests.get()
+                res_count -= 1
+                if type(data_item) == InferenceServerException:
+                    raise data_item
+                else:
+                    self.assertEqual(res_count, data_item.as_numpy("OUT")[0])
+            self.assertEqual(0, res_count)
+
+    def test_grpc_non_decoupled(self, sequence_id=0, sequence_start=False):
+        with grpcclient.InferenceServerClient("localhost:8001") as triton_client:
+            # Reload the model to reset the flag
+            triton_client.unload_model("request_rescheduling_addsub")
+            triton_client.load_model("request_rescheduling_addsub")
+
+            inputs = []
+            inputs.append(grpcclient.InferInput("INPUT0", [16], "FP32"))
+            inputs.append(grpcclient.InferInput("INPUT1", [16], "FP32"))
+            input0_val = np.random.randn(*[16]).astype(np.float32)
+            input1_val = np.random.randn(*[16]).astype(np.float32)
+            inputs[0].set_data_from_numpy(input0_val)
+            inputs[1].set_data_from_numpy(input1_val)
+
+            results = triton_client.infer(
+                model_name="request_rescheduling_addsub",
+                inputs=inputs,
+            )
+
+            output0_data = results.as_numpy("OUTPUT0")
+            output1_data = results.as_numpy("OUTPUT1")
+
+            self.assertTrue(np.array_equal(output0_data, input0_val + input1_val))
+            self.assertTrue(np.array_equal(output1_data, input0_val - input1_val))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_backend_python/request_rescheduling/test.sh b/qa/L0_backend_python/request_rescheduling/test.sh
new file mode 100755
index 0000000000..8dc43dc83f
--- /dev/null
+++ b/qa/L0_backend_python/request_rescheduling/test.sh
@@ -0,0 +1,116 @@
+#!/bin/bash
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+CLIENT_PY=../python_unittest.py
+CLIENT_LOG="./request_rescheduling_client.log"
+EXPECTED_NUM_TESTS="1"
+TEST_RESULT_FILE='test_results.txt'
+source ../../common/util.sh
+
+TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"}
+SERVER=${TRITON_DIR}/bin/tritonserver
+BACKEND_DIR=${TRITON_DIR}/backends
+
+RET=0
+
+rm -fr *.log ./models *.txt
+
+mkdir -p models/bls_request_rescheduling/1/
+cp ../../python_models/bls_request_rescheduling/model.py models/bls_request_rescheduling/1/
+cp ../../python_models/bls_request_rescheduling/config.pbtxt models/bls_request_rescheduling
+
+mkdir -p models/request_rescheduling_addsub/1/
+cp ../../python_models/request_rescheduling_addsub/model.py models/request_rescheduling_addsub/1/
+cp ../../python_models/request_rescheduling_addsub/config.pbtxt models/request_rescheduling_addsub
+
+mkdir -p models/iterative_sequence/1/
+cp ../../python_models/iterative_sequence/model.py models/iterative_sequence/1/
+cp ../../python_models/iterative_sequence/config.pbtxt models/iterative_sequence
+
+mkdir -p models/wrong_return_type/1/
+cp ../../python_models/wrong_return_type/model.py models/wrong_return_type/1/
+cp ../../python_models/wrong_return_type/config.pbtxt models/wrong_return_type
+
+SERVER_LOG="./request_rescheduling_server.log"
+SERVER_ARGS="--model-repository=`pwd`/models --backend-directory=${BACKEND_DIR} --model-control-mode=explicit --load-model=* --log-verbose=1"
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+export MODEL_NAME='bls_request_rescheduling'
+
+set +e
+python3 $CLIENT_PY >> $CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** bls_request_rescheduling test FAILED. \n***"
+    cat $CLIENT_LOG
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+set -e
+
+GRPC_TEST_PY=./grpc_endpoint_test.py
+EXPECTED_NUM_TESTS="2"
+
+set +e
+python3 $GRPC_TEST_PY >> $CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** GRPC Endpoint test FAILED. \n***"
+    cat $CLIENT_LOG
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+
+if [ $RET -eq 1 ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Request Rescheduling test FAILED. \n***"
+else
+    echo -e "\n***\n*** Request Rescheduling test PASSED. \n***"
+fi
+
+exit $RET
diff --git a/qa/L0_backend_python/restart/models/restart/1/model.py b/qa/L0_backend_python/restart/models/restart/1/model.py
index 72bce2933a..1f7491498e 100644
--- a/qa/L0_backend_python/restart/models/restart/1/model.py
+++ b/qa/L0_backend_python/restart/models/restart/1/model.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,29 +24,30 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import triton_python_backend_utils as pb_utils
-import c_python_backend_utils as c_utils
 from os import path
 
+import c_python_backend_utils as c_utils
+import triton_python_backend_utils as pb_utils
+
 
 class TritonPythonModel:
-
     def execute(self, requests):
         # This function will be called once to record the free memory. Then,
         # the stub process will be killed to trigger Python backend restart.
         # After that this value will be read again to make sure that it matches
         # before restart.
 
-        file_name = 'free_memory.txt'
+        file_name = "free_memory.txt"
         current_free_memory = str(c_utils.shared_memory.free_memory())
         if path.exists(file_name):
-            with open(file_name, 'r') as f:
+            with open(file_name, "r") as f:
                 expected_free_memory = f.read()
-                assert expected_free_memory == current_free_memory, \
-                        (f'Free shared memory before and after restart are not equal. '
-                         '{expected_free_memory} (before) != {current_free_memory} (after).')
+                assert expected_free_memory == current_free_memory, (
+                    f"Free shared memory before and after restart are not equal. "
+                    "{expected_free_memory} (before) != {current_free_memory} (after)."
+                )
         else:
-            with open(file_name, 'w') as f:
+            with open(file_name, "w") as f:
                 f.write(current_free_memory)
 
         responses = []
diff --git a/qa/L0_backend_python/restart/restart_test.py b/qa/L0_backend_python/restart/restart_test.py
old mode 100644
new mode 100755
index 534642c2e1..4f4bf63082
--- a/qa/L0_backend_python/restart/restart_test.py
+++ b/qa/L0_backend_python/restart/restart_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,32 +27,34 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../../common")
 
+import unittest
+
+import numpy as np
 import test_util as tu
 import tritonclient.http as httpclient
 from tritonclient.utils import *
-import numpy as np
-import unittest
 
 
 class RestartTest(tu.TestResultCollector):
-
     def _infer_helper(self, model_name, shape, data_type):
         with httpclient.InferenceServerClient("localhost:8000") as client:
             input_data_0 = np.array(np.random.randn(*shape), dtype=data_type)
             inputs = [
-                httpclient.InferInput("INPUT0", shape,
-                                      np_to_triton_dtype(input_data_0.dtype))
+                httpclient.InferInput(
+                    "INPUT0", shape, np_to_triton_dtype(input_data_0.dtype)
+                )
             ]
             inputs[0].set_data_from_numpy(input_data_0)
             result = client.infer(model_name, inputs)
-            output0 = result.as_numpy('OUTPUT0')
+            output0 = result.as_numpy("OUTPUT0")
             self.assertTrue(np.all(input_data_0 == output0))
 
     def test_restart(self):
         shape = [1, 16]
-        model_name = 'restart'
+        model_name = "restart"
         dtype = np.float32
 
         # Since the stub process has been killed, the first request
@@ -64,10 +68,10 @@ def test_restart(self):
 
     def test_infer(self):
         shape = [1, 16]
-        model_name = 'restart'
+        model_name = "restart"
         dtype = np.float32
         self._infer_helper(model_name, shape, dtype)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_backend_python/restart/test.sh b/qa/L0_backend_python/restart/test.sh
old mode 100644
new mode 100755
index 64c80332ac..f016af54c3
--- a/qa/L0_backend_python/restart/test.sh
+++ b/qa/L0_backend_python/restart/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,13 +25,13 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-CLIENT_LOG="./client.log"
+CLIENT_LOG="./restart_client.log"
 EXPECTED_NUM_TESTS="7"
 TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"}
 SERVER=${TRITON_DIR}/bin/tritonserver
 BACKEND_DIR=${TRITON_DIR}/backends
 SERVER_ARGS="--model-repository=`pwd`/models --backend-directory=${BACKEND_DIR} --log-verbose=1"
-SERVER_LOG="./inference_server.log"
+SERVER_LOG="./restart_server.log"
 source ../../common/util.sh
 source ../common.sh
 
@@ -127,4 +127,3 @@ else
 fi
 
 exit $RET
-
diff --git a/qa/L0_backend_python/setup_python_enviroment.sh b/qa/L0_backend_python/setup_python_enviroment.sh
new file mode 100755
index 0000000000..90d0f6eaf2
--- /dev/null
+++ b/qa/L0_backend_python/setup_python_enviroment.sh
@@ -0,0 +1,171 @@
+#!/bin/bash
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+RET=0
+set -e
+if [ ${PYTHON_ENV_VERSION} = "10" ]; then
+    echo No need to set up anything for default python3.${PYTHON_ENV_VERSION}
+    exit $RET
+fi
+
+source common.sh
+source ../common/util.sh
+
+SERVER=/opt/tritonserver/bin/tritonserver
+BASE_SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=1 --disable-auto-complete-config"
+PYTHON_BACKEND_BRANCH=$PYTHON_BACKEND_REPO_TAG
+SERVER_ARGS=$BASE_SERVER_ARGS
+SERVER_LOG="./inference_server.log"
+export PYTHON_ENV_VERSION=${PYTHON_ENV_VERSION:="10"}
+RET=0
+EXPECTED_VERSION_STRINGS=""
+
+rm -fr ./models
+rm -rf *.tar.gz
+install_build_deps
+install_conda
+
+# Test other python versions
+conda update -n base -c defaults conda -y
+# Create a model with python 3.8 version
+# Successful execution of the Python model indicates that the environment has
+# been setup correctly.
+if [ ${PYTHON_ENV_VERSION} = "8" ]; then
+    create_conda_env "3.8" "python-3-8"
+    conda install -c conda-forge libstdcxx-ng=12 -y
+    conda install numpy=1.23.4 -y
+    conda install tensorflow=2.10.0 -y
+    EXPECTED_VERSION_STRING="Python version is 3.8, NumPy version is 1.23.4, and Tensorflow version is 2.10.0"
+    create_python_backend_stub
+    conda-pack -o python3.8.tar.gz
+    path_to_conda_pack="$PWD/python-3-8"
+    mkdir -p $path_to_conda_pack
+    tar -xzf python3.8.tar.gz -C $path_to_conda_pack
+    mkdir -p models/python_3_8/1/
+    cp ../python_models/python_version/config.pbtxt ./models/python_3_8
+    (cd models/python_3_8 && \
+            sed -i "s/^name:.*/name: \"python_3_8\"/" config.pbtxt && \
+            echo "parameters: {key: \"EXECUTION_ENV_PATH\", value: {string_value: \"$path_to_conda_pack\"}}">> config.pbtxt)
+    cp ../python_models/python_version/model.py ./models/python_3_8/1/
+    cp python_backend/builddir/triton_python_backend_stub ./models/python_3_8
+fi
+
+# Create a model with python 3.9 version
+# Successful execution of the Python model indicates that the environment has
+# been setup correctly.
+if [ ${PYTHON_ENV_VERSION} = "9" ]; then
+    create_conda_env "3.9" "python-3-9"
+    conda install -c conda-forge libstdcxx-ng=12 -y
+    conda install numpy=1.23.4 -y
+    conda install tensorflow=2.10.0 -y
+    EXPECTED_VERSION_STRING="Python version is 3.9, NumPy version is 1.23.4, and Tensorflow version is 2.10.0"
+    create_python_backend_stub
+    conda-pack -o python3.9.tar.gz
+    path_to_conda_pack="$PWD/python-3-9"
+    mkdir -p $path_to_conda_pack
+    tar -xzf python3.9.tar.gz -C $path_to_conda_pack
+    mkdir -p models/python_3_9/1/
+    cp ../python_models/python_version/config.pbtxt ./models/python_3_9
+    (cd models/python_3_9 && \
+            sed -i "s/^name:.*/name: \"python_3_9\"/" config.pbtxt && \
+            echo "parameters: {key: \"EXECUTION_ENV_PATH\", value: {string_value: \"$path_to_conda_pack\"}}">> config.pbtxt)
+    cp ../python_models/python_version/model.py ./models/python_3_9/1/
+    cp python_backend/builddir/triton_python_backend_stub ./models/python_3_9
+fi
+
+# Create a model with python 3.11 version
+# Successful execution of the Python model indicates that the environment has
+# been setup correctly.
+if [ ${PYTHON_ENV_VERSION} = "11" ]; then
+    create_conda_env "3.11" "python-3-11"
+    # tensorflow needs to be installed before numpy so pip does not mess up conda
+    # environment
+    pip install tensorflow==2.12.0
+    conda install -c conda-forge libstdcxx-ng=12 -y
+    conda install numpy=1.23.5 -y
+    EXPECTED_VERSION_STRING="Python version is 3.11, NumPy version is 1.23.5, and Tensorflow version is 2.12.0"
+    create_python_backend_stub
+    conda-pack -o python3.11.tar.gz
+    path_to_conda_pack="$PWD/python-3-11"
+    mkdir -p $path_to_conda_pack
+    tar -xzf python3.11.tar.gz -C $path_to_conda_pack
+    mkdir -p models/python_3_11/1/
+    cp ../python_models/python_version/config.pbtxt ./models/python_3_11
+    (cd models/python_3_11 && \
+            sed -i "s/^name:.*/name: \"python_3_11\"/" config.pbtxt && \
+            echo "parameters: {key: \"EXECUTION_ENV_PATH\", value: {string_value: \"$path_to_conda_pack\"}}">> config.pbtxt)
+    cp ../python_models/python_version/model.py ./models/python_3_11/1/
+    cp python_backend/builddir/triton_python_backend_stub ./models/python_3_11
+fi
+conda deactivate
+rm -rf ./miniconda
+
+# test that
+set +e
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+grep "$EXPECTED_VERSION_STRING" $SERVER_LOG
+if [ $? -ne 0 ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** $EXPECTED_VERSION_STRING was not found in Triton logs. \n***"
+    RET=1
+fi
+set -e
+
+echo "python environment 3.${PYTHON_ENV_VERSION}"
+# copy the stub out to /opt/tritonserver/backends/python/triton_python_backend_stub
+cp python_backend/builddir/triton_python_backend_stub /opt/tritonserver/backends/python/triton_python_backend_stub
+# Set up environment and stub for each test
+add-apt-repository ppa:deadsnakes/ppa -y
+apt-get update && apt-get -y install \
+                            "python3.${PYTHON_ENV_VERSION}-dev" \
+                            "python3.${PYTHON_ENV_VERSION}-distutils" \
+                            libboost-dev
+rm -f /usr/bin/python3 && \
+ln -s "/usr/bin/python3.${PYTHON_ENV_VERSION}" /usr/bin/python3
+pip3 install --upgrade install requests numpy virtualenv protobuf
+find /opt/tritonserver/qa/pkgs/ -maxdepth 1 -type f -name \
+    "tritonclient-*linux*.whl" | xargs printf -- '%s[all]' | \
+    xargs pip3 install --upgrade
+
+# Build triton-shm-monitor for the test
+cd python_backend && rm -rf install build && mkdir build && cd build && \
+    cmake -DCMAKE_INSTALL_PREFIX:PATH=$PWD/install \
+        -DTRITON_COMMON_REPO_TAG:STRING=${TRITON_COMMON_REPO_TAG} \
+        -DTRITON_CORE_REPO_TAG:STRING=${TRITON_CORE_REPO_TAG} \
+        -DTRITON_BACKEND_REPO_TAG:STRING=${TRITON_BACKEND_REPO_TAG} .. && \
+    make -j16 triton-shm-monitor install
+cp $PWD/install/backends/python/triton_shm_monitor.cpython-* /opt/tritonserver/qa/common/.
+set +e
+exit $RET
diff --git a/qa/L0_backend_python/test.sh b/qa/L0_backend_python/test.sh
index a4e11dfc9e..449cee8480 100755
--- a/qa/L0_backend_python/test.sh
+++ b/qa/L0_backend_python/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -44,6 +44,7 @@ SERVER=${TRITON_DIR}/bin/tritonserver
 export BACKEND_DIR=${TRITON_DIR}/backends
 export TEST_JETSON=${TEST_JETSON:=0}
 export CUDA_VISIBLE_DEVICES=0
+export PYTHON_ENV_VERSION=${PYTHON_ENV_VERSION:="10"}
 
 BASE_SERVER_ARGS="--model-repository=`pwd`/models --backend-directory=${BACKEND_DIR} --log-verbose=1"
 # Set the default byte size to 5MBs to avoid going out of shared memory. The
@@ -53,7 +54,7 @@ SERVER_ARGS="$BASE_SERVER_ARGS --backend-config=python,shm-default-byte-size=524
 PYTHON_BACKEND_BRANCH=$PYTHON_BACKEND_REPO_TAG
 CLIENT_PY=./python_test.py
 CLIENT_LOG="./client.log"
-EXPECTED_NUM_TESTS="9"
+EXPECTED_NUM_TESTS="11"
 TEST_RESULT_FILE='test_results.txt'
 SERVER_LOG="./inference_server.log"
 source ../common/util.sh
@@ -61,6 +62,20 @@ source ./common.sh
 
 rm -fr *.log ./models
 
+python3 --version | grep "3.10" > /dev/null
+if [ $? -ne 0 ]; then
+    echo -e "Expecting Python default version to be: Python 3.10 but actual version is $(python3 --version)"
+    exit 1
+fi
+
+(bash -ex setup_python_enviroment.sh)
+
+python3 --version | grep "3.${PYTHON_ENV_VERSION}" > /dev/null
+if [ $? -ne 0 ]; then
+    echo -e "Expecting Python version to be: Python 3.${PYTHON_ENV_VERSION} but actual version is $(python3 --version)"
+    exit 1
+fi
+
 mkdir -p models/identity_fp32/1/
 cp ../python_models/identity_fp32/model.py ./models/identity_fp32/1/model.py
 cp ../python_models/identity_fp32/config.pbtxt ./models/identity_fp32/config.pbtxt
@@ -128,19 +143,23 @@ mkdir -p models/string_fixed/1/
 cp ../python_models/string_fixed/model.py ./models/string_fixed/1/
 cp ../python_models/string_fixed/config.pbtxt ./models/string_fixed
 
-# Skip torch install on Jetson since it is already installed.
+mkdir -p models/dlpack_identity/1/
+cp ../python_models/dlpack_identity/model.py ./models/dlpack_identity/1/
+cp ../python_models/dlpack_identity/config.pbtxt ./models/dlpack_identity
+
 if [ "$TEST_JETSON" == "0" ]; then
-  pip3 install torch==1.6.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
+  pip3 install torch==1.13.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
 else
-  # test_growth_error is skipped on jetson
-  EXPECTED_NUM_TESTS=8
+  pip3 install torch==1.13.0 -f https://download.pytorch.org/whl/torch_stable.html
+  # GPU tensor tests are disabled on jetson
+  EXPECTED_NUM_TESTS=9
 fi
 
 prev_num_pages=`get_shm_pages`
 run_server
 if [ "$SERVER_PID" == "0" ]; then
-    echo -e "\n***\n*** Failed to start $SERVER\n***"
     cat $SERVER_LOG
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
     exit 1
 fi
 
@@ -176,8 +195,8 @@ prev_num_pages=`get_shm_pages`
 # Triton non-graceful exit
 run_server
 if [ "$SERVER_PID" == "0" ]; then
-    echo -e "\n***\n*** Failed to start $SERVER\n***"
     cat $SERVER_LOG
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
     exit 1
 fi
 
@@ -216,8 +235,8 @@ if [ "$TEST_JETSON" == "0" ]; then
   prev_num_pages=`get_shm_pages`
   run_server
   if [ "$SERVER_PID" == "0" ]; then
-      echo -e "\n***\n*** Failed to start $SERVER\n***"
       cat $SERVER_LOG
+      echo -e "\n***\n*** Failed to start $SERVER\n***"
       exit 1
   fi
 
@@ -252,8 +271,8 @@ cp ../python_models/identity_fp32/config.pbtxt ./models/multi_file/
 prev_num_pages=`get_shm_pages`
 run_server
 if [ "$SERVER_PID" == "0" ]; then
-    echo -e "\n***\n*** Failed to start $SERVER\n***"
     cat $SERVER_LOG
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
     exit 1
 fi
 
@@ -286,9 +305,9 @@ export MY_ENV="MY_ENV"
 prev_num_pages=`get_shm_pages`
 run_server
 if [ "$SERVER_PID" == "0" ]; then
+    cat $SERVER_LOG
     echo -e "\n***\n*** Failed to start $SERVER\n***"
     echo -e "\n***\n*** Environment variable test failed \n***"
-    cat $SERVER_LOG
     exit 1
 fi
 
@@ -315,8 +334,8 @@ SERVER_ARGS="$BASE_SERVER_ARGS --backend-config=python,shm-default-byte-size=$sh
 
 run_server
 if [ "$SERVER_PID" == "0" ]; then
-    echo -e "\n***\n*** Failed to start $SERVER\n***"
     cat $SERVER_LOG
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
     exit 1
 fi
 
@@ -336,77 +355,95 @@ done
 kill $SERVER_PID
 wait $SERVER_PID
 
-# Disable env test for Jetson since build is non-dockerized and cloud storage repos are not supported
-# Disable ensemble, unittest, io and bls tests for Jetson since GPU Tensors are not supported
-# Disable variants test for Jetson since already built without GPU Tensor support
-# Disable decoupled test because it uses GPU tensors
-if [ "$TEST_JETSON" == "0" ]; then
-  (cd env && bash -ex test.sh)
-  if [ $? -ne 0 ]; then
-    RET=1
-  fi
-
-  (cd ensemble && bash -ex test.sh)
-  if [ $? -ne 0 ]; then
-    RET=1
-  fi
-
-  (cd unittest && bash -ex test.sh)
-  if [ $? -ne 0 ]; then
-    RET=1
-  fi
-
-  (cd io && bash -ex test.sh)
-  if [ $? -ne 0 ]; then
-    RET=1
-  fi
-
-  (cd bls && bash -ex test.sh)
-  if [ $? -ne 0 ]; then
-    RET=1
-  fi
+# Test model getting killed during initialization
+rm -fr ./models
+mkdir -p models/init_exit/1/
+cp ../python_models/init_exit/model.py ./models/init_exit/1/model.py
+cp ../python_models/init_exit/config.pbtxt ./models/init_exit/config.pbtxt
 
-  (cd decoupled && bash -ex test.sh)
-  if [ $? -ne 0 ]; then
-    RET=1
-  fi
+ERROR_MESSAGE="Stub process 'init_exit_0_0' is not healthy."
 
-  (cd variants && bash -ex test.sh)
-  if [ $? -ne 0 ]; then
+prev_num_pages=`get_shm_pages`
+run_server
+if [ "$SERVER_PID" != "0" ]; then
+    echo -e "*** FAILED: unexpected success starting $SERVER" >> $CLIENT_LOG
     RET=1
-  fi
+    kill $SERVER_PID
+    wait $SERVER_PID
+else
+    if grep "$ERROR_MESSAGE" $SERVER_LOG; then
+        echo -e "Found \"$ERROR_MESSAGE\"" >> $CLIENT_LOG
+    else
+        echo $CLIENT_LOG
+        echo -e "Not found \"$ERROR_MESSAGE\"" >> $CLIENT_LOG
+        RET=1
+    fi
 fi
 
-(cd lifecycle && bash -ex test.sh)
-if [ $? -ne 0 ]; then
-  RET=1
+current_num_pages=`get_shm_pages`
+if [ $current_num_pages -ne $prev_num_pages ]; then
+    cat $SERVER_LOG
+    ls /dev/shm
+    echo -e "\n***\n*** Test Failed. Shared memory pages where not cleaned properly.
+Shared memory pages before starting triton equals to $prev_num_pages
+and shared memory pages after starting triton equals to $current_num_pages \n***"
+    exit 1
 fi
 
-(cd restart && bash -ex test.sh)
-if [ $? -ne 0 ]; then
-  RET=1
+# Disable env test for Jetson since cloud storage repos are not supported
+# Disable ensemble, io and bls tests for Jetson since GPU Tensors are not supported
+# Disable variants test for Jetson since already built without GPU Tensor support
+# Disable decoupled test because it uses GPU tensors
+if [ "$TEST_JETSON" == "0" ]; then
+    SUBTESTS="ensemble io bls decoupled variants python_based_backends"
+    for TEST in ${SUBTESTS}; do
+        # Run each subtest in a separate virtual environment to avoid conflicts
+        # between dependencies.
+        virtualenv --system-site-packages venv
+        source venv/bin/activate
+
+        (cd ${TEST} && bash -ex test.sh)
+        if [ $? -ne 0 ]; then
+        echo "Subtest ${TEST} FAILED"
+        RET=1
+        fi
+
+        deactivate
+        rm -fr venv
+    done
+
+    if [ ${PYTHON_ENV_VERSION} = "10" ]; then
+        # In 'env' test we use miniconda for dependency management. No need to run
+        # the test in a virtual environment.
+        (cd env && bash -ex test.sh)
+        if [ $? -ne 0 ]; then
+            echo "Subtest env FAILED"
+            RET=1
+        fi
+    fi
 fi
 
-(cd model_control && bash -ex test.sh)
-if [ $? -ne 0 ]; then
-  RET=1
-fi
+SUBTESTS="lifecycle restart model_control examples argument_validation logging custom_metrics request_rescheduling"
+for TEST in ${SUBTESTS}; do
+    # Run each subtest in a separate virtual environment to avoid conflicts
+    # between dependencies.
+    virtualenv --system-site-packages venv
+    source venv/bin/activate
 
-(cd examples && bash -ex test.sh)
-if [ $? -ne 0 ]; then
-  RET=1
-fi
+    (cd ${TEST} && bash -ex test.sh)
 
-(cd argument_validation && bash -ex test.sh)
-if [ $? -ne 0 ]; then
-  RET=1
-fi
+    if [ $? -ne 0 ]; then
+        echo "Subtest ${TEST} FAILED"
+        RET=1
+    fi
 
+    deactivate
+    rm -fr venv
+done
 
 if [ $RET -eq 0 ]; then
   echo -e "\n***\n*** Test Passed\n***"
 else
-  cat $SERVER_LOG
   echo -e "\n***\n*** Test FAILED\n***"
 fi
 
diff --git a/qa/L0_backend_python/variants/test.sh b/qa/L0_backend_python/variants/test.sh
old mode 100644
new mode 100755
index 24ceb1cf4c..65116cb2dc
--- a/qa/L0_backend_python/variants/test.sh
+++ b/qa/L0_backend_python/variants/test.sh
@@ -25,7 +25,7 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-# Buidling a CPU build of Python backend
+# Building a CPU build of Python backend
 
 source ../common.sh
 install_build_deps
diff --git a/qa/L0_backend_tutorial/test.sh b/qa/L0_backend_tutorial/test.sh
index c745ea7ed2..4706c2c2dd 100755
--- a/qa/L0_backend_tutorial/test.sh
+++ b/qa/L0_backend_tutorial/test.sh
@@ -40,13 +40,14 @@ source ../common/util.sh
 RET=0
 
 # Client build requires recent version of CMake (FetchContent required)
-wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | \
-    gpg --dearmor - |  \
-    tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null && \
-    apt-add-repository 'deb https://apt.kitware.com/ubuntu/ focal main' && \
-    apt-get update && \
-    apt-get install -y --no-install-recommends \
-            cmake-data=3.21.1-0kitware1ubuntu20.04.1 cmake=3.21.1-0kitware1ubuntu20.04.1 \
+# Using CMAKE installation instruction from:: https://apt.kitware.com/
+apt update -q=2 \
+    && apt install -y gpg wget \
+    && wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - |  tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null \
+    && . /etc/os-release \
+    && echo "deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ $UBUNTU_CODENAME main" | tee /etc/apt/sources.list.d/kitware.list >/dev/null \
+    && apt-get update -q=2 \
+    && apt-get install -y --no-install-recommends cmake=3.27.7* cmake-data=3.27.7* \
             rapidjson-dev
 cmake --version
 
@@ -186,8 +187,16 @@ if [ $? -ne 0 ]; then
     RET=1
 fi
 
+FOUND_MATCH=0
 grep "batched INPUT value: \[ 1.000000, 1.100000, 1.200000, 1.300000, 2.000000, 2.100000, 2.200000, 2.300000, 3.000000, 3.100000, 3.200000, 3.300000, 4.000000, 4.100000, 4.200000, 4.300000, 10.000000, 10.100000, 10.200000, 10.300000, 20.000000, 20.100000, 20.200001, 20.299999, 30.000000, 30.100000, 30.200001, 30.299999, 40.000000, 40.099998, 40.200001, 40.299999 \]" $SERVER_LOG
 if [ $? -ne 0 ]; then
+    FOUND_MATCH=1
+fi
+grep "batched INPUT value: \[ 10.000000, 10.100000, 10.200000, 10.300000, 20.000000, 20.100000, 20.200001, 20.299999, 30.000000, 30.100000, 30.200001, 30.299999, 40.000000, 40.099998, 40.200001, 40.299999, 1.000000, 1.100000, 1.200000, 1.300000, 2.000000, 2.100000, 2.200000, 2.300000, 3.000000, 3.100000, 3.200000, 3.300000, 4.000000, 4.100000, 4.200000, 4.300000 \]" $SERVER_LOG
+if [ $? -ne 0 ]; then
+    FOUND_MATCH=1
+fi
+if [ $FOUND_MATCH -eq 0 ]; then
     echo -e "\n***\n*** Failed to verify recommended server log. \n***"
     cat $SERVER_LOG
     cat $RECOMMENDED_LOG
diff --git a/qa/L0_batch_custom/batch_custom_test.py b/qa/L0_batch_custom/batch_custom_test.py
new file mode 100755
index 0000000000..6cd6346ad3
--- /dev/null
+++ b/qa/L0_batch_custom/batch_custom_test.py
@@ -0,0 +1,273 @@
+#!/usr/bin/env python3
+
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+
+import os
+import threading
+import time
+import unittest
+from builtins import range
+from collections.abc import Iterable
+
+import infer_util as iu
+import numpy as np
+import test_util as tu
+import tritonclient.grpc as grpcclient
+
+# By default, find tritonserver on "localhost", but can be overridden
+# with TRITONSERVER_IPADDR envvar
+_tritonserver_ipaddr = os.environ.get("TRITONSERVER_IPADDR", "localhost")
+
+_deferred_exceptions_lock = threading.Lock()
+_deferred_exceptions = []
+
+
+class BatcherTest(tu.TestResultCollector):
+    def setUp(self):
+        # The helper client for setup will be GRPC for simplicity.
+        self.triton_client_ = grpcclient.InferenceServerClient(
+            f"{_tritonserver_ipaddr}:8001"
+        )
+        self.precreated_shm_regions_ = []
+        global _deferred_exceptions
+        _deferred_exceptions = []
+
+    def tearDown(self):
+        super().tearDown()
+
+    def add_deferred_exception(self, ex):
+        global _deferred_exceptions
+        with _deferred_exceptions_lock:
+            _deferred_exceptions.append(ex)
+
+    def check_deferred_exception(self):
+        # Just raise one of the exceptions...
+        with _deferred_exceptions_lock:
+            if len(_deferred_exceptions) > 0:
+                raise _deferred_exceptions[0]
+
+    def check_response(
+        self,
+        trial,
+        bs,
+        thresholds,
+        requested_outputs=("OUTPUT0", "OUTPUT1"),
+        input_size=16,
+        shm_region_names=None,
+        precreated_shm_regions=None,
+    ):
+        try:
+            start_ms = int(round(time.time() * 1000))
+
+            if (
+                trial == "savedmodel"
+                or trial == "graphdef"
+                or trial == "libtorch"
+                or trial == "onnx"
+                or trial == "plan"
+                or trial == "python"
+            ):
+                tensor_shape = (bs, input_size)
+                iu.infer_exact(
+                    self,
+                    trial,
+                    tensor_shape,
+                    bs,
+                    np.float32,
+                    np.float32,
+                    np.float32,
+                    swap=False,
+                    model_version=1,
+                    outputs=requested_outputs,
+                    use_http=False,
+                    use_grpc=False,
+                    use_http_json_tensors=False,
+                    skip_request_id_check=True,
+                    use_streaming=False,
+                )
+            else:
+                self.assertFalse(True, "unknown trial type: " + trial)
+
+            end_ms = int(round(time.time() * 1000))
+
+            lt_ms = thresholds[0]
+            gt_ms = thresholds[1]
+            if lt_ms is not None:
+                self.assertTrue(
+                    (end_ms - start_ms) < lt_ms,
+                    "expected less than "
+                    + str(lt_ms)
+                    + "ms response time, got "
+                    + str(end_ms - start_ms)
+                    + " ms",
+                )
+            if gt_ms is not None:
+                self.assertTrue(
+                    (end_ms - start_ms) > gt_ms,
+                    "expected greater than "
+                    + str(gt_ms)
+                    + "ms response time, got "
+                    + str(end_ms - start_ms)
+                    + " ms",
+                )
+        except Exception as ex:
+            self.add_deferred_exception(ex)
+
+    def check_status(self, model_name, batch_exec, request_cnt, infer_cnt, exec_count):
+        # There is a time window between when responses are returned and statistics are updated.
+        # To prevent intermittent test failure during that window, wait up to 10 seconds for the
+        # inference statistics to be ready.
+        num_tries = 10
+        for i in range(num_tries):
+            stats = self.triton_client_.get_inference_statistics(model_name, "1")
+            self.assertEqual(len(stats.model_stats), 1, "expect 1 model stats")
+            actual_exec_cnt = stats.model_stats[0].execution_count
+            if actual_exec_cnt == exec_count:
+                break
+            print(
+                "WARNING: expect {} executions, got {} (attempt {})".format(
+                    exec_count, actual_exec_cnt, i
+                )
+            )
+            time.sleep(1)
+
+        self.assertEqual(
+            stats.model_stats[0].name,
+            model_name,
+            "expect model stats for model {}".format(model_name),
+        )
+        self.assertEqual(
+            stats.model_stats[0].version,
+            "1",
+            "expect model stats for model {} version 1".format(model_name),
+        )
+
+        if batch_exec:
+            batch_stats = stats.model_stats[0].batch_stats
+            self.assertEqual(
+                len(batch_stats),
+                len(batch_exec),
+                "expected {} different batch-sizes, got {}".format(
+                    len(batch_exec), len(batch_stats)
+                ),
+            )
+
+            for batch_stat in batch_stats:
+                bs = batch_stat.batch_size
+                bc = batch_stat.compute_infer.count
+                self.assertTrue(bs in batch_exec, "unexpected batch-size {}".format(bs))
+                # Get count from one of the stats
+                self.assertEqual(
+                    bc,
+                    batch_exec[bs],
+                    "expected model-execution-count {} for batch size {}, got {}".format(
+                        batch_exec[bs], bs, bc
+                    ),
+                )
+
+        actual_request_cnt = stats.model_stats[0].inference_stats.success.count
+        self.assertEqual(
+            actual_request_cnt,
+            request_cnt,
+            "expected model-request-count {}, got {}".format(
+                request_cnt, actual_request_cnt
+            ),
+        )
+
+        actual_exec_cnt = stats.model_stats[0].execution_count
+        if isinstance(exec_count, Iterable):
+            self.assertIn(
+                actual_exec_cnt,
+                exec_count,
+                "expected model-exec-count {}, got {}".format(
+                    exec_count, actual_exec_cnt
+                ),
+            )
+        else:
+            self.assertEqual(
+                actual_exec_cnt,
+                exec_count,
+                "expected model-exec-count {}, got {}".format(
+                    exec_count, actual_exec_cnt
+                ),
+            )
+        actual_infer_cnt = stats.model_stats[0].inference_count
+        self.assertEqual(
+            actual_infer_cnt,
+            infer_cnt,
+            "expected model-inference-count {}, got {}".format(
+                infer_cnt, actual_infer_cnt
+            ),
+        )
+
+    def test_volume_batching(self):
+        # Send 12 requests with batch size 1. The max_queue_delay is set
+        # to non-zero. Depending upon the timing of the requests arrival
+        # there can be either 4-6 model executions.
+        model_base = "onnx"
+        dtype = np.float16
+        shapes = (
+            [
+                1,
+                4,
+                4,
+            ],
+        )
+
+        try:
+            # use threads to send 12 requests without waiting for response
+            threads = []
+            for i in range(12):
+                threads.append(
+                    threading.Thread(
+                        target=iu.infer_zero,
+                        args=(self, model_base, 1, dtype, shapes, shapes),
+                        kwargs={
+                            "use_http": True,
+                            "use_grpc": False,
+                            "use_http_json_tensors": False,
+                            "use_streaming": False,
+                        },
+                    )
+                )
+            for t in threads:
+                t.start()
+            for t in threads:
+                t.join()
+            self.check_deferred_exception()
+            model_name = tu.get_zero_model_name(model_base, len(shapes), dtype)
+            self.check_status(model_name, None, 12, 12, (4, 5, 6))
+        except Exception as ex:
+            self.assertTrue(False, "unexpected error {}".format(ex))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_batch_custom/test.sh b/qa/L0_batch_custom/test.sh
new file mode 100755
index 0000000000..01701df661
--- /dev/null
+++ b/qa/L0_batch_custom/test.sh
@@ -0,0 +1,192 @@
+#!/bin/bash
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+## This test tests the ability to use custom batching strategies with models.
+
+REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
+if [ "$#" -ge 1 ]; then
+    REPO_VERSION=$1
+fi
+if [ -z "$REPO_VERSION" ]; then
+    echo -e "Repository version must be specified"
+    echo -e "\n***\n*** Test Failed\n***"
+    exit 1
+fi
+if [ ! -z "$TEST_REPO_ARCH" ]; then
+    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
+fi
+
+export CUDA_VISIBLE_DEVICES=0
+
+BATCH_CUSTOM_TEST=batch_custom_test.py
+CLIENT_LOG_BASE="./client.log"
+DATADIR=/data/inferenceserver/${REPO_VERSION}/qa_identity_model_repository
+EXPECTED_NUM_TESTS="1"
+MODEL_NAME="onnx_zero_1_float16"
+SERVER=/opt/tritonserver/bin/tritonserver
+SERVER_ARGS="--model-repository=models --log-verbose 1"
+SERVER_LOG_BASE="./inference_server.log"
+TEST_RESULT_FILE='test_results.txt'
+TRITON_BACKEND_REPO_TAG=${TRITON_BACKEND_REPO_TAG:="main"}
+TRITON_CORE_REPO_TAG=${TRITON_CORE_REPO_TAG:="main"}
+
+source ../common/util.sh
+RET=0
+
+# Batch strategy build requires recent version of CMake (FetchContent required)
+# Using CMAKE installation instruction from:: https://apt.kitware.com/
+apt update -q=2 \
+    && apt install -y gpg wget \
+    && wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - |  tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null \
+    && . /etc/os-release \
+    && echo "deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ $UBUNTU_CODENAME main" | tee /etc/apt/sources.list.d/kitware.list >/dev/null \
+    && apt-get update -q=2 \
+    && apt-get install -y --no-install-recommends cmake=3.27.7* cmake-data=3.27.7* rapidjson-dev
+cmake --version
+
+# Set up repository
+rm -fr *.log* ./backend
+rm -fr models && mkdir models
+cp -r $DATADIR/$MODEL_NAME models
+
+CONFIG_PATH="models/${MODEL_NAME}/config.pbtxt"
+echo "dynamic_batching { max_queue_delay_microseconds: 10000}" >> ${CONFIG_PATH}
+echo "instance_group [ { kind: KIND_GPU count: 2 }]" >> ${CONFIG_PATH}
+echo "parameters { key: \"MAX_BATCH_VOLUME_BYTES\" value: {string_value: \"96\"}}" >> ${CONFIG_PATH}
+
+# Create custom batching libraries
+git clone --single-branch --depth=1 -b $TRITON_BACKEND_REPO_TAG \
+    https://github.com/triton-inference-server/backend.git
+
+(cd backend/examples/batching_strategies/volume_batching &&
+ mkdir build &&
+ cd build &&
+ cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install \
+       -DTRITON_CORE_REPO_TAG=$TRITON_CORE_REPO_TAG .. &&
+ make -j4 install)
+
+ (cd backend/examples/batching_strategies/single_batching &&
+ mkdir build &&
+ cd build &&
+ cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install \
+       -DTRITON_CORE_REPO_TAG=$TRITON_CORE_REPO_TAG .. &&
+ make -j4 install)
+
+cp -r backend/examples/batching_strategies/volume_batching/build/libtriton_volumebatching.so models
+cp -r backend/examples/batching_strategies/single_batching/build/libtriton_singlebatching.so models
+
+# Run a test to validate the single batching strategy example.
+# Then, run tests to validate the volume batching example being passed in via the backend dir, model dir, version dir, and model config.
+BACKEND_DIR="/opt/tritonserver/backends/onnxruntime"
+MODEL_DIR="models/$MODEL_NAME"
+VERSION_DIR="$MODEL_DIR/1/"
+
+test_types=('single_batching_backend' 'backend_directory' 'model_directory' 'version_directory' 'model_config')
+test_setups=("cp models/libtriton_singlebatching.so ${BACKEND_DIR}/batchstrategy.so && sed -i \"s/(4, 5, 6))/(12))/\" ${BATCH_CUSTOM_TEST}"
+    "cp models/libtriton_volumebatching.so ${BACKEND_DIR}/batchstrategy.so && sed -i \"s/(12))/(4, 5, 6))/\" ${BATCH_CUSTOM_TEST}"
+    "mv ${BACKEND_DIR}/batchstrategy.so ${MODEL_DIR} && cp models/libtriton_singlebatching.so ${BACKEND_DIR}"
+    "mv ${MODEL_DIR}/batchstrategy.so ${VERSION_DIR}/batchstrategy.so"
+    "mv ${VERSION_DIR}/batchstrategy.so models/${MODEL_NAME}/libtriton_volumebatching.so && echo \"parameters: {key: \\\"TRITON_BATCH_STRATEGY_PATH\\\", value: {string_value: \\\"${MODEL_DIR}/libtriton_volumebatching.so\\\"}}\" >> ${CONFIG_PATH}")
+
+for i in "${!test_setups[@]}"; do
+    echo "Running ${test_types[$i]} test"
+    eval ${test_setups[$i]}
+
+    SERVER_LOG=${SERVER_LOG_BASE}_${test_types[$i]}
+    CLIENT_LOG=${CLIENT_LOG_BASE}_${test_types[$i]}
+
+    run_server
+    if [ "$SERVER_PID" == "0" ]; then
+        echo -e "\n***\n*** Failed to start $SERVER\n***"
+        cat $SERVER_LOG
+        exit 1
+    fi
+    if [ `grep -c "Loading custom batching strategy" $SERVER_LOG` != "1" ]; then
+        cat $SERVER_LOG
+        echo -e "\n***\n*** Failed to load custom batching strategy.***"
+        RET=1
+    else
+        set +e
+        python $BATCH_CUSTOM_TEST >$CLIENT_LOG 2>&1
+        if [ $? -ne 0 ]; then
+            cat $CLIENT_LOG
+            echo -e "\n***\n*** ${test_types[$i]} Test Failed\n***"
+            RET=1
+        else
+            check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+            if [ $? -ne 0 ]; then
+                cat $CLIENT_LOG
+                echo -e "\n***\n*** ${test_types[$i]} Test Result Verification Failed\n***"
+                RET=1
+            fi
+        fi
+        set -e
+    fi
+
+    kill $SERVER_PID
+    wait $SERVER_PID
+done
+
+# Test ModelBatchInitialize failure
+FILE_PATH="backend/examples/batching_strategies/volume_batching/src/volume_batching.cc"
+OLD_STRING="\/\/ Batcher will point to an unsigned integer representing the maximum"
+NEW_STRING="return TRITONSERVER_ErrorNew(TRITONSERVER_ERROR_NOT_FOUND,\"Failure test case\");"
+
+sed -i "s/${OLD_STRING}/${NEW_STRING}/g" ${FILE_PATH}
+
+(cd backend/examples/batching_strategies/volume_batching &&
+ cd build &&
+ cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install \
+       -DTRITON_CORE_REPO_TAG=$TRITON_CORE_REPO_TAG .. &&
+ make -j4 install)
+
+cp -r backend/examples/batching_strategies/volume_batching/build/libtriton_volumebatching.so models/${MODEL_NAME}/libtriton_volumebatching.so
+
+SERVER_LOG=${SERVER_LOG_BASE}_batching_init_failure
+
+run_server
+if [ "$SERVER_PID" != "0" ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** ModelBatchInit Error Test: unexpected successful server start $SERVER\n***"
+    kill_server
+    RET=1
+else
+    if [ `grep -c "Failure test case" $SERVER_LOG` -lt 1 ] || [ `grep -c "Not found" $SERVER_LOG` -lt 1 ]; then
+        cat $SERVER_LOG
+        echo -e "\n***\n*** ModelBatchInit Error Test: failed to find \"Failure test case\" message and/or \"Not found\" error type"
+        RET=1
+    fi
+fi
+
+
+if [ $RET -eq 0 ]; then
+    echo -e "\n***\n*** Test Passed\n***"
+else
+    echo -e "\n***\n*** Test FAILED\n***"
+fi
+
+exit $RET
diff --git a/qa/L0_batch_input/batch_input_test.py b/qa/L0_batch_input/batch_input_test.py
old mode 100644
new mode 100755
index 2931dadbad..02de27d921
--- a/qa/L0_batch_input/batch_input_test.py
+++ b/qa/L0_batch_input/batch_input_test.py
@@ -1,4 +1,6 @@
-# Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,52 +27,68 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
+import queue
 import unittest
+from functools import partial
+
 import numpy as np
-import infer_util as iu
 import test_util as tu
-import tritonhttpclient
-from tritonclientutils import InferenceServerException
+import tritonclient.grpc as grpcclient
+from tritonclient.utils import InferenceServerException
 
 
 class BatchInputTest(tu.TestResultCollector):
-
     def setUp(self):
+        self.client = grpcclient.InferenceServerClient(url="localhost:8001")
+
+        def callback(user_data, result, error):
+            if error:
+                user_data.put(error)
+            else:
+                user_data.put(result)
+
+        self.client_callback = callback
+
+    def set_inputs(self, shapes, input_name):
         self.dtype_ = np.float32
         self.inputs = []
-        # 4 set of inputs with shape [2], [4], [1], [3]
-        for value in [2, 4, 1, 3]:
-            self.inputs.append([
-                tritonhttpclient.InferInput('RAGGED_INPUT', [1, value], "FP32")
-            ])
+        for shape in shapes:
+            self.inputs.append(
+                [grpcclient.InferInput(input_name, [1, shape[0]], "FP32")]
+            )
             self.inputs[-1][0].set_data_from_numpy(
-                np.full([1, value], value, np.float32))
-        self.client = tritonhttpclient.InferenceServerClient(
-            url="localhost:8000", concurrency=len(self.inputs))
+                np.full([1, shape[0]], shape[0], np.float32)
+            )
+
+    def set_inputs_for_batch_item(self, shapes, input_name):
+        self.dtype_ = np.float32
+        self.inputs = []
+        for shape in shapes:
+            self.inputs.append([grpcclient.InferInput(input_name, shape, "FP32")])
+            self.inputs[-1][0].set_data_from_numpy(np.full(shape, shape[0], np.float32))
 
     def test_ragged_output(self):
         model_name = "ragged_io"
+        # The model is an identity model
+        self.set_inputs([[2], [4], [1], [3]], "INPUT0")
+        user_data = queue.Queue()
+        self.client.start_stream(callback=partial(self.client_callback, user_data))
 
-        # The model is identity model
-        self.inputs = []
-        for value in [2, 4, 1, 3]:
-            self.inputs.append(
-                [tritonhttpclient.InferInput('INPUT0', [1, value], "FP32")])
-            self.inputs[-1][0].set_data_from_numpy(
-                np.full([1, value], value, np.float32))
-        output_name = 'OUTPUT0'
-        outputs = [tritonhttpclient.InferRequestedOutput(output_name)]
+        output_name = "OUTPUT0"
+        outputs = [grpcclient.InferRequestedOutput(output_name)]
 
         async_requests = []
         try:
-            for inputs in self.inputs:
+            for input in self.inputs:
                 # Asynchronous inference call.
                 async_requests.append(
-                    self.client.async_infer(model_name=model_name,
-                                            inputs=inputs,
-                                            outputs=outputs))
+                    self.client.async_stream_infer(
+                        model_name=model_name, inputs=input, outputs=outputs
+                    )
+                )
 
             expected_value_list = [[v] * v for v in [2, 4, 1, 3]]
             expected_value_list = [
@@ -80,31 +98,37 @@ def test_ragged_output(self):
             for idx in range(len(async_requests)):
                 # Get the result from the initiated asynchronous inference request.
                 # Note the call will block till the server responds.
-                result = async_requests[idx].get_result()
+                result = user_data.get()
 
                 # Validate the results by comparing with precomputed values.
                 output_data = result.as_numpy(output_name)
                 self.assertTrue(
                     np.array_equal(output_data, expected_value_list[idx]),
                     "Expect response {} to have value {}, got {}".format(
-                        idx, expected_value_list[idx], output_data))
+                        idx, expected_value_list[idx], output_data
+                    ),
+                )
         except InferenceServerException as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
+        self.client.stop_stream()
 
     def test_ragged_input(self):
         model_name = "ragged_acc_shape"
+        self.set_inputs([[2], [4], [1], [3]], "RAGGED_INPUT")
+        user_data = queue.Queue()
+        self.client.start_stream(callback=partial(self.client_callback, user_data))
 
-        output_name = 'RAGGED_OUTPUT'
-        outputs = [tritonhttpclient.InferRequestedOutput(output_name)]
-
+        output_name = "RAGGED_OUTPUT"
+        outputs = [grpcclient.InferRequestedOutput(output_name)]
         async_requests = []
         try:
-            for inputs in self.inputs:
+            for input in self.inputs:
                 # Asynchronous inference call.
                 async_requests.append(
-                    self.client.async_infer(model_name=model_name,
-                                            inputs=inputs,
-                                            outputs=outputs))
+                    self.client.async_stream_infer(
+                        model_name=model_name, inputs=input, outputs=outputs
+                    )
+                )
 
             value_lists = [[v] * v for v in [2, 4, 1, 3]]
             expected_value = []
@@ -114,191 +138,218 @@ def test_ragged_input(self):
             for idx in range(len(async_requests)):
                 # Get the result from the initiated asynchronous inference request.
                 # Note the call will block till the server responds.
-                result = async_requests[idx].get_result()
-
+                result = user_data.get()
                 # Validate the results by comparing with precomputed values.
                 output_data = result.as_numpy(output_name)
                 self.assertTrue(
                     np.array_equal(output_data, expected_value),
                     "Expect response {} to have value {}, got {}".format(
-                        idx, expected_value, output_data))
+                        idx, expected_value, output_data
+                    ),
+                )
         except InferenceServerException as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
+        self.client.stop_stream()
 
     def test_element_count(self):
         model_name = "ragged_element_count_acc_zero"
+        self.set_inputs([[2], [4], [1], [3]], "RAGGED_INPUT")
+        user_data = queue.Queue()
+        self.client.start_stream(callback=partial(self.client_callback, user_data))
 
-        output_name = 'BATCH_AND_SIZE_OUTPUT'
-        outputs = [tritonhttpclient.InferRequestedOutput(output_name)]
+        output_name = "BATCH_AND_SIZE_OUTPUT"
+        outputs = [grpcclient.InferRequestedOutput(output_name)]
 
         async_requests = []
         try:
-            for inputs in self.inputs:
+            for input in self.inputs:
                 # Asynchronous inference call.
                 async_requests.append(
-                    self.client.async_infer(model_name=model_name,
-                                            inputs=inputs,
-                                            outputs=outputs))
+                    self.client.async_stream_infer(
+                        model_name=model_name, inputs=input, outputs=outputs
+                    )
+                )
 
             expected_value = np.asarray([[2, 4, 1, 3]], np.float32)
             for idx in range(len(async_requests)):
                 # Get the result from the initiated asynchronous inference request.
                 # Note the call will block till the server responds.
-                result = async_requests[idx].get_result()
+                result = user_data.get()
 
                 # Validate the results by comparing with precomputed values.
                 output_data = result.as_numpy(output_name)
                 self.assertTrue(
                     np.array_equal(output_data, expected_value),
                     "Expect response {} to have value {}, got {}".format(
-                        idx, expected_value, output_data))
+                        idx, expected_value, output_data
+                    ),
+                )
         except InferenceServerException as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
+        self.client.stop_stream()
 
     def test_accumulated_element_count(self):
         model_name = "ragged_acc_shape"
+        self.set_inputs([[2], [4], [1], [3]], "RAGGED_INPUT")
+        user_data = queue.Queue()
+        self.client.start_stream(callback=partial(self.client_callback, user_data))
 
-        output_name = 'BATCH_AND_SIZE_OUTPUT'
-        outputs = [tritonhttpclient.InferRequestedOutput(output_name)]
+        output_name = "BATCH_AND_SIZE_OUTPUT"
+        outputs = [grpcclient.InferRequestedOutput(output_name)]
 
         async_requests = []
         try:
-            for inputs in self.inputs:
+            for input in self.inputs:
                 # Asynchronous inference call.
                 async_requests.append(
-                    self.client.async_infer(model_name=model_name,
-                                            inputs=inputs,
-                                            outputs=outputs))
+                    self.client.async_stream_infer(
+                        model_name=model_name, inputs=input, outputs=outputs
+                    )
+                )
 
             expected_value = np.asarray([[2, 6, 7, 10]], np.float32)
             for idx in range(len(async_requests)):
                 # Get the result from the initiated asynchronous inference request.
                 # Note the call will block till the server responds.
-                result = async_requests[idx].get_result()
+                result = user_data.get()
 
                 # Validate the results by comparing with precomputed values.
                 output_data = result.as_numpy(output_name)
                 self.assertTrue(
                     np.array_equal(output_data, expected_value),
                     "Expect response {} to have value {}, got {}".format(
-                        idx, expected_value, output_data))
+                        idx, expected_value, output_data
+                    ),
+                )
         except InferenceServerException as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
+        self.client.stop_stream()
 
     def test_accumulated_element_count_with_zero(self):
         model_name = "ragged_element_count_acc_zero"
+        self.set_inputs([[2], [4], [1], [3]], "RAGGED_INPUT")
+        user_data = queue.Queue()
+        self.client.start_stream(callback=partial(self.client_callback, user_data))
 
-        output_name = 'BATCH_OUTPUT'
-        outputs = [tritonhttpclient.InferRequestedOutput(output_name)]
+        output_name = "BATCH_OUTPUT"
+        outputs = [grpcclient.InferRequestedOutput(output_name)]
 
         async_requests = []
         try:
-            for inputs in self.inputs:
+            for input in self.inputs:
                 # Asynchronous inference call.
                 async_requests.append(
-                    self.client.async_infer(model_name=model_name,
-                                            inputs=inputs,
-                                            outputs=outputs))
+                    self.client.async_stream_infer(
+                        model_name=model_name, inputs=input, outputs=outputs
+                    )
+                )
 
             expected_value = np.asarray([[0, 2, 6, 7, 10]], np.float32)
             for idx in range(len(async_requests)):
                 # Get the result from the initiated asynchronous inference request.
                 # Note the call will block till the server responds.
-                result = async_requests[idx].get_result()
+                result = user_data.get()
 
                 # Validate the results by comparing with precomputed values.
                 output_data = result.as_numpy(output_name)
                 self.assertTrue(
                     np.array_equal(output_data, expected_value),
                     "Expect response {} to have value {}, got {}".format(
-                        idx, expected_value, output_data))
+                        idx, expected_value, output_data
+                    ),
+                )
         except InferenceServerException as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
+        self.client.stop_stream()
 
     def test_max_element_count_as_shape(self):
         model_name = "ragged_acc_shape"
+        self.set_inputs([[2], [4], [1], [3]], "RAGGED_INPUT")
+        user_data = queue.Queue()
+        self.client.start_stream(callback=partial(self.client_callback, user_data))
 
-        output_name = 'BATCH_OUTPUT'
-        outputs = [tritonhttpclient.InferRequestedOutput(output_name)]
+        output_name = "BATCH_OUTPUT"
+        outputs = [grpcclient.InferRequestedOutput(output_name)]
 
         async_requests = []
         try:
-            for inputs in self.inputs:
+            for input in self.inputs:
                 # Asynchronous inference call.
                 async_requests.append(
-                    self.client.async_infer(model_name=model_name,
-                                            inputs=inputs,
-                                            outputs=outputs))
+                    self.client.async_stream_infer(
+                        model_name=model_name, inputs=input, outputs=outputs
+                    )
+                )
 
             for idx in range(len(async_requests)):
                 # Get the result from the initiated asynchronous inference request.
                 # Note the call will block till the server responds.
-                result = async_requests[idx].get_result()
+                result = user_data.get()
 
                 # Validate the results by comparing with precomputed values.
                 output_data = result.as_numpy(output_name)
                 self.assertEqual(
-                    output_data.shape, (1, 4),
-                    "Expect response {} to have shape to represent max element count {} among the batch , got {}"
-                    .format(idx, 4, output_data.shape))
+                    output_data.shape,
+                    (1, 4),
+                    "Expect response {} to have shape to represent max element count {} among the batch , got {}".format(
+                        idx, 4, output_data.shape
+                    ),
+                )
         except InferenceServerException as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
+        self.client.stop_stream()
 
     def test_batch_item_shape_flatten(self):
         # Use 4 set of inputs with shape
         # [1, 4, 1], [1, 1, 2], [1, 1, 2], [1, 2, 2]
         # Note that the test only checks the formation of "BATCH_INPUT" where
         # the value of "RAGGED_INPUT" is irrelevant, only the shape matters
-        self.inputs = []
-        for value in [[1, 4, 1], [1, 1, 2], [1, 1, 2], [1, 2, 2]]:
-            self.inputs.append(
-                [tritonhttpclient.InferInput('RAGGED_INPUT', value, "FP32")])
-            self.inputs[-1][0].set_data_from_numpy(
-                np.full(value, value[0], np.float32))
-        self.client = tritonhttpclient.InferenceServerClient(
-            url="localhost:8000", concurrency=len(self.inputs))
+        self.set_inputs_for_batch_item(
+            [[1, 4, 1], [1, 1, 2], [1, 1, 2], [1, 2, 2]], "RAGGED_INPUT"
+        )
 
         model_name = "batch_item_flatten"
+        user_data = queue.Queue()
+        self.client.start_stream(callback=partial(self.client_callback, user_data))
 
-        output_name = 'BATCH_OUTPUT'
-        outputs = [tritonhttpclient.InferRequestedOutput(output_name)]
+        output_name = "BATCH_OUTPUT"
+        outputs = [grpcclient.InferRequestedOutput(output_name)]
 
         async_requests = []
         try:
-            for inputs in self.inputs:
+            for input in self.inputs:
                 # Asynchronous inference call.
                 async_requests.append(
-                    self.client.async_infer(model_name=model_name,
-                                            inputs=inputs,
-                                            outputs=outputs))
+                    self.client.async_stream_infer(
+                        model_name=model_name, inputs=input, outputs=outputs
+                    )
+                )
 
             expected_value = np.asarray([[4, 1, 1, 2, 1, 2, 2, 2]], np.float32)
             for idx in range(len(async_requests)):
                 # Get the result from the initiated asynchronous inference request.
                 # Note the call will block till the server responds.
-                result = async_requests[idx].get_result()
+                result = user_data.get()
 
                 # Validate the results by comparing with precomputed values.
                 output_data = result.as_numpy(output_name)
                 self.assertTrue(
                     np.array_equal(output_data, expected_value),
                     "Expect response {} to have value {}, got {}".format(
-                        idx, expected_value, output_data))
+                        idx, expected_value, output_data
+                    ),
+                )
         except InferenceServerException as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
+        self.client.stop_stream()
 
     def test_batch_item_shape(self):
         # Use 3 set of inputs with shape [2, 1, 2], [1, 1, 2], [1, 2, 2]
         # Note that the test only checks the formation of "BATCH_INPUT" where
         # the value of "RAGGED_INPUT" is irrelevant, only the shape matters
-        inputs = []
-        for value in [[2, 1, 2], [1, 1, 2], [1, 2, 2]]:
-            inputs.append(
-                [tritonhttpclient.InferInput('RAGGED_INPUT', value, "FP32")])
-            inputs[-1][0].set_data_from_numpy(
-                np.full(value, value[0], np.float32))
-        client = tritonhttpclient.InferenceServerClient(url="localhost:8000",
-                                                        concurrency=len(inputs))
+        self.set_inputs_for_batch_item(
+            [[2, 1, 2], [1, 1, 2], [1, 2, 2]], "RAGGED_INPUT"
+        )
 
         expected_outputs = [
             np.array([[1.0, 2.0], [1.0, 2.0]]),
@@ -307,34 +358,41 @@ def test_batch_item_shape(self):
         ]
 
         model_name = "batch_item"
+        user_data = queue.Queue()
+        self.client.start_stream(callback=partial(self.client_callback, user_data))
 
-        output_name = 'BATCH_OUTPUT'
-        outputs = [tritonhttpclient.InferRequestedOutput(output_name)]
+        output_name = "BATCH_OUTPUT"
+        outputs = [grpcclient.InferRequestedOutput(output_name)]
 
         async_requests = []
         try:
-            for request_inputs in inputs:
+            for input in self.inputs:
                 # Asynchronous inference call.
                 async_requests.append(
-                    client.async_infer(model_name=model_name,
-                                       inputs=request_inputs,
-                                       outputs=outputs))
+                    self.client.async_stream_infer(
+                        model_name=model_name, inputs=input, outputs=outputs
+                    )
+                )
 
             for idx in range(len(async_requests)):
                 # Get the result from the initiated asynchronous inference request.
                 # Note the call will block till the server responds.
-                result = async_requests[idx].get_result()
+                result = user_data.get()
 
                 # Validate the results by comparing with precomputed values.
                 output_data = result.as_numpy(output_name)
                 self.assertTrue(
                     np.allclose(output_data, expected_outputs[idx]),
-                    "Expect response to have value:\n{}, got:\n{}\nEqual matrix:\n{}"
-                    .format(expected_outputs[idx], output_data,
-                            np.isclose(expected_outputs[idx], output_data)))
+                    "Expect response to have value:\n{}, got:\n{}\nEqual matrix:\n{}".format(
+                        expected_outputs[idx],
+                        output_data,
+                        np.isclose(expected_outputs[idx], output_data),
+                    ),
+                )
         except InferenceServerException as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
+        self.client.stop_stream()
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_batch_input/test.sh b/qa/L0_batch_input/test.sh
old mode 100644
new mode 100755
index 56ca448f3a..e780516ec4
--- a/qa/L0_batch_input/test.sh
+++ b/qa/L0_batch_input/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -54,7 +54,7 @@ SERVER_LOG="./inference_server.log"
 source ../common/util.sh
 
 # If BACKENDS not specified, set to all
-BACKENDS=${BACKENDS:="onnx savedmodel plan"}
+BACKENDS=${BACKENDS:="onnx savedmodel plan libtorch"}
 
 rm -f $SERVER_LOG $CLIENT_LOG
 
@@ -82,6 +82,8 @@ for BACKEND in $BACKENDS; do
     # batch input is generated properly
     cp -r $IDENTITY_DATADIR/${BACKEND}_nobatch_zero_1_float32 models/ragged_io
     (cd models/ragged_io && \
+          # In case of libtorch, update I/O names
+          sed -i "s/__0/0/" config.pbtxt && \
           sed -i "s/${BACKEND}_nobatch_zero_1_float32/ragged_io/" config.pbtxt && \
           sed -i "s/^max_batch_size:.*/max_batch_size: 4/" config.pbtxt && \
           sed -i "s/name: \"INPUT0\"/name: \"INPUT0\"\\nallow_ragged_batch: true/" config.pbtxt && \
@@ -99,7 +101,7 @@ for BACKEND in $BACKENDS; do
     fi
 
     set +e
-    python $BATCH_INPUT_TEST >$CLIENT_LOG 2>&1
+    python3 $BATCH_INPUT_TEST >$CLIENT_LOG 2>&1
     if [ $? -ne 0 ]; then
         cat $CLIENT_LOG
         echo -e "\n***\n*** Test Failed\n***"
diff --git a/qa/L0_batcher/batcher_test.py b/qa/L0_batcher/batcher_test.py
old mode 100644
new mode 100755
index 31382c5918..38e208c21e
--- a/qa/L0_batcher/batcher_test.py
+++ b/qa/L0_batcher/batcher_test.py
@@ -1,4 +1,6 @@
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,27 +27,26 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-from builtins import range
 import os
-import time
 import threading
+import time
 import unittest
-import numpy as np
+from builtins import range
+
 import infer_util as iu
+import numpy as np
 import test_util as tu
-
 import tritonclient.grpc as grpcclient
 
 # By default, find tritonserver on "localhost", but can be overridden
 # with TRITONSERVER_IPADDR envvar
-_tritonserver_ipaddr = os.environ.get('TRITONSERVER_IPADDR', 'localhost')
+_tritonserver_ipaddr = os.environ.get("TRITONSERVER_IPADDR", "localhost")
 
-TEST_SYSTEM_SHARED_MEMORY = bool(
-    int(os.environ.get('TEST_SYSTEM_SHARED_MEMORY', 0)))
-TEST_CUDA_SHARED_MEMORY = bool(int(os.environ.get('TEST_CUDA_SHARED_MEMORY',
-                                                  0)))
+TEST_SYSTEM_SHARED_MEMORY = bool(int(os.environ.get("TEST_SYSTEM_SHARED_MEMORY", 0)))
+TEST_CUDA_SHARED_MEMORY = bool(int(os.environ.get("TEST_CUDA_SHARED_MEMORY", 0)))
 
 if TEST_SYSTEM_SHARED_MEMORY:
     import tritonclient.utils.shared_memory as shm
@@ -54,13 +55,13 @@
 
 # Test with either GRPC of HTTP, but not both since when we check
 # results we expect only one to run
-USE_GRPC = (os.environ.get('USE_GRPC', 1) != "0")
-USE_HTTP = (os.environ.get('USE_HTTP', 1) != "0")
+USE_GRPC = os.environ.get("USE_GRPC", 1) != "0"
+USE_HTTP = os.environ.get("USE_HTTP", 1) != "0"
 if USE_GRPC and USE_HTTP:
     USE_GRPC = False
 assert USE_GRPC or USE_HTTP, "USE_GRPC or USE_HTTP must be non-zero"
 
-BACKENDS = os.environ.get('BACKENDS', "graphdef savedmodel onnx libtorch plan")
+BACKENDS = os.environ.get("BACKENDS", "graphdef savedmodel onnx libtorch plan python")
 
 _trials = BACKENDS.split(" ")
 
@@ -69,6 +70,8 @@
     _ragged_batch_supported_trials.append("plan")
 if "onnx" in _trials:
     _ragged_batch_supported_trials.append("onnx")
+if "libtorch" in _trials:
+    _ragged_batch_supported_trials.append("libtorch")
 
 _max_queue_delay_ms = 10000
 
@@ -77,10 +80,11 @@
 
 
 class BatcherTest(tu.TestResultCollector):
-
     def setUp(self):
         # The helper client for setup will be GRPC for simplicity.
-        self.triton_client_ = grpcclient.InferenceServerClient(f"{_tritonserver_ipaddr}:8001")
+        self.triton_client_ = grpcclient.InferenceServerClient(
+            f"{_tritonserver_ipaddr}:8001"
+        )
         self.precreated_shm_regions_ = []
         global _deferred_exceptions
         _deferred_exceptions = []
@@ -102,19 +106,22 @@ def create_advance(self, shm_regions=None):
         if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY:
             precreated_shm_regions = []
             if shm_regions is None:
-                shm_regions = ['output0', 'output1']
+                shm_regions = ["output0", "output1"]
             for shm_region in shm_regions:
                 if TEST_SYSTEM_SHARED_MEMORY:
                     shm_handle = shm.create_shared_memory_region(
-                        shm_region + '_data', '/' + shm_region, 512)
+                        shm_region + "_data", "/" + shm_region, 512
+                    )
                     self.triton_client_.register_system_shared_memory(
-                        shm_region + '_data', '/' + shm_region, 512)
+                        shm_region + "_data", "/" + shm_region, 512
+                    )
                 else:
                     shm_handle = cudashm.create_shared_memory_region(
-                        shm_region + '_data', 512, 0)
+                        shm_region + "_data", 512, 0
+                    )
                     self.triton_client_.register_cuda_shared_memory(
-                        shm_region + '_data',
-                        cudashm.get_raw_handle(shm_handle), 0, 512)
+                        shm_region + "_data", cudashm.get_raw_handle(shm_handle), 0, 512
+                    )
                 # Collect precreated handles for cleanup
                 self.precreated_shm_regions_.append(shm_handle)
                 precreated_shm_regions.append(shm_handle)
@@ -132,19 +139,27 @@ def check_deferred_exception(self):
             if len(_deferred_exceptions) > 0:
                 raise _deferred_exceptions[0]
 
-    def check_response(self,
-                       trial,
-                       bs,
-                       thresholds,
-                       requested_outputs=("OUTPUT0", "OUTPUT1"),
-                       input_size=16,
-                       shm_region_names=None,
-                       precreated_shm_regions=None):
+    def check_response(
+        self,
+        trial,
+        bs,
+        thresholds,
+        requested_outputs=("OUTPUT0", "OUTPUT1"),
+        input_size=16,
+        shm_region_names=None,
+        precreated_shm_regions=None,
+    ):
         try:
             start_ms = int(round(time.time() * 1000))
 
-            if trial == "savedmodel" or trial == "graphdef" or trial == "libtorch" \
-                    or trial == "onnx" or trial == "plan":
+            if (
+                trial == "savedmodel"
+                or trial == "graphdef"
+                or trial == "libtorch"
+                or trial == "onnx"
+                or trial == "plan"
+                or trial == "python"
+            ):
                 tensor_shape = (bs, input_size)
                 iu.infer_exact(
                     self,
@@ -165,7 +180,8 @@ def check_response(self,
                     shm_region_names=shm_region_names,
                     precreated_shm_regions=precreated_shm_regions,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
             else:
                 self.assertFalse(True, "unknown trial type: " + trial)
 
@@ -176,72 +192,109 @@ def check_response(self,
             if lt_ms is not None:
                 self.assertTrue(
                     (end_ms - start_ms) < lt_ms,
-                    "expected less than " + str(lt_ms) +
-                    "ms response time, got " + str(end_ms - start_ms) + " ms")
+                    "expected less than "
+                    + str(lt_ms)
+                    + "ms response time, got "
+                    + str(end_ms - start_ms)
+                    + " ms",
+                )
             if gt_ms is not None:
                 self.assertTrue(
                     (end_ms - start_ms) > gt_ms,
-                    "expected greater than " + str(gt_ms) +
-                    "ms response time, got " + str(end_ms - start_ms) + " ms")
+                    "expected greater than "
+                    + str(gt_ms)
+                    + "ms response time, got "
+                    + str(end_ms - start_ms)
+                    + " ms",
+                )
         except Exception as ex:
             self.add_deferred_exception(ex)
 
-    def check_setup(self, model_name, preferred_batch_sizes,
-                    max_queue_delay_us):
+    def check_setup(self, model_name, preferred_batch_sizes, max_queue_delay_us):
         # Make sure test.sh set up the correct batcher settings
         config = self.triton_client_.get_model_config(model_name).config
         bconfig = config.dynamic_batching
-        self.assertEqual(len(bconfig.preferred_batch_size),
-                         len(preferred_batch_sizes))
+        self.assertEqual(len(bconfig.preferred_batch_size), len(preferred_batch_sizes))
         for i in preferred_batch_sizes:
             self.assertTrue(i in bconfig.preferred_batch_size)
-        self.assertEqual(bconfig.max_queue_delay_microseconds,
-                         max_queue_delay_us)
+        self.assertEqual(bconfig.max_queue_delay_microseconds, max_queue_delay_us)
 
     def check_status(self, model_name, batch_exec, request_cnt, infer_cnt, exec_count):
-        stats = self.triton_client_.get_inference_statistics(model_name, "1")
-        self.assertEqual(len(stats.model_stats), 1, "expect 1 model stats")
-        self.assertEqual(stats.model_stats[0].name, model_name,
-                         "expect model stats for model {}".format(model_name))
+        # There is a time window between when responses are returned and statistics are updated.
+        # To prevent intermittent test failure during that window, wait up to 10 seconds for the
+        # inference statistics to be ready.
+        num_tries = 10
+        for i in range(num_tries):
+            stats = self.triton_client_.get_inference_statistics(model_name, "1")
+            self.assertEqual(len(stats.model_stats), 1, "expect 1 model stats")
+            actual_exec_cnt = stats.model_stats[0].execution_count
+            if actual_exec_cnt in exec_count:
+                break
+            print(
+                "WARNING: expect {} executions, got {} (attempt {})".format(
+                    exec_count, actual_exec_cnt, i
+                )
+            )
+            time.sleep(1)
+
         self.assertEqual(
-            stats.model_stats[0].version, "1",
-            "expect model stats for model {} version 1".format(model_name))
+            stats.model_stats[0].name,
+            model_name,
+            "expect model stats for model {}".format(model_name),
+        )
+        self.assertEqual(
+            stats.model_stats[0].version,
+            "1",
+            "expect model stats for model {} version 1".format(model_name),
+        )
 
         if batch_exec:
             batch_stats = stats.model_stats[0].batch_stats
             self.assertEqual(
-                len(batch_stats), len(batch_exec),
+                len(batch_stats),
+                len(batch_exec),
                 "expected {} different batch-sizes, got {}".format(
-                    len(batch_exec), len(batch_stats)))
+                    len(batch_exec), len(batch_stats)
+                ),
+            )
 
             for batch_stat in batch_stats:
                 bs = batch_stat.batch_size
                 bc = batch_stat.compute_infer.count
-                self.assertTrue(bs in batch_exec,
-                                "unexpected batch-size {}".format(bs))
+                self.assertTrue(bs in batch_exec, "unexpected batch-size {}".format(bs))
                 # Get count from one of the stats
                 self.assertEqual(
-                    bc, batch_exec[bs],
-                    "expected model-execution-count {} for batch size {}, got {}".
-                    format(batch_exec[bs], bs, bc))
+                    bc,
+                    batch_exec[bs],
+                    "expected model-execution-count {} for batch size {}, got {}".format(
+                        batch_exec[bs], bs, bc
+                    ),
+                )
 
         actual_request_cnt = stats.model_stats[0].inference_stats.success.count
         self.assertEqual(
-            actual_request_cnt, request_cnt,
+            actual_request_cnt,
+            request_cnt,
             "expected model-request-count {}, got {}".format(
-                request_cnt, actual_request_cnt))
+                request_cnt, actual_request_cnt
+            ),
+        )
 
         actual_exec_cnt = stats.model_stats[0].execution_count
         self.assertIn(
-            actual_exec_cnt, exec_count,
-            "expected model-exec-count {}, got {}".format(
-                request_cnt, actual_exec_cnt))
+            actual_exec_cnt,
+            exec_count,
+            "expected model-exec-count {}, got {}".format(exec_count, actual_exec_cnt),
+        )
 
         actual_infer_cnt = stats.model_stats[0].inference_count
         self.assertEqual(
-            actual_infer_cnt, infer_cnt,
+            actual_infer_cnt,
+            infer_cnt,
             "expected model-inference-count {}, got {}".format(
-                infer_cnt, actual_infer_cnt))
+                infer_cnt, actual_infer_cnt
+            ),
+        )
 
     def test_static_batch_preferred(self):
         # Send two requests with static batch sizes == preferred
@@ -250,20 +303,25 @@ def test_static_batch_preferred(self):
         precreated_shm_regions = self.create_advance()
         for trial in _trials:
             try:
-                model_name = tu.get_model_name(trial, np.float32, np.float32,
-                                               np.float32)
+                model_name = tu.get_model_name(
+                    trial, np.float32, np.float32, np.float32
+                )
 
                 self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000)
                 self.assertFalse("TRITONSERVER_DELAY_SCHEDULER" in os.environ)
 
                 self.check_response(
                     trial,
-                    2, (3000, None),
-                    precreated_shm_regions=precreated_shm_regions)
+                    2,
+                    (3000, None),
+                    precreated_shm_regions=precreated_shm_regions,
+                )
                 self.check_response(
                     trial,
-                    6, (3000, None),
-                    precreated_shm_regions=precreated_shm_regions)
+                    6,
+                    (3000, None),
+                    precreated_shm_regions=precreated_shm_regions,
+                )
                 self.check_deferred_exception()
                 self.check_status(model_name, {2: 1, 6: 1}, 2, 8, (2,))
             except Exception as ex:
@@ -276,16 +334,19 @@ def test_static_batch_lt_any_preferred(self):
         precreated_shm_regions = self.create_advance()
         for trial in _trials:
             try:
-                model_name = tu.get_model_name(trial, np.float32, np.float32,
-                                               np.float32)
+                model_name = tu.get_model_name(
+                    trial, np.float32, np.float32, np.float32
+                )
 
                 self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000)
                 self.assertFalse("TRITONSERVER_DELAY_SCHEDULER" in os.environ)
 
                 self.check_response(
                     trial,
-                    1, (_max_queue_delay_ms * 1.5, _max_queue_delay_ms),
-                    precreated_shm_regions=precreated_shm_regions)
+                    1,
+                    (_max_queue_delay_ms * 1.5, _max_queue_delay_ms),
+                    precreated_shm_regions=precreated_shm_regions,
+                )
                 self.check_deferred_exception()
                 self.check_status(model_name, {1: 1}, 1, 1, (1,))
             except Exception as ex:
@@ -298,16 +359,19 @@ def test_static_batch_not_preferred(self):
         precreated_shm_regions = self.create_advance()
         for trial in _trials:
             try:
-                model_name = tu.get_model_name(trial, np.float32, np.float32,
-                                               np.float32)
+                model_name = tu.get_model_name(
+                    trial, np.float32, np.float32, np.float32
+                )
 
                 self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000)
                 self.assertFalse("TRITONSERVER_DELAY_SCHEDULER" in os.environ)
 
                 self.check_response(
                     trial,
-                    3, (_max_queue_delay_ms * 1.5, _max_queue_delay_ms),
-                    precreated_shm_regions=precreated_shm_regions)
+                    3,
+                    (_max_queue_delay_ms * 1.5, _max_queue_delay_ms),
+                    precreated_shm_regions=precreated_shm_regions,
+                )
                 self.check_deferred_exception()
                 self.check_status(model_name, {3: 1}, 1, 3, (1,))
             except Exception as ex:
@@ -320,16 +384,19 @@ def test_static_batch_gt_max_preferred(self):
         precreated_shm_regions = self.create_advance()
         for trial in _trials:
             try:
-                model_name = tu.get_model_name(trial, np.float32, np.float32,
-                                               np.float32)
+                model_name = tu.get_model_name(
+                    trial, np.float32, np.float32, np.float32
+                )
 
                 self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000)
                 self.assertFalse("TRITONSERVER_DELAY_SCHEDULER" in os.environ)
 
                 self.check_response(
                     trial,
-                    7, (3000, None),
-                    precreated_shm_regions=precreated_shm_regions)
+                    7,
+                    (3000, None),
+                    precreated_shm_regions=precreated_shm_regions,
+                )
                 self.check_deferred_exception()
                 self.check_status(model_name, {7: 1}, 1, 7, (1,))
             except Exception as ex:
@@ -350,25 +417,29 @@ def test_multi_batch_different_shape_allow_ragged(self):
 
                 threads = []
                 threads.append(
-                    threading.Thread(target=iu.infer_zero,
-                                     args=(self, trial, 1, dtype, ([1, 16],),
-                                           ([1, 16],)),
-                                     kwargs={
-                                         'use_grpc': USE_GRPC,
-                                         'use_http': USE_HTTP,
-                                         'use_http_json_tensors': False,
-                                         'use_streaming': False
-                                     }))
-                threads.append(
-                    threading.Thread(target=iu.infer_zero,
-                                     args=(self, trial, 1, dtype, ([1, 8],),
-                                           ([1, 8],)),
-                                     kwargs={
-                                         'use_grpc': USE_GRPC,
-                                         'use_http': USE_HTTP,
-                                         'use_http_json_tensors': False,
-                                         'use_streaming': False
-                                     }))
+                    threading.Thread(
+                        target=iu.infer_zero,
+                        args=(self, trial, 1, dtype, ([1, 16],), ([1, 16],)),
+                        kwargs={
+                            "use_grpc": USE_GRPC,
+                            "use_http": USE_HTTP,
+                            "use_http_json_tensors": False,
+                            "use_streaming": False,
+                        },
+                    )
+                )
+                threads.append(
+                    threading.Thread(
+                        target=iu.infer_zero,
+                        args=(self, trial, 1, dtype, ([1, 8],), ([1, 8],)),
+                        kwargs={
+                            "use_grpc": USE_GRPC,
+                            "use_http": USE_HTTP,
+                            "use_http_json_tensors": False,
+                            "use_streaming": False,
+                        },
+                    )
+                )
                 threads[0].start()
                 threads[1].start()
                 for t in threads:
@@ -386,17 +457,18 @@ def test_multi_batch_different_shape(self):
         # immediately and the second delayed by the max batch queue
         # delay
         if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY:
-            shm0_region_names = ['ip00', 'ip01', 'op00', 'op01']
-            shm1_region_names = ['ip10', 'ip11', 'op10', 'op11']
+            shm0_region_names = ["ip00", "ip01", "op00", "op01"]
+            shm1_region_names = ["ip10", "ip11", "op10", "op11"]
         else:
             shm0_region_names = None
             shm1_region_names = None
-        precreated_shm0_regions = self.create_advance(['op00', 'op01'])
-        precreated_shm1_regions = self.create_advance(['op10', 'op11'])
+        precreated_shm0_regions = self.create_advance(["op00", "op01"])
+        precreated_shm1_regions = self.create_advance(["op10", "op11"])
         for trial in _trials:
             try:
-                model_name = tu.get_model_name(trial, np.float32, np.float32,
-                                               np.float32)
+                model_name = tu.get_model_name(
+                    trial, np.float32, np.float32, np.float32
+                )
 
                 self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000)
                 self.assertFalse("TRITONSERVER_DELAY_SCHEDULER" in os.environ)
@@ -407,20 +479,27 @@ def test_multi_batch_different_shape(self):
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'input_size': 16,
-                            'shm_region_names': shm0_region_names,
-                            'precreated_shm_regions': precreated_shm0_regions
-                        }))
+                            "input_size": 16,
+                            "shm_region_names": shm0_region_names,
+                            "precreated_shm_regions": precreated_shm0_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
-                        args=(trial, 1, (_max_queue_delay_ms * 1.5,
-                                         _max_queue_delay_ms)),
+                        args=(
+                            trial,
+                            1,
+                            (_max_queue_delay_ms * 1.5, _max_queue_delay_ms),
+                        ),
                         kwargs={
-                            'input_size': 8,
-                            'shm_region_names': shm1_region_names,
-                            'precreated_shm_regions': precreated_shm1_regions
-                        }))
+                            "input_size": 8,
+                            "shm_region_names": shm1_region_names,
+                            "precreated_shm_regions": precreated_shm1_regions,
+                        },
+                    )
+                )
                 threads[0].start()
                 time.sleep(1)
                 threads[1].start()
@@ -438,17 +517,18 @@ def test_multi_batch_not_preferred(self):
         # delay (minus the difference in time that they arrived in the
         # queue)
         if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY:
-            shm0_region_names = ['ip00', 'ip01', 'op00', 'op01']
-            shm1_region_names = ['ip10', 'ip11', 'op10', 'op11']
+            shm0_region_names = ["ip00", "ip01", "op00", "op01"]
+            shm1_region_names = ["ip10", "ip11", "op10", "op11"]
         else:
             shm0_region_names = None
             shm1_region_names = None
-        precreated_shm0_regions = self.create_advance(['op00', 'op01'])
-        precreated_shm1_regions = self.create_advance(['op10', 'op11'])
+        precreated_shm0_regions = self.create_advance(["op00", "op01"])
+        precreated_shm1_regions = self.create_advance(["op10", "op11"])
         for trial in _trials:
             try:
-                model_name = tu.get_model_name(trial, np.float32, np.float32,
-                                               np.float32)
+                model_name = tu.get_model_name(
+                    trial, np.float32, np.float32, np.float32
+                )
 
                 self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000)
                 self.assertFalse("TRITONSERVER_DELAY_SCHEDULER" in os.environ)
@@ -457,21 +537,31 @@ def test_multi_batch_not_preferred(self):
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
-                        args=(trial, 1, (_max_queue_delay_ms * 1.5,
-                                         _max_queue_delay_ms)),
+                        args=(
+                            trial,
+                            1,
+                            (_max_queue_delay_ms * 1.5, _max_queue_delay_ms),
+                        ),
                         kwargs={
-                            'shm_region_names': shm0_region_names,
-                            'precreated_shm_regions': precreated_shm0_regions
-                        }))
+                            "shm_region_names": shm0_region_names,
+                            "precreated_shm_regions": precreated_shm0_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
-                        args=(trial, 3, (_max_queue_delay_ms * 1.5,
-                                         _max_queue_delay_ms - 2000)),
+                        args=(
+                            trial,
+                            3,
+                            (_max_queue_delay_ms * 1.5, _max_queue_delay_ms - 2000),
+                        ),
                         kwargs={
-                            'shm_region_names': shm1_region_names,
-                            'precreated_shm_regions': precreated_shm1_regions
-                        }))
+                            "shm_region_names": shm1_region_names,
+                            "precreated_shm_regions": precreated_shm1_regions,
+                        },
+                    )
+                )
                 threads[0].start()
                 time.sleep(1)
                 threads[1].start()
@@ -489,20 +579,21 @@ def test_multi_batch_not_preferred_different_shape(self):
         # two requests to be immediately responded to and the third
         # response to be delayed by the max batch queue delay.
         if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY:
-            shm0_region_names = ['ip00', 'ip01', 'op00', 'op01']
-            shm1_region_names = ['ip10', 'ip11', 'op10', 'op11']
-            shm2_region_names = ['ip20', 'ip21', 'op20', 'op21']
+            shm0_region_names = ["ip00", "ip01", "op00", "op01"]
+            shm1_region_names = ["ip10", "ip11", "op10", "op11"]
+            shm2_region_names = ["ip20", "ip21", "op20", "op21"]
         else:
             shm0_region_names = None
             shm1_region_names = None
             shm2_region_names = None
-        precreated_shm0_regions = self.create_advance(['op00', 'op01'])
-        precreated_shm1_regions = self.create_advance(['op10', 'op11'])
-        precreated_shm2_regions = self.create_advance(['op20', 'op21'])
+        precreated_shm0_regions = self.create_advance(["op00", "op01"])
+        precreated_shm1_regions = self.create_advance(["op10", "op11"])
+        precreated_shm2_regions = self.create_advance(["op20", "op21"])
         for trial in _trials:
             try:
-                model_name = tu.get_model_name(trial, np.float32, np.float32,
-                                               np.float32)
+                model_name = tu.get_model_name(
+                    trial, np.float32, np.float32, np.float32
+                )
 
                 self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000)
                 self.assertFalse("TRITONSERVER_DELAY_SCHEDULER" in os.environ)
@@ -513,27 +604,36 @@ def test_multi_batch_not_preferred_different_shape(self):
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm0_region_names,
-                            'precreated_shm_regions': precreated_shm0_regions
-                        }))
+                            "shm_region_names": shm0_region_names,
+                            "precreated_shm_regions": precreated_shm0_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 3, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm1_region_names,
-                            'precreated_shm_regions': precreated_shm1_regions
-                        }))
+                            "shm_region_names": shm1_region_names,
+                            "precreated_shm_regions": precreated_shm1_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
-                        args=(trial, 1, (_max_queue_delay_ms * 1.5,
-                                         _max_queue_delay_ms)),
+                        args=(
+                            trial,
+                            1,
+                            (_max_queue_delay_ms * 1.5, _max_queue_delay_ms),
+                        ),
                         kwargs={
-                            'input_size': 8,
-                            'shm_region_names': shm2_region_names,
-                            'precreated_shm_regions': precreated_shm2_regions
-                        }))
+                            "input_size": 8,
+                            "shm_region_names": shm2_region_names,
+                            "precreated_shm_regions": precreated_shm2_regions,
+                        },
+                    )
+                )
                 threads[0].start()
                 threads[1].start()
                 time.sleep(1)
@@ -554,23 +654,24 @@ def test_multi_batch_preferred_different_shape(self):
         # preferred size so that third and forth response are sent
         # immediately.
         if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY:
-            shm0_region_names = ['ip00', 'ip01', 'op00', 'op01']
-            shm1_region_names = ['ip10', 'ip11', 'op10', 'op11']
-            shm2_region_names = ['ip20', 'ip21', 'op20', 'op21']
-            shm3_region_names = ['ip30', 'ip31', 'op30', 'op31']
+            shm0_region_names = ["ip00", "ip01", "op00", "op01"]
+            shm1_region_names = ["ip10", "ip11", "op10", "op11"]
+            shm2_region_names = ["ip20", "ip21", "op20", "op21"]
+            shm3_region_names = ["ip30", "ip31", "op30", "op31"]
         else:
             shm0_region_names = None
             shm1_region_names = None
             shm2_region_names = None
             shm3_region_names = None
-        precreated_shm0_regions = self.create_advance(['op00', 'op01'])
-        precreated_shm1_regions = self.create_advance(['op10', 'op11'])
-        precreated_shm2_regions = self.create_advance(['op20', 'op21'])
-        precreated_shm3_regions = self.create_advance(['op30', 'op31'])
+        precreated_shm0_regions = self.create_advance(["op00", "op01"])
+        precreated_shm1_regions = self.create_advance(["op10", "op11"])
+        precreated_shm2_regions = self.create_advance(["op20", "op21"])
+        precreated_shm3_regions = self.create_advance(["op30", "op31"])
         for trial in _trials:
             try:
-                model_name = tu.get_model_name(trial, np.float32, np.float32,
-                                               np.float32)
+                model_name = tu.get_model_name(
+                    trial, np.float32, np.float32, np.float32
+                )
 
                 self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000)
                 self.assertFalse("TRITONSERVER_DELAY_SCHEDULER" in os.environ)
@@ -581,35 +682,43 @@ def test_multi_batch_preferred_different_shape(self):
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm0_region_names,
-                            'precreated_shm_regions': precreated_shm0_regions
-                        }))
+                            "shm_region_names": shm0_region_names,
+                            "precreated_shm_regions": precreated_shm0_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 3, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm1_region_names,
-                            'precreated_shm_regions': precreated_shm1_regions
-                        }))
+                            "shm_region_names": shm1_region_names,
+                            "precreated_shm_regions": precreated_shm1_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'input_size': 8,
-                            'shm_region_names': shm2_region_names,
-                            'precreated_shm_regions': precreated_shm2_regions
-                        }))
+                            "input_size": 8,
+                            "shm_region_names": shm2_region_names,
+                            "precreated_shm_regions": precreated_shm2_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 5, (6000, None)),
                         kwargs={
-                            'input_size': 8,
-                            'shm_region_names': shm3_region_names,
-                            'precreated_shm_regions': precreated_shm3_regions
-                        }))
+                            "input_size": 8,
+                            "shm_region_names": shm3_region_names,
+                            "precreated_shm_regions": precreated_shm3_regions,
+                        },
+                    )
+                )
                 threads[0].start()
                 threads[1].start()
                 time.sleep(1)
@@ -629,17 +738,18 @@ def test_multi_batch_gt_max_preferred(self):
         # be processed by the dynamic batcher. This should cause both
         # responses to be returned immediately.
         if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY:
-            shm0_region_names = ['ip00', 'ip01', 'op00', 'op01']
-            shm1_region_names = ['ip10', 'ip11', 'op10', 'op11']
+            shm0_region_names = ["ip00", "ip01", "op00", "op01"]
+            shm1_region_names = ["ip10", "ip11", "op10", "op11"]
         else:
             shm0_region_names = None
             shm1_region_names = None
-        precreated_shm0_regions = self.create_advance(['op00', 'op01'])
-        precreated_shm1_regions = self.create_advance(['op10', 'op11'])
+        precreated_shm0_regions = self.create_advance(["op00", "op01"])
+        precreated_shm1_regions = self.create_advance(["op10", "op11"])
         for trial in _trials:
             try:
-                model_name = tu.get_model_name(trial, np.float32, np.float32,
-                                               np.float32)
+                model_name = tu.get_model_name(
+                    trial, np.float32, np.float32, np.float32
+                )
 
                 self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000)
                 self.assertFalse("TRITONSERVER_DELAY_SCHEDULER" in os.environ)
@@ -650,17 +760,21 @@ def test_multi_batch_gt_max_preferred(self):
                         target=self.check_response,
                         args=(trial, 3, (3000, None)),
                         kwargs={
-                            'shm_region_names': shm0_region_names,
-                            'precreated_shm_regions': precreated_shm0_regions
-                        }))
+                            "shm_region_names": shm0_region_names,
+                            "precreated_shm_regions": precreated_shm0_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 7, (3000, None)),
                         kwargs={
-                            'shm_region_names': shm1_region_names,
-                            'precreated_shm_regions': precreated_shm1_regions
-                        }))
+                            "shm_region_names": shm1_region_names,
+                            "precreated_shm_regions": precreated_shm1_regions,
+                        },
+                    )
+                )
                 threads[0].start()
                 time.sleep(1)
                 threads[1].start()
@@ -681,17 +795,18 @@ def test_multi_batch_sum_gt_max_preferred(self):
         # since it alone is not greater than max preferred size, will
         # be delayed.
         if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY:
-            shm0_region_names = ['ip00', 'ip01', 'op00', 'op01']
-            shm1_region_names = ['ip10', 'ip11', 'op10', 'op11']
+            shm0_region_names = ["ip00", "ip01", "op00", "op01"]
+            shm1_region_names = ["ip10", "ip11", "op10", "op11"]
         else:
             shm0_region_names = None
             shm1_region_names = None
-        precreated_shm0_regions = self.create_advance(['op00', 'op01'])
-        precreated_shm1_regions = self.create_advance(['op10', 'op11'])
+        precreated_shm0_regions = self.create_advance(["op00", "op01"])
+        precreated_shm1_regions = self.create_advance(["op10", "op11"])
         for trial in _trials:
             try:
-                model_name = tu.get_model_name(trial, np.float32, np.float32,
-                                               np.float32)
+                model_name = tu.get_model_name(
+                    trial, np.float32, np.float32, np.float32
+                )
 
                 self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000)
                 self.assertFalse("TRITONSERVER_DELAY_SCHEDULER" in os.environ)
@@ -702,18 +817,25 @@ def test_multi_batch_sum_gt_max_preferred(self):
                         target=self.check_response,
                         args=(trial, 3, (3000, None)),
                         kwargs={
-                            'shm_region_names': shm0_region_names,
-                            'precreated_shm_regions': precreated_shm0_regions
-                        }))
+                            "shm_region_names": shm0_region_names,
+                            "precreated_shm_regions": precreated_shm0_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
-                        args=(trial, 4, (_max_queue_delay_ms * 1.5,
-                                         _max_queue_delay_ms)),
+                        args=(
+                            trial,
+                            4,
+                            (_max_queue_delay_ms * 1.5, _max_queue_delay_ms),
+                        ),
                         kwargs={
-                            'shm_region_names': shm1_region_names,
-                            'precreated_shm_regions': precreated_shm1_regions
-                        }))
+                            "shm_region_names": shm1_region_names,
+                            "precreated_shm_regions": precreated_shm1_regions,
+                        },
+                    )
+                )
                 threads[0].start()
                 time.sleep(1)
                 threads[1].start()
@@ -729,17 +851,18 @@ def test_multi_same_output0(self):
         # batched and get the correct response even though they don't
         # request both outputs.
         if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY:
-            shm0_region_names = ['ip00', 'ip01', 'op00']
-            shm1_region_names = ['ip10', 'ip11', 'op10']
+            shm0_region_names = ["ip00", "ip01", "op00"]
+            shm1_region_names = ["ip10", "ip11", "op10"]
         else:
             shm0_region_names = None
             shm1_region_names = None
-        precreated_shm0_regions = self.create_advance(['op00'])
-        precreated_shm1_regions = self.create_advance(['op10'])
+        precreated_shm0_regions = self.create_advance(["op00"])
+        precreated_shm1_regions = self.create_advance(["op10"])
         for trial in _trials:
             try:
-                model_name = tu.get_model_name(trial, np.float32, np.float32,
-                                               np.float32)
+                model_name = tu.get_model_name(
+                    trial, np.float32, np.float32, np.float32
+                )
 
                 self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000)
 
@@ -751,19 +874,23 @@ def test_multi_same_output0(self):
                         target=self.check_response,
                         args=(trial, 1, (3000, None)),
                         kwargs={
-                            'requested_outputs': ("OUTPUT0",),
-                            'shm_region_names': shm0_region_names,
-                            'precreated_shm_regions': precreated_shm0_regions
-                        }))
+                            "requested_outputs": ("OUTPUT0",),
+                            "shm_region_names": shm0_region_names,
+                            "precreated_shm_regions": precreated_shm0_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (3000, None)),
                         kwargs={
-                            'requested_outputs': ("OUTPUT0",),
-                            'shm_region_names': shm1_region_names,
-                            'precreated_shm_regions': precreated_shm1_regions
-                        }))
+                            "requested_outputs": ("OUTPUT0",),
+                            "shm_region_names": shm1_region_names,
+                            "precreated_shm_regions": precreated_shm1_regions,
+                        },
+                    )
+                )
                 threads[0].start()
                 threads[1].start()
                 for t in threads:
@@ -778,17 +905,18 @@ def test_multi_same_output1(self):
         # batched and get the correct response even though they don't
         # request both outputs.
         if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY:
-            shm0_region_names = ['ip00', 'ip01', 'op01']
-            shm1_region_names = ['ip10', 'ip11', 'op11']
+            shm0_region_names = ["ip00", "ip01", "op01"]
+            shm1_region_names = ["ip10", "ip11", "op11"]
         else:
             shm0_region_names = None
             shm1_region_names = None
-        precreated_shm0_regions = self.create_advance(['op01'])
-        precreated_shm1_regions = self.create_advance(['op11'])
+        precreated_shm0_regions = self.create_advance(["op01"])
+        precreated_shm1_regions = self.create_advance(["op11"])
         for trial in _trials:
             try:
-                model_name = tu.get_model_name(trial, np.float32, np.float32,
-                                               np.float32)
+                model_name = tu.get_model_name(
+                    trial, np.float32, np.float32, np.float32
+                )
 
                 self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000)
 
@@ -800,19 +928,23 @@ def test_multi_same_output1(self):
                         target=self.check_response,
                         args=(trial, 1, (3000, None)),
                         kwargs={
-                            'requested_outputs': ("OUTPUT1",),
-                            'shm_region_names': shm0_region_names,
-                            'precreated_shm_regions': precreated_shm0_regions
-                        }))
+                            "requested_outputs": ("OUTPUT1",),
+                            "shm_region_names": shm0_region_names,
+                            "precreated_shm_regions": precreated_shm0_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (3000, None)),
                         kwargs={
-                            'requested_outputs': ("OUTPUT1",),
-                            'shm_region_names': shm1_region_names,
-                            'precreated_shm_regions': precreated_shm1_regions
-                        }))
+                            "requested_outputs": ("OUTPUT1",),
+                            "shm_region_names": shm1_region_names,
+                            "precreated_shm_regions": precreated_shm1_regions,
+                        },
+                    )
+                )
                 threads[0].start()
                 threads[1].start()
                 for t in threads:
@@ -828,17 +960,18 @@ def test_multi_different_outputs(self):
         # batched and get the correct response even though they don't
         # request both outputs.
         if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY:
-            shm0_region_names = ['ip00', 'ip01', 'op00']
-            shm1_region_names = ['ip10', 'ip11', 'op11']
+            shm0_region_names = ["ip00", "ip01", "op00"]
+            shm1_region_names = ["ip10", "ip11", "op11"]
         else:
             shm0_region_names = None
             shm1_region_names = None
-        precreated_shm0_regions = self.create_advance(['op00'])
-        precreated_shm1_regions = self.create_advance(['op11'])
+        precreated_shm0_regions = self.create_advance(["op00"])
+        precreated_shm1_regions = self.create_advance(["op11"])
         for trial in _trials:
             try:
-                model_name = tu.get_model_name(trial, np.float32, np.float32,
-                                               np.float32)
+                model_name = tu.get_model_name(
+                    trial, np.float32, np.float32, np.float32
+                )
 
                 self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000)
 
@@ -850,19 +983,23 @@ def test_multi_different_outputs(self):
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'requested_outputs': ("OUTPUT0",),
-                            'shm_region_names': shm0_region_names,
-                            'precreated_shm_regions': precreated_shm0_regions
-                        }))
+                            "requested_outputs": ("OUTPUT0",),
+                            "shm_region_names": shm0_region_names,
+                            "precreated_shm_regions": precreated_shm0_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'requested_outputs': ("OUTPUT1",),
-                            'shm_region_names': shm1_region_names,
-                            'precreated_shm_regions': precreated_shm1_regions
-                        }))
+                            "requested_outputs": ("OUTPUT1",),
+                            "shm_region_names": shm1_region_names,
+                            "precreated_shm_regions": precreated_shm1_regions,
+                        },
+                    )
+                )
                 threads[0].start()
                 threads[1].start()
                 for t in threads:
@@ -877,15 +1014,16 @@ def test_multi_different_output_order(self):
         # different order. They should be batched and get the correct
         # response even though they use different order.
         if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY:
-            shm0_region_names = ['ip00', 'ip01', 'op00', 'op01']
-            shm1_region_names = ['ip10', 'ip11', 'op11', 'op10']
+            shm0_region_names = ["ip00", "ip01", "op00", "op01"]
+            shm1_region_names = ["ip10", "ip11", "op11", "op10"]
         else:
             shm0_region_names = None
             shm1_region_names = None
         for trial in _trials:
             try:
-                model_name = tu.get_model_name(trial, np.float32, np.float32,
-                                               np.float32)
+                model_name = tu.get_model_name(
+                    trial, np.float32, np.float32, np.float32
+                )
 
                 self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000)
 
@@ -893,21 +1031,25 @@ def test_multi_different_output_order(self):
 
                 threads = []
                 threads.append(
-                    threading.Thread(target=self.check_response,
-                                     args=(trial, 1, (6000, None)),
-                                     kwargs={
-                                         'requested_outputs':
-                                             ("OUTPUT0", "OUTPUT1"),
-                                         'shm_region_names': shm0_region_names
-                                     }))
-                threads.append(
-                    threading.Thread(target=self.check_response,
-                                     args=(trial, 1, (6000, None)),
-                                     kwargs={
-                                         'requested_outputs':
-                                             ("OUTPUT1", "OUTPUT0"),
-                                         'shm_region_names': shm1_region_names
-                                     }))
+                    threading.Thread(
+                        target=self.check_response,
+                        args=(trial, 1, (6000, None)),
+                        kwargs={
+                            "requested_outputs": ("OUTPUT0", "OUTPUT1"),
+                            "shm_region_names": shm0_region_names,
+                        },
+                    )
+                )
+                threads.append(
+                    threading.Thread(
+                        target=self.check_response,
+                        args=(trial, 1, (6000, None)),
+                        kwargs={
+                            "requested_outputs": ("OUTPUT1", "OUTPUT0"),
+                            "shm_region_names": shm1_region_names,
+                        },
+                    )
+                )
                 threads[0].start()
                 threads[1].start()
                 for t in threads:
@@ -927,24 +1069,24 @@ def test_multi_batch_delayed_sum_gt_max_preferred(self):
         # immediately but the second response, since it alone is not
         # greater than max preferred size, will be delayed.
         if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY:
-            shm0_region_names = ['ip00', 'ip01', 'op00', 'op01']
-            shm1_region_names = ['ip10', 'ip11', 'op10', 'op11']
+            shm0_region_names = ["ip00", "ip01", "op00", "op01"]
+            shm1_region_names = ["ip10", "ip11", "op10", "op11"]
         else:
             shm0_region_names = None
             shm1_region_names = None
-        precreated_shm0_regions = self.create_advance(['op00', 'op01'])
-        precreated_shm1_regions = self.create_advance(['op10', 'op11'])
+        precreated_shm0_regions = self.create_advance(["op00", "op01"])
+        precreated_shm1_regions = self.create_advance(["op10", "op11"])
         for trial in _trials:
             try:
-                model_name = tu.get_model_name(trial, np.float32, np.float32,
-                                               np.float32)
+                model_name = tu.get_model_name(
+                    trial, np.float32, np.float32, np.float32
+                )
 
                 self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000)
 
                 # Need scheduler to wait for queue to contain 2 requests
                 self.assertTrue("TRITONSERVER_DELAY_SCHEDULER" in os.environ)
-                self.assertEqual(
-                    int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 2)
+                self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 2)
 
                 threads = []
                 threads.append(
@@ -952,18 +1094,25 @@ def test_multi_batch_delayed_sum_gt_max_preferred(self):
                         target=self.check_response,
                         args=(trial, 3, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm0_region_names,
-                            'precreated_shm_regions': precreated_shm0_regions
-                        }))
+                            "shm_region_names": shm0_region_names,
+                            "precreated_shm_regions": precreated_shm0_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
-                        args=(trial, 4, (_max_queue_delay_ms * 1.5,
-                                         _max_queue_delay_ms)),
+                        args=(
+                            trial,
+                            4,
+                            (_max_queue_delay_ms * 1.5, _max_queue_delay_ms),
+                        ),
                         kwargs={
-                            'shm_region_names': shm1_region_names,
-                            'precreated_shm_regions': precreated_shm1_regions
-                        }))
+                            "shm_region_names": shm1_region_names,
+                            "precreated_shm_regions": precreated_shm1_regions,
+                        },
+                    )
+                )
                 threads[0].start()
                 time.sleep(1)
                 threads[1].start()
@@ -977,7 +1126,7 @@ def test_multi_batch_delayed_sum_gt_max_preferred(self):
     def test_multi_batch_delayed_use_max_batch(self):
         # Send three requests with first not having preferred size,
         # second being smaller than max preferred size but the sum of
-        # the requests being larger than max preferred size and thrid
+        # the requests being larger than max preferred size and third
         # is sent after the first two requests exceeds the queue delay
         # and the sum of the requests to be in full batch. Use
         # TRITONSERVER_DELAY_SCHEDULER in the environment so that
@@ -986,55 +1135,67 @@ def test_multi_batch_delayed_use_max_batch(self):
         # while it appears that the first two responses to be returned
         # after being delayed and the third response to be returned immediately.
         if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY:
-            shm0_region_names = ['ip00', 'ip01', 'op00', 'op01']
-            shm1_region_names = ['ip10', 'ip11', 'op10', 'op11']
-            shm2_region_names = ['ip20', 'ip21', 'op20', 'op21']
+            shm0_region_names = ["ip00", "ip01", "op00", "op01"]
+            shm1_region_names = ["ip10", "ip11", "op10", "op11"]
+            shm2_region_names = ["ip20", "ip21", "op20", "op21"]
         else:
             shm0_region_names = None
             shm1_region_names = None
             shm2_region_names = None
-        precreated_shm0_regions = self.create_advance(['op00', 'op01'])
-        precreated_shm1_regions = self.create_advance(['op10', 'op11'])
-        precreated_shm2_regions = self.create_advance(['op20', 'op21'])
+        precreated_shm0_regions = self.create_advance(["op00", "op01"])
+        precreated_shm1_regions = self.create_advance(["op10", "op11"])
+        precreated_shm2_regions = self.create_advance(["op20", "op21"])
         for trial in _trials:
             try:
-                model_name = tu.get_model_name(trial, np.float32, np.float32,
-                                               np.float32)
+                model_name = tu.get_model_name(
+                    trial, np.float32, np.float32, np.float32
+                )
 
                 self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000)
 
                 # Need scheduler to wait for queue to contain 3 requests
                 self.assertTrue("TRITONSERVER_DELAY_SCHEDULER" in os.environ)
-                self.assertEqual(
-                    int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 3)
+                self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 3)
 
                 threads = []
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
-                        args=(trial, 3, (_max_queue_delay_ms * 1.5,
-                                         _max_queue_delay_ms)),
+                        args=(
+                            trial,
+                            3,
+                            (_max_queue_delay_ms * 1.5, _max_queue_delay_ms),
+                        ),
                         kwargs={
-                            'shm_region_names': shm0_region_names,
-                            'precreated_shm_regions': precreated_shm0_regions
-                        }))
+                            "shm_region_names": shm0_region_names,
+                            "precreated_shm_regions": precreated_shm0_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
-                        args=(trial, 4, (_max_queue_delay_ms * 1.5,
-                                         _max_queue_delay_ms)),
+                        args=(
+                            trial,
+                            4,
+                            (_max_queue_delay_ms * 1.5, _max_queue_delay_ms),
+                        ),
                         kwargs={
-                            'shm_region_names': shm1_region_names,
-                            'precreated_shm_regions': precreated_shm1_regions
-                        }))
+                            "shm_region_names": shm1_region_names,
+                            "precreated_shm_regions": precreated_shm1_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm2_region_names,
-                            'precreated_shm_regions': precreated_shm2_regions
-                        }))
+                            "shm_region_names": shm2_region_names,
+                            "precreated_shm_regions": precreated_shm2_regions,
+                        },
+                    )
+                )
                 threads[0].start()
                 threads[1].start()
                 time.sleep(11)
@@ -1057,30 +1218,30 @@ def test_multi_batch_delayed_preferred_different_shape(self):
         # shape as the third that causes a preferred size so that
         # third and forth response are sent immediately.
         if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY:
-            shm0_region_names = ['ip00', 'ip01', 'op00', 'op01']
-            shm1_region_names = ['ip10', 'ip11', 'op10', 'op11']
-            shm2_region_names = ['ip20', 'ip21', 'op20', 'op21']
-            shm3_region_names = ['ip30', 'ip31', 'op30', 'op31']
+            shm0_region_names = ["ip00", "ip01", "op00", "op01"]
+            shm1_region_names = ["ip10", "ip11", "op10", "op11"]
+            shm2_region_names = ["ip20", "ip21", "op20", "op21"]
+            shm3_region_names = ["ip30", "ip31", "op30", "op31"]
         else:
             shm0_region_names = None
             shm1_region_names = None
             shm2_region_names = None
             shm3_region_names = None
-        precreated_shm0_regions = self.create_advance(['op00', 'op01'])
-        precreated_shm1_regions = self.create_advance(['op10', 'op11'])
-        precreated_shm2_regions = self.create_advance(['op20', 'op21'])
-        precreated_shm3_regions = self.create_advance(['op30', 'op31'])
+        precreated_shm0_regions = self.create_advance(["op00", "op01"])
+        precreated_shm1_regions = self.create_advance(["op10", "op11"])
+        precreated_shm2_regions = self.create_advance(["op20", "op21"])
+        precreated_shm3_regions = self.create_advance(["op30", "op31"])
         for trial in _trials:
             try:
-                model_name = tu.get_model_name(trial, np.float32, np.float32,
-                                               np.float32)
+                model_name = tu.get_model_name(
+                    trial, np.float32, np.float32, np.float32
+                )
 
                 self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000)
 
                 # Need scheduler to wait for queue to contain 4 requests
                 self.assertTrue("TRITONSERVER_DELAY_SCHEDULER" in os.environ)
-                self.assertEqual(
-                    int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 4)
+                self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 4)
 
                 threads = []
                 threads.append(
@@ -1088,35 +1249,43 @@ def test_multi_batch_delayed_preferred_different_shape(self):
                         target=self.check_response,
                         args=(trial, 1, (3000, None)),
                         kwargs={
-                            'shm_region_names': shm0_region_names,
-                            'precreated_shm_regions': precreated_shm0_regions
-                        }))
+                            "shm_region_names": shm0_region_names,
+                            "precreated_shm_regions": precreated_shm0_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 3, (3000, None)),
                         kwargs={
-                            'shm_region_names': shm1_region_names,
-                            'precreated_shm_regions': precreated_shm1_regions
-                        }))
+                            "shm_region_names": shm1_region_names,
+                            "precreated_shm_regions": precreated_shm1_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (3000, None)),
                         kwargs={
-                            'input_size': 8,
-                            'shm_region_names': shm2_region_names,
-                            'precreated_shm_regions': precreated_shm2_regions
-                        }))
+                            "input_size": 8,
+                            "shm_region_names": shm2_region_names,
+                            "precreated_shm_regions": precreated_shm2_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 5, (3000, None)),
                         kwargs={
-                            'input_size': 8,
-                            'shm_region_names': shm3_region_names,
-                            'precreated_shm_regions': precreated_shm3_regions
-                        }))
+                            "input_size": 8,
+                            "shm_region_names": shm3_region_names,
+                            "precreated_shm_regions": precreated_shm3_regions,
+                        },
+                    )
+                )
                 threads[0].start()
                 threads[1].start()
                 time.sleep(1)
@@ -1136,12 +1305,12 @@ def test_multi_batch_use_biggest_preferred(self):
         # that requests can be queued up before scheduler starts
         # servicing.
         if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY:
-            shm0_region_names = ['ip00', 'ip01', 'op00', 'op01']
-            shm1_region_names = ['ip10', 'ip11', 'op10', 'op11']
-            shm2_region_names = ['ip20', 'ip21', 'op20', 'op21']
-            shm3_region_names = ['ip30', 'ip31', 'op30', 'op31']
-            shm4_region_names = ['ip40', 'ip41', 'op40', 'op41']
-            shm5_region_names = ['ip50', 'ip51', 'op50', 'op51']
+            shm0_region_names = ["ip00", "ip01", "op00", "op01"]
+            shm1_region_names = ["ip10", "ip11", "op10", "op11"]
+            shm2_region_names = ["ip20", "ip21", "op20", "op21"]
+            shm3_region_names = ["ip30", "ip31", "op30", "op31"]
+            shm4_region_names = ["ip40", "ip41", "op40", "op41"]
+            shm5_region_names = ["ip50", "ip51", "op50", "op51"]
         else:
             shm0_region_names = None
             shm1_region_names = None
@@ -1149,23 +1318,23 @@ def test_multi_batch_use_biggest_preferred(self):
             shm3_region_names = None
             shm4_region_names = None
             shm5_region_names = None
-        precreated_shm0_regions = self.create_advance(['op00', 'op01'])
-        precreated_shm1_regions = self.create_advance(['op10', 'op11'])
-        precreated_shm2_regions = self.create_advance(['op20', 'op21'])
-        precreated_shm3_regions = self.create_advance(['op30', 'op31'])
-        precreated_shm4_regions = self.create_advance(['op40', 'op41'])
-        precreated_shm5_regions = self.create_advance(['op50', 'op51'])
+        precreated_shm0_regions = self.create_advance(["op00", "op01"])
+        precreated_shm1_regions = self.create_advance(["op10", "op11"])
+        precreated_shm2_regions = self.create_advance(["op20", "op21"])
+        precreated_shm3_regions = self.create_advance(["op30", "op31"])
+        precreated_shm4_regions = self.create_advance(["op40", "op41"])
+        precreated_shm5_regions = self.create_advance(["op50", "op51"])
         for trial in _trials:
             try:
-                model_name = tu.get_model_name(trial, np.float32, np.float32,
-                                               np.float32)
+                model_name = tu.get_model_name(
+                    trial, np.float32, np.float32, np.float32
+                )
 
                 self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000)
 
                 # Need scheduler to wait for queue to contain 6 request
                 self.assertTrue("TRITONSERVER_DELAY_SCHEDULER" in os.environ)
-                self.assertEqual(
-                    int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 6)
+                self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 6)
 
                 threads = []
                 threads.append(
@@ -1173,49 +1342,61 @@ def test_multi_batch_use_biggest_preferred(self):
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm0_region_names,
-                            'precreated_shm_regions': precreated_shm0_regions
-                        }))
+                            "shm_region_names": shm0_region_names,
+                            "precreated_shm_regions": precreated_shm0_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm1_region_names,
-                            'precreated_shm_regions': precreated_shm1_regions
-                        }))
+                            "shm_region_names": shm1_region_names,
+                            "precreated_shm_regions": precreated_shm1_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm2_region_names,
-                            'precreated_shm_regions': precreated_shm2_regions
-                        }))
+                            "shm_region_names": shm2_region_names,
+                            "precreated_shm_regions": precreated_shm2_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm3_region_names,
-                            'precreated_shm_regions': precreated_shm3_regions
-                        }))
+                            "shm_region_names": shm3_region_names,
+                            "precreated_shm_regions": precreated_shm3_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm4_region_names,
-                            'precreated_shm_regions': precreated_shm4_regions
-                        }))
+                            "shm_region_names": shm4_region_names,
+                            "precreated_shm_regions": precreated_shm4_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm5_region_names,
-                            'precreated_shm_regions': precreated_shm5_regions
-                        }))
+                            "shm_region_names": shm5_region_names,
+                            "precreated_shm_regions": precreated_shm5_regions,
+                        },
+                    )
+                )
                 for t in threads:
                     t.start()
                 for t in threads:
@@ -1234,27 +1415,27 @@ def test_multi_batch_use_best_preferred(self):
         # that requests can be queued up before scheduler starts
         # servicing.
         if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY:
-            shm0_region_names = ['ip00', 'ip01', 'op00', 'op01']
-            shm1_region_names = ['ip10', 'ip11', 'op10', 'op11']
-            shm2_region_names = ['ip20', 'ip21', 'op20', 'op21']
+            shm0_region_names = ["ip00", "ip01", "op00", "op01"]
+            shm1_region_names = ["ip10", "ip11", "op10", "op11"]
+            shm2_region_names = ["ip20", "ip21", "op20", "op21"]
         else:
             shm0_region_names = None
             shm1_region_names = None
             shm2_region_names = None
-        precreated_shm0_regions = self.create_advance(['op00', 'op01'])
-        precreated_shm1_regions = self.create_advance(['op10', 'op11'])
-        precreated_shm2_regions = self.create_advance(['op20', 'op21'])
+        precreated_shm0_regions = self.create_advance(["op00", "op01"])
+        precreated_shm1_regions = self.create_advance(["op10", "op11"])
+        precreated_shm2_regions = self.create_advance(["op20", "op21"])
         for trial in _trials:
             try:
-                model_name = tu.get_model_name(trial, np.float32, np.float32,
-                                               np.float32)
+                model_name = tu.get_model_name(
+                    trial, np.float32, np.float32, np.float32
+                )
 
                 self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000)
 
                 # Need scheduler to wait for queue to contain 3 requests
                 self.assertTrue("TRITONSERVER_DELAY_SCHEDULER" in os.environ)
-                self.assertEqual(
-                    int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 3)
+                self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 3)
 
                 threads = []
                 threads.append(
@@ -1262,26 +1443,35 @@ def test_multi_batch_use_best_preferred(self):
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm0_region_names,
-                            'precreated_shm_regions': precreated_shm0_regions
-                        }))
+                            "shm_region_names": shm0_region_names,
+                            "precreated_shm_regions": precreated_shm0_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm1_region_names,
-                            'precreated_shm_regions': precreated_shm1_regions
-                        }))
+                            "shm_region_names": shm1_region_names,
+                            "precreated_shm_regions": precreated_shm1_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
-                        args=(trial, 1, (_max_queue_delay_ms * 1.5,
-                                         _max_queue_delay_ms)),
+                        args=(
+                            trial,
+                            1,
+                            (_max_queue_delay_ms * 1.5, _max_queue_delay_ms),
+                        ),
                         kwargs={
-                            'shm_region_names': shm2_region_names,
-                            'precreated_shm_regions': precreated_shm2_regions
-                        }))
+                            "shm_region_names": shm2_region_names,
+                            "precreated_shm_regions": precreated_shm2_regions,
+                        },
+                    )
+                )
                 threads[0].start()
                 threads[1].start()
                 time.sleep(1)
@@ -1296,41 +1486,36 @@ def test_multi_batch_use_best_preferred(self):
     def test_multi_batch_preserve_ordering(self):
         model_base = "custom"
         dtype = np.float32
-        shapes = ([
-            1,
-            1,
-        ],)
+        shapes = (
+            [
+                1,
+                1,
+            ],
+        )
 
         try:
             # use threads to send 12 requests without waiting for response
             threads = []
             for i in range(12):
                 if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY:
-                    shm_region_name_prefix = [
-                        "input" + str(i), "output" + str(i)
-                    ]
+                    shm_region_name_prefix = ["input" + str(i), "output" + str(i)]
                 else:
                     shm_region_name_prefix = None
                 threads.append(
-                    threading.Thread(target=iu.infer_zero,
-                                     args=(self, model_base, 1, dtype, shapes,
-                                           shapes),
-                                     kwargs={
-                                         'use_grpc':
-                                             USE_GRPC,
-                                         'use_http':
-                                             USE_HTTP,
-                                         'use_http_json_tensors':
-                                             False,
-                                         'use_streaming':
-                                             False,
-                                         'shm_region_name_prefix':
-                                             shm_region_name_prefix,
-                                         'use_system_shared_memory':
-                                             TEST_SYSTEM_SHARED_MEMORY,
-                                         'use_cuda_shared_memory':
-                                             TEST_CUDA_SHARED_MEMORY
-                                     }))
+                    threading.Thread(
+                        target=iu.infer_zero,
+                        args=(self, model_base, 1, dtype, shapes, shapes),
+                        kwargs={
+                            "use_grpc": USE_GRPC,
+                            "use_http": USE_HTTP,
+                            "use_http_json_tensors": False,
+                            "use_streaming": False,
+                            "shm_region_name_prefix": shm_region_name_prefix,
+                            "use_system_shared_memory": TEST_SYSTEM_SHARED_MEMORY,
+                            "use_cuda_shared_memory": TEST_CUDA_SHARED_MEMORY,
+                        },
+                    )
+                )
             for t in threads:
                 t.start()
             for t in threads:
@@ -1348,30 +1533,30 @@ def test_preferred_batch_only_aligned(self):
         # servicing. The batcher should form a batch of preferred
         # size 4.
         if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY:
-            shm0_region_names = ['ip00', 'ip01', 'op00', 'op01']
-            shm1_region_names = ['ip10', 'ip11', 'op10', 'op11']
-            shm2_region_names = ['ip20', 'ip21', 'op20', 'op21']
-            shm3_region_names = ['ip30', 'ip31', 'op30', 'op31']
+            shm0_region_names = ["ip00", "ip01", "op00", "op01"]
+            shm1_region_names = ["ip10", "ip11", "op10", "op11"]
+            shm2_region_names = ["ip20", "ip21", "op20", "op21"]
+            shm3_region_names = ["ip30", "ip31", "op30", "op31"]
         else:
             shm0_region_names = None
             shm1_region_names = None
             shm2_region_names = None
             shm3_region_names = None
-        precreated_shm0_regions = self.create_advance(['op00', 'op01'])
-        precreated_shm1_regions = self.create_advance(['op10', 'op11'])
-        precreated_shm2_regions = self.create_advance(['op20', 'op21'])
-        precreated_shm3_regions = self.create_advance(['op30', 'op31'])
+        precreated_shm0_regions = self.create_advance(["op00", "op01"])
+        precreated_shm1_regions = self.create_advance(["op10", "op11"])
+        precreated_shm2_regions = self.create_advance(["op20", "op21"])
+        precreated_shm3_regions = self.create_advance(["op30", "op31"])
         for trial in _trials:
             try:
-                model_name = tu.get_model_name(trial, np.float32, np.float32,
-                                               np.float32)
+                model_name = tu.get_model_name(
+                    trial, np.float32, np.float32, np.float32
+                )
 
                 self.check_setup(model_name, [4, 6], 0)
 
                 # Need scheduler to wait for queue to contain 4 requests
                 self.assertTrue("TRITONSERVER_DELAY_SCHEDULER" in os.environ)
-                self.assertEqual(
-                    int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 4)
+                self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 4)
 
                 threads = []
                 threads.append(
@@ -1379,33 +1564,41 @@ def test_preferred_batch_only_aligned(self):
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm0_region_names,
-                            'precreated_shm_regions': precreated_shm0_regions
-                        }))
+                            "shm_region_names": shm0_region_names,
+                            "precreated_shm_regions": precreated_shm0_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm1_region_names,
-                            'precreated_shm_regions': precreated_shm1_regions
-                        }))
+                            "shm_region_names": shm1_region_names,
+                            "precreated_shm_regions": precreated_shm1_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm2_region_names,
-                            'precreated_shm_regions': precreated_shm2_regions
-                        }))
+                            "shm_region_names": shm2_region_names,
+                            "precreated_shm_regions": precreated_shm2_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm3_region_names,
-                            'precreated_shm_regions': precreated_shm3_regions
-                        }))
+                            "shm_region_names": shm3_region_names,
+                            "precreated_shm_regions": precreated_shm3_regions,
+                        },
+                    )
+                )
                 for t in threads:
                     t.start()
                 for t in threads:
@@ -1422,33 +1615,33 @@ def test_preferred_batch_only_unaligned(self):
         # servicing. The batcher should form a batch of preferred
         # size 4 followed by a batch of size 1.
         if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY:
-            shm0_region_names = ['ip00', 'ip01', 'op00', 'op01']
-            shm1_region_names = ['ip10', 'ip11', 'op10', 'op11']
-            shm2_region_names = ['ip20', 'ip21', 'op20', 'op21']
-            shm3_region_names = ['ip30', 'ip31', 'op30', 'op31']
-            shm4_region_names = ['ip40', 'ip41', 'op40', 'op41']
+            shm0_region_names = ["ip00", "ip01", "op00", "op01"]
+            shm1_region_names = ["ip10", "ip11", "op10", "op11"]
+            shm2_region_names = ["ip20", "ip21", "op20", "op21"]
+            shm3_region_names = ["ip30", "ip31", "op30", "op31"]
+            shm4_region_names = ["ip40", "ip41", "op40", "op41"]
         else:
             shm0_region_names = None
             shm1_region_names = None
             shm2_region_names = None
             shm3_region_names = None
             shm4_region_names = None
-        precreated_shm0_regions = self.create_advance(['op00', 'op01'])
-        precreated_shm1_regions = self.create_advance(['op10', 'op11'])
-        precreated_shm2_regions = self.create_advance(['op20', 'op21'])
-        precreated_shm3_regions = self.create_advance(['op30', 'op31'])
-        precreated_shm4_regions = self.create_advance(['op40', 'op41'])
+        precreated_shm0_regions = self.create_advance(["op00", "op01"])
+        precreated_shm1_regions = self.create_advance(["op10", "op11"])
+        precreated_shm2_regions = self.create_advance(["op20", "op21"])
+        precreated_shm3_regions = self.create_advance(["op30", "op31"])
+        precreated_shm4_regions = self.create_advance(["op40", "op41"])
         for trial in _trials:
             try:
-                model_name = tu.get_model_name(trial, np.float32, np.float32,
-                                               np.float32)
+                model_name = tu.get_model_name(
+                    trial, np.float32, np.float32, np.float32
+                )
 
                 self.check_setup(model_name, [4, 6], 0)
 
                 # Need scheduler to wait for queue to contain 3 requests
                 self.assertTrue("TRITONSERVER_DELAY_SCHEDULER" in os.environ)
-                self.assertEqual(
-                    int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 5)
+                self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 5)
 
                 threads = []
                 threads.append(
@@ -1456,41 +1649,51 @@ def test_preferred_batch_only_unaligned(self):
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm0_region_names,
-                            'precreated_shm_regions': precreated_shm0_regions
-                        }))
+                            "shm_region_names": shm0_region_names,
+                            "precreated_shm_regions": precreated_shm0_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm1_region_names,
-                            'precreated_shm_regions': precreated_shm1_regions
-                        }))
+                            "shm_region_names": shm1_region_names,
+                            "precreated_shm_regions": precreated_shm1_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm2_region_names,
-                            'precreated_shm_regions': precreated_shm2_regions
-                        }))
+                            "shm_region_names": shm2_region_names,
+                            "precreated_shm_regions": precreated_shm2_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm3_region_names,
-                            'precreated_shm_regions': precreated_shm3_regions
-                        }))
+                            "shm_region_names": shm3_region_names,
+                            "precreated_shm_regions": precreated_shm3_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm4_region_names,
-                            'precreated_shm_regions': precreated_shm4_regions
-                        }))
+                            "shm_region_names": shm4_region_names,
+                            "precreated_shm_regions": precreated_shm4_regions,
+                        },
+                    )
+                )
                 for t in threads:
                     t.start()
                 for t in threads:
@@ -1507,13 +1710,13 @@ def test_preferred_batch_only_use_biggest_preferred(self):
         # servicing. The batcher should form a batch of largest preferred
         # size 6 followed by a batch of size 1.
         if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY:
-            shm0_region_names = ['ip00', 'ip01', 'op00', 'op01']
-            shm1_region_names = ['ip10', 'ip11', 'op10', 'op11']
-            shm2_region_names = ['ip20', 'ip21', 'op20', 'op21']
-            shm3_region_names = ['ip30', 'ip31', 'op30', 'op31']
-            shm4_region_names = ['ip40', 'ip41', 'op40', 'op41']
-            shm5_region_names = ['ip50', 'ip51', 'op50', 'op51']
-            shm6_region_names = ['ip60', 'ip61', 'op60', 'op61']
+            shm0_region_names = ["ip00", "ip01", "op00", "op01"]
+            shm1_region_names = ["ip10", "ip11", "op10", "op11"]
+            shm2_region_names = ["ip20", "ip21", "op20", "op21"]
+            shm3_region_names = ["ip30", "ip31", "op30", "op31"]
+            shm4_region_names = ["ip40", "ip41", "op40", "op41"]
+            shm5_region_names = ["ip50", "ip51", "op50", "op51"]
+            shm6_region_names = ["ip60", "ip61", "op60", "op61"]
         else:
             shm0_region_names = None
             shm1_region_names = None
@@ -1522,24 +1725,24 @@ def test_preferred_batch_only_use_biggest_preferred(self):
             shm4_region_names = None
             shm5_region_names = None
             shm6_region_names = None
-        precreated_shm0_regions = self.create_advance(['op00', 'op01'])
-        precreated_shm1_regions = self.create_advance(['op10', 'op11'])
-        precreated_shm2_regions = self.create_advance(['op20', 'op21'])
-        precreated_shm3_regions = self.create_advance(['op30', 'op31'])
-        precreated_shm4_regions = self.create_advance(['op40', 'op41'])
-        precreated_shm5_regions = self.create_advance(['op50', 'op51'])
-        precreated_shm6_regions = self.create_advance(['op60', 'op61'])
+        precreated_shm0_regions = self.create_advance(["op00", "op01"])
+        precreated_shm1_regions = self.create_advance(["op10", "op11"])
+        precreated_shm2_regions = self.create_advance(["op20", "op21"])
+        precreated_shm3_regions = self.create_advance(["op30", "op31"])
+        precreated_shm4_regions = self.create_advance(["op40", "op41"])
+        precreated_shm5_regions = self.create_advance(["op50", "op51"])
+        precreated_shm6_regions = self.create_advance(["op60", "op61"])
         for trial in _trials:
             try:
-                model_name = tu.get_model_name(trial, np.float32, np.float32,
-                                               np.float32)
+                model_name = tu.get_model_name(
+                    trial, np.float32, np.float32, np.float32
+                )
 
                 self.check_setup(model_name, [4, 6], 0)
 
                 # Need scheduler to wait for queue to contain 6 request
                 self.assertTrue("TRITONSERVER_DELAY_SCHEDULER" in os.environ)
-                self.assertEqual(
-                    int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 7)
+                self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 7)
 
                 threads = []
                 threads.append(
@@ -1547,57 +1750,71 @@ def test_preferred_batch_only_use_biggest_preferred(self):
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm0_region_names,
-                            'precreated_shm_regions': precreated_shm0_regions
-                        }))
+                            "shm_region_names": shm0_region_names,
+                            "precreated_shm_regions": precreated_shm0_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm1_region_names,
-                            'precreated_shm_regions': precreated_shm1_regions
-                        }))
+                            "shm_region_names": shm1_region_names,
+                            "precreated_shm_regions": precreated_shm1_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm2_region_names,
-                            'precreated_shm_regions': precreated_shm2_regions
-                        }))
+                            "shm_region_names": shm2_region_names,
+                            "precreated_shm_regions": precreated_shm2_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm3_region_names,
-                            'precreated_shm_regions': precreated_shm3_regions
-                        }))
+                            "shm_region_names": shm3_region_names,
+                            "precreated_shm_regions": precreated_shm3_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm4_region_names,
-                            'precreated_shm_regions': precreated_shm4_regions
-                        }))
+                            "shm_region_names": shm4_region_names,
+                            "precreated_shm_regions": precreated_shm4_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm5_region_names,
-                            'precreated_shm_regions': precreated_shm5_regions
-                        }))
+                            "shm_region_names": shm5_region_names,
+                            "precreated_shm_regions": precreated_shm5_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm6_region_names,
-                            'precreated_shm_regions': precreated_shm6_regions
-                        }))
+                            "shm_region_names": shm6_region_names,
+                            "precreated_shm_regions": precreated_shm6_regions,
+                        },
+                    )
+                )
                 for t in threads:
                     t.start()
                 for t in threads:
@@ -1613,27 +1830,27 @@ def test_preferred_batch_only_use_no_preferred_size(self):
         # requests can be queued up before scheduler starts
         # servicing. The batcher should form a batch of of 3.
         if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY:
-            shm0_region_names = ['ip00', 'ip01', 'op00', 'op01']
-            shm1_region_names = ['ip10', 'ip11', 'op10', 'op11']
-            shm2_region_names = ['ip20', 'ip21', 'op20', 'op21']
+            shm0_region_names = ["ip00", "ip01", "op00", "op01"]
+            shm1_region_names = ["ip10", "ip11", "op10", "op11"]
+            shm2_region_names = ["ip20", "ip21", "op20", "op21"]
         else:
             shm0_region_names = None
             shm1_region_names = None
             shm2_region_names = None
-        precreated_shm0_regions = self.create_advance(['op00', 'op01'])
-        precreated_shm1_regions = self.create_advance(['op10', 'op11'])
-        precreated_shm2_regions = self.create_advance(['op20', 'op21'])
+        precreated_shm0_regions = self.create_advance(["op00", "op01"])
+        precreated_shm1_regions = self.create_advance(["op10", "op11"])
+        precreated_shm2_regions = self.create_advance(["op20", "op21"])
         for trial in _trials:
             try:
-                model_name = tu.get_model_name(trial, np.float32, np.float32,
-                                               np.float32)
+                model_name = tu.get_model_name(
+                    trial, np.float32, np.float32, np.float32
+                )
 
                 self.check_setup(model_name, [4, 6], 0)
 
                 # Need scheduler to wait for queue to contain 3 request
                 self.assertTrue("TRITONSERVER_DELAY_SCHEDULER" in os.environ)
-                self.assertEqual(
-                    int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 3)
+                self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 3)
 
                 threads = []
                 threads.append(
@@ -1641,25 +1858,31 @@ def test_preferred_batch_only_use_no_preferred_size(self):
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm0_region_names,
-                            'precreated_shm_regions': precreated_shm0_regions
-                        }))
+                            "shm_region_names": shm0_region_names,
+                            "precreated_shm_regions": precreated_shm0_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm1_region_names,
-                            'precreated_shm_regions': precreated_shm1_regions
-                        }))
+                            "shm_region_names": shm1_region_names,
+                            "precreated_shm_regions": precreated_shm1_regions,
+                        },
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_response,
                         args=(trial, 1, (6000, None)),
                         kwargs={
-                            'shm_region_names': shm2_region_names,
-                            'precreated_shm_regions': precreated_shm2_regions
-                        }))
+                            "shm_region_names": shm2_region_names,
+                            "precreated_shm_regions": precreated_shm2_regions,
+                        },
+                    )
+                )
                 for t in threads:
                     t.start()
                 for t in threads:
@@ -1675,48 +1898,43 @@ def test_max_queue_delay_only_non_default(self):
         # there can be either 1 or 2 model executions.
         model_base = "custom"
         dtype = np.float32
-        shapes = ([
-            1,
-            1,
-        ],)
+        shapes = (
+            [
+                1,
+                1,
+            ],
+        )
 
         try:
             # use threads to send 12 requests without waiting for response
             threads = []
             for i in range(12):
                 if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY:
-                    shm_region_name_prefix = [
-                        "input" + str(i), "output" + str(i)
-                    ]
+                    shm_region_name_prefix = ["input" + str(i), "output" + str(i)]
                 else:
                     shm_region_name_prefix = None
                 threads.append(
-                    threading.Thread(target=iu.infer_zero,
-                                     args=(self, model_base, 1, dtype, shapes,
-                                           shapes),
-                                     kwargs={
-                                         'use_grpc':
-                                             USE_GRPC,
-                                         'use_http':
-                                             USE_HTTP,
-                                         'use_http_json_tensors':
-                                             False,
-                                         'use_streaming':
-                                             False,
-                                         'shm_region_name_prefix':
-                                             shm_region_name_prefix,
-                                         'use_system_shared_memory':
-                                             TEST_SYSTEM_SHARED_MEMORY,
-                                         'use_cuda_shared_memory':
-                                             TEST_CUDA_SHARED_MEMORY
-                                     }))
+                    threading.Thread(
+                        target=iu.infer_zero,
+                        args=(self, model_base, 1, dtype, shapes, shapes),
+                        kwargs={
+                            "use_grpc": USE_GRPC,
+                            "use_http": USE_HTTP,
+                            "use_http_json_tensors": False,
+                            "use_streaming": False,
+                            "shm_region_name_prefix": shm_region_name_prefix,
+                            "use_system_shared_memory": TEST_SYSTEM_SHARED_MEMORY,
+                            "use_cuda_shared_memory": TEST_CUDA_SHARED_MEMORY,
+                        },
+                    )
+                )
             for t in threads:
                 t.start()
             for t in threads:
                 t.join()
             self.check_deferred_exception()
             model_name = tu.get_zero_model_name(model_base, len(shapes), dtype)
-            self.check_status(model_name, None, 12, 12, (1,2))
+            self.check_status(model_name, None, 12, 12, (1, 2))
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -1727,41 +1945,36 @@ def test_max_queue_delay_only_default(self):
         # and the remaining requests will form the second batch.
         model_base = "custom"
         dtype = np.float32
-        shapes = ([
-            1,
-            1,
-        ],)
+        shapes = (
+            [
+                1,
+                1,
+            ],
+        )
 
         try:
             # use threads to send 12 requests without waiting for response
             threads = []
             for i in range(12):
                 if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY:
-                    shm_region_name_prefix = [
-                        "input" + str(i), "output" + str(i)
-                    ]
+                    shm_region_name_prefix = ["input" + str(i), "output" + str(i)]
                 else:
                     shm_region_name_prefix = None
                 threads.append(
-                    threading.Thread(target=iu.infer_zero,
-                                     args=(self, model_base, 1, dtype, shapes,
-                                           shapes),
-                                     kwargs={
-                                         'use_grpc':
-                                             USE_GRPC,
-                                         'use_http':
-                                             USE_HTTP,
-                                         'use_http_json_tensors':
-                                             False,
-                                         'use_streaming':
-                                             False,
-                                         'shm_region_name_prefix':
-                                             shm_region_name_prefix,
-                                         'use_system_shared_memory':
-                                             TEST_SYSTEM_SHARED_MEMORY,
-                                         'use_cuda_shared_memory':
-                                             TEST_CUDA_SHARED_MEMORY
-                                     }))
+                    threading.Thread(
+                        target=iu.infer_zero,
+                        args=(self, model_base, 1, dtype, shapes, shapes),
+                        kwargs={
+                            "use_grpc": USE_GRPC,
+                            "use_http": USE_HTTP,
+                            "use_http_json_tensors": False,
+                            "use_streaming": False,
+                            "shm_region_name_prefix": shm_region_name_prefix,
+                            "use_system_shared_memory": TEST_SYSTEM_SHARED_MEMORY,
+                            "use_cuda_shared_memory": TEST_CUDA_SHARED_MEMORY,
+                        },
+                    )
+                )
             for t in threads:
                 t.start()
             for t in threads:
@@ -1772,5 +1985,6 @@ def test_max_queue_delay_only_default(self):
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
-if __name__ == '__main__':
+
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_batcher/test.sh b/qa/L0_batcher/test.sh
index d8ab6131f7..c5f8819276 100755
--- a/qa/L0_batcher/test.sh
+++ b/qa/L0_batcher/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -74,7 +74,7 @@ if [ "$TEST_VALGRIND" -eq 1 ]; then
                                 test_multi_batch_different_shape_allow_ragged"
 fi
 
-TF_VERSION=${TF_VERSION:=1}
+TF_VERSION=${TF_VERSION:=2}
 
 # On windows the paths invoked by the script (running in WSL) must use
 # /mnt/c when needed but the paths on the tritonserver command-line
@@ -91,6 +91,14 @@ else
     TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"}
     SERVER=${TRITON_DIR}/bin/tritonserver
     BACKEND_DIR=${TRITON_DIR}/backends
+
+    # PyTorch on SBSA requires libgomp to be loaded first. See the following
+    # GitHub issue for more information:
+    # https://github.com/pytorch/pytorch/issues/2575
+    arch=`uname -m`
+    if [ $arch = "aarch64" ]; then
+      SERVER_LD_PRELOAD=/usr/lib/$(uname -m)-linux-gnu/libgomp.so.1
+    fi
 fi
 
 SERVER_ARGS_EXTRA="--backend-directory=${BACKEND_DIR} --backend-config=tensorflow,version=${TF_VERSION}"
@@ -99,7 +107,7 @@ source ../common/util.sh
 RET=0
 
 # If BACKENDS not specified, set to all
-BACKENDS=${BACKENDS:="graphdef savedmodel onnx libtorch plan"}
+BACKENDS=${BACKENDS:="graphdef savedmodel onnx libtorch plan python"}
 export BACKENDS
 
 # Basic batcher tests
@@ -138,11 +146,21 @@ MAX_QUEUE_DELAY_ONLY_TESTS=${MAX_QUEUE_DELAY_ONLY_TESTS:="test_max_queue_delay_o
                                                     test_max_queue_delay_only_non_default"}
 
 # Setup non-variable-size model repository
-rm -fr *.log *.serverlog models && mkdir models
+rm -fr *.log  models && mkdir models
 for BACKEND in $BACKENDS; do
     TMP_MODEL_DIR="$DATADIR/qa_model_repository/${BACKEND}_float32_float32_float32"
-
-    cp -r $TMP_MODEL_DIR models/. &&
+    if [ "$BACKEND" == "python" ]; then
+        # We will be using ONNX models config.pbtxt and tweak them to make them
+        # appropriate for Python backend
+        onnx_model="${DATADIR}/qa_model_repository/onnx_float32_float32_float32"
+        python_model=`echo $onnx_model | sed 's/onnx/python/g' | sed 's,'"$DATADIR/qa_model_repository/"',,g'`
+        mkdir -p models/$python_model/1/
+        cat $onnx_model/config.pbtxt | sed 's/platform:.*/backend:\ "python"/g' | sed 's/onnx/python/g' > models/$python_model/config.pbtxt
+        cp $onnx_model/output0_labels.txt models/$python_model
+        cp ../python_models/add_sub/model.py models/$python_model/1/
+    else
+        cp -r $TMP_MODEL_DIR models/.
+    fi
     (cd models/$(basename $TMP_MODEL_DIR) && \
           sed -i "s/^max_batch_size:.*/max_batch_size: 8/" config.pbtxt && \
           sed -i "s/^version_policy:.*/version_policy: { specific { versions: [1] }}/" config.pbtxt && \
@@ -152,8 +170,18 @@ done
 rm -fr preferred_batch_only_models && mkdir preferred_batch_only_models
 for BACKEND in $BACKENDS; do
     TMP_MODEL_DIR="$DATADIR/qa_model_repository/${BACKEND}_float32_float32_float32"
-
-    cp -r $TMP_MODEL_DIR preferred_batch_only_models/. &&
+    if [ "$BACKEND" == "python" ]; then
+        # We will be using ONNX models config.pbtxt and tweak them to make them
+        # appropriate for Python backend
+        onnx_model="${DATADIR}/qa_model_repository/onnx_float32_float32_float32"
+        python_model=`echo $onnx_model | sed 's/onnx/python/g' | sed 's,'"$DATADIR/qa_model_repository/"',,g'`
+        mkdir -p preferred_batch_only_models/$python_model/1/
+        cat $onnx_model/config.pbtxt | sed 's/platform:.*/backend:\ "python"/g' | sed 's/onnx/python/g' > preferred_batch_only_models/$python_model/config.pbtxt
+        cp $onnx_model/output0_labels.txt preferred_batch_only_models/$python_model
+        cp ../python_models/add_sub/model.py preferred_batch_only_models/$python_model/1/
+    else
+        cp -r $TMP_MODEL_DIR preferred_batch_only_models/.
+    fi
     (cd preferred_batch_only_models/$(basename $TMP_MODEL_DIR) && \
           sed -i "s/^max_batch_size:.*/max_batch_size: 8/" config.pbtxt && \
           sed -i "s/^version_policy:.*/version_policy: { specific { versions: [1] }}/" config.pbtxt && \
@@ -164,14 +192,22 @@ done
 rm -fr var_models && mkdir var_models
 for BACKEND in $BACKENDS; do
     TMP_MODEL_DIR="$DATADIR/qa_variable_model_repository/${BACKEND}_float32_float32_float32"
-
-    for TMP_DIR in $TMP_MODEL_DIR; do
-      cp -r $TMP_DIR var_models/. &&
-        (cd var_models/$(basename $TMP_DIR) && \
+    if [ "$BACKEND" == "python" ]; then
+        # We will be using ONNX models config.pbtxt and tweak them to make them
+        # appropriate for Python backend
+        onnx_model="${DATADIR}/qa_variable_model_repository/onnx_float32_float32_float32"
+        python_model=`echo $onnx_model | sed 's/onnx/python/g' | sed 's,'"$DATADIR/qa_variable_model_repository/"',,g'`
+        mkdir -p var_models/$python_model/1/
+        cat $onnx_model/config.pbtxt | sed 's/platform:.*/backend:\ "python"/g' | sed 's/onnx/python/g' > var_models/$python_model/config.pbtxt
+        cp $onnx_model/output0_labels.txt var_models/$python_model
+        cp ../python_models/add_sub/model.py var_models/$python_model/1/
+    else
+        cp -r $TMP_MODEL_DIR var_models/.
+    fi
+    (cd var_models/$(basename $TMP_MODEL_DIR) && \
             sed -i "s/^max_batch_size:.*/max_batch_size: 8/" config.pbtxt && \
             sed -i "s/^version_policy:.*/version_policy: { specific { versions: [1] }}/" config.pbtxt && \
             echo "dynamic_batching { preferred_batch_size: [ 2, 6 ], max_queue_delay_microseconds: 10000000 }" >> config.pbtxt)
-    done
 done
 
 for MC in `ls var_models/*/config.pbtxt`; do
@@ -214,6 +250,19 @@ if [[ $BACKENDS == *"onnx"* ]]; then
                     dynamic_batching { preferred_batch_size: [ 2, 6 ], max_queue_delay_microseconds: 10000000 }" >> config.pbtxt)
 fi
 
+if [[ $BACKENDS == *"libtorch"* ]]; then
+    # Use nobatch model to match the ragged test requirement
+    cp -r $DATADIR/qa_identity_model_repository/libtorch_nobatch_zero_1_float32 var_models/libtorch_zero_1_float32 && \
+        (cd var_models/libtorch_zero_1_float32 && \
+            sed -i "s/nobatch_//" config.pbtxt && \
+            sed -i "s/^max_batch_size:.*/max_batch_size: 8/" config.pbtxt && \
+            sed -i "s/name: \"INPUT__0\"/name: \"INPUT__0\"\\nallow_ragged_batch: true/" config.pbtxt && \
+            echo "batch_output [{target_name: \"OUTPUT__0\" \
+                                    kind: BATCH_SCATTER_WITH_INPUT_SHAPE \
+                                    source_input: \"INPUT__0\" }] \
+                    dynamic_batching { preferred_batch_size: [ 2, 6 ], max_queue_delay_microseconds: 10000000 }" >> config.pbtxt)
+fi
+
 # Need to launch the server for each test so that the model status is
 # reset (which is used to make sure the correctly batch size was used
 # for execution). Test everything with fixed-tensor-size models and
@@ -224,7 +273,7 @@ for model_type in FIXED VARIABLE; do
     MODEL_PATH=models && [[ "$model_type" == "VARIABLE" ]] && MODEL_PATH=var_models
     for i in $NO_DELAY_TESTS ; do
         SERVER_ARGS="--model-repository=$MODELDIR/$MODEL_PATH ${SERVER_ARGS_EXTRA}"
-        SERVER_LOG="./$i.$model_type.serverlog"
+        SERVER_LOG="./$i.$model_type.server.log"
 
         if [ "$TEST_VALGRIND" -eq 1 ]; then
             LEAKCHECK_LOG="./$i.$model_type.valgrind.log"
@@ -277,7 +326,7 @@ for model_type in FIXED VARIABLE; do
             [[ "$i" != "test_multi_batch_use_best_preferred" ]] &&
             [[ "$i" != "test_multi_batch_delayed_use_max_batch" ]] && export TRITONSERVER_DELAY_SCHEDULER=2
         SERVER_ARGS="--model-repository=$MODELDIR/$MODEL_PATH ${SERVER_ARGS_EXTRA}"
-        SERVER_LOG="./$i.$model_type.serverlog"
+        SERVER_LOG="./$i.$model_type.server.log"
 
         if [ "$TEST_VALGRIND" -eq 1 ]; then
             LEAKCHECK_LOG="./$i.$model_type.valgrind.log"
@@ -327,7 +376,7 @@ done
 export BATCHER_TYPE=VARIABLE
 for i in $DIFFERENT_SHAPE_TESTS ; do
     SERVER_ARGS="--model-repository=$MODELDIR/var_models ${SERVER_ARGS_EXTRA}"
-    SERVER_LOG="./$i.VARIABLE.serverlog"
+    SERVER_LOG="./$i.VARIABLE.server.log"
 
     if [ "$TEST_VALGRIND" -eq 1 ]; then
         LEAKCHECK_LOG="./$i.VARIABLE.valgrind.log"
@@ -380,7 +429,7 @@ for i in \
         test_multi_batch_delayed_preferred_different_shape ; do
     export TRITONSERVER_DELAY_SCHEDULER=4
     SERVER_ARGS="--model-repository=$MODELDIR/var_models ${SERVER_ARGS_EXTRA}"
-    SERVER_LOG="./$i.VARIABLE.serverlog"
+    SERVER_LOG="./$i.VARIABLE.server.log"
 
     if [ "$TEST_VALGRIND" -eq 1 ]; then
         LEAKCHECK_LOG="./$i.VARIABLE.valgrind.log"
@@ -433,7 +482,7 @@ for i in $PREFERRED_BATCH_ONLY_TESTS ; do
             [[ "$i" != "test_preferred_batch_only_unaligned" ]] && export TRITONSERVER_DELAY_SCHEDULER=7 &&
             [[ "$i" != "test_preferred_batch_only_use_biggest_preferred" ]] && export TRITONSERVER_DELAY_SCHEDULER=3
     SERVER_ARGS="--model-repository=$MODELDIR/preferred_batch_only_models ${SERVER_ARGS_EXTRA}"
-    SERVER_LOG="./$i.PREFERRED_BATCH_ONLY.serverlog"
+    SERVER_LOG="./$i.PREFERRED_BATCH_ONLY.server.log"
 
     if [ "$TEST_VALGRIND" -eq 1 ]; then
         LEAKCHECK_LOG="./$i.PREFERRED_BATCH_ONLY.valgrind.log"
@@ -502,7 +551,7 @@ for i in $MAX_QUEUE_DELAY_ONLY_TESTS ; do
         sed -i "s/max_queue_delay_microseconds:.*\[.*\]/max_queue_delay_microseconds: ${MAX_QUEUE_DELAY_MICROSECONDS}/g" config.pbtxt )
 
     SERVER_ARGS="--model-repository=$MODELDIR/custom_models ${SERVER_ARGS_EXTRA}"
-    SERVER_LOG="./$i.MAX_QUEUE_DELAY_ONLY.serverlog"
+    SERVER_LOG="./$i.MAX_QUEUE_DELAY_ONLY.server.log"
 
     if [ "$TEST_VALGRIND" -eq 1 ]; then
         LEAKCHECK_LOG="./$i.MAX_QUEUE_DELAY_ONLY.valgrind.log"
@@ -580,7 +629,7 @@ if [[ "$(< /proc/sys/kernel/osrelease)" != *microsoft* ]]; then
 
     # not preserve
     SERVER_ARGS="--trace-file=not_preserve.log --trace-level=MIN --trace-rate=1 --model-repository=$MODELDIR/custom_models ${SERVER_ARGS_EXTRA}"
-    SERVER_LOG="./not_preserve.serverlog"
+    SERVER_LOG="./not_preserve.server.log"
 
     if [ "$TEST_VALGRIND" -eq 1 ]; then
         LEAKCHECK_LOG="./not_preserve.valgrind.log"
@@ -635,7 +684,7 @@ if [[ "$(< /proc/sys/kernel/osrelease)" != *microsoft* ]]; then
             sed -i "s/dynamic_batching.*/dynamic_batching { preferred_batch_size: [ 4 ] preserve_ordering: true }/g" config.pbtxt)
 
     SERVER_ARGS="--trace-file=preserve.log --trace-level=MIN --trace-rate=1 --model-repository=$MODELDIR/custom_models  ${SERVER_ARGS_EXTRA}"
-    SERVER_LOG="./preserve.serverlog"
+    SERVER_LOG="./preserve.server.log"
 
     if [ "$TEST_VALGRIND" -eq 1 ]; then
         LEAKCHECK_LOG="./preserve.valgrind.log"
@@ -695,3 +744,4 @@ else
 fi
 
 exit $RET
+
diff --git a/qa/L0_batcher/verify_timestamps.py b/qa/L0_batcher/verify_timestamps.py
old mode 100644
new mode 100755
index 30aad60fa3..3271135fcd
--- a/qa/L0_batcher/verify_timestamps.py
+++ b/qa/L0_batcher/verify_timestamps.py
@@ -33,7 +33,7 @@
 
 def verify_timestamps(traces, preserve):
     # Order traces by id
-    traces = sorted(traces, key=lambda t: t.get('id', -1))
+    traces = sorted(traces, key=lambda t: t.get("id", -1))
 
     # Filter the trace that is not meaningful and group them by 'id'
     filtered_traces = dict()
@@ -41,7 +41,7 @@ def verify_timestamps(traces, preserve):
     for trace in traces:
         if "id" not in trace:
             continue
-        # Skip GRPC traces as actual traces are not genarated via GRPC,
+        # Skip GRPC traces as actual traces are not generated via GRPC,
         # thus GRPC traces are ill-formed
         if "timestamps" in trace:
             is_grpc = False
@@ -53,16 +53,16 @@ def verify_timestamps(traces, preserve):
                 grpc_id_offset += 1
                 continue
 
-        if (trace['id'] in filtered_traces.keys()):
-            rep_trace = filtered_traces[trace['id']]
-            # Apend the timestamp to the trace representing this 'id'
+        if trace["id"] in filtered_traces.keys():
+            rep_trace = filtered_traces[trace["id"]]
+            # Append the timestamp to the trace representing this 'id'
             if "timestamps" in trace:
                 rep_trace["timestamps"] += trace["timestamps"]
         else:
             # Use this trace to represent this 'id'
             if "timestamps" not in trace:
                 trace["timestamps"] = []
-            filtered_traces[trace['id']] = trace
+            filtered_traces[trace["id"]] = trace
 
     # First find the latest response complete timestamp for the batch with large delay
     large_delay_response_complete = 0
@@ -75,10 +75,11 @@ def verify_timestamps(traces, preserve):
         compute_span = timestamps["COMPUTE_END"] - timestamps["COMPUTE_START"]
         # If the 3rd batch is also processed by large delay instance, we don't
         # want to use its responses as baseline
-        if trace["id"] <= (
-                8 + grpc_id_offset) and compute_span >= 400 * 1000 * 1000:
+        if trace["id"] <= (8 + grpc_id_offset) and compute_span >= 400 * 1000 * 1000:
             response_complete = timestamps["INFER_RESPONSE_COMPLETE"]
-            large_delay_response_complete = max(large_delay_response_complete, response_complete)
+            large_delay_response_complete = max(
+                large_delay_response_complete, response_complete
+            )
         else:
             small_delay_traces.append(trace)
 
@@ -92,8 +93,11 @@ def verify_timestamps(traces, preserve):
             response_request_after_large_delay_count += 1
 
     # Hardcoded expected count here
-    print("responses after large delay count: {}".format(
-        response_request_after_large_delay_count))
+    print(
+        "responses after large delay count: {}".format(
+            response_request_after_large_delay_count
+        )
+    )
     if preserve:
         # If preserve ordering, there must be large delay batch followed by
         # small delay batch and thus at least 4 responses are sent after
@@ -103,15 +107,18 @@ def verify_timestamps(traces, preserve):
         # before large delay batch regardless of the ordering in scheduler
         return 0 if response_request_after_large_delay_count == 0 else 1
 
-if __name__ == '__main__':
+
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('-p',
-                        '--preserve',
-                        action="store_true",
-                        required=False,
-                        default=False,
-                        help='Timestamps is collected with preserve ordering')
-    parser.add_argument('file', type=argparse.FileType('r'), nargs='+')
+    parser.add_argument(
+        "-p",
+        "--preserve",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Timestamps is collected with preserve ordering",
+    )
+    parser.add_argument("file", type=argparse.FileType("r"), nargs="+")
     FLAGS = parser.parse_args()
 
     for f in FLAGS.file:
diff --git a/qa/L0_buffer_attributes/buffer_attributes_test.py b/qa/L0_buffer_attributes/buffer_attributes_test.py
old mode 100644
new mode 100755
index 907a469bab..7d61e082c5
--- a/qa/L0_buffer_attributes/buffer_attributes_test.py
+++ b/qa/L0_buffer_attributes/buffer_attributes_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -29,28 +31,26 @@
 sys.path.append("../common")
 
 import unittest
+
 import numpy as np
 import test_util as tu
-
+import tritonclient.grpc as grpcclient
+import tritonclient.http as httpclient
 import tritonclient.utils.cuda_shared_memory as cudashm
 from tritonclient.utils import triton_to_np_dtype
-import tritonclient.http as httpclient
-import tritonclient.grpc as grpcclient
 
 
 class BufferAttributesTest(tu.TestResultCollector):
-
     def test_buffer_attributes(self):
-        model_name = 'bls'
+        model_name = "bls"
 
         # Infer
         clients = [
-            httpclient.InferenceServerClient(url='localhost:8000'),
-            grpcclient.InferenceServerClient(url='localhost:8001')
+            httpclient.InferenceServerClient(url="localhost:8000"),
+            grpcclient.InferenceServerClient(url="localhost:8001"),
         ]
         triton_clients = [httpclient, grpcclient]
         for i, client in enumerate(clients):
-
             # To make sure no shared memory regions are registered with the
             # server.
             client.unregister_system_shared_memory()
@@ -59,8 +59,7 @@ def test_buffer_attributes(self):
             triton_client = triton_clients[i]
             inputs = []
             outputs = []
-            inputs.append(triton_client.InferInput('INPUT0', [1, 1000],
-                                                   "INT32"))
+            inputs.append(triton_client.InferInput("INPUT0", [1, 1000], "INT32"))
 
             input0_data = np.arange(start=0, stop=1000, dtype=np.int32)
             input0_data = np.expand_dims(input0_data, axis=0)
@@ -69,45 +68,55 @@ def test_buffer_attributes(self):
             output_byte_size = input_byte_size
 
             shm_ip0_handle = cudashm.create_shared_memory_region(
-                "input0_data", input_byte_size, 0)
+                "input0_data", input_byte_size, 0
+            )
             shm_op0_handle = cudashm.create_shared_memory_region(
-                "output0_data", output_byte_size, 0)
+                "output0_data", output_byte_size, 0
+            )
 
             client.register_cuda_shared_memory(
-                "input0_data", cudashm.get_raw_handle(shm_ip0_handle), 0,
-                input_byte_size)
+                "input0_data",
+                cudashm.get_raw_handle(shm_ip0_handle),
+                0,
+                input_byte_size,
+            )
             client.register_cuda_shared_memory(
-                "output0_data", cudashm.get_raw_handle(shm_op0_handle), 0,
-                input_byte_size)
+                "output0_data",
+                cudashm.get_raw_handle(shm_op0_handle),
+                0,
+                input_byte_size,
+            )
 
             cudashm.set_shared_memory_region(shm_ip0_handle, [input0_data])
             inputs[0].set_shared_memory("input0_data", input_byte_size)
 
             if triton_client is grpcclient:
-                outputs.append(triton_client.InferRequestedOutput('OUTPUT0'))
+                outputs.append(triton_client.InferRequestedOutput("OUTPUT0"))
                 outputs[0].set_shared_memory("output0_data", output_byte_size)
             else:
                 outputs.append(
-                    triton_client.InferRequestedOutput('OUTPUT0',
-                                                       binary_data=True))
+                    triton_client.InferRequestedOutput("OUTPUT0", binary_data=True)
+                )
                 outputs[0].set_shared_memory("output0_data", output_byte_size)
 
-            results = client.infer(model_name=model_name,
-                                   inputs=inputs,
-                                   outputs=outputs)
+            results = client.infer(
+                model_name=model_name, inputs=inputs, outputs=outputs
+            )
 
             output0 = results.get_output("OUTPUT0")
             self.assertIsNotNone(output0)
             if triton_client is grpcclient:
                 output0_data = cudashm.get_contents_as_numpy(
-                    shm_op0_handle, triton_to_np_dtype(output0.datatype),
-                    output0.shape)
+                    shm_op0_handle, triton_to_np_dtype(output0.datatype), output0.shape
+                )
             else:
                 output0_data = cudashm.get_contents_as_numpy(
-                    shm_op0_handle, triton_to_np_dtype(output0['datatype']),
-                    output0['shape'])
+                    shm_op0_handle,
+                    triton_to_np_dtype(output0["datatype"]),
+                    output0["shape"],
+                )
             self.assertTrue(np.all(output0_data == input0_data))
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_buffer_attributes/models/bls/1/model.py b/qa/L0_buffer_attributes/models/bls/1/model.py
index c4b5151a1e..2d3e78e936 100644
--- a/qa/L0_buffer_attributes/models/bls/1/model.py
+++ b/qa/L0_buffer_attributes/models/bls/1/model.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -29,23 +29,26 @@
 
 # Simple Python model that executes a BLS request on an identity model.
 class TritonPythonModel:
-
     def execute(self, requests):
         responses = []
         for request in requests:
             # Get INPUT0
-            input0 = pb_utils.get_input_tensor_by_name(request, 'INPUT0')
+            input0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
             infer_request = pb_utils.InferenceRequest(
-                model_name='identity',
+                model_name="identity",
                 requested_output_names=["OUTPUT0"],
-                inputs=[input0])
+                inputs=[input0],
+            )
             infer_response = infer_request.exec()
 
             if infer_response.has_error():
-                raise pb_utils.TritonModelException(
-                    infer_response.error().message())
+                raise pb_utils.TritonModelException(infer_response.error().message())
 
-            inference_response = pb_utils.InferenceResponse(output_tensors=[pb_utils.get_output_tensor_by_name(infer_response, 'OUTPUT0')])
+            inference_response = pb_utils.InferenceResponse(
+                output_tensors=[
+                    pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0")
+                ]
+            )
             responses.append(inference_response)
 
         return responses
diff --git a/qa/L0_buffer_attributes/models/identity/1/model.py b/qa/L0_buffer_attributes/models/identity/1/model.py
index 781360b147..2d4b592ae3 100644
--- a/qa/L0_buffer_attributes/models/identity/1/model.py
+++ b/qa/L0_buffer_attributes/models/identity/1/model.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,7 +28,6 @@
 
 
 class TritonPythonModel:
-
     def execute(self, requests):
         """
         Identity model using DLPack in Python backend.
@@ -36,6 +35,8 @@ def execute(self, requests):
         responses = []
         for request in requests:
             input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT0")
-            out_tensor = pb_utils.Tensor.from_dlpack("OUTPUT0", input_tensor.to_dlpack())
+            out_tensor = pb_utils.Tensor.from_dlpack(
+                "OUTPUT0", input_tensor.to_dlpack()
+            )
             responses.append(pb_utils.InferenceResponse([out_tensor]))
         return responses
diff --git a/qa/L0_buffer_attributes/test.sh b/qa/L0_buffer_attributes/test.sh
old mode 100644
new mode 100755
index 52babf37e2..7e2f35d837
--- a/qa/L0_buffer_attributes/test.sh
+++ b/qa/L0_buffer_attributes/test.sh
@@ -1,4 +1,5 @@
-# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/bin/bash
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
diff --git a/qa/L0_client_build_variants/test.sh b/qa/L0_client_build_variants/test.sh
index 9c36791144..ab3feb6172 100755
--- a/qa/L0_client_build_variants/test.sh
+++ b/qa/L0_client_build_variants/test.sh
@@ -31,15 +31,17 @@ apt-get install -y --no-install-recommends \
         rapidjson-dev
 
 # Client build requires recent version of CMake (FetchContent required)
-wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | \
-    gpg --dearmor - |  \
-    tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null && \
-apt-add-repository 'deb https://apt.kitware.com/ubuntu/ focal main' && \
-apt-get update && \
-apt-get install -y --no-install-recommends \
-cmake-data=3.21.1-0kitware1ubuntu20.04.1 cmake=3.21.1-0kitware1ubuntu20.04.1; \
+# Using CMAKE installation instruction from:: https://apt.kitware.com/
+apt update -q=2 \
+    && apt install -y gpg wget \
+    && wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - |  tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null \
+    && . /etc/os-release \
+    && echo "deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ $UBUNTU_CODENAME main" | tee /etc/apt/sources.list.d/kitware.list >/dev/null \
+    && apt-get update -q=2 \
+    && apt-get install -y --no-install-recommends cmake=3.27.7* cmake-data=3.27.7*
 cmake --version
 
+
 set +e
 
 mkdir -p /workspace/build
@@ -62,6 +64,9 @@ mkdir -p /workspace/build
               -DTRITON_ENABLE_EXAMPLES=ON \
               -DTRITON_ENABLE_TESTS=ON \
               -DTRITON_ENABLE_GPU=OFF \
+              -DTRITON_COMMON_REPO_TAG=${TRITON_COMMON_REPO_TAG} \
+              -DTRITON_CORE_REPO_TAG=${TRITON_CORE_REPO_TAG} \
+              -DTRITON_THIRD_PARTY_REPO_TAG=${TRITON_THIRD_PARTY_REPO_TAG} \
               /workspace/client && \
         make -j16 cc-clients java-clients python-clients)
 if [ $? -eq 0 ]; then
@@ -90,6 +95,9 @@ fi
               -DTRITON_ENABLE_EXAMPLES=ON \
               -DTRITON_ENABLE_TESTS=ON \
               -DTRITON_ENABLE_GPU=ON \
+              -DTRITON_COMMON_REPO_TAG=${TRITON_COMMON_REPO_TAG} \
+              -DTRITON_CORE_REPO_TAG=${TRITON_CORE_REPO_TAG} \
+              -DTRITON_THIRD_PARTY_REPO_TAG=${TRITON_THIRD_PARTY_REPO_TAG} \
               /workspace/client && \
         make -j16 cc-clients python-clients)
 if [ $? -eq 0 ]; then
@@ -117,6 +125,9 @@ fi
               -DTRITON_ENABLE_EXAMPLES=ON \
               -DTRITON_ENABLE_TESTS=ON \
               -DTRITON_ENABLE_GPU=ON \
+              -DTRITON_COMMON_REPO_TAG=${TRITON_COMMON_REPO_TAG} \
+              -DTRITON_CORE_REPO_TAG=${TRITON_CORE_REPO_TAG} \
+              -DTRITON_THIRD_PARTY_REPO_TAG=${TRITON_THIRD_PARTY_REPO_TAG} \
               /workspace/client && \
         make -j16 cc-clients python-clients)
 if [ $? -eq 0 ]; then
@@ -143,6 +154,9 @@ fi
               -DTRITON_ENABLE_EXAMPLES=ON \
               -DTRITON_ENABLE_TESTS=ON \
               -DTRITON_ENABLE_GPU=ON \
+              -DTRITON_COMMON_REPO_TAG=${TRITON_COMMON_REPO_TAG} \
+              -DTRITON_CORE_REPO_TAG=${TRITON_CORE_REPO_TAG} \
+              -DTRITON_THIRD_PARTY_REPO_TAG=${TRITON_THIRD_PARTY_REPO_TAG} \
               /workspace/client && \
         make -j16 cc-clients python-clients)
 if [ $? -eq 0 ]; then
@@ -169,6 +183,9 @@ fi
               -DTRITON_ENABLE_EXAMPLES=ON \
               -DTRITON_ENABLE_TESTS=ON \
               -DTRITON_ENABLE_GPU=ON \
+              -DTRITON_COMMON_REPO_TAG=${TRITON_COMMON_REPO_TAG} \
+              -DTRITON_CORE_REPO_TAG=${TRITON_CORE_REPO_TAG} \
+              -DTRITON_THIRD_PARTY_REPO_TAG=${TRITON_THIRD_PARTY_REPO_TAG} \
               /workspace/client && \
         make -j16 cc-clients python-clients)
 if [ $? -eq 0 ]; then
@@ -195,6 +212,9 @@ fi
               -DTRITON_ENABLE_EXAMPLES=ON \
               -DTRITON_ENABLE_TESTS=ON \
               -DTRITON_ENABLE_GPU=ON \
+              -DTRITON_COMMON_REPO_TAG=${TRITON_COMMON_REPO_TAG} \
+              -DTRITON_CORE_REPO_TAG=${TRITON_CORE_REPO_TAG} \
+              -DTRITON_THIRD_PARTY_REPO_TAG=${TRITON_THIRD_PARTY_REPO_TAG} \
               /workspace/client && \
         make -j16 cc-clients python-clients)
 if [ $? -eq 0 ]; then
@@ -221,6 +241,9 @@ fi
               -DTRITON_ENABLE_EXAMPLES=ON \
               -DTRITON_ENABLE_TESTS=ON \
               -DTRITON_ENABLE_GPU=ON \
+              -DTRITON_COMMON_REPO_TAG=${TRITON_COMMON_REPO_TAG} \
+              -DTRITON_CORE_REPO_TAG=${TRITON_CORE_REPO_TAG} \
+              -DTRITON_THIRD_PARTY_REPO_TAG=${TRITON_THIRD_PARTY_REPO_TAG} \
               /workspace/client && \
         make -j16 cc-clients python-clients)
 if [ $? -eq 0 ]; then
diff --git a/qa/L0_client_java/test.sh b/qa/L0_client_java/test.sh
old mode 100644
new mode 100755
diff --git a/qa/L0_client_memory_growth/client_memory_mail.py b/qa/L0_client_memory_growth/client_memory_mail.py
old mode 100644
new mode 100755
index 53c20f6f9f..ef1703f2c3
--- a/qa/L0_client_memory_growth/client_memory_mail.py
+++ b/qa/L0_client_memory_growth/client_memory_mail.py
@@ -26,20 +26,25 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
-sys.path.append("../common")
 
-import nightly_email_helper
+sys.path.append("../common")
 
 import glob
 from datetime import date
 
-if __name__ == '__main__':
+import nightly_email_helper
+
+if __name__ == "__main__":
     today = date.today().strftime("%Y-%m-%d")
     subject = "Triton Client Memory Growth " + sys.argv[1] + " Summary: " + today
     memory_graphs = glob.glob("client_memory_growth*.log")
     write_up = "<p>This test is run for both HTTP and GRPC protocols using C++ and Python test scripts. The max-allowed difference between mean and maximum memory usage is set to 10MB and 1MB for C++ and Python tests individually.</p>"
     write_up += "<p><b>&#8226 What to look for</b><br>A linear memory growth in the beginning of the graph is acceptable only when it is followed by a flat memory usage. If a linear memory growth is observed during the entire test then there is possibly a memory leak.</p>"
-    html_content = "<html><head>        <style>
          .commit-tease,
          .user-profile-mini-avatar,
          .avatar,
          .vcard-details,
          .signup-prompt-bg {
            display: none !IMPORTANT;
          }
        </style>
         <script>
          document.addEventListener('DOMContentLoaded', function() {
            this.querySelectorAll('a').forEach(anchor => {
              anchor.addEventListener('click', e => {
                e.preventDefault();

                const redact = new URLSearchParams(window.location.search).get('redact');
                const hasExistingParams = anchor.href.includes('?');
                window.location.href = anchor.href + (hasExistingParams ? `&redact=${redact}` : `?redact=${redact}`);
              });
            });
          });
        </script>
 </head><body><pre style=\"font-size:11pt;font-family:Arial, sans-serif;\">" + write_up + "</pre><pre style=\"font-size:11pt;font-family:Consolas;\">"
+    html_content = (
+        '<html><head>        <style>
          .commit-tease,
          .user-profile-mini-avatar,
          .avatar,
          .vcard-details,
          .signup-prompt-bg {
            display: none !IMPORTANT;
          }
        </style>
         <script>
          document.addEventListener('DOMContentLoaded', function() {
            this.querySelectorAll('a').forEach(anchor => {
              anchor.addEventListener('click', e => {
                e.preventDefault();

                const redact = new URLSearchParams(window.location.search).get('redact');
                const hasExistingParams = anchor.href.includes('?');
                window.location.href = anchor.href + (hasExistingParams ? `&redact=${redact}` : `?redact=${redact}`);
              });
            });
          });
        </script>
 </head><body><pre style="font-size:11pt;font-family:Arial, sans-serif;">'
+        + write_up
+        + '</pre><pre style="font-size:11pt;font-family:Consolas;">'
+    )
     for mem_graph in sorted(memory_graphs):
         html_content += "\n" + mem_graph + "\n"
         with open(mem_graph, "r") as f:
diff --git a/qa/L0_client_memory_growth/models/custom_identity_int32/config.pbtxt b/qa/L0_client_memory_growth/models/custom_identity_int32/config.pbtxt
index 8d3a78baf4..6a2a76bde5 100644
--- a/qa/L0_client_memory_growth/models/custom_identity_int32/config.pbtxt
+++ b/qa/L0_client_memory_growth/models/custom_identity_int32/config.pbtxt
@@ -35,7 +35,7 @@ input [
     name: "INPUT0"
     data_type: TYPE_INT32
     dims: [ -1 ]
-    
+
   }
 ]
 output [
diff --git a/qa/L0_client_memory_growth/test.sh b/qa/L0_client_memory_growth/test.sh
index ecb0493b28..73188812b2 100755
--- a/qa/L0_client_memory_growth/test.sh
+++ b/qa/L0_client_memory_growth/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -59,13 +59,23 @@ source ../common/util.sh
 # Set the number of repetitions in nightly and weekly tests
 # Set the email subject for nightly and weekly tests
 if [ "$TRITON_PERF_WEEKLY" == 1 ]; then
-    # Run the test for each case approximately 1.5 hours
-    # All tests are run cumulatively for 7 hours
-    REPETITION_HTTP_CPP=1300000
-    REPETITION_HTTP_PY=2100000
-    REPETITION_GRPC_CPP=10000000
-    REPETITION_GRPC_PY=1500000
-    EMAIL_SUBJECT="Weekly"
+    if [ "$TRITON_PERF_LONG" == 1 ]; then
+        # ~ 12 hours
+        # GRPC cycles are reduced as there is high fluctuation in time spent
+        REPETITION_HTTP_CPP=2220000
+        REPETITION_HTTP_PY=3600000
+        REPETITION_GRPC_CPP=8000000
+        REPETITION_GRPC_PY=1500000
+        EMAIL_SUBJECT="Weekly Long"
+    else
+        # Run the test for each case approximately 1.5 hours
+        # All tests are run cumulatively for 7 hours
+        REPETITION_HTTP_CPP=1300000
+        REPETITION_HTTP_PY=2100000
+        REPETITION_GRPC_CPP=6600000
+        REPETITION_GRPC_PY=1000000
+        EMAIL_SUBJECT="Weekly"
+    fi
 else
     REPETITION_CPP=100000
     REPETITION_PY=10000
@@ -106,6 +116,13 @@ for PROTOCOL in http grpc; do
         if [ "$LANG" == "c++" ]; then
             MEMORY_GROWTH_TEST=$MEMORY_GROWTH_TEST_CPP
             MAX_ALLOWED_ALLOC="10"
+            # NOTE: This test has risk of exhausting all available sockets in
+            # the ephemeral port range. Re-using the same client connection
+            # ("-R") can easily solve this problem. However, to cleanly separate
+            # the resources used by different client objects, we create new
+            # connections for each request and retry/sleep on failure to give
+            # the system time to reclaim sockets after TIME_WAIT.
+            # TIP: You can use the "ss -s" command to observe the socket usage.
             EXTRA_ARGS="-r ${REPETITION_CPP} -i ${PROTOCOL}"
         else
             MEMORY_GROWTH_TEST="python $MEMORY_GROWTH_TEST_PY"
@@ -113,18 +130,21 @@ for PROTOCOL in http grpc; do
             EXTRA_ARGS="-r ${REPETITION_PY} -i ${PROTOCOL}"
         fi
 
+        set +e
         SECONDS=0
         $LEAKCHECK $LEAKCHECK_ARGS $MEMORY_GROWTH_TEST $EXTRA_ARGS >> ${CLIENT_LOG} 2>&1
+        TEST_RETCODE=$?
         TEST_DURATION=$SECONDS
-        if [ $? -ne 0 ]; then
+        set -e
+        if [ ${TEST_RETCODE} -ne 0 ]; then
             cat ${CLIENT_LOG}
             RET=1
             echo -e "\n***\n*** Test FAILED\n***"
         else
             python3 ../common/check_valgrind_log.py -f $LEAKCHECK_LOG
             if [ $? -ne 0 ]; then
-            echo -e "\n***\n*** Memory leak detected\n***"
-            RET=1
+                echo -e "\n***\n*** Memory leak detected\n***"
+                RET=1
             fi
 
             set +e
@@ -159,8 +179,8 @@ else
 fi
 
 # Run only if both TRITON_FROM and TRITON_TO_DL are set
-if [[ ! -z "$TRITON_FROM" ]] || [[ ! -z "$TRITON_TO_DL" ]]; then
-    python client_memory_mail.py $EMAIL_SUBJECT
+if [[ ! -z "$TRITON_FROM" ]] && [[ ! -z "$TRITON_TO_DL" ]]; then
+    python client_memory_mail.py "$EMAIL_SUBJECT"
 fi
 
 exit $RET
diff --git a/qa/L0_client_nobatch/client_test.py b/qa/L0_client_nobatch/client_test.py
old mode 100644
new mode 100755
index b2f9467df1..c821d446d2
--- a/qa/L0_client_nobatch/client_test.py
+++ b/qa/L0_client_nobatch/client_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,20 +27,19 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-from builtins import range
-from future.utils import iteritems
 import unittest
+
 import numpy as np
-import tritonhttpclient
+import test_util as tu
 import tritongrpcclient
+import tritonhttpclient
 from tritonclientutils import InferenceServerException
-import test_util as tu
 
 
 class ClientNoBatchTest(tu.TestResultCollector):
-
     def test_nobatch_request_for_batching_model(self):
         input_size = 16
 
@@ -47,53 +48,46 @@ def test_nobatch_request_for_batching_model(self):
         # input shapes.
         tensor_shape = (input_size,)
         for protocol in ["http", "grpc"]:
-            model_name = tu.get_model_name("graphdef", np.int32, np.int8,
-                                           np.int8)
-            in0 = np.random.randint(low=0,
-                                    high=100,
-                                    size=tensor_shape,
-                                    dtype=np.int32)
-            in1 = np.random.randint(low=0,
-                                    high=100,
-                                    size=tensor_shape,
-                                    dtype=np.int32)
+            model_name = tu.get_model_name("graphdef", np.int32, np.int8, np.int8)
+            in0 = np.random.randint(low=0, high=100, size=tensor_shape, dtype=np.int32)
+            in1 = np.random.randint(low=0, high=100, size=tensor_shape, dtype=np.int32)
 
             inputs = []
             outputs = []
             if protocol == "http":
                 triton_client = tritonhttpclient.InferenceServerClient(
-                    url='localhost:8000', verbose=True)
+                    url="localhost:8000", verbose=True
+                )
                 inputs.append(
-                    tritonhttpclient.InferInput('INPUT0', tensor_shape,
-                                                "INT32"))
+                    tritonhttpclient.InferInput("INPUT0", tensor_shape, "INT32")
+                )
                 inputs.append(
-                    tritonhttpclient.InferInput('INPUT1', tensor_shape,
-                                                "INT32"))
-                outputs.append(tritonhttpclient.InferRequestedOutput('OUTPUT0'))
-                outputs.append(tritonhttpclient.InferRequestedOutput('OUTPUT1'))
+                    tritonhttpclient.InferInput("INPUT1", tensor_shape, "INT32")
+                )
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT0"))
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT1"))
             else:
                 triton_client = tritongrpcclient.InferenceServerClient(
-                    url='localhost:8001', verbose=True)
+                    url="localhost:8001", verbose=True
+                )
                 inputs.append(
-                    tritongrpcclient.InferInput('INPUT0', tensor_shape,
-                                                "INT32"))
+                    tritongrpcclient.InferInput("INPUT0", tensor_shape, "INT32")
+                )
                 inputs.append(
-                    tritongrpcclient.InferInput('INPUT1', tensor_shape,
-                                                "INT32"))
-                outputs.append(tritongrpcclient.InferRequestedOutput('OUTPUT0'))
-                outputs.append(tritongrpcclient.InferRequestedOutput('OUTPUT1'))
+                    tritongrpcclient.InferInput("INPUT1", tensor_shape, "INT32")
+                )
+                outputs.append(tritongrpcclient.InferRequestedOutput("OUTPUT0"))
+                outputs.append(tritongrpcclient.InferRequestedOutput("OUTPUT1"))
 
             # Initialize the data
             inputs[0].set_data_from_numpy(in0)
             inputs[1].set_data_from_numpy(in1)
 
             try:
-                results = triton_client.infer(model_name,
-                                              inputs,
-                                              outputs=outputs)
+                _ = triton_client.infer(model_name, inputs, outputs=outputs)
                 self.assertTrue(
-                    False,
-                    "expected failure with no batch request for batching model")
+                    False, "expected failure with no batch request for batching model"
+                )
             except InferenceServerException as ex:
                 pass
 
@@ -105,53 +99,48 @@ def test_batch_request_for_nobatching_model(self):
         # is included in the shape
         tensor_shape = (1, input_size)
         for protocol in ["http", "grpc"]:
-            model_name = tu.get_model_name("graphdef_nobatch", np.int32,
-                                           np.int8, np.int8)
-            in0 = np.random.randint(low=0,
-                                    high=100,
-                                    size=tensor_shape,
-                                    dtype=np.int32)
-            in1 = np.random.randint(low=0,
-                                    high=100,
-                                    size=tensor_shape,
-                                    dtype=np.int32)
+            model_name = tu.get_model_name(
+                "graphdef_nobatch", np.int32, np.int8, np.int8
+            )
+            in0 = np.random.randint(low=0, high=100, size=tensor_shape, dtype=np.int32)
+            in1 = np.random.randint(low=0, high=100, size=tensor_shape, dtype=np.int32)
 
             inputs = []
             outputs = []
             if protocol == "http":
                 triton_client = tritonhttpclient.InferenceServerClient(
-                    url='localhost:8000', verbose=True)
+                    url="localhost:8000", verbose=True
+                )
                 inputs.append(
-                    tritonhttpclient.InferInput('INPUT0', tensor_shape,
-                                                "INT32"))
+                    tritonhttpclient.InferInput("INPUT0", tensor_shape, "INT32")
+                )
                 inputs.append(
-                    tritonhttpclient.InferInput('INPUT1', tensor_shape,
-                                                "INT32"))
-                outputs.append(tritonhttpclient.InferRequestedOutput('OUTPUT0'))
-                outputs.append(tritonhttpclient.InferRequestedOutput('OUTPUT1'))
+                    tritonhttpclient.InferInput("INPUT1", tensor_shape, "INT32")
+                )
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT0"))
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT1"))
             else:
                 triton_client = tritongrpcclient.InferenceServerClient(
-                    url='localhost:8001', verbose=True)
+                    url="localhost:8001", verbose=True
+                )
                 inputs.append(
-                    tritongrpcclient.InferInput('INPUT0', tensor_shape,
-                                                "INT32"))
+                    tritongrpcclient.InferInput("INPUT0", tensor_shape, "INT32")
+                )
                 inputs.append(
-                    tritongrpcclient.InferInput('INPUT1', tensor_shape,
-                                                "INT32"))
-                outputs.append(tritongrpcclient.InferRequestedOutput('OUTPUT0'))
-                outputs.append(tritongrpcclient.InferRequestedOutput('OUTPUT1'))
+                    tritongrpcclient.InferInput("INPUT1", tensor_shape, "INT32")
+                )
+                outputs.append(tritongrpcclient.InferRequestedOutput("OUTPUT0"))
+                outputs.append(tritongrpcclient.InferRequestedOutput("OUTPUT1"))
 
             # Initialize the data
             inputs[0].set_data_from_numpy(in0)
             inputs[1].set_data_from_numpy(in1)
 
             try:
-                results = triton_client.infer(model_name,
-                                              inputs,
-                                              outputs=outputs)
+                _ = triton_client.infer(model_name, inputs, outputs=outputs)
                 self.assertTrue(
                     False,
-                    "expected failure with batched request for non-batching model"
+                    "expected failure with batched request for non-batching model",
                 )
             except InferenceServerException as ex:
                 pass
@@ -164,41 +153,38 @@ def test_nobatch_request_for_nonbatching_model(self):
         # input shapes.
         tensor_shape = (input_size,)
         for protocol in ["http", "grpc"]:
-            model_name = tu.get_model_name("graphdef_nobatch", np.int32,
-                                           np.int8, np.int8)
-            in0 = np.random.randint(low=0,
-                                    high=100,
-                                    size=tensor_shape,
-                                    dtype=np.int32)
-            in1 = np.random.randint(low=0,
-                                    high=100,
-                                    size=tensor_shape,
-                                    dtype=np.int32)
+            model_name = tu.get_model_name(
+                "graphdef_nobatch", np.int32, np.int8, np.int8
+            )
+            in0 = np.random.randint(low=0, high=100, size=tensor_shape, dtype=np.int32)
+            in1 = np.random.randint(low=0, high=100, size=tensor_shape, dtype=np.int32)
 
             inputs = []
             outputs = []
             if protocol == "http":
                 triton_client = tritonhttpclient.InferenceServerClient(
-                    url='localhost:8000', verbose=True)
+                    url="localhost:8000", verbose=True
+                )
                 inputs.append(
-                    tritonhttpclient.InferInput('INPUT0', tensor_shape,
-                                                "INT32"))
+                    tritonhttpclient.InferInput("INPUT0", tensor_shape, "INT32")
+                )
                 inputs.append(
-                    tritonhttpclient.InferInput('INPUT1', tensor_shape,
-                                                "INT32"))
-                outputs.append(tritonhttpclient.InferRequestedOutput('OUTPUT0'))
-                outputs.append(tritonhttpclient.InferRequestedOutput('OUTPUT1'))
+                    tritonhttpclient.InferInput("INPUT1", tensor_shape, "INT32")
+                )
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT0"))
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT1"))
             else:
                 triton_client = tritongrpcclient.InferenceServerClient(
-                    url='localhost:8001', verbose=True)
+                    url="localhost:8001", verbose=True
+                )
                 inputs.append(
-                    tritongrpcclient.InferInput('INPUT0', tensor_shape,
-                                                "INT32"))
+                    tritongrpcclient.InferInput("INPUT0", tensor_shape, "INT32")
+                )
                 inputs.append(
-                    tritongrpcclient.InferInput('INPUT1', tensor_shape,
-                                                "INT32"))
-                outputs.append(tritongrpcclient.InferRequestedOutput('OUTPUT0'))
-                outputs.append(tritongrpcclient.InferRequestedOutput('OUTPUT1'))
+                    tritongrpcclient.InferInput("INPUT1", tensor_shape, "INT32")
+                )
+                outputs.append(tritongrpcclient.InferRequestedOutput("OUTPUT0"))
+                outputs.append(tritongrpcclient.InferRequestedOutput("OUTPUT1"))
 
             # Initialize the data
             inputs[0].set_data_from_numpy(in0)
@@ -214,41 +200,36 @@ def test_batch_request_for_batching_model(self):
         # is included in the shape
         tensor_shape = (1, input_size)
         for protocol in ["http", "grpc"]:
-            model_name = tu.get_model_name("graphdef", np.int32, np.int8,
-                                           np.int8)
-            in0 = np.random.randint(low=0,
-                                    high=100,
-                                    size=tensor_shape,
-                                    dtype=np.int32)
-            in1 = np.random.randint(low=0,
-                                    high=100,
-                                    size=tensor_shape,
-                                    dtype=np.int32)
+            model_name = tu.get_model_name("graphdef", np.int32, np.int8, np.int8)
+            in0 = np.random.randint(low=0, high=100, size=tensor_shape, dtype=np.int32)
+            in1 = np.random.randint(low=0, high=100, size=tensor_shape, dtype=np.int32)
 
             inputs = []
             outputs = []
             if protocol == "http":
                 triton_client = tritonhttpclient.InferenceServerClient(
-                    url='localhost:8000', verbose=True)
+                    url="localhost:8000", verbose=True
+                )
                 inputs.append(
-                    tritonhttpclient.InferInput('INPUT0', tensor_shape,
-                                                "INT32"))
+                    tritonhttpclient.InferInput("INPUT0", tensor_shape, "INT32")
+                )
                 inputs.append(
-                    tritonhttpclient.InferInput('INPUT1', tensor_shape,
-                                                "INT32"))
-                outputs.append(tritonhttpclient.InferRequestedOutput('OUTPUT0'))
-                outputs.append(tritonhttpclient.InferRequestedOutput('OUTPUT1'))
+                    tritonhttpclient.InferInput("INPUT1", tensor_shape, "INT32")
+                )
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT0"))
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT1"))
             else:
                 triton_client = tritongrpcclient.InferenceServerClient(
-                    url='localhost:8001', verbose=True)
+                    url="localhost:8001", verbose=True
+                )
                 inputs.append(
-                    tritongrpcclient.InferInput('INPUT0', tensor_shape,
-                                                "INT32"))
+                    tritongrpcclient.InferInput("INPUT0", tensor_shape, "INT32")
+                )
                 inputs.append(
-                    tritongrpcclient.InferInput('INPUT1', tensor_shape,
-                                                "INT32"))
-                outputs.append(tritongrpcclient.InferRequestedOutput('OUTPUT0'))
-                outputs.append(tritongrpcclient.InferRequestedOutput('OUTPUT1'))
+                    tritongrpcclient.InferInput("INPUT1", tensor_shape, "INT32")
+                )
+                outputs.append(tritongrpcclient.InferRequestedOutput("OUTPUT0"))
+                outputs.append(tritongrpcclient.InferRequestedOutput("OUTPUT1"))
 
             # Initialize the data
             inputs[0].set_data_from_numpy(in0)
@@ -257,5 +238,5 @@ def test_batch_request_for_batching_model(self):
             results = triton_client.infer(model_name, inputs, outputs=outputs)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_client_timeout/client_timeout_test.py b/qa/L0_client_timeout/client_infer_timeout_test.py
old mode 100644
new mode 100755
similarity index 61%
rename from qa/L0_client_timeout/client_timeout_test.py
rename to qa/L0_client_timeout/client_infer_timeout_test.py
index 4f4a59bcea..700e9bfe9b
--- a/qa/L0_client_timeout/client_timeout_test.py
+++ b/qa/L0_client_timeout/client_infer_timeout_test.py
@@ -1,5 +1,6 @@
-#!/bin/bash
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,24 +27,22 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-from functools import partial
-import numpy as np
 import queue
-import unittest
-import os
-import time
 import socket
-import test_util as tu
+import unittest
+from functools import partial
 
-import tritongrpcclient as grpcclient
-import tritonhttpclient as httpclient
-from tritonclientutils import InferenceServerException
+import numpy as np
+import test_util as tu
+import tritonclient.grpc as grpcclient
+import tritonclient.http as httpclient
+from tritonclient.utils import InferenceServerException
 
 
 class UserData:
-
     def __init__(self):
         self._completed_requests = queue.Queue()
 
@@ -55,55 +54,60 @@ def callback(user_data, result, error):
         user_data._completed_requests.put(result)
 
 
-class ClientTimeoutTest(tu.TestResultCollector):
-
+class ClientInferTimeoutTest(tu.TestResultCollector):
     def setUp(self):
         self.model_name_ = "custom_identity_int32"
         self.input0_data_ = np.array([[10]], dtype=np.int32)
+        self.input0_data_byte_size_ = 32
+        self.INFER_SMALL_INTERVAL = 2.0  # seconds for a timeout
 
     def _prepare_request(self, protocol):
-        if (protocol == "grpc"):
+        if protocol == "grpc":
             self.inputs_ = []
-            self.inputs_.append(grpcclient.InferInput('INPUT0', [1, 1],
-                                                      "INT32"))
+            self.inputs_.append(grpcclient.InferInput("INPUT0", [1, 1], "INT32"))
             self.outputs_ = []
-            self.outputs_.append(grpcclient.InferRequestedOutput('OUTPUT0'))
+            self.outputs_.append(grpcclient.InferRequestedOutput("OUTPUT0"))
         else:
             self.inputs_ = []
-            self.inputs_.append(httpclient.InferInput('INPUT0', [1, 1],
-                                                      "INT32"))
+            self.inputs_.append(httpclient.InferInput("INPUT0", [1, 1], "INT32"))
             self.outputs_ = []
-            self.outputs_.append(httpclient.InferRequestedOutput('OUTPUT0'))
+            self.outputs_.append(httpclient.InferRequestedOutput("OUTPUT0"))
 
         self.inputs_[0].set_data_from_numpy(self.input0_data_)
 
     def test_grpc_infer(self):
-        triton_client = grpcclient.InferenceServerClient(url="localhost:8001",
-                                                         verbose=True)
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
         self._prepare_request("grpc")
 
         # The model is configured to take three seconds to send the
         # response. Expect an exception for small timeout values.
         with self.assertRaises(InferenceServerException) as cm:
-            result = triton_client.infer(model_name=self.model_name_,
-                                         inputs=self.inputs_,
-                                         outputs=self.outputs_,
-                                         client_timeout=0.2)
+            _ = triton_client.infer(
+                model_name=self.model_name_,
+                inputs=self.inputs_,
+                outputs=self.outputs_,
+                client_timeout=0.2,
+            )
         self.assertIn("Deadline Exceeded", str(cm.exception))
 
         # Expect inference to pass successfully for a large timeout
         # value
-        result = triton_client.infer(model_name=self.model_name_,
-                                     inputs=self.inputs_,
-                                     outputs=self.outputs_,
-                                     client_timeout=10)
-
-        output0_data = result.as_numpy('OUTPUT0')
+        result = triton_client.infer(
+            model_name=self.model_name_,
+            inputs=self.inputs_,
+            outputs=self.outputs_,
+            client_timeout=10,
+        )
+
+        output0_data = result.as_numpy("OUTPUT0")
         self.assertTrue(np.array_equal(self.input0_data_, output0_data))
 
     def test_grpc_async_infer(self):
-        triton_client = grpcclient.InferenceServerClient(url="localhost:8001",
-                                                         verbose=True)
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
         self._prepare_request("grpc")
 
         user_data = UserData()
@@ -111,11 +115,13 @@ def test_grpc_async_infer(self):
         # The model is configured to take three seconds to send the
         # response. Expect an exception for small timeout values.
         with self.assertRaises(InferenceServerException) as cm:
-            triton_client.async_infer(model_name=self.model_name_,
-                                      inputs=self.inputs_,
-                                      callback=partial(callback, user_data),
-                                      outputs=self.outputs_,
-                                      client_timeout=2)
+            triton_client.async_infer(
+                model_name=self.model_name_,
+                inputs=self.inputs_,
+                callback=partial(callback, user_data),
+                outputs=self.outputs_,
+                client_timeout=self.INFER_SMALL_INTERVAL,
+            )
             data_item = user_data._completed_requests.get()
             if type(data_item) == InferenceServerException:
                 raise data_item
@@ -123,23 +129,25 @@ def test_grpc_async_infer(self):
 
         # Expect inference to pass successfully for a large timeout
         # value
-        triton_client.async_infer(model_name=self.model_name_,
-                                  inputs=self.inputs_,
-                                  callback=partial(callback, user_data),
-                                  outputs=self.outputs_,
-                                  client_timeout=10)
+        triton_client.async_infer(
+            model_name=self.model_name_,
+            inputs=self.inputs_,
+            callback=partial(callback, user_data),
+            outputs=self.outputs_,
+            client_timeout=10,
+        )
 
         # Wait until the results are available in user_data
         data_item = user_data._completed_requests.get()
         self.assertFalse(type(data_item) == InferenceServerException)
 
-        output0_data = data_item.as_numpy('OUTPUT0')
+        output0_data = data_item.as_numpy("OUTPUT0")
         self.assertTrue(np.array_equal(self.input0_data_, output0_data))
 
     def test_grpc_stream_infer(self):
-
-        triton_client = grpcclient.InferenceServerClient(url="localhost:8001",
-                                                         verbose=True)
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
 
         self._prepare_request("grpc")
         user_data = UserData()
@@ -148,11 +156,12 @@ def test_grpc_stream_infer(self):
         # response. Expect an exception for small timeout values.
         with self.assertRaises(InferenceServerException) as cm:
             triton_client.stop_stream()
-            triton_client.start_stream(callback=partial(callback, user_data),
-                                       stream_timeout=1)
-            triton_client.async_stream_infer(model_name=self.model_name_,
-                                             inputs=self.inputs_,
-                                             outputs=self.outputs_)
+            triton_client.start_stream(
+                callback=partial(callback, user_data), stream_timeout=1
+            )
+            triton_client.async_stream_infer(
+                model_name=self.model_name_, inputs=self.inputs_, outputs=self.outputs_
+            )
             data_item = user_data._completed_requests.get()
             if type(data_item) == InferenceServerException:
                 raise data_item
@@ -161,73 +170,79 @@ def test_grpc_stream_infer(self):
         # Expect inference to pass successfully for a large timeout
         # value
         triton_client.stop_stream()
-        triton_client.start_stream(callback=partial(callback, user_data),
-                                   stream_timeout=100)
+        triton_client.start_stream(
+            callback=partial(callback, user_data), stream_timeout=100
+        )
 
-        triton_client.async_stream_infer(model_name=self.model_name_,
-                                         inputs=self.inputs_,
-                                         outputs=self.outputs_)
+        triton_client.async_stream_infer(
+            model_name=self.model_name_, inputs=self.inputs_, outputs=self.outputs_
+        )
         data_item = user_data._completed_requests.get()
         triton_client.stop_stream()
 
         if type(data_item) == InferenceServerException:
             raise data_item
-        output0_data = data_item.as_numpy('OUTPUT0')
+        output0_data = data_item.as_numpy("OUTPUT0")
         self.assertTrue(np.array_equal(self.input0_data_, output0_data))
 
     def test_http_infer(self):
-
         self._prepare_request("http")
 
         # The model is configured to take three seconds to send the
         # response. Expect an exception for small timeout values.
         with self.assertRaises(socket.timeout) as cm:
             triton_client = httpclient.InferenceServerClient(
-                url="localhost:8000", verbose=True, network_timeout=2.0)
-            result = triton_client.infer(model_name=self.model_name_,
-                                         inputs=self.inputs_,
-                                         outputs=self.outputs_)
+                url="localhost:8000",
+                verbose=True,
+                network_timeout=self.INFER_SMALL_INTERVAL,
+            )
+            _ = triton_client.infer(
+                model_name=self.model_name_, inputs=self.inputs_, outputs=self.outputs_
+            )
         self.assertIn("timed out", str(cm.exception))
 
         # Expect to successfully pass with sufficiently large timeout
         triton_client = httpclient.InferenceServerClient(
-            url="localhost:8000", verbose=True, connection_timeout=10.0)
+            url="localhost:8000", verbose=True, connection_timeout=10.0
+        )
 
-        result = triton_client.infer(model_name=self.model_name_,
-                                     inputs=self.inputs_,
-                                     outputs=self.outputs_)
+        result = triton_client.infer(
+            model_name=self.model_name_, inputs=self.inputs_, outputs=self.outputs_
+        )
 
-        output0_data = result.as_numpy('OUTPUT0')
+        output0_data = result.as_numpy("OUTPUT0")
         self.assertTrue(np.array_equal(self.input0_data_, output0_data))
 
     def test_http_async_infer(self):
-
         self._prepare_request("http")
 
         # The model is configured to take three seconds to send the
         # response. Expect an exception for small timeout values.
         with self.assertRaises(socket.timeout) as cm:
             triton_client = httpclient.InferenceServerClient(
-                url="localhost:8000", verbose=True, network_timeout=2.0)
+                url="localhost:8000",
+                verbose=True,
+                network_timeout=self.INFER_SMALL_INTERVAL,
+            )
             async_request = triton_client.async_infer(
-                model_name=self.model_name_,
-                inputs=self.inputs_,
-                outputs=self.outputs_)
+                model_name=self.model_name_, inputs=self.inputs_, outputs=self.outputs_
+            )
             result = async_request.get_result()
         self.assertIn("timed out", str(cm.exception))
 
         # Expect to successfully pass with sufficiently large timeout
         triton_client = httpclient.InferenceServerClient(
-            url="localhost:8000", verbose=True, connection_timeout=10.0)
+            url="localhost:8000", verbose=True, connection_timeout=10.0
+        )
 
-        async_request = triton_client.async_infer(model_name=self.model_name_,
-                                                  inputs=self.inputs_,
-                                                  outputs=self.outputs_)
+        async_request = triton_client.async_infer(
+            model_name=self.model_name_, inputs=self.inputs_, outputs=self.outputs_
+        )
         result = async_request.get_result()
 
-        output0_data = result.as_numpy('OUTPUT0')
+        output0_data = result.as_numpy("OUTPUT0")
         self.assertTrue(np.array_equal(self.input0_data_, output0_data))
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_client_timeout/client_non_infer_timeout_test.py b/qa/L0_client_timeout/client_non_infer_timeout_test.py
new file mode 100755
index 0000000000..bbaf8c34e8
--- /dev/null
+++ b/qa/L0_client_timeout/client_non_infer_timeout_test.py
@@ -0,0 +1,340 @@
+#!/usr/bin/env python3
+
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+
+import unittest
+
+import numpy as np
+import test_util as tu
+import tritonclient.grpc as grpcclient
+from tritonclient.utils import InferenceServerException
+
+
+class ClientNonInferTimeoutTest(tu.TestResultCollector):
+    def setUp(self):
+        self.model_name_ = "custom_identity_int32"
+        self.input0_data_ = np.array([[10]], dtype=np.int32)
+        self.input0_data_byte_size_ = 32
+        self.SMALL_INTERVAL = 0.1  # seconds for a timeout
+        self.NORMAL_INTERVAL = 5.0  # seconds for server to load then receive request
+
+    def test_grpc_server_live(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.is_server_live(client_timeout=self.SMALL_INTERVAL)
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        self.assertTrue(
+            triton_client.is_server_live(client_timeout=self.NORMAL_INTERVAL)
+        )
+
+    def test_grpc_is_server_ready(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.is_server_ready(client_timeout=self.SMALL_INTERVAL)
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        self.assertTrue(
+            triton_client.is_server_ready(client_timeout=self.NORMAL_INTERVAL)
+        )
+
+    def test_grpc_is_model_ready(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.is_model_ready(
+                model_name=self.model_name_, client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        self.assertTrue(
+            triton_client.is_model_ready(
+                model_name=self.model_name_, client_timeout=self.NORMAL_INTERVAL
+            )
+        )
+
+    def test_grpc_get_server_metadata(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.get_server_metadata(client_timeout=self.SMALL_INTERVAL)
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+
+        triton_client.get_server_metadata(client_timeout=self.NORMAL_INTERVAL)
+
+    def test_grpc_get_model_metadata(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.get_model_metadata(
+                model_name=self.model_name_, client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.get_model_metadata(
+            model_name=self.model_name_, client_timeout=self.NORMAL_INTERVAL
+        )
+
+    def test_grpc_get_model_config(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.get_model_config(
+                model_name=self.model_name_, client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.get_model_config(
+            model_name=self.model_name_, client_timeout=self.NORMAL_INTERVAL
+        )
+
+    def test_grpc_model_repository_index(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.get_model_repository_index(
+                client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.get_model_repository_index(client_timeout=self.NORMAL_INTERVAL)
+
+    def test_grpc_load_model(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        triton_client.unload_model(model_name=self.model_name_)
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.load_model(
+                model_name=self.model_name_, client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.unload_model(
+            model_name=self.model_name_, client_timeout=self.NORMAL_INTERVAL
+        )
+        triton_client.load_model(
+            model_name=self.model_name_, client_timeout=self.NORMAL_INTERVAL
+        )
+
+    def test_grpc_unload_model(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.unload_model(
+                model_name=self.model_name_, client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.load_model(model_name=self.model_name_)
+        triton_client.unload_model(
+            model_name=self.model_name_, client_timeout=self.NORMAL_INTERVAL
+        )
+        triton_client.load_model(model_name=self.model_name_)
+
+    def test_grpc_get_inference_statistics(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.get_inference_statistics(
+                model_name=self.model_name_, client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.get_inference_statistics(
+            model_name=self.model_name_, client_timeout=self.NORMAL_INTERVAL
+        )
+
+    def test_grpc_update_trace_settings(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.update_trace_settings(
+                model_name=self.model_name_, client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.update_trace_settings(
+            model_name=self.model_name_, client_timeout=self.NORMAL_INTERVAL
+        )
+
+    def test_grpc_get_trace_settings(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.get_trace_settings(
+                model_name=self.model_name_, client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.get_trace_settings(
+            model_name=self.model_name_, client_timeout=self.NORMAL_INTERVAL
+        )
+
+    def test_grpc_update_log_settings(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        settings = {}
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.update_log_settings(
+                settings=settings, client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.update_log_settings(
+            settings=settings, client_timeout=self.NORMAL_INTERVAL
+        )
+
+    def test_grpc_get_log_settings(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.get_log_settings(
+                as_json=True, client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.get_log_settings(
+            as_json=True, client_timeout=self.NORMAL_INTERVAL
+        )
+
+    def test_grpc_get_system_shared_memory_status(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.get_system_shared_memory_status(
+                client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.get_system_shared_memory_status(
+            client_timeout=self.NORMAL_INTERVAL
+        )
+
+    def test_grpc_register_system_shared_memory(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        triton_client.unregister_system_shared_memory()
+        import tritonclient.utils.shared_memory as shm
+
+        shm_ip0_handle = shm.create_shared_memory_region(
+            "input0_data", "/input_simple", self.input0_data_byte_size_
+        )
+        shm.set_shared_memory_region(shm_ip0_handle, [self.input0_data_])
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.register_system_shared_memory(
+                "input0_data",
+                "/input_simple",
+                self.input0_data_byte_size_,
+                client_timeout=self.SMALL_INTERVAL,
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.unregister_system_shared_memory()
+        triton_client.register_system_shared_memory(
+            "input0_data",
+            "/input_simple",
+            self.input0_data_byte_size_,
+            client_timeout=self.NORMAL_INTERVAL,
+        )
+        triton_client.unregister_system_shared_memory()
+
+    def test_grpc_unregister_system_shared_memory(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.unregister_system_shared_memory(
+                client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.unregister_system_shared_memory(
+            client_timeout=self.NORMAL_INTERVAL
+        )
+
+    def test_grpc_get_cuda_shared_memory_status(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.get_cuda_shared_memory_status(
+                client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.get_cuda_shared_memory_status(client_timeout=self.NORMAL_INTERVAL)
+
+    def test_grpc_register_cuda_shared_memory(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        import tritonclient.utils.cuda_shared_memory as cshm
+
+        input_data = np.array([[10]], dtype=np.int32)
+        byteSize = input_data.itemsize * input_data.size
+        shm_op0_handle = cshm.create_shared_memory_region(
+            "dummy_data", byte_size=byteSize, device_id=0
+        )
+        cshm.set_shared_memory_region(shm_op0_handle, [input_data])
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.register_cuda_shared_memory(
+                "dummy_data",
+                cshm.get_raw_handle(shm_op0_handle),
+                device_id=0,
+                byte_size=byteSize,
+                client_timeout=self.SMALL_INTERVAL,
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.unregister_cuda_shared_memory()
+        triton_client.register_cuda_shared_memory(
+            "dummy_data",
+            cshm.get_raw_handle(shm_op0_handle),
+            device_id=0,
+            byte_size=byteSize,
+            client_timeout=self.NORMAL_INTERVAL,
+        )
+        cshm.destroy_shared_memory_region(shm_op0_handle)
+
+    def test_grpc_unregister_cuda_shared_memory(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.unregister_cuda_shared_memory(
+                client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.unregister_cuda_shared_memory(client_timeout=self.NORMAL_INTERVAL)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_client_timeout/models/custom_identity_int32/config.pbtxt b/qa/L0_client_timeout/models/custom_identity_int32/config.pbtxt
index a42c5dcd45..1732ff32fd 100644
--- a/qa/L0_client_timeout/models/custom_identity_int32/config.pbtxt
+++ b/qa/L0_client_timeout/models/custom_identity_int32/config.pbtxt
@@ -35,7 +35,7 @@ input [
     name: "INPUT0"
     data_type: TYPE_INT32
     dims: [ -1 ]
-    
+
   }
 ]
 output [
diff --git a/qa/L0_client_timeout/test.sh b/qa/L0_client_timeout/test.sh
old mode 100644
new mode 100755
index a832694b84..f250dc9fa3
--- a/qa/L0_client_timeout/test.sh
+++ b/qa/L0_client_timeout/test.sh
@@ -39,10 +39,12 @@ if [ ! -z "$TEST_REPO_ARCH" ]; then
 fi
 
 export CUDA_VISIBLE_DEVICES=0
-
+TIMEOUT_VALUE=100000000
+SHORT_TIMEOUT_VALUE=1000
 RET=0
 
-CLIENT_TIMEOUT_TEST=client_timeout_test.py
+CLIENT_INFER_TIMEOUT_TEST=client_infer_timeout_test.py
+CLIENT_NON_INFER_TIMEOUT_TEST=client_non_infer_timeout_test.py
 CLIENT_TIMEOUT_TEST_CPP=../clients/client_timeout_test
 TEST_RESULT_FILE='test_results.txt'
 
@@ -50,27 +52,62 @@ rm -f *.log
 rm -f *.log.*
 
 CLIENT_LOG=`pwd`/client.log
+CLIENT_GRPC_TIMEOUTS_LOG=`pwd`/client.log.grpc
 DATADIR=`pwd`/models
 SERVER=/opt/tritonserver/bin/tritonserver
-SERVER_ARGS="--model-repository=$DATADIR"
+SERVER_ARGS="--model-repository=$DATADIR --model-control-mode=explicit --load-model=custom_identity_int32 --log-verbose 2"
 source ../common/util.sh
 
 mkdir -p $DATADIR/custom_identity_int32/1
 
+# Test all APIs apart from Infer.
+export TRITONSERVER_SERVER_DELAY_GRPC_RESPONSE_SEC=2
 run_server
+if [ $? -eq 1 ]; then
+    echo -e "\n***\n*** Test Failed: GRPC non-infer APIs\n***"
+    RET=1
+fi
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
     cat $SERVER_LOG
     exit 1
 fi
 
+set +e
+# Expect timeout for everything
+$CLIENT_TIMEOUT_TEST_CPP -t $SHORT_TIMEOUT_VALUE -v -i grpc -p >> ${CLIENT_LOG}.c++.grpc_non_infer_apis 2>&1
+if [ `grep -c "Deadline Exceeded" ${CLIENT_LOG}.c++.grpc_non_infer_apis` != "18" ]; then
+    cat ${CLIENT_LOG}.c++.grpc_non_infer_apis
+    echo -e "\n***\n*** Test Failed. Expected 18 failed\n***"
+    RET=1
+fi
+# Test all APIs with long timeout
+$CLIENT_TIMEOUT_TEST_CPP -t $TIMEOUT_VALUE -v -i grpc -p >> ${CLIENT_LOG} 2>&1
+if [ $? -eq 0 ]; then
+    echo -e "\n***\n*** Test Failed: GRPC non-infer APIs\n***"
+    RET=1
+fi
+
+set -e
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Test infer APIs
+unset TRITONSERVER_SERVER_DELAY_GRPC_RESPONSE_SEC
+SERVER_ARGS="--model-repository=$DATADIR --log-verbose 2"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
 set +e
 
 # CASE 1: Provide too small a timeout and expect a failure.
 # Note, the custom_identity_int32 is configured with a delay
 # of 3 sec.
 # Test request timeout in grpc synchronous inference
-$CLIENT_TIMEOUT_TEST_CPP -t 1000 -v -i grpc >> ${CLIENT_LOG}.c++.grpc_infer 2>&1
+$CLIENT_TIMEOUT_TEST_CPP -t $SHORT_TIMEOUT_VALUE -v -i grpc >> ${CLIENT_LOG}.c++.grpc_infer 2>&1
 if [ $? -eq 0 ]; then
     RET=1
 fi
@@ -81,7 +118,7 @@ if [ `grep -c "Deadline Exceeded" ${CLIENT_LOG}.c++.grpc_infer` != "1" ]; then
 fi
 
 # Test request timeout in grpc asynchronous inference
-$CLIENT_TIMEOUT_TEST_CPP -t 1000 -v -i grpc -a >> ${CLIENT_LOG}.c++.grpc_async_infer 2>&1
+$CLIENT_TIMEOUT_TEST_CPP -t $SHORT_TIMEOUT_VALUE -v -i grpc -a >> ${CLIENT_LOG}.c++.grpc_async_infer 2>&1
 if [ $? -eq 0 ]; then
     RET=1
 fi
@@ -92,7 +129,7 @@ if [ `grep -c "Deadline Exceeded" ${CLIENT_LOG}.c++.grpc_async_infer` != "1" ];
 fi
 
 # Test stream timeout in grpc asynchronous streaming inference
-$CLIENT_TIMEOUT_TEST_CPP -t 1000 -v -i grpc -s >> ${CLIENT_LOG}.c++.grpc_async_stream_infer 2>&1
+$CLIENT_TIMEOUT_TEST_CPP -t $SHORT_TIMEOUT_VALUE -v -i grpc -s >> ${CLIENT_LOG}.c++.grpc_async_stream_infer 2>&1
 if [ $? -eq 0 ]; then
     RET=1
 fi
@@ -103,7 +140,7 @@ if [ `grep -c "Stream has been closed" ${CLIENT_LOG}.c++.grpc_async_stream_infer
 fi
 
 # Test request timeout in http synchronous inference
-$CLIENT_TIMEOUT_TEST_CPP -t 1000 -v >> ${CLIENT_LOG}.c++.http_infer 2>&1
+$CLIENT_TIMEOUT_TEST_CPP -t $SHORT_TIMEOUT_VALUE -v >> ${CLIENT_LOG}.c++.http_infer 2>&1
 if [ $? -eq 0 ]; then
     RET=1
 fi
@@ -115,7 +152,7 @@ fi
 
 
 # Test request timeout in http asynchronous inference
-$CLIENT_TIMEOUT_TEST_CPP -t 1000 -v -a >> ${CLIENT_LOG}.c++.http_async_infer 2>&1
+$CLIENT_TIMEOUT_TEST_CPP -t $SHORT_TIMEOUT_VALUE -v -a >> ${CLIENT_LOG}.c++.http_async_infer 2>&1
 if [ $? -eq 0 ]; then
     RET=1
 fi
@@ -136,7 +173,6 @@ fi
 
 
 # CASE 2: Provide sufficiently large timeout value
-TIMEOUT_VALUE=100000000
 set +e
 
 echo "TEST:  GRPC Synchronous" >> ${CLIENT_LOG}
@@ -174,7 +210,6 @@ if [ $? -ne 0 ]; then
     RET=1
 fi
 
-
 echo "TEST:  Python Library" >> ${CLIENT_LOG}
 
 # CASE 3: Python Library
@@ -185,7 +220,7 @@ for i in test_grpc_infer \
     test_http_infer \
     test_http_async_infer \
    ; do
-    python $CLIENT_TIMEOUT_TEST ClientTimeoutTest.$i >>$CLIENT_LOG 2>&1
+    python $CLIENT_INFER_TIMEOUT_TEST ClientInferTimeoutTest.$i >>$CLIENT_LOG 2>&1
     if [ $? -ne 0 ]; then
         echo -e "\n***\n*** Test $i Failed\n***" >>$CLIENT_LOG
             echo -e "\n***\n*** Test $i Failed\n***"
@@ -204,6 +239,28 @@ set -e
 kill $SERVER_PID
 wait $SERVER_PID
 
+# Test all APIs other than infer
+export TRITONSERVER_SERVER_DELAY_GRPC_RESPONSE_SEC=2
+SERVER_ARGS="${SERVER_ARGS} --model-control-mode=explicit --load-model=custom_identity_int32 --log-verbose 2"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+set +e
+
+python $CLIENT_NON_INFER_TIMEOUT_TEST >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Test $i Failed\n***" >>$CLIENT_LOG
+    echo -e "\n***\n*** Test $i Failed\n***"
+    RET=1
+fi
+
+set -e
+kill $SERVER_PID
+wait $SERVER_PID
+
 if [ $RET -eq 0 ]; then
     echo -e "\n***\n*** Test Passed\n***"
 else
@@ -211,4 +268,5 @@ else
     echo -e "\n***\n*** Test FAILED\n***"
 fi
 
+set +e
 exit $RET
diff --git a/qa/L0_client_valgrind/models/custom_identity_int32/config.pbtxt b/qa/L0_client_valgrind/models/custom_identity_int32/config.pbtxt
index 8d3a78baf4..6a2a76bde5 100644
--- a/qa/L0_client_valgrind/models/custom_identity_int32/config.pbtxt
+++ b/qa/L0_client_valgrind/models/custom_identity_int32/config.pbtxt
@@ -35,7 +35,7 @@ input [
     name: "INPUT0"
     data_type: TYPE_INT32
     dims: [ -1 ]
-    
+
   }
 ]
 output [
diff --git a/qa/L0_client_valgrind/test.sh b/qa/L0_client_valgrind/test.sh
index 062417753c..0870aa883c 100755
--- a/qa/L0_client_valgrind/test.sh
+++ b/qa/L0_client_valgrind/test.sh
@@ -87,8 +87,8 @@ for PROTOCOL in http grpc; do
         else
             python3 ../common/check_valgrind_log.py -f $LEAKCHECK_LOG
             if [ $? -ne 0 ]; then
-            echo -e "\n***\n*** Memory leak detected\n***"
-            RET=1
+                echo -e "\n***\n*** Memory leak detected\n***"
+                RET=1
             fi
         fi
     done
diff --git a/qa/L0_cmdline_trace/test.sh b/qa/L0_cmdline_trace/test.sh
index efe20ac386..66f9a08fc0 100755
--- a/qa/L0_cmdline_trace/test.sh
+++ b/qa/L0_cmdline_trace/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,9 +25,8 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-SIMPLE_HTTP_CLIENT=../clients/simple_http_infer_client
-SIMPLE_GRPC_CLIENT=../clients/simple_grpc_infer_client
 TRACE_SUMMARY=../common/trace_summary.py
+CLIENT_SCRIPT=trace_client.py
 
 REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
 if [ "$#" -ge 1 ]; then
@@ -79,12 +78,12 @@ fi
 set +e
 
 for p in {1..10}; do
-    $SIMPLE_HTTP_CLIENT >> client_off.log 2>&1
+    python3 $CLIENT_SCRIPT -i grpc -u localhost:8001 >> client_off.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
 
-    $SIMPLE_GRPC_CLIENT >> client_off.log 2>&1
+    python3 $CLIENT_SCRIPT -i http -u localhost:8000 >> client_off.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
@@ -117,12 +116,12 @@ fi
 set +e
 
 for p in {1..10}; do
-    $SIMPLE_HTTP_CLIENT >> client_min.log 2>&1
+    python3 $CLIENT_SCRIPT -i grpc -u localhost:8001 >> client_min.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
 
-    $SIMPLE_GRPC_CLIENT >> client_min.log 2>&1
+    python3 $CLIENT_SCRIPT -i http -u localhost:8000 >> client_min.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
@@ -165,12 +164,12 @@ fi
 set +e
 
 for p in {1..10}; do
-    $SIMPLE_HTTP_CLIENT >> client_max.log 2>&1
+    python3 $CLIENT_SCRIPT -i grpc -u localhost:8001 >> client_max.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
 
-    $SIMPLE_GRPC_CLIENT >> client_max.log 2>&1
+    python3 $CLIENT_SCRIPT -i http -u localhost:8000 >> client_max.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
@@ -212,12 +211,12 @@ fi
 set +e
 
 for p in {1..10}; do
-    $SIMPLE_HTTP_CLIENT >> client_1.log 2>&1
+    python3 $CLIENT_SCRIPT -i grpc -u localhost:8001 >> client_1.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
 
-    $SIMPLE_GRPC_CLIENT >> client_1.log 2>&1
+    python3 $CLIENT_SCRIPT -i http -u localhost:8000 >> client_1.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
@@ -260,12 +259,12 @@ fi
 set +e
 
 for p in {1..10}; do
-    $SIMPLE_HTTP_CLIENT >> client_6.log 2>&1
+    python3 $CLIENT_SCRIPT -i grpc -u localhost:8001 >> client_6.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
 
-    $SIMPLE_GRPC_CLIENT >> client_6.log 2>&1
+    python3 $CLIENT_SCRIPT -i http -u localhost:8000 >> client_6.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
@@ -309,12 +308,12 @@ fi
 set +e
 
 for p in {1..10}; do
-    $SIMPLE_HTTP_CLIENT >> client_frequency.log 2>&1
+    python3 $CLIENT_SCRIPT -i grpc -u localhost:8001 >> client_frequency.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
 
-    $SIMPLE_GRPC_CLIENT >> client_frequency.log 2>&1
+    python3 $CLIENT_SCRIPT -i http -u localhost:8000 >> client_frequency.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
@@ -370,12 +369,12 @@ fi
 set +e
 
 for p in {1..10}; do
-    $SIMPLE_HTTP_CLIENT >> client_9.log 2>&1
+    python3 $CLIENT_SCRIPT -i grpc -u localhost:8001 >> client_9.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
 
-    $SIMPLE_GRPC_CLIENT >> client_9.log 2>&1
+    python3 $CLIENT_SCRIPT -i http -u localhost:8000 >> client_9.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
@@ -439,7 +438,7 @@ fi
 
 set +e
 
-$SIMPLE_HTTP_CLIENT >> client_ensemble.log 2>&1
+python3 $CLIENT_SCRIPT -i http -u localhost:8000 >> client_ensemble.log 2>&1
 if [ $? -ne 0 ]; then
     echo -e "\n***\n*** Test Failed\n***"
     RET=1
@@ -461,15 +460,15 @@ if [ `grep -c "COMPUTE_INPUT_END" summary_ensemble.log` != "7" ]; then
 fi
 
 for trace_str in \
-        "{\"id\":1,\"model_name\":\"simple\",\"model_version\":1}" \
-        "{\"id\":2,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"parent_id\":1}" \
-        "{\"id\":3,\"model_name\":\"fan_${MODELBASE}\",\"model_version\":1,\"parent_id\":1}" \
-        "{\"id\":4,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"parent_id\":3}" \
-        "{\"id\":5,\"model_name\":\"${MODELBASE}\",\"model_version\":1,\"parent_id\":3}" \
-        "{\"id\":6,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"parent_id\":3}" \
-        "{\"id\":7,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"parent_id\":3}" \
-        "{\"id\":8,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"parent_id\":1}" \
-        "{\"id\":9,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"parent_id\":1}" ; do
+        "{\"id\":1,\"model_name\":\"simple\",\"model_version\":1,\"request_id\":\"1\"}" \
+        "{\"id\":2,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":1}" \
+        "{\"id\":3,\"model_name\":\"fan_${MODELBASE}\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":1}" \
+        "{\"id\":4,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":3}" \
+        "{\"id\":5,\"model_name\":\"${MODELBASE}\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":3}" \
+        "{\"id\":6,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":3}" \
+        "{\"id\":7,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":3}" \
+        "{\"id\":8,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":1}" \
+        "{\"id\":9,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":1}" ; do
     if [ `grep -c ${trace_str} trace_ensemble.log` != "1" ]; then
         echo -e "Ensemble trace log expects trace: ${trace_str}"
         RET=1
@@ -485,12 +484,6 @@ fi
 set -e
 
 
-if [ $RET -eq 0 ]; then
-    echo -e "\n***\n*** Test Passed\n***"
-else
-    echo -e "\n***\n*** Test FAILED\n***"
-fi
-
 # trace-rate == 1, trace-level=TIMESTAMPS, trace-level=TENSORS
 SERVER_ARGS="--http-thread-count=1 --trace-file=trace_ensemble_tensor.log \
              --trace-level=TIMESTAMPS --trace-level=TENSORS --trace-rate=1 --model-repository=$MODELSDIR"
@@ -504,7 +497,7 @@ fi
 
 set +e
 
-$SIMPLE_HTTP_CLIENT >> client_ensemble_tensor.log 2>&1
+python3 $CLIENT_SCRIPT -i http -u localhost:8000 >> client_ensemble_tensor.log 2>&1
 if [ $? -ne 0 ]; then
     echo -e "\n***\n*** Test Failed\n***"
     RET=1
@@ -525,15 +518,15 @@ if [ `grep -c "COMPUTE_INPUT_END" summary_ensemble_tensor.log` != "7" ]; then
     RET=1
 fi
 for trace_str in \
-        "{\"id\":1,\"model_name\":\"simple\",\"model_version\":1}" \
-        "{\"id\":2,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"parent_id\":1}" \
-        "{\"id\":3,\"model_name\":\"fan_${MODELBASE}\",\"model_version\":1,\"parent_id\":1}" \
-        "{\"id\":4,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"parent_id\":3}" \
-        "{\"id\":5,\"model_name\":\"${MODELBASE}\",\"model_version\":1,\"parent_id\":3}" \
-        "{\"id\":6,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"parent_id\":3}" \
-        "{\"id\":7,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"parent_id\":3}" \
-        "{\"id\":8,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"parent_id\":1}" \
-        "{\"id\":9,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"parent_id\":1}" ; do
+        "{\"id\":1,\"model_name\":\"simple\",\"model_version\":1,\"request_id\":\"1\"}" \
+        "{\"id\":2,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":1}" \
+        "{\"id\":3,\"model_name\":\"fan_${MODELBASE}\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":1}" \
+        "{\"id\":4,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":3}" \
+        "{\"id\":5,\"model_name\":\"${MODELBASE}\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":3}" \
+        "{\"id\":6,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":3}" \
+        "{\"id\":7,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":3}" \
+        "{\"id\":8,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":1}" \
+        "{\"id\":9,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":1}" ; do
     if [ `grep -c ${trace_str} trace_ensemble_tensor.log` != "1" ]; then
         echo -e "Ensemble trace tensors log expects trace: ${trace_str}"
         RET=1
@@ -577,4 +570,59 @@ else
 fi
 
 
+# check deprecation warnings
+SERVER_ARGS=" --trace-file=/tmp/trace.json --trace-rate=100 --trace-level=TIMESTAMPS \
+              --trace-log-frequency=50 --trace-count=100 --model-repository=$MODELSDIR"
+SERVER_LOG="./inference_server_trace_config_flag.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+
+if [ `grep -c "Warning: '--trace-file' has been deprecated" $SERVER_LOG` != "1" ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+
+if [ `grep -c "Warning: '--trace-rate' has been deprecated" $SERVER_LOG` != "1" ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+
+if [ `grep -c "Warning: '--trace-level' has been deprecated" $SERVER_LOG` != "1" ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+
+if [ `grep -c "Warning: '--trace-log-frequency' has been deprecated" $SERVER_LOG` != "1" ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+
+if [ `grep -c "Warning: '--trace-count' has been deprecated" $SERVER_LOG` != "1" ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+if [ $RET -eq 0 ]; then
+    echo -e "\n***\n*** Test Passed\n***"
+else
+    echo -e "\n***\n*** Test FAILED\n***"
+fi
+
+
 exit $RET
diff --git a/qa/L0_cmdline_trace/trace_client.py b/qa/L0_cmdline_trace/trace_client.py
new file mode 100755
index 0000000000..4d59579d7c
--- /dev/null
+++ b/qa/L0_cmdline_trace/trace_client.py
@@ -0,0 +1,79 @@
+#!/usr/bin/env python
+# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import argparse
+import sys
+
+import numpy as np
+import tritonclient.grpc as grpcclient
+import tritonclient.http as httpclient
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "-u",
+        "--url",
+        type=str,
+        required=False,
+        default="localhost:8001",
+        help="Inference server URL. Default is localhost:8001.",
+    )
+    parser.add_argument("-i", "--protocol", type=str, required=True)
+    FLAGS = parser.parse_args()
+
+    if FLAGS.protocol == "grpc":
+        client_type = grpcclient
+    else:
+        client_type = httpclient
+
+    try:
+        triton_client = client_type.InferenceServerClient(url=FLAGS.url)
+    except Exception as e:
+        print("channel creation failed: " + str(e))
+        sys.exit()
+
+    model_name = "simple"
+
+    # Infer
+    inputs = []
+    outputs = []
+    inputs.append(client_type.InferInput("INPUT0", [1, 16], "INT32"))
+    inputs.append(client_type.InferInput("INPUT1", [1, 16], "INT32"))
+
+    input0_data = np.arange(start=0, stop=16, dtype=np.int32)
+    input0_data = np.expand_dims(input0_data, axis=0)
+    input1_data = np.ones(shape=(1, 16), dtype=np.int32)
+
+    inputs[0].set_data_from_numpy(input0_data)
+    inputs[1].set_data_from_numpy(input1_data)
+
+    outputs.append(client_type.InferRequestedOutput("OUTPUT0"))
+    outputs.append(client_type.InferRequestedOutput("OUTPUT1"))
+
+    triton_client.infer(
+        model_name=model_name, inputs=inputs, outputs=outputs, request_id="1"
+    )
diff --git a/qa/L0_config_json/max_priority_level.pbtxt b/qa/L0_config_json/max_priority_level.pbtxt
new file mode 100644
index 0000000000..f71f08d236
--- /dev/null
+++ b/qa/L0_config_json/max_priority_level.pbtxt
@@ -0,0 +1,62 @@
+# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "max_priority_level"
+backend: "identity"
+max_batch_size: 1
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
+
+dynamic_batching:
+{
+    # Max uint64
+    priority_levels: 18446744073709551615
+    # Max uint32
+    default_priority_level: 4294967295
+    # Max uint32 + 1
+    priority_queue_policy: [
+       {key: 4294967296
+        value: {
+          timeout_action: REJECT
+	  default_timeout_microseconds: 18446744073709551615
+	  allow_timeout_override: true
+	  max_queue_size: 10
+       }
+    }
+]
+}
\ No newline at end of file
diff --git a/qa/L0_config_json/test.sh b/qa/L0_config_json/test.sh
index 0b7f29e05b..b1016b806b 100755
--- a/qa/L0_config_json/test.sh
+++ b/qa/L0_config_json/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2020-2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -373,6 +373,52 @@ fi
 kill $SERVER_PID
 wait $SERVER_PID
 
+# Test max_priority_level
+TRIAL=max_priority_level
+
+rm -fr models && mkdir models
+mkdir -p models/max_priority_level/1 && cp max_priority_level.pbtxt models/max_priority_level/config.pbtxt
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+code=`curl -s -w %{http_code} -o ./$TRIAL.out localhost:8000/v2/models/max_priority_level/config`
+set -e
+if [ "$code" != "200" ]; then
+    cat $TRIAL.out
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+
+declare -A expected_values
+
+MAX_UINT64=18446744073709551615
+MAX_UINT32=4294967295
+MAX_UINT32_PLUS_1=4294967296
+
+expected_values["priority_levels"]=$MAX_UINT64
+expected_values["default_priority_level"]=$MAX_UINT32
+expected_values[$MAX_UINT32_PLUS_1]=\{\"timeout_action\":\"REJECT\",\"default_timeout_microseconds\":18446744073709551615,\"allow_timeout_override\":true,\"max_queue_size\":10\}
+expected_values["default_timeout_microseconds"]=$MAX_UINT64
+
+for key in "${!expected_values[@]}"; do
+    value=${expected_values[$key]}
+    matches=`grep -o "\"$key\":$value" $TRIAL.out | wc -l`
+    if [ $matches -ne 1 ]; then
+	cat $TRIAL.out
+	echo -e "\n***\n*** Expected 1 $key == $value, got $matches\n***"
+	RET=1
+    fi
+done
+
+kill $SERVER_PID
+wait $SERVER_PID
+
 if [ $RET -eq 0 ]; then
     echo -e "\n***\n*** Test Passed\n***"
 else
diff --git a/qa/L0_cuda_graph/test.sh b/qa/L0_cuda_graph/test.sh
old mode 100644
new mode 100755
index 58a796eb4d..9388dba77d
--- a/qa/L0_cuda_graph/test.sh
+++ b/qa/L0_cuda_graph/test.sh
@@ -49,7 +49,7 @@ rm -rf ${DATADIR}
 mkdir -p ${DATADIR}
 
 SERVER=/opt/tritonserver/bin/tritonserver
-SERVER_ARGS="--log-verbose=1 --model-repository=$DATADIR"
+SERVER_ARGS="--log-verbose=1 --model-repository=$DATADIR --strict-model-config=true"
 SERVER_LOG="./inference_server.log"
 source ../common/util.sh
 
@@ -118,6 +118,7 @@ wait $SERVER_PID
 rm -rf ${DATADIR} && mkdir -p ${DATADIR}
 cp -r /data/inferenceserver/${REPO_VERSION}/qa_variable_model_repository/plan_float32_float32_float32 ${DATADIR}/
 
+SERVER_ARGS="--log-verbose=1 --model-repository=$DATADIR --strict-model-config=true"
 CLIENT_LOG="./dynamic_shape.client.log"
 SERVER_LOG="./dynamic_shape.inference_server.log"
 sed -i "s/profile:.*/profile: [\"0\"]/" ${DATADIR}/plan_float32_float32_float32/config.pbtxt
@@ -167,6 +168,7 @@ cp -r /data/inferenceserver/${REPO_VERSION}/qa_model_repository/plan_float32_flo
 # Make sure only one version is present
 rm -rf ${DATADIR}/plan_float32_float32_float32/3
 
+SERVER_ARGS="--log-verbose=1 --model-repository=$DATADIR"
 CLIENT_LOG="./range_fixed_shape.client.log"
 SERVER_LOG="./range_fixed_shape.inference_server.log"
 echo "optimization { \
@@ -285,6 +287,53 @@ set -e
 kill $SERVER_PID
 wait $SERVER_PID
 
+# TrtCudaGraphTest.test_nobatch_fixed_shape
+rm -rf ${DATADIR} && mkdir -p ${DATADIR}
+cp -r /data/inferenceserver/${REPO_VERSION}/qa_model_repository/plan_nobatch_float32_float32_float32 ${DATADIR}/
+# Make sure only one version is present
+rm -rf ${DATADIR}/plan_nobatch_float32_float32_float32/2 ${DATADIR}/plan_nobatch_float32_float32_float32/3
+
+CLIENT_LOG="./nobatch_fixed_shape.client.log"
+SERVER_LOG="./nobatch_fixed_shape.inference_server.log"
+echo "optimization { cuda { graphs: true } }" >> ${DATADIR}/plan_nobatch_float32_float32_float32/config.pbtxt
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python $TRT_CUDA_GRAPH_TEST TrtCudaGraphTest.test_nobatch_fixed_shape plan_nobatch>>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Test Failed\n***"
+    cat $CLIENT_LOG
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE 1
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+set -e
+
+set +e
+if [ `grep -c "Context with profile default \[0\] is launching CUDA graph " $SERVER_LOG` != "1" ]; then
+    echo -e "\n***\n*** Failed. Expected only one execution with CUDA graph\n***"
+    RET=1
+fi
+
+if [ `grep -c "captured CUDA graph for" $SERVER_LOG` != "1" ]; then
+    echo -e "\n***\n*** Failed. Expected 1 CUDA graph to be captured\n***"
+    RET=1
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
 
 if [ $RET -eq 0 ]; then
   echo -e "\n***\n*** Test Passed\n***"
diff --git a/qa/L0_cuda_graph/trt_cuda_graph_test.py b/qa/L0_cuda_graph/trt_cuda_graph_test.py
old mode 100644
new mode 100755
index 851ae90ed2..a7f9f3be98
--- a/qa/L0_cuda_graph/trt_cuda_graph_test.py
+++ b/qa/L0_cuda_graph/trt_cuda_graph_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,39 +27,50 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
 import unittest
-import numpy as np
+
 import infer_util as iu
+import numpy as np
 import test_util as tu
 from tritonclientutils import *
 
 
 class TrtCudaGraphTest(tu.TestResultCollector):
+    MODELNAME = "plan"
 
     def setUp(self):
         self.dtype_ = np.float32
         self.dtype_str_ = "FP32"
-        self.model_name_ = 'plan'
+        self.model_name_ = self.MODELNAME
 
     def _check_infer(self, tensor_shape, batch_size=1):
         try:
-            iu.infer_exact(self,
-                           self.model_name_, (batch_size,) + tensor_shape,
-                           batch_size,
-                           self.dtype_,
-                           self.dtype_,
-                           self.dtype_,
-                           model_version=1,
-                           use_http_json_tensors=False,
-                           use_grpc=False,
-                           use_streaming=False)
+            if batch_size:
+                full_shape = (batch_size,) + tensor_shape
+            else:
+                full_shape = tensor_shape
+            iu.infer_exact(
+                self,
+                self.model_name_,
+                full_shape,
+                batch_size,
+                self.dtype_,
+                self.dtype_,
+                self.dtype_,
+                model_version=1,
+                use_http_json_tensors=False,
+                use_grpc=False,
+                use_streaming=False,
+            )
         except InferenceServerException as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
     def _erroneous_infer(self, tensor_shape, batch_size):
         import tritonhttpclient
+
         item_size = batch_size
         for dim in tensor_shape:
             item_size *= dim
@@ -68,30 +81,38 @@ def _erroneous_infer(self, tensor_shape, batch_size):
 
         inputs = []
         inputs.append(
-            tritonhttpclient.InferInput('INPUT0', full_shape, self.dtype_str_))
+            tritonhttpclient.InferInput("INPUT0", full_shape, self.dtype_str_)
+        )
         inputs[-1].set_data_from_numpy(input_np)
         inputs.append(
-            tritonhttpclient.InferInput('INPUT1', full_shape, self.dtype_str_))
+            tritonhttpclient.InferInput("INPUT1", full_shape, self.dtype_str_)
+        )
         inputs[-1].set_data_from_numpy(input_np)
         outputs = []
         outputs.append(
-            tritonhttpclient.InferRequestedOutput('OUTPUT0', binary_data=True))
+            tritonhttpclient.InferRequestedOutput("OUTPUT0", binary_data=True)
+        )
         outputs.append(
-            tritonhttpclient.InferRequestedOutput('OUTPUT1', binary_data=True))
+            tritonhttpclient.InferRequestedOutput("OUTPUT1", binary_data=True)
+        )
 
-        model_name = tu.get_model_name(self.model_name_, self.dtype_,
-                                       self.dtype_, self.dtype_)
+        model_name = tu.get_model_name(
+            self.model_name_, self.dtype_, self.dtype_, self.dtype_
+        )
         results = tritonhttpclient.InferenceServerClient(
-            "localhost:8000", verbose=True).infer(model_name=model_name,
-                                                  inputs=inputs,
-                                                  outputs=outputs)
+            "localhost:8000", verbose=True
+        ).infer(model_name=model_name, inputs=inputs, outputs=outputs)
         # Validate the results by comparing with precomputed values.
-        output0_np = results.as_numpy('OUTPUT0')
-        output1_np = results.as_numpy('OUTPUT1')
-        self.assertFalse(np.array_equal(output0_np, expected_output0_np),
-                         "expects OUTPUT0 is not correct")
-        self.assertFalse(np.array_equal(output1_np, expected_output1_np),
-                         "expects OUTPUT1 is not correct")
+        output0_np = results.as_numpy("OUTPUT0")
+        output1_np = results.as_numpy("OUTPUT1")
+        self.assertFalse(
+            np.array_equal(output0_np, expected_output0_np),
+            "expects OUTPUT0 is not correct",
+        )
+        self.assertFalse(
+            np.array_equal(output1_np, expected_output1_np),
+            "expects OUTPUT1 is not correct",
+        )
 
     def test_fixed_shape(self):
         tensor_shape = (16,)
@@ -131,6 +152,12 @@ def test_range_dynamic_shape(self):
         self._check_infer((16,), 8)
         self._check_infer((30,), 4)
 
+    def test_nobatch_fixed_shape(self):
+        self._check_infer((16,), 0)
+
+
+if __name__ == "__main__":
+    if len(sys.argv) > 2:
+        TrtCudaGraphTest.MODELNAME = sys.argv.pop()
 
-if __name__ == '__main__':
     unittest.main()
diff --git a/qa/L0_cuda_shared_memory/cuda_shared_memory_test.py b/qa/L0_cuda_shared_memory/cuda_shared_memory_test.py
old mode 100644
new mode 100755
index 4bff4eba75..87fb7c1d3c
--- a/qa/L0_cuda_shared_memory/cuda_shared_memory_test.py
+++ b/qa/L0_cuda_shared_memory/cuda_shared_memory_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,13 +27,14 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-import numpy as np
-import unittest
 import os
-import test_util as tu
+import unittest
 
+import numpy as np
+import test_util as tu
 import tritongrpcclient as grpcclient
 import tritonhttpclient as httpclient
 import tritonshmutils.cuda_shared_memory as cshm
@@ -39,16 +42,13 @@
 
 
 class CudaSharedMemoryTest(tu.TestResultCollector):
-
     def test_invalid_create_shm(self):
         # Raises error since tried to create invalid cuda shared memory region
         try:
-            shm_op0_handle = cshm.create_shared_memory_region(
-                "dummy_data", -1, 0)
+            shm_op0_handle = cshm.create_shared_memory_region("dummy_data", -1, 0)
             cshm.destroy_shared_memory_region(shm_op0_handle)
         except Exception as ex:
-            self.assertEqual(str(ex),
-                             "unable to create cuda shared memory handle")
+            self.assertEqual(str(ex), "unable to create cuda shared memory handle")
 
     def test_valid_create_set_register(self):
         # Create a valid cuda shared memory region, fill data in it and register
@@ -57,10 +57,12 @@ def test_valid_create_set_register(self):
         else:
             triton_client = grpcclient.InferenceServerClient(_url, verbose=True)
         shm_op0_handle = cshm.create_shared_memory_region("dummy_data", 8, 0)
-        cshm.set_shared_memory_region(shm_op0_handle,
-                                      [np.array([1, 2], dtype=np.float32)])
+        cshm.set_shared_memory_region(
+            shm_op0_handle, [np.array([1, 2], dtype=np.float32)]
+        )
         triton_client.register_cuda_shared_memory(
-            "dummy_data", cshm.get_raw_handle(shm_op0_handle), 0, 8)
+            "dummy_data", cshm.get_raw_handle(shm_op0_handle), 0, 8
+        )
         shm_status = triton_client.get_cuda_shared_memory_status()
         if _protocol == "http":
             self.assertEqual(len(shm_status), 1)
@@ -91,7 +93,8 @@ def test_unregister_after_register(self):
             triton_client = grpcclient.InferenceServerClient(_url, verbose=True)
         shm_op0_handle = cshm.create_shared_memory_region("dummy_data", 8, 0)
         triton_client.register_cuda_shared_memory(
-            "dummy_data", cshm.get_raw_handle(shm_op0_handle), 0, 8)
+            "dummy_data", cshm.get_raw_handle(shm_op0_handle), 0, 8
+        )
         triton_client.unregister_cuda_shared_memory("dummy_data")
         shm_status = triton_client.get_cuda_shared_memory_status()
         if _protocol == "http":
@@ -108,13 +111,16 @@ def test_reregister_after_register(self):
             triton_client = grpcclient.InferenceServerClient(_url, verbose=True)
         shm_op0_handle = cshm.create_shared_memory_region("dummy_data", 8, 0)
         triton_client.register_cuda_shared_memory(
-            "dummy_data", cshm.get_raw_handle(shm_op0_handle), 0, 8)
+            "dummy_data", cshm.get_raw_handle(shm_op0_handle), 0, 8
+        )
         try:
             triton_client.register_cuda_shared_memory(
-                "dummy_data", cshm.get_raw_handle(shm_op0_handle), 0, 8)
+                "dummy_data", cshm.get_raw_handle(shm_op0_handle), 0, 8
+            )
         except Exception as ex:
             self.assertIn(
-                "shared memory region 'dummy_data' already in manager", str(ex))
+                "shared memory region 'dummy_data' already in manager", str(ex)
+            )
         shm_status = triton_client.get_cuda_shared_memory_status()
         if _protocol == "http":
             self.assertEqual(len(shm_status), 1)
@@ -137,27 +143,33 @@ def _configure_sever(self):
         else:
             triton_client = grpcclient.InferenceServerClient(_url, verbose=True)
         triton_client.register_cuda_shared_memory(
-            "input0_data", cshm.get_raw_handle(shm_ip0_handle), 0, 64)
+            "input0_data", cshm.get_raw_handle(shm_ip0_handle), 0, 64
+        )
         triton_client.register_cuda_shared_memory(
-            "input1_data", cshm.get_raw_handle(shm_ip1_handle), 0, 64)
+            "input1_data", cshm.get_raw_handle(shm_ip1_handle), 0, 64
+        )
         triton_client.register_cuda_shared_memory(
-            "output0_data", cshm.get_raw_handle(shm_op0_handle), 0, 64)
+            "output0_data", cshm.get_raw_handle(shm_op0_handle), 0, 64
+        )
         triton_client.register_cuda_shared_memory(
-            "output1_data", cshm.get_raw_handle(shm_op1_handle), 0, 64)
+            "output1_data", cshm.get_raw_handle(shm_op1_handle), 0, 64
+        )
         return [shm_ip0_handle, shm_ip1_handle, shm_op0_handle, shm_op1_handle]
 
     def _cleanup_server(self, shm_handles):
         for shm_handle in shm_handles:
             cshm.destroy_shared_memory_region(shm_handle)
 
-    def _basic_inference(self,
-                         shm_ip0_handle,
-                         shm_ip1_handle,
-                         shm_op0_handle,
-                         shm_op1_handle,
-                         error_msg,
-                         big_shm_name="",
-                         big_shm_size=64):
+    def _basic_inference(
+        self,
+        shm_ip0_handle,
+        shm_ip1_handle,
+        shm_op0_handle,
+        shm_op1_handle,
+        error_msg,
+        big_shm_name="",
+        big_shm_size=64,
+    ):
         input0_data = np.arange(start=0, stop=16, dtype=np.int32)
         input1_data = np.ones(shape=16, dtype=np.int32)
         inputs = []
@@ -166,16 +178,16 @@ def _basic_inference(self,
             triton_client = httpclient.InferenceServerClient(_url, verbose=True)
             inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32"))
             inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32"))
+            outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=True))
             outputs.append(
-                httpclient.InferRequestedOutput('OUTPUT0', binary_data=True))
-            outputs.append(
-                httpclient.InferRequestedOutput('OUTPUT1', binary_data=False))
+                httpclient.InferRequestedOutput("OUTPUT1", binary_data=False)
+            )
         else:
             triton_client = grpcclient.InferenceServerClient(_url, verbose=True)
             inputs.append(grpcclient.InferInput("INPUT0", [1, 16], "INT32"))
             inputs.append(grpcclient.InferInput("INPUT1", [1, 16], "INT32"))
-            outputs.append(grpcclient.InferRequestedOutput('OUTPUT0'))
-            outputs.append(grpcclient.InferRequestedOutput('OUTPUT1'))
+            outputs.append(grpcclient.InferRequestedOutput("OUTPUT0"))
+            outputs.append(grpcclient.InferRequestedOutput("OUTPUT1"))
         inputs[0].set_shared_memory("input0_data", 64)
         if type(shm_ip1_handle) == np.array:
             inputs[1].set_data_from_numpy(input0_data, binary_data=True)
@@ -187,22 +199,21 @@ def _basic_inference(self,
         outputs[1].set_shared_memory("output1_data", 64)
 
         try:
-            results = triton_client.infer("simple",
-                                          inputs,
-                                          model_version="",
-                                          outputs=outputs)
-            output = results.get_output('OUTPUT0')
+            results = triton_client.infer(
+                "simple", inputs, model_version="", outputs=outputs
+            )
+            output = results.get_output("OUTPUT0")
             if _protocol == "http":
-                output_datatype = output['datatype']
-                output_shape = output['shape']
+                output_datatype = output["datatype"]
+                output_shape = output["shape"]
             else:
                 output_datatype = output.datatype
                 output_shape = output.shape
             output_dtype = triton_to_np_dtype(output_datatype)
-            output_data = cshm.get_contents_as_numpy(shm_op0_handle,
-                                                     output_dtype, output_shape)
-            self.assertTrue(
-                (output_data[0] == (input0_data + input1_data)).all())
+            output_data = cshm.get_contents_as_numpy(
+                shm_op0_handle, output_dtype, output_shape
+            )
+            self.assertTrue((output_data[0] == (input0_data + input1_data)).all())
         except Exception as ex:
             error_msg.append(str(ex))
 
@@ -210,8 +221,9 @@ def test_unregister_after_inference(self):
         # Unregister after inference
         error_msg = []
         shm_handles = self._configure_sever()
-        self._basic_inference(shm_handles[0], shm_handles[1], shm_handles[2],
-                              shm_handles[3], error_msg)
+        self._basic_inference(
+            shm_handles[0], shm_handles[1], shm_handles[2], shm_handles[3], error_msg
+        )
         if len(error_msg) > 0:
             raise Exception(str(error_msg))
         if _protocol == "http":
@@ -234,13 +246,15 @@ def test_register_after_inference(self):
             triton_client = httpclient.InferenceServerClient(_url, verbose=True)
         else:
             triton_client = grpcclient.InferenceServerClient(_url, verbose=True)
-        self._basic_inference(shm_handles[0], shm_handles[1], shm_handles[2],
-                              shm_handles[3], error_msg)
+        self._basic_inference(
+            shm_handles[0], shm_handles[1], shm_handles[2], shm_handles[3], error_msg
+        )
         if len(error_msg) > 0:
             raise Exception(str(error_msg))
         shm_ip2_handle = cshm.create_shared_memory_region("input2_data", 64, 0)
         triton_client.register_cuda_shared_memory(
-            "input2_data", cshm.get_raw_handle(shm_ip2_handle), 0, 64)
+            "input2_data", cshm.get_raw_handle(shm_ip2_handle), 0, 64
+        )
         shm_status = triton_client.get_cuda_shared_memory_status()
         if _protocol == "http":
             self.assertEqual(len(shm_status), 5)
@@ -259,13 +273,22 @@ def test_too_big_shm(self):
         else:
             triton_client = grpcclient.InferenceServerClient(_url, verbose=True)
         triton_client.register_cuda_shared_memory(
-            "input2_data", cshm.get_raw_handle(shm_ip2_handle), 0, 128)
-        self._basic_inference(shm_handles[0], shm_ip2_handle, shm_handles[2],
-                              shm_handles[3], error_msg, "input2_data", 128)
+            "input2_data", cshm.get_raw_handle(shm_ip2_handle), 0, 128
+        )
+        self._basic_inference(
+            shm_handles[0],
+            shm_ip2_handle,
+            shm_handles[2],
+            shm_handles[3],
+            error_msg,
+            "input2_data",
+            128,
+        )
         if len(error_msg) > 0:
             self.assertIn(
                 "unexpected total byte size 128 for input 'INPUT1', expecting 64",
-                error_msg[-1])
+                error_msg[-1],
+            )
         shm_handles.append(shm_ip2_handle)
         self._cleanup_server(shm_handles)
 
@@ -274,8 +297,9 @@ def test_mixed_raw_shm(self):
         error_msg = []
         shm_handles = self._configure_sever()
         input1_data = np.ones(shape=16, dtype=np.int32)
-        self._basic_inference(shm_handles[0], [input1_data], shm_handles[2],
-                              shm_handles[3], error_msg)
+        self._basic_inference(
+            shm_handles[0], [input1_data], shm_handles[2], shm_handles[3], error_msg
+        )
         if len(error_msg) > 0:
             raise Exception(error_msg[-1])
         self._cleanup_server(shm_handles)
@@ -301,8 +325,8 @@ def test_unregisterall(self):
         self._cleanup_server(shm_handles)
 
 
-if __name__ == '__main__':
-    _protocol = os.environ.get('CLIENT_TYPE', "http")
+if __name__ == "__main__":
+    _protocol = os.environ.get("CLIENT_TYPE", "http")
     if _protocol == "http":
         _url = "localhost:8000"
     else:
diff --git a/qa/L0_cuda_shared_memory/test.sh b/qa/L0_cuda_shared_memory/test.sh
old mode 100644
new mode 100755
index 2e1120c9b1..b011244174
--- a/qa/L0_cuda_shared_memory/test.sh
+++ b/qa/L0_cuda_shared_memory/test.sh
@@ -50,7 +50,7 @@ for i in \
         test_unregisterall; do
     for client_type in http grpc; do
         SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=1"
-        SERVER_LOG="./$i.$client_type.serverlog"
+        SERVER_LOG="./$i.$client_type.server.log"
         run_server
         if [ "$SERVER_PID" == "0" ]; then
             echo -e "\n***\n*** Failed to start $SERVER\n***"
diff --git a/qa/L0_custom_ops/cuda_op_test.py b/qa/L0_custom_ops/cuda_op_test.py
old mode 100644
new mode 100755
index d4389d67ad..896ed2adf0
--- a/qa/L0_custom_ops/cuda_op_test.py
+++ b/qa/L0_custom_ops/cuda_op_test.py
@@ -27,47 +27,50 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import argparse
-import numpy as np
 import sys
 from builtins import range
+
+import numpy as np
 import tritongrpcclient as grpcclient
 import tritonhttpclient as httpclient
 from tritonclientutils import np_to_triton_dtype
 
 FLAGS = None
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('-v',
-                        '--verbose',
-                        action="store_true",
-                        required=False,
-                        default=False,
-                        help='Enable verbose output')
-    parser.add_argument('-u',
-                        '--url',
-                        type=str,
-                        required=False,
-                        default='localhost:8000',
-                        help='Inference server URL. Default is localhost:8000.')
     parser.add_argument(
-        '-i',
-        '--protocol',
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable verbose output",
+    )
+    parser.add_argument(
+        "-u",
+        "--url",
         type=str,
         required=False,
-        default='http',
-        help='Protocol ("http"/"grpc") used to ' +
-        'communicate with inference service. Default is "http".')
-    parser.add_argument('-m',
-                        '--model',
-                        type=str,
-                        required=True,
-                        help='Name of model.')
+        default="localhost:8000",
+        help="Inference server URL. Default is localhost:8000.",
+    )
+    parser.add_argument(
+        "-i",
+        "--protocol",
+        type=str,
+        required=False,
+        default="http",
+        help='Protocol ("http"/"grpc") used to '
+        + 'communicate with inference service. Default is "http".',
+    )
+    parser.add_argument("-m", "--model", type=str, required=True, help="Name of model.")
 
     FLAGS = parser.parse_args()
     if (FLAGS.protocol != "http") and (FLAGS.protocol != "grpc"):
-        print("unexpected protocol \"{}\", expects \"http\" or \"grpc\"".format(
-            FLAGS.protocol))
+        print(
+            'unexpected protocol "{}", expects "http" or "grpc"'.format(FLAGS.protocol)
+        )
         exit(1)
 
     client_util = httpclient if FLAGS.protocol == "http" else grpcclient
@@ -84,21 +87,22 @@
     input_data = np.arange(start=42, stop=42 + elements, dtype=np.int32)
 
     inputs = [
-        client_util.InferInput("in", input_data.shape,
-                               np_to_triton_dtype(input_data.dtype))
+        client_util.InferInput(
+            "in", input_data.shape, np_to_triton_dtype(input_data.dtype)
+        )
     ]
     inputs[0].set_data_from_numpy(input_data)
 
     results = client.infer(model_name, inputs)
-    output_data = results.as_numpy('out')
+    output_data = results.as_numpy("out")
     if output_data is None:
         print("error: expected 'out'")
         sys.exit(1)
 
     for i in range(elements):
         print(
-            str(i) + ": input " + str(input_data[i]) + ", output " +
-            str(output_data[i]))
+            str(i) + ": input " + str(input_data[i]) + ", output " + str(output_data[i])
+        )
         if output_data[i] != (input_data[i] + 1):
             print("error: incorrect value")
             sys.exit(1)
diff --git a/qa/L0_custom_ops/mod_op_test.py b/qa/L0_custom_ops/mod_op_test.py
old mode 100644
new mode 100755
index 62edd1e289..14855f7c40
--- a/qa/L0_custom_ops/mod_op_test.py
+++ b/qa/L0_custom_ops/mod_op_test.py
@@ -27,47 +27,50 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import argparse
-import numpy as np
 import sys
 from builtins import range
+
+import numpy as np
 import tritongrpcclient as grpcclient
 import tritonhttpclient as httpclient
 from tritonclientutils import np_to_triton_dtype
 
 FLAGS = None
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('-v',
-                        '--verbose',
-                        action="store_true",
-                        required=False,
-                        default=False,
-                        help='Enable verbose output')
-    parser.add_argument('-u',
-                        '--url',
-                        type=str,
-                        required=False,
-                        default='localhost:8000',
-                        help='Inference server URL. Default is localhost:8000.')
     parser.add_argument(
-        '-i',
-        '--protocol',
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable verbose output",
+    )
+    parser.add_argument(
+        "-u",
+        "--url",
         type=str,
         required=False,
-        default='http',
-        help='Protocol ("http"/"grpc") used to ' +
-        'communicate with inference service. Default is "http".')
-    parser.add_argument('-m',
-                        '--model',
-                        type=str,
-                        required=True,
-                        help='Name of model.')
+        default="localhost:8000",
+        help="Inference server URL. Default is localhost:8000.",
+    )
+    parser.add_argument(
+        "-i",
+        "--protocol",
+        type=str,
+        required=False,
+        default="http",
+        help='Protocol ("http"/"grpc") used to '
+        + 'communicate with inference service. Default is "http".',
+    )
+    parser.add_argument("-m", "--model", type=str, required=True, help="Name of model.")
 
     FLAGS = parser.parse_args()
     if (FLAGS.protocol != "http") and (FLAGS.protocol != "grpc"):
-        print("unexpected protocol \"{}\", expects \"http\" or \"grpc\"".format(
-            FLAGS.protocol))
+        print(
+            'unexpected protocol "{}", expects "http" or "grpc"'.format(FLAGS.protocol)
+        )
         exit(1)
 
     client_util = httpclient if FLAGS.protocol == "http" else grpcclient
@@ -87,22 +90,32 @@
     inputs = []
     for i in range(len(input_data)):
         inputs.append(
-            client_util.InferInput("INPUT__{}".format(i), input_data[0].shape,
-                                   np_to_triton_dtype(input_data[0].dtype)))
+            client_util.InferInput(
+                "INPUT__{}".format(i),
+                input_data[0].shape,
+                np_to_triton_dtype(input_data[0].dtype),
+            )
+        )
         inputs[i].set_data_from_numpy(input_data[i])
 
     results = client.infer(model_name, inputs)
 
     # We expect 1 result of size 10 with alternating 1 and 0.
-    output_data = results.as_numpy('OUTPUT__0')
+    output_data = results.as_numpy("OUTPUT__0")
     if output_data is None:
         print("error: expected 'OUTPUT__0'")
         sys.exit(1)
 
     for i in range(elements):
         print(
-            str(i) + ": " + str(input_data[0][i]) + " % " +
-            str(input_data[1][i]) + " = " + str(output_data[i]))
-        if ((input_data[0][i] % input_data[1][i]) != output_data[i]):
+            str(i)
+            + ": "
+            + str(input_data[0][i])
+            + " % "
+            + str(input_data[1][i])
+            + " = "
+            + str(output_data[i])
+        )
+        if (input_data[0][i] % input_data[1][i]) != output_data[i]:
             print("error: incorrect value")
             sys.exit(1)
diff --git a/qa/L0_custom_ops/onnx_op_test.py b/qa/L0_custom_ops/onnx_op_test.py
old mode 100644
new mode 100755
index 6a3d5ebb53..9b246c8e31
--- a/qa/L0_custom_ops/onnx_op_test.py
+++ b/qa/L0_custom_ops/onnx_op_test.py
@@ -27,47 +27,50 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import argparse
-import numpy as np
 import sys
 from builtins import range
+
+import numpy as np
 import tritongrpcclient as grpcclient
 import tritonhttpclient as httpclient
 from tritonclientutils import np_to_triton_dtype
 
 FLAGS = None
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('-v',
-                        '--verbose',
-                        action="store_true",
-                        required=False,
-                        default=False,
-                        help='Enable verbose output')
-    parser.add_argument('-u',
-                        '--url',
-                        type=str,
-                        required=False,
-                        default='localhost:8000',
-                        help='Inference server URL. Default is localhost:8000.')
     parser.add_argument(
-        '-i',
-        '--protocol',
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable verbose output",
+    )
+    parser.add_argument(
+        "-u",
+        "--url",
+        type=str,
+        required=False,
+        default="localhost:8000",
+        help="Inference server URL. Default is localhost:8000.",
+    )
+    parser.add_argument(
+        "-i",
+        "--protocol",
         type=str,
         required=False,
-        default='http',
-        help='Protocol ("http"/"grpc") used to ' +
-        'communicate with inference service. Default is "http".')
-    parser.add_argument('-m',
-                        '--model',
-                        type=str,
-                        required=True,
-                        help='Name of model.')
+        default="http",
+        help='Protocol ("http"/"grpc") used to '
+        + 'communicate with inference service. Default is "http".',
+    )
+    parser.add_argument("-m", "--model", type=str, required=True, help="Name of model.")
 
     FLAGS = parser.parse_args()
     if (FLAGS.protocol != "http") and (FLAGS.protocol != "grpc"):
-        print("unexpected protocol \"{}\", expects \"http\" or \"grpc\"".format(
-            FLAGS.protocol))
+        print(
+            'unexpected protocol "{}", expects "http" or "grpc"'.format(FLAGS.protocol)
+        )
         exit(1)
 
     client_util = httpclient if FLAGS.protocol == "http" else grpcclient
@@ -88,14 +91,16 @@
     inputs = []
     for i in range(len(input_data)):
         inputs.append(
-            client_util.InferInput("input_{}".format(i + 1), shape,
-                                   np_to_triton_dtype(dtype)))
+            client_util.InferInput(
+                "input_{}".format(i + 1), shape, np_to_triton_dtype(dtype)
+            )
+        )
         inputs[i].set_data_from_numpy(input_data[i])
 
     results = client.infer(model_name, inputs)
 
     # We expect 1 result of size 10 with alternating 1 and 0.
-    output_data = results.as_numpy('output')
+    output_data = results.as_numpy("output")
     if output_data is None:
         print("error: expected 'output'")
         sys.exit(1)
@@ -103,9 +108,12 @@
     for i in range(3):
         for j in range(5):
             print(
-                str(input_data[0][i][j]) + " + " + str(input_data[1][i][j]) +
-                " = " + str(output_data[i][j]))
-            if ((input_data[0][i][j] + input_data[1][i][j]) !=
-                    output_data[i][j]):
+                str(input_data[0][i][j])
+                + " + "
+                + str(input_data[1][i][j])
+                + " = "
+                + str(output_data[i][j])
+            )
+            if (input_data[0][i][j] + input_data[1][i][j]) != output_data[i][j]:
                 print("error: incorrect value")
                 sys.exit(1)
diff --git a/qa/L0_custom_ops/test.sh b/qa/L0_custom_ops/test.sh
index c4b50dd43d..a12c1d67a4 100755
--- a/qa/L0_custom_ops/test.sh
+++ b/qa/L0_custom_ops/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -57,9 +57,10 @@ RET=0
 
 # Must explicitly set LD_LIBRARY_PATH so that the custom operations
 # can find libtensorflow_framework.so.
-LD_LIBRARY_PATH=/opt/tritonserver/backends/tensorflow1:$LD_LIBRARY_PATH
+LD_LIBRARY_PATH=/opt/tritonserver/backends/tensorflow:$LD_LIBRARY_PATH
 
 # Tensorflow
+## Load operations via LD_PRELOAD
 SERVER_ARGS="--model-repository=/data/inferenceserver/${REPO_VERSION}/qa_custom_ops/tf_custom_ops"
 SERVER_LD_PRELOAD="/data/inferenceserver/${REPO_VERSION}/qa_custom_ops/tf_custom_ops/libzeroout.so:/data/inferenceserver/${REPO_VERSION}/qa_custom_ops/tf_custom_ops/libcudaop.so:/data/inferenceserver/${REPO_VERSION}/qa_custom_ops/tf_custom_ops/libbusyop.so"
 
@@ -105,13 +106,72 @@ set -e
 kill $SERVER_PID
 wait $SERVER_PID
 
+## Load operations via model config
+SERVER_ARGS="--model-repository=tf_custom_ops"
+SERVER_LD_PRELOAD=""
+
+rm -rf tf_custom_ops && \
+    mkdir -p tf_custom_ops && \
+    cp -r /data/inferenceserver/${REPO_VERSION}/qa_custom_ops/tf_custom_ops .
+
+for MODEL_TYPE in savedmodel graphdef; do
+    echo "model_operations { op_library_filename: \"tf_custom_ops/libbusyop.so\" }" >> tf_custom_ops/${MODEL_TYPE}_busyop/config.pbtxt
+    echo "model_operations { op_library_filename: \"tf_custom_ops/libcudaop.so\" }" >> tf_custom_ops/${MODEL_TYPE}_cudaop/config.pbtxt
+    echo "model_operations { op_library_filename: \"tf_custom_ops/libzeroout.so\" }" >> tf_custom_ops/${MODEL_TYPE}_zeroout/config.pbtxt
+done
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+
+python $ZERO_OUT_TEST -v -m graphdef_zeroout >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+
+python $ZERO_OUT_TEST -v -m savedmodel_zeroout >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+
+python $CUDA_OP_TEST -v -m graphdef_cudaop >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+
+python $CUDA_OP_TEST -v -m savedmodel_cudaop >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
 # Must set LD_LIBRARY_PATH just for the server launch so that the
 # custom operations can find libtorch.so and other pytorch dependencies.
 LD_LIBRARY_PATH=/opt/tritonserver/backends/pytorch:$LD_LIBRARY_PATH
 
 # Pytorch
 SERVER_ARGS="--model-repository=/data/inferenceserver/${REPO_VERSION}/qa_custom_ops/libtorch_custom_ops"
-SERVER_LD_PRELOAD="/data/inferenceserver/${REPO_VERSION}/qa_custom_ops/libtorch_custom_ops/libtorch_modulo/custom_modulo.so"
+# FIXME: Pre-loading the python library system to satisfy the symbol definitions
+# as the custom op library is built with different python version within
+# pytorch container. See DLIS-4152.
+SERVER_LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libpython3.10.so.1:/data/inferenceserver/${REPO_VERSION}/qa_custom_ops/libtorch_custom_ops/libtorch_modulo/custom_modulo.so"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
diff --git a/qa/L0_custom_ops/vision_op_test.py b/qa/L0_custom_ops/vision_op_test.py
old mode 100644
new mode 100755
index c925dc19c0..88857c3d12
--- a/qa/L0_custom_ops/vision_op_test.py
+++ b/qa/L0_custom_ops/vision_op_test.py
@@ -27,46 +27,49 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import argparse
-import numpy as np
 import sys
+
+import numpy as np
 import tritonclient.grpc as grpcclient
 import tritonclient.http as httpclient
 from tritonclient.utils import np_to_triton_dtype
 
 FLAGS = None
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('-v',
-                        '--verbose',
-                        action="store_true",
-                        required=False,
-                        default=False,
-                        help='Enable verbose output')
-    parser.add_argument('-u',
-                        '--url',
-                        type=str,
-                        required=False,
-                        default='localhost:8000',
-                        help='Inference server URL. Default is localhost:8000.')
     parser.add_argument(
-        '-i',
-        '--protocol',
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable verbose output",
+    )
+    parser.add_argument(
+        "-u",
+        "--url",
+        type=str,
+        required=False,
+        default="localhost:8000",
+        help="Inference server URL. Default is localhost:8000.",
+    )
+    parser.add_argument(
+        "-i",
+        "--protocol",
         type=str,
         required=False,
-        default='http',
-        help='Protocol ("http"/"grpc") used to ' +
-        'communicate with inference service. Default is "http".')
-    parser.add_argument('-m',
-                        '--model',
-                        type=str,
-                        required=True,
-                        help='Name of model.')
+        default="http",
+        help='Protocol ("http"/"grpc") used to '
+        + 'communicate with inference service. Default is "http".',
+    )
+    parser.add_argument("-m", "--model", type=str, required=True, help="Name of model.")
 
     FLAGS = parser.parse_args()
     if (FLAGS.protocol != "http") and (FLAGS.protocol != "grpc"):
-        print("unexpected protocol \"{}\", expects \"http\" or \"grpc\"".format(
-            FLAGS.protocol))
+        print(
+            'unexpected protocol "{}", expects "http" or "grpc"'.format(FLAGS.protocol)
+        )
         exit(1)
 
     client_util = httpclient if FLAGS.protocol == "http" else grpcclient
@@ -83,23 +86,26 @@
 
     inputs = []
     inputs.append(
-        client_util.InferInput("INPUT__0", input_data.shape,
-                               np_to_triton_dtype(input_data.dtype)))
+        client_util.InferInput(
+            "INPUT__0", input_data.shape, np_to_triton_dtype(input_data.dtype)
+        )
+    )
     inputs[0].set_data_from_numpy(input_data)
     inputs.append(
-        client_util.InferInput("INPUT__1", box_data.shape,
-                               np_to_triton_dtype(box_data.dtype)))
+        client_util.InferInput(
+            "INPUT__1", box_data.shape, np_to_triton_dtype(box_data.dtype)
+        )
+    )
     inputs[1].set_data_from_numpy(box_data)
 
     results = client.infer(model_name, inputs)
 
     # We expect 1 result of shape [1, 3, 5, 5].
-    output_data = results.as_numpy('OUTPUT__0')
+    output_data = results.as_numpy("OUTPUT__0")
     if output_data is None:
         print("error: expected 'OUTPUT__0'")
         sys.exit(1)
 
-    if (output_data.shape != (1, 3, 5, 5)):
-        print("error: incorrect shape " + str(output_data.shape) +
-              "for 'OUTPUT__0'")
+    if output_data.shape != (1, 3, 5, 5):
+        print("error: incorrect shape " + str(output_data.shape) + "for 'OUTPUT__0'")
         sys.exit(1)
diff --git a/qa/L0_custom_ops/zero_out_test.py b/qa/L0_custom_ops/zero_out_test.py
old mode 100644
new mode 100755
index ad87dc8f37..28d5d2c9e6
--- a/qa/L0_custom_ops/zero_out_test.py
+++ b/qa/L0_custom_ops/zero_out_test.py
@@ -27,47 +27,50 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import argparse
-import numpy as np
 import sys
 from builtins import range
+
+import numpy as np
 import tritongrpcclient as grpcclient
 import tritonhttpclient as httpclient
 from tritonclientutils import np_to_triton_dtype
 
 FLAGS = None
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('-v',
-                        '--verbose',
-                        action="store_true",
-                        required=False,
-                        default=False,
-                        help='Enable verbose output')
-    parser.add_argument('-u',
-                        '--url',
-                        type=str,
-                        required=False,
-                        default='localhost:8000',
-                        help='Inference server URL. Default is localhost:8000.')
     parser.add_argument(
-        '-i',
-        '--protocol',
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable verbose output",
+    )
+    parser.add_argument(
+        "-u",
+        "--url",
         type=str,
         required=False,
-        default='http',
-        help='Protocol ("http"/"grpc") used to ' +
-        'communicate with inference service. Default is "http".')
-    parser.add_argument('-m',
-                        '--model',
-                        type=str,
-                        required=True,
-                        help='Name of model.')
+        default="localhost:8000",
+        help="Inference server URL. Default is localhost:8000.",
+    )
+    parser.add_argument(
+        "-i",
+        "--protocol",
+        type=str,
+        required=False,
+        default="http",
+        help='Protocol ("http"/"grpc") used to '
+        + 'communicate with inference service. Default is "http".',
+    )
+    parser.add_argument("-m", "--model", type=str, required=True, help="Name of model.")
 
     FLAGS = parser.parse_args()
     if (FLAGS.protocol != "http") and (FLAGS.protocol != "grpc"):
-        print("unexpected protocol \"{}\", expects \"http\" or \"grpc\"".format(
-            FLAGS.protocol))
+        print(
+            'unexpected protocol "{}", expects "http" or "grpc"'.format(FLAGS.protocol)
+        )
         exit(1)
 
     client_util = httpclient if FLAGS.protocol == "http" else grpcclient
@@ -83,8 +86,9 @@
     input_data = np.arange(start=42, stop=42 + elements, dtype=np.int32)
 
     inputs = [
-        client_util.InferInput("to_zero", input_data.shape,
-                               np_to_triton_dtype(input_data.dtype))
+        client_util.InferInput(
+            "to_zero", input_data.shape, np_to_triton_dtype(input_data.dtype)
+        )
     ]
     inputs[0].set_data_from_numpy(input_data)
     results = client.infer(model_name, inputs)
@@ -97,8 +101,8 @@
 
     for i in range(elements):
         print(
-            str(i) + ": input " + str(input_data[i]) + ", output " +
-            str(output_data[i]))
+            str(i) + ": input " + str(input_data[i]) + ", output " + str(output_data[i])
+        )
         if (i == 0) and (input_data[i] != output_data[i]):
             print("error: incorrect value")
             sys.exit(1)
diff --git a/qa/L0_data_compression/test.sh b/qa/L0_data_compression/test.sh
old mode 100644
new mode 100755
index aa8b950fe5..28255f5f7b
--- a/qa/L0_data_compression/test.sh
+++ b/qa/L0_data_compression/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -55,7 +55,7 @@ set +e
 echo "All work and no play makes Jack a dull boy" >> raw_data
 python3 validation.py generate_compressed_data
 
-$DATA_COMPRESSOR_TEST >>$TEST_LOG 2>&1
+LD_LIBRARY_PATH=/opt/tritonserver/lib:${LD_LIBRARY_PATH} $DATA_COMPRESSOR_TEST >>$TEST_LOG 2>&1
 if [ $? -ne 0 ]; then
     echo -e "\n***\n*** Data Compression Test Failed\n***"
     RET=1
@@ -148,6 +148,9 @@ if [ $? -ne 0 ]; then
 fi
 set -e
 
+kill $SERVER_PID
+wait $SERVER_PID
+
 if [ $RET -eq 0 ]; then
     echo -e "\n***\n*** Test Passed\n***"
 else
diff --git a/qa/L0_data_compression/validation.py b/qa/L0_data_compression/validation.py
old mode 100644
new mode 100755
index 927c863952..a0e5cb1576
--- a/qa/L0_data_compression/validation.py
+++ b/qa/L0_data_compression/validation.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -29,8 +31,9 @@
 
 def generate_compressed_data():
     with open("raw_data", "rb") as f:
-        import zlib
         import gzip
+        import zlib
+
         raw_data = f.read()
         with open("deflate_compressed_data", "wb") as of:
             of.write(zlib.compress(raw_data))
@@ -40,8 +43,9 @@ def generate_compressed_data():
 
 def validate_compressed_data():
     with open("raw_data", "rb") as f:
-        import zlib
         import gzip
+        import zlib
+
         raw_data = f.read()
         with open("generated_deflate_compressed_data", "rb") as cf:
             decompressed_data = zlib.decompress(cf.read())
@@ -53,5 +57,5 @@ def validate_compressed_data():
                 exit(1)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     globals()[sys.argv[1]]()
diff --git a/qa/L0_decoupled/decoupled_test.py b/qa/L0_decoupled/decoupled_test.py
old mode 100644
new mode 100755
index bb2219b6f0..b78170cf63
--- a/qa/L0_decoupled/decoupled_test.py
+++ b/qa/L0_decoupled/decoupled_test.py
@@ -1,5 +1,6 @@
-#!/bin/bash
-# Copyright (c) 2020-2021, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,59 +27,94 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-from functools import partial
-import numpy as np
-import queue
-import unittest
 import os
+import queue
 import time
-import test_util as tu
+import unittest
+from functools import partial
 
-import tritongrpcclient as grpcclient
-import tritonhttpclient as httpclient
-from tritonclientutils import InferenceServerException
+import numpy as np
+import test_util as tu
+import tritonclient.grpc as grpcclient
+import tritonclient.http as httpclient
+from tritonclient.utils import InferenceServerException
 
 
 class UserData:
-
     def __init__(self):
-        self._completed_requests = queue.Queue()
+        self._response_queue = queue.Queue()
 
 
 def callback(user_data, result, error):
     if error:
-        user_data._completed_requests.put(error)
+        user_data._response_queue.put(error)
     else:
-        user_data._completed_requests.put(result)
+        user_data._response_queue.put(result)
 
 
 class DecoupledTest(tu.TestResultCollector):
-
     def setUp(self):
-        self.trials_ = [("repeat_int32", None), ("simple_repeat", None),
-                        ("sequence_repeat", None),
-                        ("fan_repeat", self._fan_validate),
-                        ("repeat_square", self._nested_validate),
-                        ("nested_square", self._nested_validate)]
+        self.trials_ = [
+            ("repeat_int32", None),
+            ("simple_repeat", None),
+            ("sequence_repeat", None),
+            ("fan_repeat", self._fan_validate),
+            ("repeat_square", self._nested_validate),
+            ("nested_square", self._nested_validate),
+        ]
         self.model_name_ = "repeat_int32"
 
         self.inputs_ = []
-        self.inputs_.append(grpcclient.InferInput('IN', [1], "INT32"))
-        self.inputs_.append(grpcclient.InferInput('DELAY', [1], "UINT32"))
-        self.inputs_.append(grpcclient.InferInput('WAIT', [1], "UINT32"))
+        self.inputs_.append(grpcclient.InferInput("IN", [1], "INT32"))
+        self.inputs_.append(grpcclient.InferInput("DELAY", [1], "UINT32"))
+        self.inputs_.append(grpcclient.InferInput("WAIT", [1], "UINT32"))
 
         self.outputs_ = []
-        self.outputs_.append(grpcclient.InferRequestedOutput('OUT'))
-        self.outputs_.append(grpcclient.InferRequestedOutput('IDX'))
+        self.outputs_.append(grpcclient.InferRequestedOutput("OUT"))
+        self.outputs_.append(grpcclient.InferRequestedOutput("IDX"))
         # Some trials only expect a subset of outputs
         self.requested_outputs_ = self.outputs_
 
-    def _stream_infer(self, request_count, request_delay, expected_count,
-                      delay_data, delay_factor, user_data, result_dict):
-        with grpcclient.InferenceServerClient(url="localhost:8001",
-                                              verbose=True) as triton_client:
+    # Client can receive a "triton_final_response" response parameter
+    # from Triton server that indicates when a response is the final response for
+    # its request.
+    #
+    # For non-decoupled models, there is a 1:1 request:response ratio, so every
+    # response is the final response, and this parameter is unnecessary.
+    #
+    # For decoupled models, there is a 1:N request:response ratio, so there may be
+    # more one response before receiving the "final" response.
+    #
+    # However, decoupled models have the unique property in that they can return
+    # a flags-only response to the server to indicate completion, which is not
+    # returned to the client by default (See TRITONBACKEND_ResponseFactorySendFlags).
+    #
+    # To forward this flags-only response to the client, users must opt-in to this
+    # behavior by adding the following argument:
+    # client.async_stream_infer(..., enable_empty_final_response=True).
+    #
+    # If the decoupled backend/model always sends the final response flag along
+    # with a non-null response, no opt-in is needed.
+    #
+    # With this behavior, the client can programmatically detect when all responses
+    # for an individual request have been received without knowing the expected
+    # number of responses in advance and without closing the stream.
+    def _stream_infer_with_params(
+        self,
+        request_count,
+        request_delay,
+        _,
+        delay_data,
+        delay_factor,
+        user_data,
+        result_dict,
+    ):
+        with grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        ) as triton_client:
             # Establish stream
             triton_client.start_stream(callback=partial(callback, user_data))
             # Send specified many requests in parallel
@@ -89,7 +125,67 @@ def _stream_infer(self, request_count, request_delay, expected_count,
                     model_name=self.model_name_,
                     inputs=self.inputs_,
                     request_id=str(i),
-                    outputs=self.requested_outputs_)
+                    outputs=self.requested_outputs_,
+                    # Opt-in to receiving flags-only responses from model/backend
+                    # to help detect final responses for decoupled models.
+                    enable_empty_final_response=True,
+                )
+                # Update delay input in accordance with the scaling factor
+                delay_data = delay_data * delay_factor
+                delay_data = delay_data.astype(np.uint32)
+
+            # Retrieve results...
+            recv_count = 0
+            completed_requests = 0
+            while completed_requests < request_count:
+                data_item = user_data._response_queue.get()
+                if type(data_item) == InferenceServerException:
+                    raise data_item
+                else:
+                    response = data_item.get_response()
+                    # Request IDs should generally be provided with each request
+                    # to associate decoupled responses with their requests.
+                    if not response.id:
+                        raise ValueError(
+                            "No response id found. Was a request_id provided?"
+                        )
+
+                    # Detect final response. Parameters are oneof and we expect bool_param
+                    if response.parameters.get("triton_final_response").bool_param:
+                        completed_requests += 1
+
+                    # Only process non-empty response, ignore if empty (no outputs)
+                    if response.outputs:
+                        if response.id not in result_dict:
+                            result_dict[response.id] = []
+                        result_dict[response.id].append((recv_count, data_item))
+                        recv_count += 1
+
+    def _stream_infer(
+        self,
+        request_count,
+        request_delay,
+        expected_count,
+        delay_data,
+        delay_factor,
+        user_data,
+        result_dict,
+    ):
+        with grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        ) as triton_client:
+            # Establish stream
+            triton_client.start_stream(callback=partial(callback, user_data))
+            # Send specified many requests in parallel
+            for i in range(request_count):
+                time.sleep((request_delay / 1000))
+                self.inputs_[1].set_data_from_numpy(delay_data)
+                triton_client.async_stream_infer(
+                    model_name=self.model_name_,
+                    inputs=self.inputs_,
+                    request_id=str(i),
+                    outputs=self.requested_outputs_,
+                )
                 # Update delay input in accordance with the scaling factor
                 delay_data = delay_data * delay_factor
                 delay_data = delay_data.astype(np.uint32)
@@ -97,12 +193,12 @@ def _stream_infer(self, request_count, request_delay, expected_count,
             # Retrieve results...
             recv_count = 0
             while recv_count < expected_count:
-                data_item = user_data._completed_requests.get()
+                data_item = user_data._response_queue.get()
                 if type(data_item) == InferenceServerException:
                     raise data_item
                 else:
                     this_id = data_item.get_response().id
-                    if this_id not in result_dict.keys():
+                    if this_id not in result_dict:
                         result_dict[this_id] = []
                     result_dict[this_id].append((recv_count, data_item))
 
@@ -113,7 +209,7 @@ def _fan_validate(self, result_list, data_offset, repeat_count):
         self.assertEqual(len(result_list), repeat_count)
         expected_data = 2 * data_offset
         for j in range(len(result_list)):
-            this_data = result_list[j][1].as_numpy('OUT')
+            this_data = result_list[j][1].as_numpy("OUT")
             self.assertEqual(len(this_data), 1)
             self.assertEqual(this_data[0], expected_data)
             expected_data += 2
@@ -121,13 +217,12 @@ def _fan_validate(self, result_list, data_offset, repeat_count):
     def _nested_validate(self, result_list, data_offset, repeat_count):
         # if repeat model returns repeat result n, repeat_square-like model
         # will return the same result n times
-        expected_len = sum(
-            x for x in range(data_offset, data_offset + repeat_count))
+        expected_len = sum(x for x in range(data_offset, data_offset + repeat_count))
         self.assertEqual(len(result_list), expected_len)
         expected_data = data_offset
         expected_count = expected_data
         for j in range(len(result_list)):
-            this_data = result_list[j][1].as_numpy('OUT')
+            this_data = result_list[j][1].as_numpy("OUT")
             self.assertEqual(len(this_data), 1)
             self.assertEqual(this_data[0], expected_data)
             expected_count -= 1
@@ -135,20 +230,22 @@ def _nested_validate(self, result_list, data_offset, repeat_count):
                 expected_data += 1
                 expected_count = expected_data
 
-    def _decoupled_infer(self,
-                         request_count,
-                         request_delay=0,
-                         repeat_count=1,
-                         data_offset=100,
-                         delay_time=1000,
-                         delay_factor=1,
-                         wait_time=500,
-                         order_sequence=None,
-                         validate_fn=None):
+    def _decoupled_infer(
+        self,
+        request_count,
+        request_delay=0,
+        repeat_count=1,
+        data_offset=100,
+        delay_time=1000,
+        delay_factor=1,
+        wait_time=500,
+        order_sequence=None,
+        validate_fn=None,
+    ):
         # Initialize data for IN
-        input_data = np.arange(start=data_offset,
-                               stop=data_offset + repeat_count,
-                               dtype=np.int32)
+        input_data = np.arange(
+            start=data_offset, stop=data_offset + repeat_count, dtype=np.int32
+        )
         self.inputs_[0].set_shape([repeat_count])
         self.inputs_[0].set_data_from_numpy(input_data)
 
@@ -161,54 +258,67 @@ def _decoupled_infer(self,
         self.inputs_[2].set_data_from_numpy(wait_data)
 
         # use validate_fn to differentiate requested outputs
-        self.requested_outputs_ = self.outputs_ if validate_fn is None else self.outputs_[
-            0:1]
+        self.requested_outputs_ = (
+            self.outputs_ if validate_fn is None else self.outputs_[0:1]
+        )
 
-        user_data = UserData()
-        result_dict = {}
+        for infer_helper in [self._stream_infer, self._stream_infer_with_params]:
+            user_data = UserData()
+            result_dict = {}
 
-        try:
-            if "square" not in self.model_name_:
-                expected_count = (repeat_count * request_count)
-            else:
-                expected_count = sum(
-                    x for x in range(data_offset, data_offset +
-                                     repeat_count)) * request_count
-            self._stream_infer(request_count, request_delay, expected_count,
-                               delay_data, delay_factor, user_data, result_dict)
-        except Exception as ex:
-            self.assertTrue(False, "unexpected error {}".format(ex))
-
-        # Validate the results..
-        for i in range(request_count):
-            this_id = str(i)
-            if repeat_count != 0 and this_id not in result_dict.keys():
-                self.assertTrue(
-                    False,
-                    "response for request id {} not received".format(this_id))
-            elif repeat_count == 0 and this_id in result_dict.keys():
-                self.assertTrue(
-                    False,
-                    "received unexpected response for request id {}".format(
-                        this_id))
-            if repeat_count != 0:
-                if validate_fn is None:
-                    self.assertEqual(len(result_dict[this_id]), repeat_count)
-                    expected_data = data_offset
-                    result_list = result_dict[this_id]
-                    for j in range(len(result_list)):
-                        if order_sequence is not None:
-                            self.assertEqual(result_list[j][0],
-                                             order_sequence[i][j])
-                        this_data = result_list[j][1].as_numpy('OUT')
-                        self.assertEqual(len(this_data), 1)
-                        self.assertEqual(this_data[0], expected_data)
-                        this_idx = result_list[j][1].as_numpy('IDX')
-                        self.assertEqual(len(this_idx), 1)
-                        self.assertEqual(this_idx[0], j)
-                        expected_data += 1
+            try:
+                if "square" not in self.model_name_:
+                    expected_count = repeat_count * request_count
                 else:
-                    validate_fn(result_dict[this_id], data_offset, repeat_count)
+                    expected_count = (
+                        sum(x for x in range(data_offset, data_offset + repeat_count))
+                        * request_count
+                    )
+                infer_helper(
+                    request_count,
+                    request_delay,
+                    expected_count,
+                    delay_data,
+                    delay_factor,
+                    user_data,
+                    result_dict,
+                )
+            except Exception as ex:
+                self.assertTrue(False, "unexpected error {}".format(ex))
+
+            # Validate the results..
+            for i in range(request_count):
+                this_id = str(i)
+                if repeat_count != 0 and this_id not in result_dict.keys():
+                    self.assertTrue(
+                        False, "response for request id {} not received".format(this_id)
+                    )
+                elif repeat_count == 0 and this_id in result_dict.keys():
+                    self.assertTrue(
+                        False,
+                        "received unexpected response for request id {}".format(
+                            this_id
+                        ),
+                    )
+                if repeat_count != 0:
+                    if validate_fn is None:
+                        self.assertEqual(len(result_dict[this_id]), repeat_count)
+                        expected_data = data_offset
+                        result_list = result_dict[this_id]
+                        for j in range(len(result_list)):
+                            if order_sequence is not None:
+                                self.assertEqual(
+                                    result_list[j][0], order_sequence[i][j]
+                                )
+                            this_data = result_list[j][1].as_numpy("OUT")
+                            self.assertEqual(len(this_data), 1)
+                            self.assertEqual(this_data[0], expected_data)
+                            this_idx = result_list[j][1].as_numpy("IDX")
+                            self.assertEqual(len(this_idx), 1)
+                            self.assertEqual(this_idx[0], j)
+                            expected_data += 1
+                    else:
+                        validate_fn(result_dict[this_id], data_offset, repeat_count)
 
     def test_one_to_none(self):
         # Test cases where each request generates no response.
@@ -218,13 +328,9 @@ def test_one_to_none(self):
         for trial in self.trials_:
             self.model_name_ = trial[0]
             # Single request case
-            self._decoupled_infer(request_count=1,
-                                  repeat_count=0,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(request_count=1, repeat_count=0, validate_fn=trial[1])
             # Multiple request case
-            self._decoupled_infer(request_count=5,
-                                  repeat_count=0,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(request_count=5, repeat_count=0, validate_fn=trial[1])
 
     def test_one_to_one(self):
         # Test cases where each request generates single response.
@@ -235,23 +341,15 @@ def test_one_to_one(self):
             self.model_name_ = trial[0]
             # Single request case
             # Release request before the response is delivered
-            self._decoupled_infer(request_count=1,
-                                  wait_time=500,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(request_count=1, wait_time=500, validate_fn=trial[1])
             # Release request after the response is delivered
-            self._decoupled_infer(request_count=1,
-                                  wait_time=2000,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(request_count=1, wait_time=2000, validate_fn=trial[1])
 
             # Multiple request case
             # Release request before the response is delivered
-            self._decoupled_infer(request_count=5,
-                                  wait_time=500,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(request_count=5, wait_time=500, validate_fn=trial[1])
             # Release request after the response is delivered
-            self._decoupled_infer(request_count=5,
-                                  wait_time=2000,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(request_count=5, wait_time=2000, validate_fn=trial[1])
 
     def test_one_to_many(self):
         # Test cases where each request generates multiple response.
@@ -264,37 +362,31 @@ def test_one_to_many(self):
             self.model_name_ = trial[0]
             # Single request case
             # Release request before the first response is delivered
-            self._decoupled_infer(request_count=1,
-                                  repeat_count=5,
-                                  wait_time=500,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=1, repeat_count=5, wait_time=500, validate_fn=trial[1]
+            )
             # Release request when the responses are getting delivered
-            self._decoupled_infer(request_count=1,
-                                  repeat_count=5,
-                                  wait_time=2000,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=1, repeat_count=5, wait_time=2000, validate_fn=trial[1]
+            )
             # Release request after all the responses are delivered
-            self._decoupled_infer(request_count=1,
-                                  repeat_count=5,
-                                  wait_time=10000,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=1, repeat_count=5, wait_time=10000, validate_fn=trial[1]
+            )
 
             # Multiple request case
             # Release request before the first response is delivered
-            self._decoupled_infer(request_count=5,
-                                  repeat_count=5,
-                                  wait_time=500,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=5, repeat_count=5, wait_time=500, validate_fn=trial[1]
+            )
             # Release request when the responses are getting delivered
-            self._decoupled_infer(request_count=5,
-                                  repeat_count=5,
-                                  wait_time=2000,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=5, repeat_count=5, wait_time=2000, validate_fn=trial[1]
+            )
             # Release request after all the responses are delivered
-            self._decoupled_infer(request_count=5,
-                                  repeat_count=5,
-                                  wait_time=10000,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=5, repeat_count=5, wait_time=10000, validate_fn=trial[1]
+            )
 
     def test_one_to_multi_many(self):
         # Test cases where each request generates multiple response but the
@@ -307,37 +399,31 @@ def test_one_to_multi_many(self):
             self.model_name_ = trial[0]
             # Single request case
             # Release request before the first response is delivered
-            self._decoupled_infer(request_count=1,
-                                  repeat_count=5,
-                                  wait_time=500,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=1, repeat_count=5, wait_time=500, validate_fn=trial[1]
+            )
             # Release request when the responses are getting delivered
-            self._decoupled_infer(request_count=1,
-                                  repeat_count=5,
-                                  wait_time=8000,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=1, repeat_count=5, wait_time=8000, validate_fn=trial[1]
+            )
             # Release request after all the responses are delivered
-            self._decoupled_infer(request_count=1,
-                                  repeat_count=5,
-                                  wait_time=20000,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=1, repeat_count=5, wait_time=20000, validate_fn=trial[1]
+            )
 
             # Multiple request case
             # Release request before the first response is delivered
-            self._decoupled_infer(request_count=5,
-                                  repeat_count=5,
-                                  wait_time=500,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=5, repeat_count=5, wait_time=500, validate_fn=trial[1]
+            )
             # Release request when the responses are getting delivered
-            self._decoupled_infer(request_count=5,
-                                  repeat_count=5,
-                                  wait_time=3000,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=5, repeat_count=5, wait_time=3000, validate_fn=trial[1]
+            )
             # Release request after all the responses are delivered
-            self._decoupled_infer(request_count=5,
-                                  repeat_count=5,
-                                  wait_time=10000,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=5, repeat_count=5, wait_time=10000, validate_fn=trial[1]
+            )
 
     def test_response_order(self):
         # Test the expected response order for different cases
@@ -348,51 +434,61 @@ def test_response_order(self):
             self.model_name_ = trial[0]
 
             # Case 1: Interleaved responses
-            self._decoupled_infer(request_count=2,
-                                  request_delay=500,
-                                  repeat_count=4,
-                                  order_sequence=[[0, 2, 4, 6], [1, 3, 5, 7]],
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=2,
+                request_delay=500,
+                repeat_count=4,
+                order_sequence=[[0, 2, 4, 6], [1, 3, 5, 7]],
+                validate_fn=trial[1],
+            )
 
             # Case 2: All responses of second request delivered before any
             # response from the first
-            self._decoupled_infer(request_count=2,
-                                  request_delay=500,
-                                  repeat_count=4,
-                                  delay_time=2000,
-                                  delay_factor=0.1,
-                                  order_sequence=[[4, 5, 6, 7], [0, 1, 2, 3]],
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=2,
+                request_delay=500,
+                repeat_count=4,
+                delay_time=2000,
+                delay_factor=0.1,
+                order_sequence=[[4, 5, 6, 7], [0, 1, 2, 3]],
+                validate_fn=trial[1],
+            )
 
             # Case 3: Similar to Case 2, but the second request is generated
             # after the first response from first request is received
-            self._decoupled_infer(request_count=2,
-                                  request_delay=2500,
-                                  repeat_count=4,
-                                  delay_time=2000,
-                                  delay_factor=0.1,
-                                  order_sequence=[[0, 5, 6, 7], [1, 2, 3, 4]],
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=2,
+                request_delay=2500,
+                repeat_count=4,
+                delay_time=2000,
+                delay_factor=0.1,
+                order_sequence=[[0, 5, 6, 7], [1, 2, 3, 4]],
+                validate_fn=trial[1],
+            )
 
             # Case 4: All the responses of second requests are dleivered after
             # all the responses from first requests are received
-            self._decoupled_infer(request_count=2,
-                                  request_delay=100,
-                                  repeat_count=4,
-                                  delay_time=500,
-                                  delay_factor=10,
-                                  order_sequence=[[0, 1, 2, 3], [4, 5, 6, 7]],
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=2,
+                request_delay=100,
+                repeat_count=4,
+                delay_time=500,
+                delay_factor=10,
+                order_sequence=[[0, 1, 2, 3], [4, 5, 6, 7]],
+                validate_fn=trial[1],
+            )
 
             # Case 5: Similar to Case 4, but the second request is generated
             # after the first response from the first request is received
-            self._decoupled_infer(request_count=2,
-                                  request_delay=750,
-                                  repeat_count=4,
-                                  delay_time=500,
-                                  delay_factor=10,
-                                  order_sequence=[[0, 1, 2, 3], [4, 5, 6, 7]],
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=2,
+                request_delay=750,
+                repeat_count=4,
+                delay_time=500,
+                delay_factor=10,
+                order_sequence=[[0, 1, 2, 3], [4, 5, 6, 7]],
+                validate_fn=trial[1],
+            )
 
     def _no_streaming_helper(self, protocol):
         data_offset = 100
@@ -400,9 +496,9 @@ def _no_streaming_helper(self, protocol):
         delay_time = 1000
         wait_time = 2000
 
-        input_data = np.arange(start=data_offset,
-                               stop=data_offset + repeat_count,
-                               dtype=np.int32)
+        input_data = np.arange(
+            start=data_offset, stop=data_offset + repeat_count, dtype=np.int32
+        )
         delay_data = (np.ones([repeat_count], dtype=np.uint32)) * delay_time
         wait_data = np.array([wait_time], dtype=np.uint32)
 
@@ -412,12 +508,11 @@ def _no_streaming_helper(self, protocol):
             this_outputs = self.outputs_
         else:
             this_inputs = []
-            this_inputs.append(
-                httpclient.InferInput('IN', [repeat_count], "INT32"))
-            this_inputs.append(httpclient.InferInput('DELAY', [1], "UINT32"))
-            this_inputs.append(httpclient.InferInput('WAIT', [1], "UINT32"))
+            this_inputs.append(httpclient.InferInput("IN", [repeat_count], "INT32"))
+            this_inputs.append(httpclient.InferInput("DELAY", [1], "UINT32"))
+            this_inputs.append(httpclient.InferInput("WAIT", [1], "UINT32"))
             this_outputs = []
-            this_outputs.append(httpclient.InferRequestedOutput('OUT'))
+            this_outputs.append(httpclient.InferRequestedOutput("OUT"))
 
         # Initialize data for IN
         this_inputs[0].set_shape([repeat_count])
@@ -432,19 +527,22 @@ def _no_streaming_helper(self, protocol):
 
         if protocol == "grpc":
             triton_client = grpcclient.InferenceServerClient(
-                url="localhost:8001", verbose=True)
+                url="localhost:8001", verbose=True
+            )
         else:
             triton_client = httpclient.InferenceServerClient(
-                url="localhost:8000", verbose=True)
+                url="localhost:8000", verbose=True
+            )
 
         with self.assertRaises(InferenceServerException) as cm:
-            triton_client.infer(model_name=self.model_name_,
-                                inputs=this_inputs,
-                                outputs=this_outputs)
+            triton_client.infer(
+                model_name=self.model_name_, inputs=this_inputs, outputs=this_outputs
+            )
 
         self.assertIn(
             "doesn't support models with decoupled transaction policy",
-            str(cm.exception))
+            str(cm.exception),
+        )
 
     def test_no_streaming(self):
         # Test cases with no streaming inference. Server should give
@@ -463,9 +561,9 @@ def test_wrong_shape(self):
         delay_time = 1000
         wait_time = 2000
 
-        input_data = np.arange(start=data_offset,
-                               stop=data_offset + repeat_count,
-                               dtype=np.int32)
+        input_data = np.arange(
+            start=data_offset, stop=data_offset + repeat_count, dtype=np.int32
+        )
         delay_data = (np.ones([repeat_count + 1], dtype=np.uint32)) * delay_time
         wait_data = np.array([wait_time], dtype=np.uint32)
 
@@ -484,12 +582,14 @@ def test_wrong_shape(self):
         result_dict = {}
 
         with self.assertRaises(InferenceServerException) as cm:
-            self._stream_infer(1, 0, repeat_count, delay_data, 1, user_data,
-                               result_dict)
+            self._stream_infer(
+                1, 0, repeat_count, delay_data, 1, user_data, result_dict
+            )
 
-        self.assertIn("expected IN and DELAY shape to match, got [1] and [2]",
-                      str(cm.exception))
+        self.assertIn(
+            "expected IN and DELAY shape to match, got [1] and [2]", str(cm.exception)
+        )
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_decoupled/test.sh b/qa/L0_decoupled/test.sh
old mode 100644
new mode 100755
index 8fb5841997..90bb913b6c
--- a/qa/L0_decoupled/test.sh
+++ b/qa/L0_decoupled/test.sh
@@ -74,7 +74,7 @@ for trial in $TRIALS; do
       cat $SERVER_LOG
       exit 1
   fi
-  
+
   for i in \
               test_one_to_none \
               test_one_to_one \
@@ -82,7 +82,7 @@ for trial in $TRIALS; do
               test_no_streaming \
               test_response_order \
 	      test_wrong_shape; do
-  
+
       echo "Test: $i" >>$CLIENT_LOG
       set +e
       python $DECOUPLED_TEST DecoupledTest.$i >>$CLIENT_LOG 2>&1
@@ -100,11 +100,11 @@ for trial in $TRIALS; do
       fi
       set -e
   done
-  
+
   # Will delay the writing of each response by the specified many milliseconds.
   # This will ensure that there are multiple responses available to be written.
   export TRITONSERVER_DELAY_GRPC_RESPONSE=2000
-  
+
   echo "Test: test_one_to_multi_many" >>$CLIENT_LOG
   set +e
   python $DECOUPLED_TEST DecoupledTest.test_one_to_multi_many >>$CLIENT_LOG 2>&1
@@ -120,18 +120,18 @@ for trial in $TRIALS; do
           RET=1
       fi
   fi
-  
+
   set -e
-  
+
   unset TRITONSERVER_DELAY_GRPC_RESPONSE
-  
+
   kill $SERVER_PID
   wait $SERVER_PID
 done
 
 if [ $RET -eq 0 ]; then
   echo -e "\n***\n*** Test Passed\n***"
-else 
+else
   echo -e "\n***\n*** Test Failed\n***"
 fi
 
diff --git a/qa/L0_device_memory_tracker/test.py b/qa/L0_device_memory_tracker/test.py
new file mode 100755
index 0000000000..1d443d1032
--- /dev/null
+++ b/qa/L0_device_memory_tracker/test.py
@@ -0,0 +1,109 @@
+#!/usr/bin/env python
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import time
+import unittest
+from functools import partial
+
+import nvidia_smi
+import tritonclient.grpc as grpcclient
+import tritonclient.http as httpclient
+
+
+class UnifiedClientProxy:
+    def __init__(self, client):
+        self.client_ = client
+
+    def __getattr__(self, attr):
+        forward_attr = getattr(self.client_, attr)
+        if type(self.client_) == grpcclient.InferenceServerClient:
+            if attr == "get_model_config":
+                return lambda *args, **kwargs: forward_attr(
+                    *args, **kwargs, as_json=True
+                )["config"]
+            elif attr == "get_inference_statistics":
+                return partial(forward_attr, as_json=True)
+        return forward_attr
+
+
+class MemoryUsageTest(unittest.TestCase):
+    def setUp(self):
+        nvidia_smi.nvmlInit()
+        self.gpu_handle_ = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
+        self.http_client_ = httpclient.InferenceServerClient(url="localhost:8000")
+        self.grpc_client_ = grpcclient.InferenceServerClient(url="localhost:8001")
+
+    def tearDown(self):
+        nvidia_smi.nvmlShutdown()
+
+    def report_used_gpu_memory(self):
+        info = nvidia_smi.nvmlDeviceGetMemoryInfo(self.gpu_handle_)
+        return info.used
+
+    def is_testing_backend(self, model_name, backend_name):
+        return self.client_.get_model_config(model_name)["backend"] == backend_name
+
+    def verify_recorded_usage(self, model_stat):
+        recorded_gpu_usage = 0
+        for usage in model_stat["memory_usage"]:
+            if usage["type"] == "GPU":
+                recorded_gpu_usage += int(usage["byte_size"])
+        # unload and verify recorded usage
+        before_total_usage = self.report_used_gpu_memory()
+        self.client_.unload_model(model_stat["name"])
+        # unload can return before the model is fully unloaded,
+        # wait to be finished
+        time.sleep(2)
+        usage_delta = before_total_usage - self.report_used_gpu_memory()
+        # check with tolerance as gpu usage obtained is overall usage
+        self.assertTrue(
+            usage_delta * 0.9 <= recorded_gpu_usage <= usage_delta * 1.1,
+            msg="For model {}, expect recorded usage to be in range [{}, {}], got {}".format(
+                model_stat["name"],
+                usage_delta * 0.9,
+                usage_delta * 1.1,
+                recorded_gpu_usage,
+            ),
+        )
+
+    def test_onnx_http(self):
+        self.client_ = UnifiedClientProxy(self.http_client_)
+        model_stats = self.client_.get_inference_statistics()["model_stats"]
+        for model_stat in model_stats:
+            if self.is_testing_backend(model_stat["name"], "onnxruntime"):
+                self.verify_recorded_usage(model_stat)
+
+    def test_plan_grpc(self):
+        self.client_ = UnifiedClientProxy(self.grpc_client_)
+        model_stats = self.client_.get_inference_statistics()["model_stats"]
+        for model_stat in model_stats:
+            if self.is_testing_backend(model_stat["name"], "tensorrt"):
+                self.verify_recorded_usage(model_stat)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_device_memory_tracker/test.sh b/qa/L0_device_memory_tracker/test.sh
new file mode 100755
index 0000000000..7eb0d745da
--- /dev/null
+++ b/qa/L0_device_memory_tracker/test.sh
@@ -0,0 +1,128 @@
+#!/bin/bash
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
+if [ "$#" -ge 1 ]; then
+    REPO_VERSION=$1
+fi
+if [ -z "$REPO_VERSION" ]; then
+    echo -e "Repository version must be specified"
+    echo -e "\n***\n*** Test Failed\n***"
+    exit 1
+fi
+if [ ! -z "$TEST_REPO_ARCH" ]; then
+    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
+fi
+
+export CUDA_VISIBLE_DEVICES=0
+
+TEST_LOG="./test.log"
+TEST_PY=test.py
+
+DATADIR=/data/inferenceserver/${REPO_VERSION}
+rm -f *.log
+
+TEST_RESULT_FILE='test_results.txt'
+SERVER=/opt/tritonserver/bin/tritonserver
+SERVER_LOG="./server.log"
+
+source ../common/util.sh
+
+RET=0
+
+# prepare model repository, only contains ONNX and TRT models as the
+# corresponding backend are known to be memory.
+rm -rf models && mkdir models
+# ONNX
+cp -r /data/inferenceserver/${REPO_VERSION}/onnx_model_store/* models/.
+rm -r models/*cpu
+
+# Convert to get TRT models against the system
+CAFFE2PLAN=../common/caffe2plan
+set +e
+mkdir -p models/vgg19_plan/1 && rm -f models/vgg19_plan/1/model.plan && \
+    $CAFFE2PLAN -b32 -n prob -o models/vgg19_plan/1/model.plan \
+                $DATADIR/caffe_models/vgg19.prototxt $DATADIR/caffe_models/vgg19.caffemodel
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Failed to generate vgg19 PLAN\n***"
+    exit 1
+fi
+
+mkdir -p models/resnet50_plan/1 && rm -f models/resnet50_plan/1/model.plan && \
+    $CAFFE2PLAN -b32 -n prob -o models/resnet50_plan/1/model.plan \
+                $DATADIR/caffe_models/resnet50.prototxt $DATADIR/caffe_models/resnet50.caffemodel
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Failed to generate resnet50 PLAN\n***"
+    exit 1
+fi
+
+mkdir -p models/resnet152_plan/1 && rm -f models/resnet152_plan/1/model.plan && \
+    $CAFFE2PLAN -h -b32 -n prob -o models/resnet152_plan/1/model.plan \
+                $DATADIR/caffe_models/resnet152.prototxt $DATADIR/caffe_models/resnet152.caffemodel
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Failed to generate resnet152 PLAN\n***"
+    exit 1
+fi
+set -e
+
+# Set multiple instances on selected model to test instance-wise collection
+# and accumulation.
+echo "instance_group [{ count: 2; kind: KIND_GPU }]" >> models/resnet152_plan/config.pbtxt
+echo "instance_group [{ count: 2; kind: KIND_GPU }]" >> models/densenet/config.pbtxt
+
+# testing use nvidia-smi for Python to validate the reported usage
+pip install nvidia-ml-py3
+
+# Start server to load all models (in parallel), then gradually unload
+# the models and expect the memory usage changes matches what are reported
+# in statistic.
+SERVER_ARGS="--backend-config=triton-backend-memory-tracker=true --model-repository=models --model-control-mode=explicit --load-model=*"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python $TEST_PY > $TEST_LOG 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+set -e
+kill $SERVER_PID
+wait $SERVER_PID
+
+if [ $RET -eq 0 ]; then
+    echo -e "\n***\n*** Test Passed\n***"
+else
+    cat $SERVER_LOG
+    cat $TEST_LOG
+    echo -e "\n***\n*** Test FAILED\n***"
+fi
+
+exit $RET
diff --git a/qa/L0_backend_python/unittest/test.sh b/qa/L0_dlpack_multi_gpu/test.sh
old mode 100644
new mode 100755
similarity index 79%
rename from qa/L0_backend_python/unittest/test.sh
rename to qa/L0_dlpack_multi_gpu/test.sh
index e78b2613b0..2485bfdb88
--- a/qa/L0_backend_python/unittest/test.sh
+++ b/qa/L0_dlpack_multi_gpu/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -27,27 +27,33 @@
 
 SERVER=/opt/tritonserver/bin/tritonserver
 SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=1"
-CLIENT_PY=../python_unittest.py
+CLIENT_PY=./python_unittest.py
 CLIENT_LOG="./client.log"
 EXPECTED_NUM_TESTS="1"
 TEST_RESULT_FILE='test_results.txt'
 SERVER_LOG="./inference_server.log"
+export CUDA_VISIBLE_DEVICES=0,1,2,3
 
 RET=0
 rm -fr *.log ./models
 
-source ../../common/util.sh
+source ../common/util.sh
 
 # Uninstall the non CUDA version of PyTorch
 pip3 uninstall -y torch
-pip3 install torch==1.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
+pip3 install torch==1.13.0+cu117 -f https://download.pytorch.org/whl/torch_stable.html
 pip3 install tensorflow
 
+# Install CuPy for testing non_blocking compute streams
+pip3 install cupy-cuda12x
+
 rm -fr *.log ./models
 
 mkdir -p models/dlpack_test/1/
-cp ../../python_models/dlpack_test/model.py models/dlpack_test/1/
-cp ../../python_models/dlpack_test/config.pbtxt models/dlpack_test
+cp ../python_models/dlpack_test/model.py models/dlpack_test/1/
+cp ../python_models/dlpack_test/config.pbtxt models/dlpack_test
+cp ../L0_backend_python/python_unittest.py .
+sed -i 's#sys.path.append("../../common")#sys.path.append("../common")#g' python_unittest.py
 
 run_server
 if [ "$SERVER_PID" == "0" ]; then
@@ -58,7 +64,7 @@ fi
 
 set +e
 export MODEL_NAME="dlpack_test"
-python3 $CLIENT_PY > $CLIENT_LOG 2>&1 
+python3 $CLIENT_PY > $CLIENT_LOG 2>&1
 
 if [ $? -ne 0 ]; then
     echo -e "\n***\n*** python_unittest.py FAILED. \n***"
@@ -84,4 +90,5 @@ else
     echo -e "\n***\n*** Unittest test PASSED. \n***"
 fi
 
+export CUDA_VISIBLE_DEVICES=0
 exit $RET
diff --git a/qa/L0_doc_links/mkdocs.yml b/qa/L0_doc_links/mkdocs.yml
new file mode 100644
index 0000000000..1588680d92
--- /dev/null
+++ b/qa/L0_doc_links/mkdocs.yml
@@ -0,0 +1,44 @@
+# Copyright (c) 2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+site_name: CI Test
+use_directory_urls: False
+docs_dir: "./repos"
+plugins:
+        - htmlproofer
+        - search
diff --git a/qa/L0_doc_links/test.sh b/qa/L0_doc_links/test.sh
new file mode 100755
index 0000000000..be7d291b01
--- /dev/null
+++ b/qa/L0_doc_links/test.sh
@@ -0,0 +1,76 @@
+#!/bin/bash
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+LOG="`pwd`/log.txt"
+CONFIG="`pwd`/mkdocs.yml"
+RET=0
+# Download necessary packages
+python3 -m pip install mkdocs
+python3 -m pip install mkdocs-htmlproofer-plugin
+
+# Get the necessary repos
+mkdir repos && cd repos
+TRITON_BACKEND_REPO_TAG=${TRITON_BACKEND_REPO_TAG:="main"}
+echo ${TRITON_BACKEND_REPO_TAG}
+git clone --single-branch --depth=1 -b ${TRITON_BACKEND_REPO_TAG} https://github.com/triton-inference-server/backend.git
+cd ..
+
+exec mkdocs serve -f $CONFIG > $LOG &
+PID=$!
+# Time for the compilation to finish. This needs to be increased if other repos
+# are added to the test
+sleep 20
+
+until [[ (-z `pgrep mkdocs`) ]]; do
+    kill -2 $PID
+    sleep 2
+done
+
+if [[ ! -z `grep "invalid url" $LOG` ]]; then
+    cat $LOG
+    RET=1
+fi
+
+
+if [ $RET -eq 0 ]; then
+    echo -e "\n***\n*** Test PASSED\n***"
+else
+    echo -e "\n***\n*** Test FAILED\n***"
+fi
+# exit $RET
diff --git a/qa/L0_dyna_implicit_state/test.sh b/qa/L0_dyna_implicit_state/test.sh
old mode 100644
new mode 100755
index e09a24d493..0721d5cd32
--- a/qa/L0_dyna_implicit_state/test.sh
+++ b/qa/L0_dyna_implicit_state/test.sh
@@ -25,12 +25,25 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
+REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
+if [ "$#" -ge 1 ]; then
+    REPO_VERSION=$1
+fi
+if [ -z "$REPO_VERSION" ]; then
+    echo -e "Repository version must be specified"
+    echo -e "\n***\n*** Test Failed\n***"
+    exit 1
+fi
+if [ ! -z "$TEST_REPO_ARCH" ]; then
+    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
+fi
+
 export ENSEMBLES=0
 BACKENDS=${BACKENDS:="onnx plan"}
 export BACKENDS
 export IMPLICIT_STATE=1
 
-(cd ../L0_dyna_sequence_batcher/ && bash -ex test.sh)
+(cd ../L0_dyna_sequence_batcher/ && bash -ex test.sh $REPO_VERSION)
 RET=$?
 
 if [ $RET == 0 ]; then
diff --git a/qa/L0_dyna_sequence_batcher/dyna_sequence_batcher_test.py b/qa/L0_dyna_sequence_batcher/dyna_sequence_batcher_test.py
old mode 100644
new mode 100755
index 6fff86948c..f2c709469b
--- a/qa/L0_dyna_sequence_batcher/dyna_sequence_batcher_test.py
+++ b/qa/L0_dyna_sequence_batcher/dyna_sequence_batcher_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,57 +30,55 @@
 
 sys.path.append("../common")
 
-from builtins import str
 import os
-import time
 import threading
+import time
 import unittest
+from builtins import str
+
 import numpy as np
-import test_util as tu
 import sequence_util as su
+import test_util as tu
 
-_test_system_shared_memory = bool(
-    int(os.environ.get('TEST_SYSTEM_SHARED_MEMORY', 0)))
-_test_cuda_shared_memory = bool(
-    int(os.environ.get('TEST_CUDA_SHARED_MEMORY', 0)))
+_test_system_shared_memory = bool(int(os.environ.get("TEST_SYSTEM_SHARED_MEMORY", 0)))
+_test_cuda_shared_memory = bool(int(os.environ.get("TEST_CUDA_SHARED_MEMORY", 0)))
 
-NO_BATCHING = (int(os.environ.get('NO_BATCHING', 0)) == 1)
+NO_BATCHING = int(os.environ.get("NO_BATCHING", 0)) == 1
 BACKENDS = os.environ.get(
-    'BACKENDS', "graphdef savedmodel libtorch onnx plan custom custom_string")
-IMPLICIT_STATE = (int(os.environ['IMPLICIT_STATE']) == 1)
+    "BACKENDS", "graphdef savedmodel libtorch onnx plan custom custom_string"
+)
+IMPLICIT_STATE = int(os.environ["IMPLICIT_STATE"]) == 1
 
-_trials = BACKENDS.split(' ')
+_trials = BACKENDS.split(" ")
 for backend in BACKENDS.split(" "):
     if NO_BATCHING:
-        if (backend != 'custom') and (backend != 'custom_string'):
+        if (backend != "custom") and (backend != "custom_string"):
             _trials += (backend + "_nobatch",)
 
 _ragged_batch_supported_trials = []
-if 'custom' in BACKENDS.split(' '):
-    _ragged_batch_supported_trials.append('custom')
+if "custom" in BACKENDS.split(" "):
+    _ragged_batch_supported_trials.append("custom")
 
 _protocols = ("http", "grpc")
 _max_sequence_idle_ms = 5000
 
 
 class DynaSequenceBatcherTest(su.SequenceBatcherTestUtil):
-
     def get_datatype(self, trial):
         return np.int32
 
-    def get_expected_result(self,
-                            expected_result,
-                            corrid,
-                            value,
-                            trial,
-                            flag_str=None):
+    def get_expected_result(self, expected_result, corrid, value, trial, flag_str=None):
         # Adjust the expected_result for models that
-        # couldn't implement the full accumulator. See
+        # could not implement the full accumulator. See
         # qa/common/gen_qa_dyna_sequence_models.py for more
         # information.
-        if ((("nobatch" not in trial) and ("custom" not in trial)) or \
-            ("graphdef" in trial) or ("plan" in trial) or ("onnx" in trial) or \
-            ("libtorch" in trial)):
+        if (
+            (("nobatch" not in trial) and ("custom" not in trial))
+            or ("graphdef" in trial)
+            or ("plan" in trial)
+            or ("onnx" in trial)
+            or ("libtorch" in trial)
+        ):
             expected_result = value
             if flag_str is not None:
                 if "start" in flag_str:
@@ -90,12 +90,9 @@ def get_expected_result(self,
                         expected_result += corrid
         return expected_result
 
-    def get_expected_result_implicit(self,
-                                     expected_result,
-                                     corrid,
-                                     value,
-                                     trial,
-                                     flag_str=None):
+    def get_expected_result_implicit(
+        self, expected_result, corrid, value, trial, flag_str=None
+    ):
         return expected_result
 
     def test_simple_sequence(self):
@@ -111,18 +108,22 @@ def test_simple_sequence(self):
 
                     self.check_setup(model_name)
                     self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
-                    self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER",
-                                     os.environ)
+                    self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
 
                     if "string" in trial:
-                        corrid = '52'
+                        corrid = "52"
                     else:
                         corrid = 52
 
-                    expected_result = self.get_expected_result(
-                        45 + int(corrid), corrid, 9, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        45, corrid, 9, trial, "end")
+                    expected_result = (
+                        self.get_expected_result(
+                            45 + int(corrid), corrid, 9, trial, "end"
+                        )
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            45, corrid, 9, trial, "end"
+                        )
+                    )
 
                     self.check_sequence(
                         trial,
@@ -131,19 +132,26 @@ def test_simple_sequence(self):
                         corrid,
                         (4000, None),
                         # (flag_str, value, (ls_ms, gt_ms), (pre_delay, post_delay))
-                        (("start", 1, None, None), (None, 2, None, None),
-                         (None, 3, None, None), (None, 4, None, None),
-                         (None, 5, None, None), (None, 6, None, None),
-                         (None, 7, None, None), (None, 8, None, None),
-                         ("end", 9, None, None)),
+                        (
+                            ("start", 1, None, None),
+                            (None, 2, None, None),
+                            (None, 3, None, None),
+                            (None, 4, None, None),
+                            (None, 5, None, None),
+                            (None, 6, None, None),
+                            (None, 7, None, None),
+                            (None, 8, None, None),
+                            ("end", 9, None, None),
+                        ),
                         expected_result,
                         protocol,
-                        sequence_name="{}_{}".format(self._testMethodName,
-                                                     protocol))
+                        sequence_name="{}_{}".format(self._testMethodName, protocol),
+                    )
 
                     self.check_deferred_exception()
-                    self.check_status(model_name, {1: 9 * (idx + 1)},
-                                      9 * (idx + 1), 9 * (idx + 1))
+                    self.check_status(
+                        model_name, {1: 9 * (idx + 1)}, 9 * (idx + 1), 9 * (idx + 1)
+                    )
                 except Exception as ex:
                     self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -160,18 +168,22 @@ def test_length1_sequence(self):
 
                     self.check_setup(model_name)
                     self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
-                    self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER",
-                                     os.environ)
+                    self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
 
                     if "string" in trial:
-                        corrid = '99'
+                        corrid = "99"
                     else:
                         corrid = 99
 
-                    expected_result = self.get_expected_result(
-                        42 + int(corrid), corrid, 42, trial, "start,end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        42, corrid, 42, trial, "start,end")
+                    expected_result = (
+                        self.get_expected_result(
+                            42 + int(corrid), corrid, 42, trial, "start,end"
+                        )
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            42, corrid, 42, trial, "start,end"
+                        )
+                    )
 
                     self.check_sequence(
                         trial,
@@ -180,50 +192,60 @@ def test_length1_sequence(self):
                         corrid,
                         (4000, None),
                         # (flag_str, value, (ls_ms, gt_ms), (pre_delay, post_delay))
-                        (
-                            ("start,end", 42, None, None),),
+                        (("start,end", 42, None, None),),
                         expected_result,
                         protocol,
-                        sequence_name="{}_{}".format(self._testMethodName,
-                                                     protocol))
+                        sequence_name="{}_{}".format(self._testMethodName, protocol),
+                    )
 
                     self.check_deferred_exception()
-                    self.check_status(model_name, {1: (idx + 1)}, (idx + 1),
-                                      (idx + 1))
+                    self.check_status(model_name, {1: (idx + 1)}, (idx + 1), (idx + 1))
                 except Exception as ex:
                     self.assertTrue(False, "unexpected error {}".format(ex))
 
-    def _multi_sequence_impl(self, trials, expected_batch_exec,
-                             expected_exec_cnt, sleep_secs, tensor_shapes):
+    def _multi_sequence_impl(
+        self, trials, expected_batch_exec, expected_exec_cnt, sleep_secs, tensor_shapes
+    ):
         for trial in trials:
             self.clear_deferred_exceptions()
             dtype = self.get_datatype(trial)
             precreated_shm0_handles = self.precreate_register_regions(
-                (1, 3), dtype, 0, tensor_shape=(tensor_shapes[0],))
+                (1, 3), dtype, 0, tensor_shape=(tensor_shapes[0],)
+            )
             precreated_shm1_handles = self.precreate_register_regions(
-                (11, 12, 13), dtype, 1, tensor_shape=(tensor_shapes[1],))
+                (11, 12, 13), dtype, 1, tensor_shape=(tensor_shapes[1],)
+            )
             precreated_shm2_handles = self.precreate_register_regions(
-                (111, 112, 113), dtype, 2, tensor_shape=(tensor_shapes[2],))
+                (111, 112, 113), dtype, 2, tensor_shape=(tensor_shapes[2],)
+            )
             precreated_shm3_handles = self.precreate_register_regions(
-                (1111, 1112, 1113), dtype, 3, tensor_shape=(tensor_shapes[3],))
+                (1111, 1112, 1113), dtype, 3, tensor_shape=(tensor_shapes[3],)
+            )
             try:
                 model_name = tu.get_dyna_sequence_model_name(trial, dtype)
 
                 self.check_setup(model_name)
                 self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
-                self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER",
-                                 os.environ)
+                self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
 
                 if "string" in trial:
-                    corrids = ['1001', '1002', '1003', '1004']
+                    corrids = ["1001", "1002", "1003", "1004"]
                 else:
                     corrids = [1001, 1002, 1003, 1004]
 
-                expected_result = self.get_expected_result(
-                    4 * tensor_shapes[0] +
-                    int(corrids[0]), corrids[0], 3, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    4, corrids[0], 3, trial, "end")
+                expected_result = (
+                    self.get_expected_result(
+                        4 * tensor_shapes[0] + int(corrids[0]),
+                        corrids[0],
+                        3,
+                        trial,
+                        "end",
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        4, corrids[0], 3, trial, "end"
+                    )
+                )
 
                 threads = []
                 threads.append(
@@ -238,19 +260,30 @@ def _multi_sequence_impl(self, trials, expected_batch_exec,
                             # (flag_str, value, pre_delay_ms)
                             (("start", 1, None), ("end", 3, None)),
                             expected_result,
-                            precreated_shm0_handles),
+                            precreated_shm0_handles,
+                        ),
                         kwargs={
-                            'sequence_name':
-                                "{}_{}".format(self._testMethodName,
-                                               corrids[0]),
-                            'tensor_shape': (tensor_shapes[0],)
-                        }))
+                            "sequence_name": "{}_{}".format(
+                                self._testMethodName, corrids[0]
+                            ),
+                            "tensor_shape": (tensor_shapes[0],),
+                        },
+                    )
+                )
 
-                expected_result = self.get_expected_result(
-                    36 * tensor_shapes[1] +
-                    int(corrids[1]), corrids[1], 13, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    36, corrids[1], 13, trial, "end")
+                expected_result = (
+                    self.get_expected_result(
+                        36 * tensor_shapes[1] + int(corrids[1]),
+                        corrids[1],
+                        13,
+                        trial,
+                        "end",
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        36, corrids[1], 13, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -261,22 +294,32 @@ def _multi_sequence_impl(self, trials, expected_batch_exec,
                             corrids[1],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 11, None), (None, 12, None), ("end", 13,
-                                                                     None)),
+                            (("start", 11, None), (None, 12, None), ("end", 13, None)),
                             expected_result,
-                            precreated_shm1_handles),
+                            precreated_shm1_handles,
+                        ),
                         kwargs={
-                            'sequence_name':
-                                "{}_{}".format(self._testMethodName,
-                                               corrids[1]),
-                            'tensor_shape': (tensor_shapes[1],)
-                        }))
+                            "sequence_name": "{}_{}".format(
+                                self._testMethodName, corrids[1]
+                            ),
+                            "tensor_shape": (tensor_shapes[1],),
+                        },
+                    )
+                )
 
-                expected_result = self.get_expected_result(
-                    336 * tensor_shapes[2] +
-                    int(corrids[2]), corrids[2], 113, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    336, corrids[2], 113, trial, "end")
+                expected_result = (
+                    self.get_expected_result(
+                        336 * tensor_shapes[2] + int(corrids[2]),
+                        corrids[2],
+                        113,
+                        trial,
+                        "end",
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        336, corrids[2], 113, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -287,21 +330,35 @@ def _multi_sequence_impl(self, trials, expected_batch_exec,
                             corrids[2],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 111, None), (None, 112, None),
-                             ("end", 113, None)),
+                            (
+                                ("start", 111, None),
+                                (None, 112, None),
+                                ("end", 113, None),
+                            ),
                             expected_result,
-                            precreated_shm2_handles),
+                            precreated_shm2_handles,
+                        ),
                         kwargs={
-                            'sequence_name':
-                                "{}_{}".format(self._testMethodName,
-                                               corrids[2]),
-                            'tensor_shape': (tensor_shapes[2],)
-                        }))
-                expected_result = self.get_expected_result(
-                    3336 * tensor_shapes[3] +
-                    int(corrids[3]), corrids[3], 1113, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    3336, corrids[3], 1113, trial, "end")
+                            "sequence_name": "{}_{}".format(
+                                self._testMethodName, corrids[2]
+                            ),
+                            "tensor_shape": (tensor_shapes[2],),
+                        },
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        3336 * tensor_shapes[3] + int(corrids[3]),
+                        corrids[3],
+                        1113,
+                        trial,
+                        "end",
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        3336, corrids[3], 1113, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -312,16 +369,22 @@ def _multi_sequence_impl(self, trials, expected_batch_exec,
                             corrids[3],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 1111, None), (None, 1112, None),
-                             ("end", 1113, None)),
+                            (
+                                ("start", 1111, None),
+                                (None, 1112, None),
+                                ("end", 1113, None),
+                            ),
                             expected_result,
-                            precreated_shm3_handles),
+                            precreated_shm3_handles,
+                        ),
                         kwargs={
-                            'sequence_name':
-                                "{}_{}".format(self._testMethodName,
-                                               corrids[3]),
-                            'tensor_shape': (tensor_shapes[3],)
-                        }))
+                            "sequence_name": "{}_{}".format(
+                                self._testMethodName, corrids[3]
+                            ),
+                            "tensor_shape": (tensor_shapes[3],),
+                        },
+                    )
+                )
 
                 for t in threads:
                     t.start()
@@ -330,8 +393,9 @@ def _multi_sequence_impl(self, trials, expected_batch_exec,
                 for t in threads:
                     t.join()
                 self.check_deferred_exception()
-                self.check_status(model_name, expected_batch_exec,
-                                  expected_exec_cnt, 11)
+                self.check_status(
+                    model_name, expected_batch_exec, expected_exec_cnt, 11
+                )
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
             finally:
@@ -355,18 +419,18 @@ def test_multi_sequence_different_shape(self):
         # Send four sequences in parallel where the requests in each
         # sequence have different shape. Sequences should not be
         # batched due to input tensor size differences.
-        self._multi_sequence_impl(_ragged_batch_supported_trials, {1: 11}, 11,
-                                  0, (4, 3, 1, 2))
+        self._multi_sequence_impl(
+            _ragged_batch_supported_trials, {1: 11}, 11, 0, (4, 3, 1, 2)
+        )
 
     def test_multi_sequence_different_shape_allow_ragged(self):
         # Send four sequences in parallel where the requests in each
         # sequence have different shape. Input is marked as allowing
         # ragged and so sequences should be batched even with input
         # tensor size differences.
-        self._multi_sequence_impl(_ragged_batch_supported_trials, {
-            4: 2,
-            3: 1
-        }, 3, 1, (4, 3, 1, 2))
+        self._multi_sequence_impl(
+            _ragged_batch_supported_trials, {4: 2, 3: 1}, 3, 1, (4, 3, 1, 2)
+        )
 
     def test_backlog(self):
         # Send 5 equal-length sequences in parallel and make sure they
@@ -376,33 +440,42 @@ def test_backlog(self):
         for trial in _trials:
             self.clear_deferred_exceptions()
             dtype = self.get_datatype(trial)
-            precreated_shm0_handles = self.precreate_register_regions((1, 2, 3),
-                                                                      dtype, 0)
+            precreated_shm0_handles = self.precreate_register_regions(
+                (1, 2, 3), dtype, 0
+            )
             precreated_shm1_handles = self.precreate_register_regions(
-                (11, 12, 13), dtype, 1)
+                (11, 12, 13), dtype, 1
+            )
             precreated_shm2_handles = self.precreate_register_regions(
-                (111, 112, 113), dtype, 2)
+                (111, 112, 113), dtype, 2
+            )
             precreated_shm3_handles = self.precreate_register_regions(
-                (1111, 1112, 1113), dtype, 3)
+                (1111, 1112, 1113), dtype, 3
+            )
             precreated_shm4_handles = self.precreate_register_regions(
-                (11111, 11112, 11113), dtype, 4)
+                (11111, 11112, 11113), dtype, 4
+            )
             try:
                 model_name = tu.get_dyna_sequence_model_name(trial, dtype)
 
                 self.check_setup(model_name)
                 self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
-                self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER",
-                                 os.environ)
+                self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
 
                 if "string" in trial:
-                    corrids = ['1001', '1002', '1003', '1004', '1005']
+                    corrids = ["1001", "1002", "1003", "1004", "1005"]
                 else:
                     corrids = [1001, 1002, 1003, 1004, 1005]
 
-                expected_result = self.get_expected_result(
-                    6 + int(corrids[0]), corrids[0], 3, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    6, corrids[0], 3, trial, "end")
+                expected_result = (
+                    self.get_expected_result(
+                        6 + int(corrids[0]), corrids[0], 3, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        6, corrids[0], 3, trial, "end"
+                    )
+                )
 
                 threads = []
                 threads.append(
@@ -415,18 +488,23 @@ def test_backlog(self):
                             corrids[0],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 1, None), (None, 2, None), ("end", 3,
-                                                                   None)),
+                            (("start", 1, None), (None, 2, None), ("end", 3, None)),
                             expected_result,
-                            precreated_shm0_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
+                            precreated_shm0_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
 
-                expected_result = self.get_expected_result(
-                    36 + int(corrids[1]), corrids[1], 13, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    36, corrids[1], 13, trial, "end")
+                expected_result = (
+                    self.get_expected_result(
+                        36 + int(corrids[1]), corrids[1], 13, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        36, corrids[1], 13, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -437,18 +515,23 @@ def test_backlog(self):
                             corrids[1],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 11, None), (None, 12, None), ("end", 13,
-                                                                     None)),
+                            (("start", 11, None), (None, 12, None), ("end", 13, None)),
                             expected_result,
-                            precreated_shm1_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
+                            precreated_shm1_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
 
-                expected_result = self.get_expected_result(
-                    336 + int(corrids[2]), corrids[2], 113, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    336, corrids[2], 113, trial, "end")
+                expected_result = (
+                    self.get_expected_result(
+                        336 + int(corrids[2]), corrids[2], 113, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        336, corrids[2], 113, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -459,18 +542,27 @@ def test_backlog(self):
                             corrids[2],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 111, None), (None, 112, None),
-                             ("end", 113, None)),
+                            (
+                                ("start", 111, None),
+                                (None, 112, None),
+                                ("end", 113, None),
+                            ),
                             expected_result,
-                            precreated_shm2_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
+                            precreated_shm2_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
 
-                expected_result = self.get_expected_result(
-                    3336 + int(corrids[3]), corrids[3], 1113, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    3336, corrids[3], 1113, trial, "end")
+                expected_result = (
+                    self.get_expected_result(
+                        3336 + int(corrids[3]), corrids[3], 1113, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        3336, corrids[3], 1113, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -481,18 +573,27 @@ def test_backlog(self):
                             corrids[3],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 1111, None), (None, 1112, None),
-                             ("end", 1113, None)),
+                            (
+                                ("start", 1111, None),
+                                (None, 1112, None),
+                                ("end", 1113, None),
+                            ),
                             expected_result,
-                            precreated_shm3_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
+                            precreated_shm3_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
 
-                expected_result = self.get_expected_result(
-                    33336 + int(corrids[4]), corrids[4], 11113, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    33336, corrids[4], 11113, trial, "end")
+                expected_result = (
+                    self.get_expected_result(
+                        33336 + int(corrids[4]), corrids[4], 11113, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        33336, corrids[4], 11113, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -503,13 +604,17 @@ def test_backlog(self):
                             corrids[4],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 11111, None), (None, 11112, None),
-                             ("end", 11113, None)),
+                            (
+                                ("start", 11111, None),
+                                (None, 11112, None),
+                                ("end", 11113, None),
+                            ),
                             expected_result,
-                            precreated_shm4_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
+                            precreated_shm4_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
 
                 for t in threads:
                     t.start()
@@ -534,35 +639,45 @@ def test_backlog_fill(self):
         for trial in _trials:
             self.clear_deferred_exceptions()
             dtype = self.get_datatype(trial)
-            precreated_shm0_handles = self.precreate_register_regions((1, 2, 3),
-                                                                      dtype, 0)
-            precreated_shm1_handles = self.precreate_register_regions((11, 13),
-                                                                      dtype, 1)
+            precreated_shm0_handles = self.precreate_register_regions(
+                (1, 2, 3), dtype, 0
+            )
+            precreated_shm1_handles = self.precreate_register_regions(
+                (11, 13), dtype, 1
+            )
             precreated_shm2_handles = self.precreate_register_regions(
-                (111, 113), dtype, 2)
+                (111, 113), dtype, 2
+            )
             precreated_shm3_handles = self.precreate_register_regions(
-                (1111, 1112, 1113), dtype, 3)
-            precreated_shm4_handles = self.precreate_register_regions((11111,),
-                                                                      dtype, 4)
-            precreated_shm5_handles = self.precreate_register_regions((22222,),
-                                                                      dtype, 5)
+                (1111, 1112, 1113), dtype, 3
+            )
+            precreated_shm4_handles = self.precreate_register_regions(
+                (11111,), dtype, 4
+            )
+            precreated_shm5_handles = self.precreate_register_regions(
+                (22222,), dtype, 5
+            )
             try:
                 model_name = tu.get_dyna_sequence_model_name(trial, dtype)
 
                 self.check_setup(model_name)
                 self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
-                self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER",
-                                 os.environ)
+                self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
                 if "string" in trial:
-                    corrids = ['1001', '1002', '1003', '1004', '1005', '1006']
+                    corrids = ["1001", "1002", "1003", "1004", "1005", "1006"]
                 else:
                     corrids = [1001, 1002, 1003, 1004, 1005, 1006]
                 threads = []
 
-                expected_result = self.get_expected_result(
-                    6 + int(corrids[0]), corrids[0], 3, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    6, corrids[0], 3, trial, "end")
+                expected_result = (
+                    self.get_expected_result(
+                        6 + int(corrids[0]), corrids[0], 3, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        6, corrids[0], 3, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -573,17 +688,22 @@ def test_backlog_fill(self):
                             corrids[0],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 1, None), (None, 2, None), ("end", 3,
-                                                                   None)),
+                            (("start", 1, None), (None, 2, None), ("end", 3, None)),
                             expected_result,
-                            precreated_shm0_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    24 + int(corrids[1]), corrids[1], 13, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    24, corrids[1], 13, trial, "end")
+                            precreated_shm0_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        24 + int(corrids[1]), corrids[1], 13, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        24, corrids[1], 13, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -596,14 +716,20 @@ def test_backlog_fill(self):
                             # (flag_str, value, pre_delay_ms)
                             (("start", 11, None), ("end", 13, None)),
                             expected_result,
-                            precreated_shm1_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    224 + int(corrids[2]), corrids[2], 113, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    224, corrids[2], 113, trial, "end")
+                            precreated_shm1_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        224 + int(corrids[2]), corrids[2], 113, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        224, corrids[2], 113, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -616,14 +742,20 @@ def test_backlog_fill(self):
                             # (flag_str, value, pre_delay_ms)
                             (("start", 111, None), ("end", 113, None)),
                             expected_result,
-                            precreated_shm2_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    3336 + int(corrids[3]), corrids[3], 1113, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    3336, corrids[3], 1113, trial, "end")
+                            precreated_shm2_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        3336 + int(corrids[3]), corrids[3], 1113, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        3336, corrids[3], 1113, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -634,18 +766,26 @@ def test_backlog_fill(self):
                             corrids[3],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 1111, None), (None, 1112, 3000),
-                             ("end", 1113, None)),
+                            (
+                                ("start", 1111, None),
+                                (None, 1112, 3000),
+                                ("end", 1113, None),
+                            ),
                             expected_result,
-                            precreated_shm3_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    11111 +
-                    int(corrids[4]), corrids[4], 11111, trial, "start,end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    11111, corrids[4], 11111, trial, "start,end")
+                            precreated_shm3_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        11111 + int(corrids[4]), corrids[4], 11111, trial, "start,end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        11111, corrids[4], 11111, trial, "start,end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -656,18 +796,22 @@ def test_backlog_fill(self):
                             corrids[4],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (
-                                ("start,end", 11111, None),),
+                            (("start,end", 11111, None),),
                             expected_result,
-                            precreated_shm4_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    22222 +
-                    int(corrids[5]), corrids[5], 22222, trial, "start,end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    22222, corrids[5], 22222, trial, "start,end")
+                            precreated_shm4_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        22222 + int(corrids[5]), corrids[5], 22222, trial, "start,end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        22222, corrids[5], 22222, trial, "start,end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -678,13 +822,13 @@ def test_backlog_fill(self):
                             corrids[5],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (
-                                ("start,end", 22222, None),),
+                            (("start,end", 22222, None),),
                             expected_result,
-                            precreated_shm5_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
+                            precreated_shm5_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
 
                 threads[0].start()
                 threads[1].start()
@@ -716,35 +860,45 @@ def test_backlog_fill_no_end(self):
         for trial in _trials:
             self.clear_deferred_exceptions()
             dtype = self.get_datatype(trial)
-            precreated_shm0_handles = self.precreate_register_regions((1, 2, 3),
-                                                                      dtype, 0)
-            precreated_shm1_handles = self.precreate_register_regions((11, 13),
-                                                                      dtype, 1)
+            precreated_shm0_handles = self.precreate_register_regions(
+                (1, 2, 3), dtype, 0
+            )
+            precreated_shm1_handles = self.precreate_register_regions(
+                (11, 13), dtype, 1
+            )
             precreated_shm2_handles = self.precreate_register_regions(
-                (111, 113), dtype, 2)
+                (111, 113), dtype, 2
+            )
             precreated_shm3_handles = self.precreate_register_regions(
-                (1111, 1112, 1113), dtype, 3)
-            precreated_shm4_handles = self.precreate_register_regions((11111,),
-                                                                      dtype, 4)
+                (1111, 1112, 1113), dtype, 3
+            )
+            precreated_shm4_handles = self.precreate_register_regions(
+                (11111,), dtype, 4
+            )
             precreated_shm5_handles = self.precreate_register_regions(
-                (22222, 22223, 22224), dtype, 5)
+                (22222, 22223, 22224), dtype, 5
+            )
             try:
                 model_name = tu.get_dyna_sequence_model_name(trial, dtype)
 
                 self.check_setup(model_name)
                 self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
-                self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER",
-                                 os.environ)
+                self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
 
                 if "string" in trial:
-                    corrids = ['1001', '1002', '1003', '1004', '1005', '1006']
+                    corrids = ["1001", "1002", "1003", "1004", "1005", "1006"]
                 else:
                     corrids = [1001, 1002, 1003, 1004, 1005, 1006]
                 threads = []
-                expected_result = self.get_expected_result(
-                    6 + int(corrids[0]), corrids[0], 3, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    6, corrids[0], 3, trial, "end")
+                expected_result = (
+                    self.get_expected_result(
+                        6 + int(corrids[0]), corrids[0], 3, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        6, corrids[0], 3, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -755,17 +909,22 @@ def test_backlog_fill_no_end(self):
                             corrids[0],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 1, None), (None, 2, None), ("end", 3,
-                                                                   None)),
+                            (("start", 1, None), (None, 2, None), ("end", 3, None)),
                             expected_result,
-                            precreated_shm0_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    24 + int(corrids[1]), corrids[1], 13, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    24, corrids[1], 13, trial, "end")
+                            precreated_shm0_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        24 + int(corrids[1]), corrids[1], 13, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        24, corrids[1], 13, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -778,14 +937,20 @@ def test_backlog_fill_no_end(self):
                             # (flag_str, value, pre_delay_ms)
                             (("start", 11, None), ("end", 13, None)),
                             expected_result,
-                            precreated_shm1_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    224 + int(corrids[2]), corrids[2], 113, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    224, corrids[2], 113, trial, "end")
+                            precreated_shm1_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        224 + int(corrids[2]), corrids[2], 113, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        224, corrids[2], 113, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -798,14 +963,20 @@ def test_backlog_fill_no_end(self):
                             # (flag_str, value, pre_delay_ms)
                             (("start", 111, None), ("end", 113, None)),
                             expected_result,
-                            precreated_shm2_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    3336 + int(corrids[3]), corrids[3], 1113, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    3336, corrids[3], 1113, trial, "end")
+                            precreated_shm2_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        3336 + int(corrids[3]), corrids[3], 1113, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        3336, corrids[3], 1113, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -816,18 +987,26 @@ def test_backlog_fill_no_end(self):
                             corrids[3],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 1111, None), (None, 1112, 3000),
-                             ("end", 1113, None)),
+                            (
+                                ("start", 1111, None),
+                                (None, 1112, 3000),
+                                ("end", 1113, None),
+                            ),
                             expected_result,
-                            precreated_shm3_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    11111 +
-                    int(corrids[4]), corrids[4], 11111, trial, "start,end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    11111, corrids[4], 11111, trial, "start,end")
+                            precreated_shm3_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        11111 + int(corrids[4]), corrids[4], 11111, trial, "start,end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        11111, corrids[4], 11111, trial, "start,end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -838,17 +1017,22 @@ def test_backlog_fill_no_end(self):
                             corrids[4],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (
-                                ("start,end", 11111, None),),
+                            (("start,end", 11111, None),),
                             expected_result,
-                            precreated_shm4_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    66669 + int(corrids[5]), corrids[5], 22224, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    66669, corrids[5], 22224, trial, "end")
+                            precreated_shm4_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        66669 + int(corrids[5]), corrids[5], 22224, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        66669, corrids[5], 22224, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -865,10 +1049,11 @@ def test_backlog_fill_no_end(self):
                                 ("end", 22224, 2000),
                             ),
                             expected_result,
-                            precreated_shm5_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
+                            precreated_shm5_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
 
                 threads[0].start()
                 threads[1].start()
@@ -906,33 +1091,40 @@ def test_backlog_sequence_timeout(self):
         for trial in _trials:
             self.clear_deferred_exceptions()
             dtype = self.get_datatype(trial)
-            precreated_shm0_handles = self.precreate_register_regions((1, 3),
-                                                                      dtype, 0)
+            precreated_shm0_handles = self.precreate_register_regions((1, 3), dtype, 0)
             precreated_shm1_handles = self.precreate_register_regions(
-                (11, 12, 12, 13), dtype, 1)
+                (11, 12, 12, 13), dtype, 1
+            )
             precreated_shm2_handles = self.precreate_register_regions(
-                (111, 112, 112, 113), dtype, 2)
+                (111, 112, 112, 113), dtype, 2
+            )
             precreated_shm3_handles = self.precreate_register_regions(
-                (1111, 1112, 1112, 1113), dtype, 3)
+                (1111, 1112, 1112, 1113), dtype, 3
+            )
             precreated_shm4_handles = self.precreate_register_regions(
-                (11111, 11113), dtype, 4)
+                (11111, 11113), dtype, 4
+            )
             try:
                 model_name = tu.get_dyna_sequence_model_name(trial, dtype)
 
                 self.check_setup(model_name)
                 self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
-                self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER",
-                                 os.environ)
+                self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
 
                 if "string" in trial:
-                    corrids = ['1001', '1002', '1003', '1004', '1005']
+                    corrids = ["1001", "1002", "1003", "1004", "1005"]
                 else:
                     corrids = [1001, 1002, 1003, 1004, 1005]
                 threads = []
-                expected_result = self.get_expected_result(
-                    4 + int(corrids[0]), corrids[0], 3, trial, None
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    4, corrids[0], 3, trial, None)
+                expected_result = (
+                    self.get_expected_result(
+                        4 + int(corrids[0]), corrids[0], 3, trial, None
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        4, corrids[0], 3, trial, None
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -943,17 +1135,25 @@ def test_backlog_sequence_timeout(self):
                             corrids[0],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 1, None),
-                             (None, 3, _max_sequence_idle_ms + 1000)),
+                            (
+                                ("start", 1, None),
+                                (None, 3, _max_sequence_idle_ms + 1000),
+                            ),
                             expected_result,
-                            precreated_shm0_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    48 + int(corrids[1]), corrids[1], 13, trial, None
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    48, corrids[1], 13, trial, None)
+                            precreated_shm0_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        48 + int(corrids[1]), corrids[1], 13, trial, None
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        48, corrids[1], 13, trial, None
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -964,19 +1164,27 @@ def test_backlog_sequence_timeout(self):
                             corrids[1],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 11, None), (None, 12,
-                                                   _max_sequence_idle_ms / 2),
-                             (None, 12, _max_sequence_idle_ms / 2),
-                             ("end", 13, _max_sequence_idle_ms / 2)),
+                            (
+                                ("start", 11, None),
+                                (None, 12, _max_sequence_idle_ms / 2),
+                                (None, 12, _max_sequence_idle_ms / 2),
+                                ("end", 13, _max_sequence_idle_ms / 2),
+                            ),
                             expected_result,
-                            precreated_shm1_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    448 + int(corrids[2]), corrids[2], 113, trial, None
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    448, corrids[2], 113, trial, None)
+                            precreated_shm1_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        448 + int(corrids[2]), corrids[2], 113, trial, None
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        448, corrids[2], 113, trial, None
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -987,19 +1195,27 @@ def test_backlog_sequence_timeout(self):
                             corrids[2],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 111, None), (None, 112,
-                                                    _max_sequence_idle_ms / 2),
-                             (None, 112, _max_sequence_idle_ms / 2),
-                             ("end", 113, _max_sequence_idle_ms / 2)),
+                            (
+                                ("start", 111, None),
+                                (None, 112, _max_sequence_idle_ms / 2),
+                                (None, 112, _max_sequence_idle_ms / 2),
+                                ("end", 113, _max_sequence_idle_ms / 2),
+                            ),
                             expected_result,
-                            precreated_shm2_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    4448 + int(corrids[3]), corrids[3], 1113, trial, None
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    4448, corrids[3], 1113, trial, None)
+                            precreated_shm2_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        4448 + int(corrids[3]), corrids[3], 1113, trial, None
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        4448, corrids[3], 1113, trial, None
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -1010,19 +1226,27 @@ def test_backlog_sequence_timeout(self):
                             corrids[3],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 1111, None), (None, 1112,
-                                                     _max_sequence_idle_ms / 2),
-                             (None, 1112, _max_sequence_idle_ms / 2),
-                             ("end", 1113, _max_sequence_idle_ms / 2)),
+                            (
+                                ("start", 1111, None),
+                                (None, 1112, _max_sequence_idle_ms / 2),
+                                (None, 1112, _max_sequence_idle_ms / 2),
+                                ("end", 1113, _max_sequence_idle_ms / 2),
+                            ),
                             expected_result,
-                            precreated_shm3_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    22224 + int(corrids[4]), corrids[4], 11113, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    22224, corrids[4], 11113, trial, "end")
+                            precreated_shm3_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        22224 + int(corrids[4]), corrids[4], 11113, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        22224, corrids[4], 11113, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -1035,10 +1259,11 @@ def test_backlog_sequence_timeout(self):
                             # (flag_str, value, pre_delay_ms)
                             (("start", 11111, None), ("end", 11113, None)),
                             expected_result,
-                            precreated_shm4_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
+                            precreated_shm4_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
 
                 threads[0].start()
                 threads[1].start()
@@ -1052,10 +1277,15 @@ def test_backlog_sequence_timeout(self):
                 self.check_deferred_exception()
                 self.assertTrue(False, "expected error")
             except Exception as ex:
-                self.assertTrue(ex.message().startswith(
-                    str("inference request for sequence 1001 to " +
-                        "model '{}' must specify the START flag on the first " +
-                        "request of the sequence").format(model_name)))
+                self.assertTrue(
+                    ex.message().startswith(
+                        str(
+                            "inference request for sequence 1001 to "
+                            + "model '{}' must specify the START flag on the first "
+                            + "request of the sequence"
+                        ).format(model_name)
+                    )
+                )
             finally:
                 if _test_system_shared_memory or _test_cuda_shared_memory:
                     self.cleanup_shm_regions(precreated_shm0_handles)
@@ -1065,5 +1295,5 @@ def test_backlog_sequence_timeout(self):
                     self.cleanup_shm_regions(precreated_shm4_handles)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_dyna_sequence_batcher/test.sh b/qa/L0_dyna_sequence_batcher/test.sh
index 8c62e3f630..acac8399af 100755
--- a/qa/L0_dyna_sequence_batcher/test.sh
+++ b/qa/L0_dyna_sequence_batcher/test.sh
@@ -65,15 +65,20 @@ fi
 
 RET=0
 
-rm -fr *.log *.serverlog
+rm -fr *.log
 
 # models
 rm -fr models && mkdir models
-cp -r ${DATADIR}/$MODEL_REPOSITORY/* models/.
+for MODEL in ${DATADIR}/$MODEL_REPOSITORY/* ; do
+    cp -r $MODEL models/. && \
+        (cd models/$(basename $MODEL) && \
+            sed -i "s/kind: KIND_CPU/kind: KIND_CPU\\ncount: 1/" config.pbtxt)
+done
 
 # Implicit state models for custom backend do not exist.
 if [ $IMPLICIT_STATE == "0" ]; then
     cp -r ../custom_models/custom_dyna_sequence_int32 models/.
+    sed -i "s/kind: KIND_CPU/kind: KIND_CPU\\ncount: 1/" models/custom_dyna_sequence_int32/config.pbtxt
     # Construct custom dyna_sequence_model with STRING sequence ID. Copy model and edit config.pbtxt
     cp -r models/custom_dyna_sequence_int32 models/custom_string_dyna_sequence_int32
     sed -i "s/custom_dyna_sequence_int32/custom_string_dyna_sequence_int32/g" models/custom_string_dyna_sequence_int32/config.pbtxt
@@ -86,6 +91,7 @@ if [ $IMPLICIT_STATE == "0" ]; then
     rm -fr ragged_models && mkdir ragged_models
     cp -r ../custom_models/custom_dyna_sequence_int32 ragged_models/.
     (cd ragged_models/custom_dyna_sequence_int32 && \
+            sed -i "s/kind: KIND_CPU/kind: KIND_CPU\\ncount: 1/" config.pbtxt && \
             sed -i "s/name:.*\"INPUT\"/name: \"INPUT\"\\nallow_ragged_batch: true/" config.pbtxt)
 fi
 
@@ -98,7 +104,7 @@ for i in \
         test_simple_sequence \
         test_length1_sequence \
          ; do
-    SERVER_LOG="./$i.serverlog"
+    SERVER_LOG="./$i.server.log"
     SERVER_ARGS="--model-repository=`pwd`/models"
     run_server
     if [ "$SERVER_PID" == "0" ]; then
@@ -141,7 +147,7 @@ for i in \
         test_backlog_sequence_timeout \
     ; do
 
-    SERVER_LOG="./$i.serverlog"
+    SERVER_LOG="./$i.server.log"
     SERVER_ARGS="--model-repository=`pwd`/models"
     run_server
     if [ "$SERVER_PID" == "0" ]; then
@@ -180,7 +186,7 @@ if [ $IMPLICIT_STATE == "0" ]; then
         test_multi_sequence_different_shape_allow_ragged \
         ; do
 
-        SERVER_LOG="./$i.serverlog"
+        SERVER_LOG="./$i.server.log"
         SERVER_ARGS="--model-repository=`pwd`/ragged_models"
         run_server
         if [ "$SERVER_PID" == "0" ]; then
diff --git a/qa/L0_grpc/client_plugin_models/client_plugin_test/1/model.py b/qa/L0_grpc/client_plugin_models/client_plugin_test/1/model.py
new file mode 100644
index 0000000000..17c406b18e
--- /dev/null
+++ b/qa/L0_grpc/client_plugin_models/client_plugin_test/1/model.py
@@ -0,0 +1,63 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import json
+
+import numpy as np
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    def execute(self, requests):
+        responses = []
+
+        for request in requests:
+            json_string = (
+                pb_utils.get_input_tensor_by_name(request, "EXPECTED_HEADERS")
+                .as_numpy()[0]
+                .decode("utf-8")
+            )
+            expected_headers = json.loads(json_string)
+
+            success = True
+            if request.parameters() != "":
+                parameters = json.loads(request.parameters())
+                for key, value in expected_headers.items():
+                    if key in parameters:
+                        if parameters[key] != value:
+                            success = False
+                    else:
+                        success = False
+
+            test_success = pb_utils.Tensor(
+                "TEST_SUCCESS", np.array([success], dtype=bool)
+            )
+            inference_response = pb_utils.InferenceResponse(
+                output_tensors=[test_success]
+            )
+            responses.append(inference_response)
+
+        return responses
diff --git a/docs/model_analyzer.md b/qa/L0_grpc/client_plugin_models/client_plugin_test/config.pbtxt
similarity index 63%
rename from docs/model_analyzer.md
rename to qa/L0_grpc/client_plugin_models/client_plugin_test/config.pbtxt
index 4f442e55cc..1bf368f795 100644
--- a/docs/model_analyzer.md
+++ b/qa/L0_grpc/client_plugin_models/client_plugin_test/config.pbtxt
@@ -1,4 +1,3 @@
-<!--
 # Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
@@ -24,21 +23,23 @@
 # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--->
 
-# Model Analyzer
+name: "client_plugin_test"
+backend: "python"
 
-The Triton Model Analyzer is a tool that uses [Performance
-Analyzer](perf_analyzer.md) to send requests to your model while
-measuring GPU memory and compute utilization. The Model Analyzer is
-specifically useful for characterizing the GPU memory requirements for
-your model under different batching and model instance
-configurations. Once you have this GPU memory usage information you
-can more intelligently decide on how to combine multiple models on the
-same GPU while remaining within the memory capacity of the GPU.
+input [
+  {
+    name: "EXPECTED_HEADERS"
+    data_type: TYPE_STRING
+    dims: [ 1 ]
+  }
+]
+output [
+  {
+    name: "TEST_SUCCESS"
+    data_type: TYPE_BOOL
+    dims: [ 1 ]
+  }
+]
 
-For more information see the [Model Analyzer
-repository](https://github.com/triton-inference-server/model_analyzer)
-and the detailed explanation provided in [Maximizing Deep Learning
-Inference Performance with NVIDIA Model
-Analyzer](https://developer.nvidia.com/blog/maximizing-deep-learning-inference-performance-with-nvidia-model-analyzer).
+instance_group [{ kind: KIND_CPU }]
diff --git a/qa/L0_grpc/grpc_basic_auth_test.py b/qa/L0_grpc/grpc_basic_auth_test.py
new file mode 100755
index 0000000000..07d29ef5b7
--- /dev/null
+++ b/qa/L0_grpc/grpc_basic_auth_test.py
@@ -0,0 +1,66 @@
+#!/usr/bin/python
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+import sys
+import unittest
+
+sys.path.append("../common")
+
+import test_util as tu
+import tritonclient.grpc as tritongrpcclient
+import tritonclient.grpc.aio as asynctritongrpcclient
+from tritonclient.grpc.aio.auth import BasicAuth as AsyncBasicAuth
+from tritonclient.grpc.auth import BasicAuth
+
+
+class GRPCBasicAuthTest(tu.TestResultCollector):
+    def setUp(self):
+        # Use the nginx port
+        self._client = tritongrpcclient.InferenceServerClient(url="localhost:8004")
+        self._client.register_plugin(BasicAuth("username", "password"))
+
+    def test_client_call(self):
+        self.assertTrue(self._client.is_server_live())
+
+    def tearDown(self):
+        self._client.close()
+
+
+class GRPCBasicAuthAsyncTest(unittest.IsolatedAsyncioTestCase):
+    async def asyncSetUp(self):
+        # Use the nginx port
+        self._client = asynctritongrpcclient.InferenceServerClient(url="localhost:8004")
+        self._client.register_plugin(AsyncBasicAuth("username", "password"))
+
+    async def test_client_call(self):
+        self.assertTrue(await self._client.is_server_live())
+
+    async def asyncTearDown(self):
+        await self._client.close()
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_grpc/grpc_client_plugin_test.py b/qa/L0_grpc/grpc_client_plugin_test.py
new file mode 100755
index 0000000000..1cc8c474ef
--- /dev/null
+++ b/qa/L0_grpc/grpc_client_plugin_test.py
@@ -0,0 +1,120 @@
+#!/usr/bin/python
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+import json
+import sys
+
+sys.path.append("../common")
+
+import unittest
+
+import numpy as np
+import test_util as tu
+import tritonclient.grpc as tritongrpcclient
+import tritonclient.grpc.aio as asynctritongrpcclient
+from tritonclient.grpc import InferenceServerClientPlugin
+from tritonclient.utils import np_to_triton_dtype
+
+
+# A simple plugin that adds headers to the inference request.
+class TestPlugin(InferenceServerClientPlugin):
+    def __init__(self, headers):
+        self._headers = headers
+
+    def __call__(self, request):
+        request.headers.update(self._headers)
+
+
+def prepare_infer_inputs(headers):
+    expected_headers = np.array([json.dumps(headers)], dtype=object)
+    inputs = []
+    inputs.append(
+        tritongrpcclient.InferInput(
+            "EXPECTED_HEADERS",
+            expected_headers.shape,
+            np_to_triton_dtype(expected_headers.dtype),
+        )
+    )
+    inputs[0].set_data_from_numpy(expected_headers)
+
+    return inputs
+
+
+class GRPCClientPluginAsyncTest(unittest.IsolatedAsyncioTestCase):
+    async def asyncSetUp(self):
+        self._headers = {"my-key": "my-value"}
+        self._plugin = TestPlugin(self._headers)
+        self._client = asynctritongrpcclient.InferenceServerClient(url="localhost:8001")
+
+    async def test_simple_infer(self):
+        model = "client_plugin_test"
+        inputs = prepare_infer_inputs(self._headers)
+        self._client.register_plugin(self._plugin)
+        response = await self._client.infer(model_name=model, inputs=inputs)
+        test_success = response.as_numpy("TEST_SUCCESS")
+        self.assertEqual(test_success, True)
+
+        self._client.unregister_plugin()
+        inputs = prepare_infer_inputs({})
+        response = await self._client.infer(model_name=model, inputs=inputs)
+        test_success = response.as_numpy("TEST_SUCCESS")
+        self.assertEqual(test_success, True)
+
+    async def asyncTearDown(self):
+        await self._client.close()
+
+
+class GRPCClientPluginTest(tu.TestResultCollector):
+    def setUp(self):
+        self._headers = {"my-key": "my-value"}
+        self._plugin = TestPlugin(self._headers)
+        self._client = tritongrpcclient.InferenceServerClient(url="localhost:8001")
+
+    def test_simple_infer(self):
+        # Set the binary data to False so that 'Inference-Header-Length' is not
+        # added to the headers.
+        model = "client_plugin_test"
+        inputs = prepare_infer_inputs(self._headers)
+        self._client.register_plugin(self._plugin)
+        self.assertEqual(self._plugin, self._client.plugin())
+        response = self._client.infer(model_name=model, inputs=inputs)
+        test_success = response.as_numpy("TEST_SUCCESS")
+        self.assertEqual(test_success, True)
+
+        # Unregister the plugin
+        inputs = prepare_infer_inputs({})
+        self._client.unregister_plugin()
+        self.assertEqual(None, self._client.plugin())
+        response = self._client.infer(model_name=model, inputs=inputs)
+        test_success = response.as_numpy("TEST_SUCCESS")
+        self.assertEqual(test_success, True)
+
+    def tearDown(self):
+        self._client.close()
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_grpc/nginx.conf b/qa/L0_grpc/nginx.conf
new file mode 100644
index 0000000000..063d358c21
--- /dev/null
+++ b/qa/L0_grpc/nginx.conf
@@ -0,0 +1,54 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+worker_processes  1;
+
+error_log  /var/log/nginx/error.log;
+
+events {
+    worker_connections  1024;
+}
+
+http {
+    # Configure basic authentication
+    auth_basic "Restricted Content";
+    auth_basic_user_file /opt/tritonserver/qa/L0_grpc/pswd;
+
+    # Define upstream server
+    upstream backend {
+        server localhost:8001;
+    }
+
+    # Define server block for reverse proxy
+    server {
+        listen 8004 http2;
+
+        # Configure location for reverse proxy
+        location / {
+            grpc_pass grpc://backend;
+        }
+    }
+}
diff --git a/qa/L0_grpc/python_grpc_aio_test.py b/qa/L0_grpc/python_grpc_aio_test.py
new file mode 100755
index 0000000000..f342f19ad5
--- /dev/null
+++ b/qa/L0_grpc/python_grpc_aio_test.py
@@ -0,0 +1,125 @@
+#!/usr/bin/env python
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import unittest
+
+import tritonclient.grpc.aio as grpcclient
+from tritonclient.utils import *
+
+
+class TestGrpcAioClient(unittest.IsolatedAsyncioTestCase):
+    """Test if aio rpc can reach the server"""
+
+    def setUp(self):
+        self._triton_client = grpcclient.InferenceServerClient(url="localhost:8001")
+
+    async def asyncTearDown(self):
+        await self._triton_client.close()
+
+    async def test_is_server_live(self):
+        ret = await self._triton_client.is_server_live()
+        self.assertEqual(ret, True)
+
+    async def test_is_server_ready(self):
+        ret = await self._triton_client.is_server_ready()
+        self.assertEqual(ret, True)
+
+    async def test_is_model_ready(self):
+        ret = await self._triton_client.is_model_ready("simple")
+        self.assertEqual(ret, True)
+
+    async def test_get_server_metadata(self):
+        ret = await self._triton_client.get_server_metadata()
+        self.assertEqual(ret.name, "triton")
+
+        ret = await self._triton_client.get_server_metadata(as_json=True)
+        self.assertEqual(ret["name"], "triton")
+
+    async def test_get_model_metadata(self):
+        ret = await self._triton_client.get_model_metadata("simple")
+        self.assertEqual(ret.name, "simple")
+
+    async def test_get_model_config(self):
+        ret = await self._triton_client.get_model_config("simple")
+        self.assertEqual(ret.config.name, "simple")
+
+    async def test_get_model_repository_index(self):
+        ret = await self._triton_client.get_model_repository_index()
+        self.assertEqual(len(ret.models), 8)
+
+    async def test_load_model(self):
+        with self.assertRaisesRegex(
+            InferenceServerException,
+            "\[StatusCode\.UNAVAILABLE\] explicit model load / unload is not allowed if polling is enabled",
+        ):
+            await self._triton_client.load_model("simple")
+
+    async def test_unload_model(self):
+        with self.assertRaisesRegex(
+            InferenceServerException,
+            "\[StatusCode\.UNAVAILABLE\] explicit model load / unload is not allowed if polling is enabled",
+        ):
+            await self._triton_client.load_model("simple")
+
+    async def test_get_inference_statistics(self):
+        await self._triton_client.get_inference_statistics()
+
+    async def test_update_trace_settings(self):
+        await self._triton_client.update_trace_settings()
+
+    async def test_get_trace_settings(self):
+        await self._triton_client.get_trace_settings()
+
+    async def test_get_system_shared_memory_status(self):
+        await self._triton_client.get_system_shared_memory_status()
+
+    async def test_register_system_shared_memory(self):
+        with self.assertRaisesRegex(
+            InferenceServerException,
+            "\[StatusCode\.INTERNAL\] Unable to open shared memory region: ''",
+        ):
+            await self._triton_client.register_system_shared_memory("", "", 0)
+
+    async def test_unregister_system_shared_memory(self):
+        await self._triton_client.unregister_system_shared_memory()
+
+    async def test_get_cuda_shared_memory_status(self):
+        await self._triton_client.get_cuda_shared_memory_status()
+
+    async def test_register_cuda_shared_memory(self):
+        with self.assertRaisesRegex(
+            InferenceServerException,
+            "\[StatusCode\.INVALID_ARGUMENT\] failed to register CUDA shared memory region '': failed to open CUDA IPC handle: invalid argument",
+        ):
+            await self._triton_client.register_cuda_shared_memory("", b"", 0, 0)
+
+    async def test_unregister_cuda_shared_memory(self):
+        await self._triton_client.unregister_cuda_shared_memory()
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_grpc/python_unit_test.py b/qa/L0_grpc/python_unit_test.py
new file mode 100755
index 0000000000..9591d4274c
--- /dev/null
+++ b/qa/L0_grpc/python_unit_test.py
@@ -0,0 +1,159 @@
+#!/usr/bin/env python
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import queue
+import time
+import unittest
+
+# For stream infer test
+from functools import partial
+
+import numpy as np
+import tritonclient.grpc as grpcclient
+from tritonclient.utils import InferenceServerException
+
+
+class UserData:
+    def __init__(self):
+        self._completed_requests = queue.Queue()
+
+
+def callback(user_data, result, error):
+    if error:
+        user_data._completed_requests.put(error)
+    else:
+        user_data._completed_requests.put(result)
+
+
+class RestrictedProtocolTest(unittest.TestCase):
+    def setUp(self):
+        self.client_ = grpcclient.InferenceServerClient(url="localhost:8001")
+        self.model_name_ = "simple"
+        self.prefix_ = "triton-grpc-protocol-"
+
+    # Other unspecified protocols should not be restricted
+    def test_sanity(self):
+        self.client_.get_inference_statistics("simple")
+        self.client_.get_inference_statistics(
+            "simple", headers={self.prefix_ + "infer-key": "infer-value"}
+        )
+
+    # health, infer, model repository protocols are restricted.
+    # health and infer expects "triton-grpc-restricted-infer-key : infer-value" header,
+    # model repository expected "triton-grpc-restricted-admin-key : admin-value".
+    def test_model_repository(self):
+        with self.assertRaisesRegex(
+            InferenceServerException, "This protocol is restricted"
+        ):
+            self.client_.unload_model(
+                self.model_name_, headers={self.prefix_ + "infer-key": "infer-value"}
+            )
+        # Request go through and get actual transaction error
+        with self.assertRaisesRegex(
+            InferenceServerException, "explicit model load / unload is not allowed"
+        ):
+            self.client_.unload_model(
+                self.model_name_, headers={self.prefix_ + "admin-key": "admin-value"}
+            )
+
+    def test_health(self):
+        with self.assertRaisesRegex(
+            InferenceServerException, "This protocol is restricted"
+        ):
+            self.client_.is_server_live()
+        self.client_.is_server_live({self.prefix_ + "infer-key": "infer-value"})
+
+    def test_infer(self):
+        # setup
+        inputs = [
+            grpcclient.InferInput("INPUT0", [1, 16], "INT32"),
+            grpcclient.InferInput("INPUT1", [1, 16], "INT32"),
+        ]
+        inputs[0].set_data_from_numpy(np.ones(shape=(1, 16), dtype=np.int32))
+        inputs[1].set_data_from_numpy(np.ones(shape=(1, 16), dtype=np.int32))
+
+        # This test only care if the request goes through
+        with self.assertRaisesRegex(
+            InferenceServerException, "This protocol is restricted"
+        ):
+            _ = self.client_.infer(
+                model_name=self.model_name_, inputs=inputs, headers={"test": "1"}
+            )
+        self.client_.infer(
+            model_name=self.model_name_,
+            inputs=inputs,
+            headers={self.prefix_ + "infer-key": "infer-value"},
+        )
+
+    def test_stream_infer(self):
+        # setup
+        inputs = [
+            grpcclient.InferInput("INPUT0", [1, 16], "INT32"),
+            grpcclient.InferInput("INPUT1", [1, 16], "INT32"),
+        ]
+        inputs[0].set_data_from_numpy(np.ones(shape=(1, 16), dtype=np.int32))
+        inputs[1].set_data_from_numpy(np.ones(shape=(1, 16), dtype=np.int32))
+        user_data = UserData()
+        # The server can't interfere with whether GRPC should create the stream,
+        # server will be notified after the stream is established and only
+        # until then be able to access metadata to decide whether to continue
+        # the stream.
+        # So on client side, it will always perceive that the stream is
+        # successfully created and can only check its health at a later time.
+        self.client_.start_stream(partial(callback, user_data), headers={"test": "1"})
+        # wait for sufficient round-trip time
+        time.sleep(1)
+        with self.assertRaisesRegex(
+            InferenceServerException, "The stream is no longer in valid state"
+        ):
+            self.client_.async_stream_infer(model_name=self.model_name_, inputs=inputs)
+        # callback should record error detail
+        self.assertFalse(user_data._completed_requests.empty())
+        with self.assertRaisesRegex(
+            InferenceServerException, "This protocol is restricted"
+        ):
+            raise user_data._completed_requests.get()
+
+        self.assertTrue(user_data._completed_requests.empty())
+
+        # Stop and start new stream with proper header
+        self.client_.stop_stream()
+        self.client_.start_stream(
+            partial(callback, user_data),
+            headers={self.prefix_ + "infer-key": "infer-value"},
+        )
+        self.client_.async_stream_infer(model_name=self.model_name_, inputs=inputs)
+        # wait for response
+        time.sleep(1)
+        self.assertFalse(user_data._completed_requests.empty())
+        self.assertNotEqual(
+            type(user_data._completed_requests.get()), InferenceServerException
+        )
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_grpc/test.sh b/qa/L0_grpc/test.sh
old mode 100644
new mode 100755
index 70eb3bf561..73b9710a71
--- a/qa/L0_grpc/test.sh
+++ b/qa/L0_grpc/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -42,16 +42,22 @@ export CUDA_VISIBLE_DEVICES=0
 
 RET=0
 
+CLIENT_PLUGIN_TEST="./grpc_client_plugin_test.py"
+BASIC_AUTH_TEST="./grpc_basic_auth_test.py"
+NGINX_CONF="./nginx.conf"
 # On windows the paths invoked by the script (running in WSL) must use
 # /mnt/c when needed but the paths on the tritonserver command-line
 # must be C:/ style.
 if [[ "$(< /proc/sys/kernel/osrelease)" == *microsoft* ]]; then
     SDKDIR=${SDKDIR:=C:/sdk}
     MODELDIR=${MODELDIR:=C:/models}
+    CLIENT_PLUGIN_MODELDIR=${MODELDIR:=C:/client_plugin_models}
     DATADIR=${DATADIR:="/mnt/c/data/inferenceserver/${REPO_VERSION}"}
     BACKEND_DIR=${BACKEND_DIR:=C:/tritonserver/backends}
     SERVER=${SERVER:=/mnt/c/tritonserver/bin/tritonserver.exe}
 
+    SIMPLE_AIO_INFER_CLIENT_PY=${SDKDIR}/python/simple_grpc_aio_infer_client.py
+    SIMPLE_AIO_STREAM_INFER_CLIENT_PY=${SDKDIR}/python/simple_grpc_aio_sequence_stream_infer_client.py
     SIMPLE_HEALTH_CLIENT_PY=${SDKDIR}/python/simple_grpc_health_metadata.py
     SIMPLE_INFER_CLIENT_PY=${SDKDIR}/python/simple_grpc_infer_client.py
     SIMPLE_ASYNC_INFER_CLIENT_PY=${SDKDIR}/python/simple_grpc_async_infer_client.py
@@ -91,11 +97,14 @@ if [[ "$(< /proc/sys/kernel/osrelease)" == *microsoft* ]]; then
     CC_UNIT_TEST=${SDKDIR}/python/cc_client_test
 else
     MODELDIR=${MODELDIR:=`pwd`/models}
+    CLIENT_PLUGIN_MODELDIR=${CLIENTPLUGINMODELDIR:=`pwd`/client_plugin_models}
     DATADIR=${DATADIR:="/data/inferenceserver/${REPO_VERSION}"}
     TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"}
     SERVER=${TRITON_DIR}/bin/tritonserver
     BACKEND_DIR=${TRITON_DIR}/backends
 
+    SIMPLE_AIO_INFER_CLIENT_PY=../clients/simple_grpc_aio_infer_client.py
+    SIMPLE_AIO_STREAM_INFER_CLIENT_PY=../clients/simple_grpc_aio_sequence_stream_infer_client.py
     SIMPLE_HEALTH_CLIENT_PY=../clients/simple_grpc_health_metadata.py
     SIMPLE_INFER_CLIENT_PY=../clients/simple_grpc_infer_client.py
     SIMPLE_ASYNC_INFER_CLIENT_PY=../clients/simple_grpc_async_infer_client.py
@@ -133,6 +142,7 @@ else
     SIMPLE_CUSTOM_ARGS_CLIENT=../clients/simple_grpc_custom_args_client
     CC_UNIT_TEST=../clients/cc_client_test
 fi
+PYTHON_UNIT_TEST=python_unit_test.py
 
 # Add string_dyna_sequence model to repo
 cp -r ${MODELDIR}/simple_dyna_sequence ${MODELDIR}/simple_string_dyna_sequence
@@ -168,6 +178,8 @@ fi
 
 IMAGE=../images/vulture.jpeg
 for i in \
+        $SIMPLE_AIO_INFER_CLIENT_PY \
+        $SIMPLE_AIO_STREAM_INFER_CLIENT_PY \
         $SIMPLE_INFER_CLIENT_PY \
         $SIMPLE_ASYNC_INFER_CLIENT_PY \
         $SIMPLE_STRING_INFER_CLIENT_PY \
@@ -327,6 +339,37 @@ set -e
 kill $SERVER_PID
 wait $SERVER_PID
 
+SERVER_ARGS="--backend-directory=${BACKEND_DIR} --model-repository=${CLIENT_PLUGIN_MODELDIR} --http-header-forward-pattern=.* --grpc-header-forward-pattern=.*"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python3 $CLIENT_PLUGIN_TEST >> ${CLIENT_LOG}.python.plugin 2>&1
+if [ $? -ne 0 ]; then
+    cat ${CLIENT_LOG}.python.plugin
+    RET=1
+fi
+set -e
+
+# Create a password file with username:password
+echo -n 'username:' > pswd
+echo "password" | openssl passwd -stdin -apr1 >> pswd
+nginx -c `pwd`/$NGINX_CONF
+
+python3 $BASIC_AUTH_TEST
+if [ $? -ne 0 ]; then
+    cat ${CLIENT_LOG}.python.plugin.auth
+    RET=1
+fi
+service nginx stop
+
+kill $SERVER_PID
+wait $SERVER_PID
+
 export GRPC_TRACE=compression, channel
 export GRPC_VERBOSITY=DEBUG
 SERVER_ARGS="--backend-directory=${BACKEND_DIR} --model-repository=${MODELDIR} --grpc-infer-response-compression-level=high"
@@ -386,6 +429,10 @@ if [ $(cat ${CLIENT_LOG}.model_control | grep "PASS" | wc -l) -ne 1 ]; then
     cat ${CLIENT_LOG}.model_control
     RET=1
 fi
+if [ $(cat ${SERVER_LOG} | grep "Invalid config override" | wc -l) -eq 0 ]; then
+    cat ${SERVER_LOG}
+    RET=1
+fi
 set -e
 
 kill $SERVER_PID
@@ -443,7 +490,7 @@ wait $SERVER_PID
 # Run cpp client unit test
 rm -rf unit_test_models && mkdir unit_test_models
 cp -r $DATADIR/qa_model_repository/onnx_int32_int32_int32 unit_test_models/.
-cp -r ${MODELDIR}/simple unit_test_models/. 
+cp -r ${MODELDIR}/simple unit_test_models/.
 
 SERVER_ARGS="--backend-directory=${BACKEND_DIR} --model-repository=unit_test_models
             --trace-file=global_unittest.log --trace-level=TIMESTAMPS --trace-rate=1"
@@ -481,21 +528,138 @@ SERVER_ARGS="--model-repository=`pwd`/unit_test_models \
              --strict-model-config=false"
 SERVER_LOG="./inference_server_cc_unit_test.load.log"
 CLIENT_LOG="./cc_unit_test.load.log"
+
+for i in \
+   "LoadWithFileOverride" \
+   "LoadWithConfigOverride" \
+   ; do
+    run_server
+    if [ "$SERVER_PID" == "0" ]; then
+        echo -e "\n***\n*** Failed to start $SERVER\n***"
+        cat $SERVER_LOG
+        exit 1
+    fi
+
+    set +e
+    $CC_UNIT_TEST --gtest_filter=GRPC*$i >> ${CLIENT_LOG}.$i 2>&1
+    if [ $? -ne 0 ]; then
+        cat ${CLIENT_LOG}.$i
+        RET=1
+    fi
+    set -e
+
+    kill $SERVER_PID
+    wait $SERVER_PID
+done
+
+# Run python grpc aio unit test
+PYTHON_GRPC_AIO_TEST=python_grpc_aio_test.py
+CLIENT_LOG=`pwd`/python_grpc_aio_test.log
+SERVER_ARGS="--backend-directory=${BACKEND_DIR} --model-repository=${MODELDIR}"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
     cat $SERVER_LOG
     exit 1
 fi
-
 set +e
-$CC_UNIT_TEST --gtest_filter=GRPC*Load* >> ${CLIENT_LOG} 2>&1
+python $PYTHON_GRPC_AIO_TEST > $CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
-    cat ${CLIENT_LOG}
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Python GRPC AsyncIO Test Failed\n***"
     RET=1
 fi
 set -e
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Test GRPC health check implemented
+go install github.com/grpc-ecosystem/grpc-health-probe@latest
+HEALTH_PROBE="${GOPATH}/bin/grpc-health-probe -addr=localhost:8001"
 
+CLIENT_LOG=`pwd`/grpc_health_probe_offline.log
+set +e
+$HEALTH_PROBE > $CLIENT_LOG 2>&1
+set -e
+if [ `grep -c "timeout: failed to connect service" ${CLIENT_LOG}` != "1" ]; then
+    echo -e "\n***\n*** Failed. Expected health check timeout\n***"
+    cat $CLIENT_LOG
+    RET=1
+fi
+
+SERVER_ARGS="--backend-directory=${BACKEND_DIR} --model-repository=${MODELDIR}"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+CLIENT_LOG=`pwd`/grpc_health_probe_online.log
+set +e
+$HEALTH_PROBE > $CLIENT_LOG 2>&1
+set -e
+if [ `grep -c "status: SERVING" ${CLIENT_LOG}` != "1" ]; then
+    echo -e "\n***\n*** Failed. Expected health check to return SERVING\n***"
+    cat $CLIENT_LOG
+    RET=1
+fi
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Repeated protocol, not allowed
+SERVER_ARGS="--model-repository=${MODELDIR} \
+             --grpc-restricted-protocol=model-repository,health:k1=v1 \
+             --grpc-restricted-protocol=metadata,health:k2=v2"
+run_server
+EXPECTED_MSG="protocol 'health' can not be specified in multiple config groups"
+if [ "$SERVER_PID" != "0" ]; then
+    echo -e "\n***\n*** Expect fail to start $SERVER\n***"
+    kill $SERVER_PID
+    wait $SERVER_PID
+    RET=1
+elif [ `grep -c "${EXPECTED_MSG}" ${SERVER_LOG}` != "1" ]; then
+    echo -e "\n***\n*** Failed. Expected ${EXPECTED_MSG} to be found in log\n***"
+    cat $SERVER_LOG
+    RET=1
+fi
+
+# Unknown protocol, not allowed
+SERVER_ARGS="--model-repository=${MODELDIR} \
+             --grpc-restricted-protocol=model-reposit,health:k1=v1 \
+             --grpc-restricted-protocol=metadata,health:k2=v2"
+run_server
+EXPECTED_MSG="unknown restricted protocol 'model-reposit'"
+if [ "$SERVER_PID" != "0" ]; then
+    echo -e "\n***\n*** Expect fail to start $SERVER\n***"
+    kill $SERVER_PID
+    wait $SERVER_PID
+    RET=1
+elif [ `grep -c "${EXPECTED_MSG}" ${SERVER_LOG}` != "1" ]; then
+    echo -e "\n***\n*** Failed. Expected ${EXPECTED_MSG} to be found in log\n***"
+    cat $SERVER_LOG
+    RET=1
+fi
+
+# Test restricted protocols
+SERVER_ARGS="--model-repository=${MODELDIR} \
+             --grpc-restricted-protocol=model-repository:admin-key=admin-value \
+             --grpc-restricted-protocol=inference,health:infer-key=infer-value"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+set +e
+python $PYTHON_UNIT_TEST RestrictedProtocolTest > $CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Python GRPC Restricted Protocol Test Failed\n***"
+    RET=1
+fi
+set -e
 kill $SERVER_PID
 wait $SERVER_PID
 
@@ -506,3 +670,4 @@ else
 fi
 
 exit $RET
+
diff --git a/qa/L0_grpc_state_cleanup/cleanup_test.py b/qa/L0_grpc_state_cleanup/cleanup_test.py
new file mode 100755
index 0000000000..89af756a8b
--- /dev/null
+++ b/qa/L0_grpc_state_cleanup/cleanup_test.py
@@ -0,0 +1,560 @@
+#!/usr/bin/env python3
+
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+
+import os
+import queue
+import signal
+import time
+import unittest
+from functools import partial
+
+import numpy as np
+import test_util as tu
+import tritonclient.grpc as grpcclient
+from tritonclient.utils import InferenceServerException
+
+
+class UserData:
+    def __init__(self):
+        self._response_queue = queue.Queue()
+
+
+def callback(user_data, result, error):
+    if error:
+        user_data._response_queue.put(error)
+    else:
+        user_data._response_queue.put(result)
+
+
+# These state cleanup tests relies on the test.sh
+# to check whether all the created request objects
+# were properly deleted by the sever.
+# The purpose on these unittest is to exercise
+# different portions of the gRPC frontend and
+# and track the state objects.
+class CleanUpTest(tu.TestResultCollector):
+    SERVER_PID = None
+
+    def setUp(self):
+        self.decoupled_model_name_ = "repeat_int32"
+        self.identity_model_name_ = "custom_zero_1_float32"
+
+    def _prepare_inputs_and_outputs(self, kind):
+        if kind == "decoupled_streaming":
+            self.inputs_ = []
+            self.inputs_.append(grpcclient.InferInput("IN", [1], "INT32"))
+            self.inputs_.append(grpcclient.InferInput("DELAY", [1], "UINT32"))
+            self.inputs_.append(grpcclient.InferInput("WAIT", [1], "UINT32"))
+
+            self.outputs_ = []
+            self.outputs_.append(grpcclient.InferRequestedOutput("OUT"))
+            self.outputs_.append(grpcclient.InferRequestedOutput("IDX"))
+            self.requested_outputs_ = self.outputs_
+        elif kind == "simple" or kind == "streaming":
+            self.inputs_ = []
+            self.inputs_.append(grpcclient.InferInput("INPUT0", [1, 1], "FP32"))
+
+            self.outputs_ = []
+            self.outputs_.append(grpcclient.InferRequestedOutput("OUTPUT0"))
+            self.requested_outputs_ = self.outputs_
+        else:
+            raise ValueError("Unsupported kind specified to prepare inputs/outputs")
+
+    def _simple_infer(
+        self,
+        request_count,
+        cancel_response_idx=None,
+        client_timeout_pair=None,
+        kill_server=None,
+    ):
+        with grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        ) as triton_client:
+            self._prepare_inputs_and_outputs("simple")
+
+            input_data = np.array([[1.0]], dtype=np.float32)
+            self.inputs_[0].set_data_from_numpy(input_data)
+
+            user_data = UserData()
+
+            futures = []
+            timeout_idx = None
+            timeout_value = None
+            if client_timeout_pair:
+                timeout_idx, timeout_value = client_timeout_pair
+            for i in range(request_count):
+                if kill_server == i:
+                    os.kill(int(self.SERVER_PID), signal.SIGINT)
+                this_timeout = None
+                if timeout_idx == i:
+                    this_timeout = timeout_value
+                futures.append(
+                    triton_client.async_infer(
+                        model_name=self.identity_model_name_,
+                        inputs=self.inputs_,
+                        request_id=str(i),
+                        callback=partial(callback, user_data),
+                        outputs=self.requested_outputs_,
+                        client_timeout=this_timeout,
+                    )
+                )
+
+            if cancel_response_idx is not None:
+                futures[cancel_response_idx].cancel()
+
+            responses = []
+            while len(responses) < len(futures):
+                data_item = user_data._response_queue.get()
+                if type(data_item) == InferenceServerException:
+                    raise data_item
+                else:
+                    responses.append(data_item)
+
+            for response in responses:
+                output0_data = response.as_numpy("OUTPUT0")
+                self.assertTrue(np.array_equal(input_data, output0_data))
+
+    def _stream_infer_with_params(
+        self,
+        request_count,
+        request_delay,
+        _,
+        user_data,
+        result_dict,
+        delay_data=None,
+        delay_factor=None,
+        cancel_response_idx=None,
+        stream_timeout=None,
+        kill_server=None,
+    ):
+        with grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        ) as triton_client:
+            # Establish stream
+            triton_client.start_stream(
+                callback=partial(callback, user_data), stream_timeout=stream_timeout
+            )
+            # Send specified many requests in parallel
+            for i in range(request_count):
+                time.sleep((request_delay / 1000))
+                self.inputs_[1].set_data_from_numpy(delay_data)
+                if kill_server == i:
+                    os.kill(int(self.SERVER_PID), signal.SIGINT)
+                triton_client.async_stream_infer(
+                    model_name=self.decoupled_model_name_,
+                    inputs=self.inputs_,
+                    request_id=str(i),
+                    outputs=self.requested_outputs_,
+                    # Opt-in to receiving flags-only responses from model/backend
+                    # to help detect final responses for decoupled models.
+                    enable_empty_final_response=True,
+                )
+                # Update delay input in accordance with the scaling factor
+                delay_data = delay_data * delay_factor
+                delay_data = delay_data.astype(np.uint32)
+
+            # Retrieve results...
+            recv_count = 0
+            completed_requests = 0
+            while completed_requests < request_count:
+                if cancel_response_idx == recv_count:
+                    triton_client.stop_stream(cancel_requests=True)
+                data_item = user_data._response_queue.get()
+                if type(data_item) == InferenceServerException:
+                    raise data_item
+                else:
+                    response = data_item.get_response()
+                    # Request IDs should generally be provided with each request
+                    # to associate decoupled responses with their requests.
+                    if not response.id:
+                        raise ValueError(
+                            "No response id found. Was a request_id provided?"
+                        )
+
+                    # Detect final response. Parameters are oneof and we expect bool_param
+                    if response.parameters.get("triton_final_response").bool_param:
+                        completed_requests += 1
+
+                    # Only process non-empty response, ignore if empty (no outputs)
+                    if response.outputs:
+                        if response.id not in result_dict:
+                            result_dict[response.id] = []
+                        result_dict[response.id].append((recv_count, data_item))
+                        recv_count += 1
+
+    def _stream_infer(
+        self,
+        request_count,
+        request_delay,
+        expected_count,
+        user_data,
+        result_dict,
+        delay_data=None,
+        delay_factor=None,
+        cancel_response_idx=None,
+        stream_timeout=None,
+        kill_server=None,
+    ):
+        with grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        ) as triton_client:
+            # Establish stream
+            triton_client.start_stream(
+                callback=partial(callback, user_data), stream_timeout=stream_timeout
+            )
+            # Send specified many requests in parallel
+            for i in range(request_count):
+                time.sleep((request_delay / 1000))
+                model_name = self.identity_model_name_
+                if delay_data is not None:
+                    model_name = self.decoupled_model_name_
+                    self.inputs_[1].set_data_from_numpy(delay_data)
+                if kill_server == i:
+                    os.kill(int(self.SERVER_PID), signal.SIGINT)
+                triton_client.async_stream_infer(
+                    model_name=model_name,
+                    inputs=self.inputs_,
+                    request_id=str(i),
+                    outputs=self.requested_outputs_,
+                )
+                if (delay_data is not None) and (delay_factor is not None):
+                    # Update delay input in accordance with the scaling factor
+                    delay_data = delay_data * delay_factor
+                    delay_data = delay_data.astype(np.uint32)
+
+            # Retrieve results...
+            recv_count = 0
+            while recv_count < expected_count:
+                if cancel_response_idx == recv_count:
+                    triton_client.stop_stream(cancel_requests=True)
+                data_item = user_data._response_queue.get()
+                if type(data_item) == InferenceServerException:
+                    raise data_item
+                else:
+                    this_id = data_item.get_response().id
+                    if this_id not in result_dict:
+                        result_dict[this_id] = []
+                    result_dict[this_id].append((recv_count, data_item))
+
+                recv_count += 1
+
+    def _streaming_infer(
+        self,
+        request_count,
+        request_delay=0,
+        cancel_response_idx=None,
+        stream_timeout=None,
+        kill_server=None,
+        should_error=True,
+    ):
+        self._prepare_inputs_and_outputs("streaming")
+
+        input_data = np.array([[1.0]], dtype=np.float32)
+        self.inputs_[0].set_data_from_numpy(input_data)
+
+        user_data = UserData()
+        result_dict = {}
+
+        try:
+            expected_count = request_count
+            self._stream_infer(
+                request_count,
+                request_delay,
+                expected_count,
+                user_data,
+                result_dict,
+                cancel_response_idx=cancel_response_idx,
+                stream_timeout=stream_timeout,
+                kill_server=kill_server,
+            )
+        except Exception as ex:
+            if cancel_response_idx or stream_timeout or should_error:
+                raise ex
+            self.assertTrue(False, "unexpected error {}".format(ex))
+
+        # Validate the results..
+        for i in range(request_count):
+            this_id = str(i)
+            if this_id not in result_dict.keys():
+                self.assertTrue(
+                    False, "response for request id {} not received".format(this_id)
+                )
+            self.assertEqual(len(result_dict[this_id]), 1)
+            result = result_dict[this_id][0][1]
+            output0_data = result.as_numpy("OUTPUT0")
+            self.assertTrue(np.array_equal(input_data, output0_data))
+
+    def _decoupled_infer(
+        self,
+        request_count,
+        request_delay=0,
+        repeat_count=1,
+        data_offset=100,
+        delay_time=1000,
+        delay_factor=1,
+        wait_time=500,
+        cancel_response_idx=None,
+        stream_timeout=None,
+        kill_server=None,
+        should_error=True,
+        infer_helper_map=[True, True],
+    ):
+        self._prepare_inputs_and_outputs(kind="decoupled_streaming")
+
+        # Initialize data for IN
+        input_data = np.arange(
+            start=data_offset, stop=data_offset + repeat_count, dtype=np.int32
+        )
+        self.inputs_[0].set_shape([repeat_count])
+        self.inputs_[0].set_data_from_numpy(input_data)
+
+        # Initialize data for DELAY
+        delay_data = (np.ones([repeat_count], dtype=np.uint32)) * delay_time
+        self.inputs_[1].set_shape([repeat_count])
+
+        # Initialize data for WAIT
+        wait_data = np.array([wait_time], dtype=np.uint32)
+        self.inputs_[2].set_data_from_numpy(wait_data)
+
+        infer_helpers = []
+        if infer_helper_map[0]:
+            infer_helpers.append(self._stream_infer)
+        if infer_helper_map[1]:
+            infer_helpers.append(self._stream_infer_with_params)
+
+        for infer_helper in infer_helpers:
+            user_data = UserData()
+            result_dict = {}
+
+            try:
+                expected_count = repeat_count * request_count
+                infer_helper(
+                    request_count,
+                    request_delay,
+                    expected_count,
+                    user_data,
+                    result_dict,
+                    delay_data,
+                    delay_factor,
+                    cancel_response_idx,
+                    stream_timeout,
+                    kill_server,
+                )
+            except Exception as ex:
+                if cancel_response_idx or stream_timeout or should_error:
+                    raise ex
+                self.assertTrue(False, "unexpected error {}".format(ex))
+
+            # Validate the results..
+            for i in range(request_count):
+                this_id = str(i)
+                if repeat_count != 0 and this_id not in result_dict.keys():
+                    self.assertTrue(
+                        False, "response for request id {} not received".format(this_id)
+                    )
+                elif repeat_count == 0 and this_id in result_dict.keys():
+                    self.assertTrue(
+                        False,
+                        "received unexpected response for request id {}".format(
+                            this_id
+                        ),
+                    )
+                if repeat_count != 0:
+                    self.assertEqual(len(result_dict[this_id]), repeat_count)
+                    expected_data = data_offset
+                    result_list = result_dict[this_id]
+                    for j in range(len(result_list)):
+                        this_data = result_list[j][1].as_numpy("OUT")
+                        self.assertEqual(len(this_data), 1)
+                        self.assertEqual(this_data[0], expected_data)
+                        this_idx = result_list[j][1].as_numpy("IDX")
+                        self.assertEqual(len(this_idx), 1)
+                        self.assertEqual(this_idx[0], j)
+                        expected_data += 1
+
+    ###
+    ### Non-Streaming Tests
+    ###
+    def test_simple_infer(self):
+        # This test case sends 10 asynchronous requests and validates
+        # the response.
+        self._simple_infer(request_count=10)
+
+    def test_simple_infer_cancellation(self):
+        # This test case is used to check whether all the states are
+        # correctly released when one of the request is cancelled from
+        # the client side.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._simple_infer(request_count=10, cancel_response_idx=5)
+        self.assertIn("Locally cancelled by application!", str(cm.exception))
+
+    def test_simple_infer_timeout(self):
+        # This test case is used to check whether all the states are
+        # correctly released when the request gets timed-out on the client.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._simple_infer(request_count=10, client_timeout_pair=[5, 0.1])
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+
+    def test_simple_infer_error_status(self):
+        # This test case is used to check whether all the state objects are
+        # released when RPC runs into error.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._simple_infer(request_count=10)
+        self.assertIn(
+            "This protocol is restricted, expecting header 'triton-grpc-protocol-infer-key'",
+            str(cm.exception),
+        )
+
+    def test_simple_infer_shutdownserver(self):
+        # This test case is used to check whether all the state objects are
+        # released when the server is interrupted to shutdown in middle of
+        # inference run with final parameters being returned.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._simple_infer(request_count=10, kill_server=5)
+
+    ###
+    ### Streaming Tests
+    ###
+    def test_streaming_infer(self):
+        # Sanity test to check whether all the state objects
+        # are correctly released. Sends 10 requests in a single
+        # gRPC bidirectional stream.
+        self._streaming_infer(request_count=10)
+
+    def test_streaming_cancellation(self):
+        # This test case is used to check whether all the states are
+        # correctly released when the stream is closed when fifth
+        # response is received.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._streaming_infer(request_count=10, cancel_response_idx=5)
+        self.assertIn("Locally cancelled by application!", str(cm.exception))
+
+    def test_streaming_timeout(self):
+        # This test case is used to check whether all the states are
+        # released when some of the requests timeouts.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._streaming_infer(request_count=10, request_delay=1, stream_timeout=2)
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+
+    def test_streaming_error_status(self):
+        # This test case is used to check whether all the state objects are
+        # released when RPC runs into error.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._streaming_infer(request_count=10, should_error=True)
+        self.assertIn(
+            "This protocol is restricted, expecting header 'triton-grpc-protocol-infer-key'",
+            str(cm.exception),
+        )
+
+    def test_streaming_infer_shutdownserver(self):
+        # This test case is used to check whether all the state objects are
+        # released when the server is interrupted to shutdown in middle of
+        # inference run.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._streaming_infer(
+                request_count=10,
+                request_delay=1,
+                kill_server=5,
+                should_error=True,
+            )
+
+    ###
+    ### Decoupled Streaming Tests
+    ###
+    def test_decoupled_infer(self):
+        # Sanity test to check whether all the state objects
+        # are correctly released. Sends 10 requests in a single
+        # gRPC bidirectional stream and expects each of these
+        # requests to generate 10 responses.
+        self._decoupled_infer(request_count=10, repeat_count=10)
+
+    def test_decoupled_cancellation(self):
+        # This test case is used to check whether all the states are
+        # correctly released when the stream is closed when fifth
+        # response is received.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._decoupled_infer(
+                request_count=10, repeat_count=10, cancel_response_idx=5
+            )
+        self.assertIn("Locally cancelled by application!", str(cm.exception))
+
+    def test_decoupled_timeout(self):
+        # This test case is used to check whether all the states are
+        # released when some of the requests timeouts.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._decoupled_infer(
+                request_count=10, repeat_count=10, request_delay=1, stream_timeout=2
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+
+    def test_decoupled_error_status(self):
+        # This test case is used to check whether all the state objects are
+        # released when RPC runs into error.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._decoupled_infer(request_count=10, repeat_count=10, should_error=True)
+        self.assertIn(
+            "This protocol is restricted, expecting header 'triton-grpc-protocol-infer-key'",
+            str(cm.exception),
+        )
+
+    def test_decoupled_infer_shutdownserver(self):
+        # This test case is used to check whether all the state objects are
+        # released when the server is interrupted to shutdown in middle of
+        # inference run.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._decoupled_infer(
+                request_count=10,
+                repeat_count=10,
+                request_delay=1,
+                kill_server=5,
+                should_error=True,
+                infer_helper_map=[True, False],
+            )
+
+    def test_decoupled_infer_with_params_shutdownserver(self):
+        # This test case is used to check whether all the state objects are
+        # released when the server is interrupted to shutdown in middle of
+        # inference run with final parameters being returned.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._decoupled_infer(
+                request_count=10,
+                repeat_count=10,
+                request_delay=1,
+                kill_server=5,
+                should_error=True,
+                infer_helper_map=[False, True],
+            )
+
+
+if __name__ == "__main__":
+    CleanUpTest.SERVER_PID = os.environ.get("SERVER_PID", CleanUpTest.SERVER_PID)
+    unittest.main()
diff --git a/qa/L0_grpc_state_cleanup/test.sh b/qa/L0_grpc_state_cleanup/test.sh
new file mode 100755
index 0000000000..605edb6f9c
--- /dev/null
+++ b/qa/L0_grpc_state_cleanup/test.sh
@@ -0,0 +1,194 @@
+#!/bin/bash
+# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
+if [ "$#" -ge 1 ]; then
+    REPO_VERSION=$1
+fi
+if [ -z "$REPO_VERSION" ]; then
+    echo -e "Repository version must be specified"
+    echo -e "\n***\n*** Test Failed\n***"
+    exit 1
+fi
+if [ ! -z "$TEST_REPO_ARCH" ]; then
+    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
+fi
+
+export CUDA_VISIBLE_DEVICES=0
+
+RET=0
+CLEANUP_TEST=cleanup_test.py
+
+rm -f *.log
+
+CLIENT_LOG=`pwd`/client.log
+SERVER=/opt/tritonserver/bin/tritonserver
+source ../common/util.sh
+
+function check_state_release() {
+  local log_file=$1
+
+  num_state_release=`cat $log_file | grep  "StateRelease" | wc -l`
+  num_state_new=`cat $log_file | grep  "StateNew" | wc -l`
+
+  if [ $num_state_release -ne $num_state_new ]; then
+    cat $log_file
+    echo -e "\n***\n*** Test Failed: Mismatch detected, $num_state_new state(s) created, $num_state_release state(s) released. \n***" >> $log_file
+    return 1
+  fi
+
+  return 0
+}
+
+rm -fr ./models/custom_zero_1_float32 && \
+        cp -r ../custom_models/custom_zero_1_float32 ./models/. && \
+        mkdir -p ./models/custom_zero_1_float32/1
+
+(cd models/custom_zero_1_float32 && \
+    echo "parameters [" >> config.pbtxt && \
+    echo "{ key: \"execute_delay_ms\"; value: { string_value: \"1000\" }}" >> config.pbtxt && \
+    echo "]" >> config.pbtxt)
+
+for i in test_simple_infer \
+            test_simple_infer_cancellation \
+            test_simple_infer_timeout \
+            test_streaming_infer \
+            test_streaming_timeout \
+            test_streaming_cancellation \
+            test_decoupled_infer \
+            test_decoupled_cancellation \
+            test_decoupled_timeout; do
+  SERVER_LOG="./inference_server.$i.log"
+  SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=2"
+  run_server
+  if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+  fi
+
+  echo "Test: $i" >>$CLIENT_LOG
+
+  set +e
+  python $CLEANUP_TEST CleanUpTest.$i >>$CLIENT_LOG 2>&1
+  if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Test $i Failed\n***" >>$CLIENT_LOG
+    echo -e "\n***\n*** Test $i Failed\n***"
+    RET=1
+  fi
+
+  kill $SERVER_PID
+  wait $SERVER_PID
+
+  check_state_release $SERVER_LOG
+  if [ $? -ne 0 ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** State Verification Failed for $i\n***"
+      RET=1
+  fi
+  set -e
+done
+
+
+for i in test_simple_infer_error_status \
+                test_streaming_error_status \
+                test_decoupled_error_status; do
+  SERVER_LOG="./inference_server.$i.log"
+  SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=2 --grpc-restricted-protocol=inference:infer-key=infer-value"
+  run_server
+  if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+  fi
+
+  echo "Test: $i" >>$CLIENT_LOG
+
+  set +e
+  python $CLEANUP_TEST CleanUpTest.$i >>$CLIENT_LOG 2>&1
+  if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Test $i Failed\n***" >>$CLIENT_LOG
+    echo -e "\n***\n*** Test $i Failed\n***"
+    RET=1
+  fi
+
+  kill $SERVER_PID
+  wait $SERVER_PID
+
+  check_state_release $SERVER_LOG
+  if [ $? -ne 0 ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** State Verification Failed for $i\n***"
+      RET=1
+  fi
+
+  set -e
+done
+
+for i in test_simple_infer_shutdownserver \
+         test_streaming_infer_shutdownserver \
+         test_decoupled_infer_shutdownserver \
+         test_decoupled_infer_with_params_shutdownserver; do
+  SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=2"
+  SERVER_LOG="./inference_server.$i.log"
+  run_server
+  if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+  fi
+
+  echo "Test: $i" >>$CLIENT_LOG
+
+  set +e
+  SERVER_PID=$SERVER_PID python $CLEANUP_TEST CleanUpTest.$i >>$CLIENT_LOG 2>&1
+  if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Test $i Failed\n***" >>$CLIENT_LOG
+    echo -e "\n***\n*** Test $i Failed\n***"
+    RET=1
+  fi
+
+  wait $SERVER_PID
+
+  check_state_release $SERVER_LOG
+  if [ $? -ne 0 ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** State Verification Failed for $i\n***"
+      RET=1
+  fi
+
+  set -e
+done
+
+
+if [ $RET -eq 0 ]; then
+  echo -e "\n***\n*** Test Passed\n***"
+else
+  echo -e "\n***\n*** Test Failed\n***"
+fi
+
+exit $RET
diff --git a/qa/L0_http/generate_endpoint_test.py b/qa/L0_http/generate_endpoint_test.py
new file mode 100755
index 0000000000..29d2e20d96
--- /dev/null
+++ b/qa/L0_http/generate_endpoint_test.py
@@ -0,0 +1,419 @@
+#!/usr/bin/python3
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+
+import json
+import threading
+import time
+import unittest
+
+import requests
+import sseclient
+import test_util as tu
+
+
+class GenerateEndpointTest(tu.TestResultCollector):
+    def setUp(self):
+        self._model_name = "mock_llm"
+
+    def _get_infer_url(self, model_name, route):
+        return f"http://localhost:8000/v2/models/{model_name}/{route}"
+
+    def generate_stream(self, model_name, inputs, stream=False):
+        headers = {"Accept": "text/event-stream"}
+        url = self._get_infer_url(model_name, "generate_stream")
+        # stream=True used to indicate response can be iterated over, which
+        # should be the common setting for generate_stream.
+        # For correctness test case, stream=False so that we can re-examine
+        # the response content.
+        return requests.post(
+            url,
+            data=inputs if isinstance(inputs, str) else json.dumps(inputs),
+            headers=headers,
+            stream=stream,
+        )
+
+    def generate(self, model_name, inputs):
+        url = self._get_infer_url(model_name, "generate")
+        return requests.post(
+            url, data=inputs if isinstance(inputs, str) else json.dumps(inputs)
+        )
+
+    def generate_expect_failure(self, model_name, inputs, msg):
+        url = self._get_infer_url(model_name, "generate")
+        r = requests.post(
+            url, data=inputs if isinstance(inputs, str) else json.dumps(inputs)
+        )
+        # Content-Type header should always be JSON for errors
+        self.assertEqual(r.headers["Content-Type"], "application/json")
+
+        try:
+            r.raise_for_status()
+            self.assertTrue(False, f"Expected failure, success for {inputs}")
+        except requests.exceptions.HTTPError as e:
+            self.assertIn(msg, r.json()["error"])
+
+    def generate_stream_expect_failure(self, model_name, inputs, msg):
+        r = self.generate_stream(model_name, inputs)
+        # Content-Type header should always be JSON for errors
+        self.assertEqual(r.headers["Content-Type"], "application/json")
+
+        try:
+            r.raise_for_status()
+            self.assertTrue(False, f"Expected failure, success for {inputs}")
+        except requests.exceptions.HTTPError as e:
+            self.assertIn(msg, r.json()["error"])
+
+    def generate_stream_expect_success(
+        self, model_name, inputs, expected_output, rep_count
+    ):
+        r = self.generate_stream(model_name, inputs)
+        r.raise_for_status()
+        self.check_sse_responses(r, [{"TEXT": expected_output}] * rep_count)
+
+    def check_sse_responses(self, res, expected_res):
+        # Validate SSE format
+        self.assertIn("Content-Type", res.headers)
+        self.assertEqual(
+            "text/event-stream; charset=utf-8", res.headers["Content-Type"]
+        )
+
+        # SSE format (data: []) is hard to parse, use helper library for simplicity
+        client = sseclient.SSEClient(res)
+        res_count = 0
+        for event in client.events():
+            # Parse event data, join events into a single response
+            data = json.loads(event.data)
+            for key, value in expected_res[res_count].items():
+                self.assertIn(key, data)
+                self.assertEqual(value, data[key])
+            res_count += 1
+        self.assertEqual(len(expected_res), res_count)
+        # Make sure there is no message in the wrong form
+        for remaining in client._read():
+            self.assertTrue(
+                remaining.startswith(b"data:"),
+                f"SSE response not formed properly, got: {remaining}",
+            )
+            self.assertTrue(
+                remaining.endswith(b"\n\n"),
+                f"SSE response not formed properly, got: {remaining}",
+            )
+
+    def test_generate(self):
+        # Setup text-based input
+        text = "hello world"
+        inputs = {"PROMPT": text, "STREAM": False}
+
+        r = self.generate(self._model_name, inputs)
+        r.raise_for_status()
+
+        self.assertIn("Content-Type", r.headers)
+        self.assertEqual(r.headers["Content-Type"], "application/json")
+
+        data = r.json()
+        self.assertIn("TEXT", data)
+        self.assertEqual(text, data["TEXT"])
+
+    def test_generate_stream(self):
+        # Setup text-based input
+        text = "hello world"
+        rep_count = 3
+        inputs = {"PROMPT": [text], "STREAM": True, "REPETITION": rep_count}
+        self.generate_stream_expect_success(self._model_name, inputs, text, rep_count)
+
+    def test_streaming(self):
+        # verify the responses are streamed as soon as it is generated
+        text = "hello world"
+        rep_count = 3
+        inputs = {"PROMPT": [text], "STREAM": True, "REPETITION": rep_count, "DELAY": 2}
+        past = time.time()
+        res = self.generate_stream(self._model_name, inputs, stream=True)
+        client = sseclient.SSEClient(res)
+        # This test does not focus on event content
+        for _ in client.events():
+            now = time.time()
+            self.assertTrue(1 < (now - past) < 3)
+            past = now
+
+    def test_missing_inputs(self):
+        missing_all_inputs = [
+            # Missing all inputs
+            {},
+            {"abc": 123},
+        ]
+        missing_one_input = [
+            # Missing 1 input
+            {"PROMPT": "hello"},
+            {"STREAM": False},
+            {"STREAM": False, "other": "param"},
+        ]
+        for inputs in missing_all_inputs:
+            self.generate_expect_failure(
+                self._model_name, inputs, "expected 2 inputs but got 0"
+            )
+            self.generate_stream_expect_failure(
+                self._model_name, inputs, "expected 2 inputs but got 0"
+            )
+
+        for inputs in missing_one_input:
+            self.generate_expect_failure(
+                self._model_name, inputs, "expected 2 inputs but got 1"
+            )
+            self.generate_stream_expect_failure(
+                self._model_name, inputs, "expected 2 inputs but got 1"
+            )
+
+    def test_invalid_input_types(self):
+        invalid_bool = "attempt to access JSON non-boolean as boolean"
+        invalid_string = "attempt to access JSON non-string as string"
+        invalid_type_inputs = [
+            # Prompt bad type
+            ({"PROMPT": 123, "STREAM": False}, invalid_string),
+            # Stream bad type
+            ({"PROMPT": "hello", "STREAM": "false"}, invalid_bool),
+            # Both bad type, parsed in order
+            ({"PROMPT": True, "STREAM": 123}, invalid_string),
+            ({"STREAM": 123, "PROMPT": True}, invalid_bool),
+        ]
+
+        for inputs, error_msg in invalid_type_inputs:
+            self.generate_expect_failure(self._model_name, inputs, error_msg)
+            self.generate_stream_expect_failure(self._model_name, inputs, error_msg)
+
+    def test_duplicate_inputs(self):
+        dupe_prompt = "input 'PROMPT' already exists in request"
+        dupe_stream = "input 'STREAM' already exists in request"
+        # Use JSON string directly as Python Dict doesn't support duplicate keys
+        invalid_type_inputs = [
+            # One duplicate
+            (
+                '{"PROMPT": "hello", "STREAM": false, "PROMPT": "duplicate"}',
+                dupe_prompt,
+            ),
+            ('{"PROMPT": "hello", "STREAM": false, "STREAM": false}', dupe_stream),
+            # Multiple duplicates, parsed in order
+            (
+                '{"PROMPT": "hello", "STREAM": false, "PROMPT": "duplicate", "STREAM": true}',
+                dupe_prompt,
+            ),
+            (
+                '{"PROMPT": "hello", "STREAM": false, "STREAM": true, "PROMPT": "duplicate"}',
+                dupe_stream,
+            ),
+        ]
+        for inputs, error_msg in invalid_type_inputs:
+            self.generate_expect_failure(self._model_name, inputs, error_msg)
+            self.generate_stream_expect_failure(self._model_name, inputs, error_msg)
+
+    def test_generate_stream_response_error(self):
+        # Setup text-based input
+        text = "hello world"
+        inputs = {"PROMPT": [text], "STREAM": True, "REPETITION": 0, "FAIL_LAST": True}
+        r = self.generate_stream(self._model_name, inputs)
+
+        # With "REPETITION": 0, error will be first response and the HTTP code
+        # will be set properly
+        try:
+            r.raise_for_status()
+        except requests.exceptions.HTTPError as e:
+            self.check_sse_responses(r, [{"error": "An Error Occurred"}])
+
+        # With "REPETITION" > 0, the first response is valid response and set
+        # HTTP code to success, so user must validate each response
+        inputs["REPETITION"] = 1
+        r = self.generate_stream(self._model_name, inputs)
+        r.raise_for_status()
+
+        self.check_sse_responses(r, [{"TEXT": text}, {"error": "An Error Occurred"}])
+
+    def test_race_condition(self):
+        # In Triton HTTP frontend, the HTTP response is sent in a different
+        # thread than Triton response complete thread, both programs have shared
+        # access to the same object, so this test is sending sufficient load to
+        # the endpoint, in attempt to expose race condition if any  .
+        input1 = {"PROMPT": "hello", "STREAM": False, "param": "segfault"}
+        input2 = {
+            "PROMPT": "hello",
+            "STREAM": True,
+            "REPETITION": 3,
+            "param": "segfault",
+        }
+        threads = []
+
+        def thread_func(model_name, inputs):
+            self.generate_stream(model_name, inputs).raise_for_status()
+
+        for _ in range(50):
+            threads.append(
+                threading.Thread(target=thread_func, args=((self._model_name, input1)))
+            )
+            threads.append(
+                threading.Thread(target=thread_func, args=((self._model_name, input2)))
+            )
+        for thread in threads:
+            thread.start()
+        for thread in threads:
+            thread.join()
+
+    def test_one_response(self):
+        # In the current 'inputs' setting, the model will send at least 1
+        # response, "STREAM" controls model behavior on sending model responses:
+        # If True, the model sends two responses, one is the actual infer
+        # response and the other contains flag only to signal end of response.
+        # 'generate_stream' endpoint is designed for this case so it should send
+        # infer response and complete HTTP response appropriately. And
+        # 'generate' endpoint will be able to handle this case as at its core
+        # only one infer response is received, which is the same as typical HTTP
+        # usage.
+        # If False, the model sends one response containing infer response and
+        # end flag, which is the same as how non-decoupled model responds.
+        inputs = {"PROMPT": "hello world", "STREAM": True}
+        r = self.generate_stream(self._model_name, inputs)
+        r.raise_for_status()
+        r = self.generate(self._model_name, inputs)
+        r.raise_for_status()
+
+        inputs["STREAM"] = False
+        r = self.generate_stream(self._model_name, inputs)
+        r.raise_for_status()
+        r = self.generate(self._model_name, inputs)
+        r.raise_for_status()
+
+    def test_zero_response(self):
+        inputs = {"PROMPT": "hello world", "STREAM": True, "REPETITION": 0}
+        r = self.generate_stream(self._model_name, inputs)
+        r.raise_for_status()
+        # Expect generate fails the inference
+        r = self.generate(self._model_name, inputs)
+        try:
+            r.raise_for_status()
+        except requests.exceptions.HTTPError as e:
+            self.assertIn(
+                "generate expects model to produce exactly 1 response",
+                r.json()["error"],
+            )
+
+    def test_many_response(self):
+        inputs = {"PROMPT": "hello world", "STREAM": True, "REPETITION": 2}
+        r = self.generate_stream(self._model_name, inputs)
+        r.raise_for_status()
+        # Expect generate fails the inference
+        r = self.generate(self._model_name, inputs)
+        try:
+            r.raise_for_status()
+        except requests.exceptions.HTTPError as e:
+            self.assertIn(
+                "generate expects model to produce exactly 1 response",
+                r.json()["error"],
+            )
+
+    def test_complex_schema(self):
+        # Currently only the fundamental conversion is supported, nested object
+        # in the request will results in parsing error
+
+        # complex object to parameters (specifying non model input)
+        inputs = {
+            "PROMPT": "hello world",
+            "STREAM": True,
+            "PARAMS": {"PARAM_0": 0, "PARAM_1": True},
+        }
+        r = self.generate(self._model_name, inputs)
+        try:
+            r.raise_for_status()
+        except requests.exceptions.HTTPError as e:
+            self.assertIn("parameter 'PARAMS' has invalid type", r.json()["error"])
+
+        # complex object to model input
+        inputs = {
+            "PROMPT": {"USER": "hello world", "BOT": "world hello"},
+            "STREAM": True,
+        }
+        r = self.generate(self._model_name, inputs)
+        try:
+            r.raise_for_status()
+        except requests.exceptions.HTTPError as e:
+            self.assertIn(
+                "attempt to access JSON non-string as string", r.json()["error"]
+            )
+
+    def test_close_connection_during_streaming(self):
+        # verify the responses are streamed as soon as it is generated
+        text = "hello world"
+        rep_count = 3
+        inputs = {"PROMPT": [text], "STREAM": True, "REPETITION": rep_count, "DELAY": 2}
+        res = self.generate_stream(self._model_name, inputs, stream=True)
+        # close connection while the responses are being generated
+        res.close()
+        # check server healthiness
+        health_url = "http://localhost:8000/v2/health/live"
+        requests.get(health_url).raise_for_status()
+
+    def test_parameters(self):
+        # Test reserved nested object for parameters
+        text = "hello world"
+        rep_count = 3
+        inputs = {
+            "PROMPT": [text],
+            "STREAM": True,
+            "parameters": {"REPETITION": rep_count},
+        }
+        self.generate_stream_expect_success(self._model_name, inputs, text, rep_count)
+
+        # parameters keyword is not an object
+        inputs = {"PROMPT": [text], "STREAM": True, "parameters": 1}
+
+        r = self.generate(self._model_name, inputs)
+        try:
+            r.raise_for_status()
+        except requests.exceptions.HTTPError as e:
+            self.assertIn(
+                "Expected JSON object for keyword: 'parameters'", r.json()["error"]
+            )
+
+        # parameters contains complex object
+        inputs = {
+            "PROMPT": [text],
+            "STREAM": True,
+            "parameters": {"nested": {"twice": 1}},
+        }
+
+        r = self.generate(self._model_name, inputs)
+        try:
+            r.raise_for_status()
+        except requests.exceptions.HTTPError as e:
+            self.assertIn(
+                "Converting keyword: 'parameters': parameter 'nested' has invalid type.",
+                r.json()["error"],
+            )
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_http/generate_models/mock_llm/1/model.py b/qa/L0_http/generate_models/mock_llm/1/model.py
new file mode 100644
index 0000000000..9c5e9423e4
--- /dev/null
+++ b/qa/L0_http/generate_models/mock_llm/1/model.py
@@ -0,0 +1,107 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import json
+import time
+
+import numpy as np
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    def initialize(self, args):
+        self.model_config = json.loads(args["model_config"])
+        self.decoupled = self.model_config.get("model_transaction_policy", {}).get(
+            "decoupled"
+        )
+
+    def execute(self, requests):
+        if self.decoupled:
+            return self.exec_decoupled(requests)
+        else:
+            return self.exec(requests)
+
+    def exec(self, requests):
+        responses = []
+        for request in requests:
+            params = json.loads(request.parameters())
+            rep_count = params["REPETITION"] if "REPETITION" in params else 1
+
+            input_np = pb_utils.get_input_tensor_by_name(request, "PROMPT").as_numpy()
+            stream_np = pb_utils.get_input_tensor_by_name(request, "STREAM").as_numpy()
+            stream = stream_np.flatten()[0]
+            if stream:
+                responses.append(
+                    pb_utils.InferenceResponse(
+                        error=pb_utils.TritonError(
+                            "STREAM only supported in decoupled mode"
+                        )
+                    )
+                )
+            else:
+                out_tensor = pb_utils.Tensor(
+                    "TEXT", np.repeat(input_np, rep_count, axis=1)
+                )
+                responses.append(pb_utils.InferenceResponse([out_tensor]))
+        return responses
+
+    def exec_decoupled(self, requests):
+        for request in requests:
+            params = json.loads(request.parameters())
+            rep_count = params["REPETITION"] if "REPETITION" in params else 1
+            fail_last = params["FAIL_LAST"] if "FAIL_LAST" in params else False
+            delay = params["DELAY"] if "DELAY" in params else None
+
+            sender = request.get_response_sender()
+            input_np = pb_utils.get_input_tensor_by_name(request, "PROMPT").as_numpy()
+            stream_np = pb_utils.get_input_tensor_by_name(request, "STREAM").as_numpy()
+            out_tensor = pb_utils.Tensor("TEXT", input_np)
+            response = pb_utils.InferenceResponse([out_tensor])
+            # If stream enabled, just send multiple copies of response
+            # FIXME: Could split up response string into tokens, but this is simpler for now.
+            stream = stream_np.flatten()[0]
+            if stream:
+                for _ in range(rep_count):
+                    if delay is not None:
+                        time.sleep(delay)
+                    if not sender.is_cancelled():
+                        sender.send(response)
+                    else:
+                        break
+                sender.send(
+                    None
+                    if not fail_last
+                    else pb_utils.InferenceResponse(
+                        error=pb_utils.TritonError("An Error Occurred")
+                    ),
+                    flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL,
+                )
+            # If stream disabled, just send one response
+            else:
+                sender.send(
+                    response, flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL
+                )
+        return None
diff --git a/qa/L0_http/generate_models/mock_llm/config.pbtxt b/qa/L0_http/generate_models/mock_llm/config.pbtxt
new file mode 100644
index 0000000000..6871661525
--- /dev/null
+++ b/qa/L0_http/generate_models/mock_llm/config.pbtxt
@@ -0,0 +1,60 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+backend: "python"
+
+max_batch_size: 0
+
+model_transaction_policy {
+  decoupled: True
+}
+
+input [
+  {
+    name: "PROMPT"
+    data_type: TYPE_STRING
+    dims: [ 1, 1 ]
+  },
+  {
+    name: "STREAM"
+    data_type: TYPE_BOOL
+    dims: [ 1, 1 ]
+  }
+]
+
+output [
+  {
+    name: "TEXT"
+    data_type: TYPE_STRING
+    dims: [ 1, -1 ]
+  }
+]
+
+instance_group [
+  {
+    count: 1
+    kind: KIND_MODEL
+  }
+]
diff --git a/qa/L0_http/http_basic_auth_test.py b/qa/L0_http/http_basic_auth_test.py
new file mode 100755
index 0000000000..5aa1f71d81
--- /dev/null
+++ b/qa/L0_http/http_basic_auth_test.py
@@ -0,0 +1,66 @@
+#!/usr/bin/python
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+import sys
+import unittest
+
+sys.path.append("../common")
+
+import test_util as tu
+import tritonclient.http as tritonhttpclient
+import tritonclient.http.aio as asynctritonhttpclient
+from tritonclient.http.aio.auth import BasicAuth as AsyncBasicAuth
+from tritonclient.http.auth import BasicAuth
+
+
+class HTTPBasicAuthTest(tu.TestResultCollector):
+    def setUp(self):
+        # Use the nginx port
+        self._client = tritonhttpclient.InferenceServerClient(url="localhost:8004")
+        self._client.register_plugin(BasicAuth("username", "password"))
+
+    def test_client_call(self):
+        self.assertTrue(self._client.is_server_live())
+
+    def tearDown(self):
+        self._client.close()
+
+
+class HTTPBasicAuthAsyncTest(unittest.IsolatedAsyncioTestCase):
+    async def asyncSetUp(self):
+        # Use the nginx port
+        self._client = asynctritonhttpclient.InferenceServerClient(url="localhost:8004")
+        self._client.register_plugin(AsyncBasicAuth("username", "password"))
+
+    async def test_client_call(self):
+        self.assertTrue(await self._client.is_server_live())
+
+    async def asyncTearDown(self):
+        await self._client.close()
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_http/http_client_plugin_test.py b/qa/L0_http/http_client_plugin_test.py
new file mode 100755
index 0000000000..963ea2a81b
--- /dev/null
+++ b/qa/L0_http/http_client_plugin_test.py
@@ -0,0 +1,175 @@
+#!/usr/bin/python
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+
+import unittest
+from unittest.mock import AsyncMock, MagicMock, patch
+
+import numpy as np
+import test_util as tu
+import tritonclient.http as tritonhttpclient
+import tritonclient.http.aio as asynctritonhttpclient
+from tritonclient.http import InferenceServerClientPlugin
+from tritonclient.utils import np_to_triton_dtype
+
+
+# A simple plugin that adds headers to the inference request.
+class TestPlugin(InferenceServerClientPlugin):
+    def __init__(self, headers):
+        self._headers = headers
+
+    def __call__(self, request):
+        request.headers.update(self._headers)
+
+
+class HTTPClientPluginAsyncTest(unittest.IsolatedAsyncioTestCase):
+    async def asyncSetUp(self):
+        self._headers = {"MY-KEY": "MY-VALUE"}
+        self._plugin = TestPlugin(self._headers)
+        self._client = asynctritonhttpclient.InferenceServerClient(url="localhost:8001")
+
+    async def test_server_is_live(self):
+        # We are testing is_server_live as an example API that uses GET method
+        # for communication with the server.
+        self._client._stub.get = AsyncMock()
+
+        self._client.register_plugin(self._plugin)
+        self.assertEqual(self._plugin, self._client.plugin())
+        await self._client.is_server_live()
+        self._client._stub.get.assert_awaited_with(
+            url=unittest.mock.ANY, headers=self._headers
+        )
+
+        # Make sure unregistering the plugin would no longer add the headers
+        self._client.unregister_plugin()
+        self.assertEqual(None, self._client.plugin())
+        await self._client.is_server_live()
+        self._client._stub.get.assert_awaited_with(url=unittest.mock.ANY, headers={})
+
+    async def test_simple_infer(self):
+        # Only the read function must return async
+        post_return = MagicMock()
+        post_return.read = AsyncMock()
+        self._client._stub.post = AsyncMock(return_value=post_return)
+
+        np_input = np.arange(8, dtype=np.float32).reshape(1, -1)
+        model = "onnx_zero_1_float32"
+
+        # Setup inputs
+        inputs = []
+        inputs.append(
+            tritonhttpclient.InferInput(
+                "INPUT0", np_input.shape, np_to_triton_dtype(np_input.dtype)
+            )
+        )
+
+        # Set the binary data to False so that 'Inference-Header-Length' is not
+        # added to the headers.
+        inputs[0].set_data_from_numpy(np_input, binary_data=False)
+
+        async def run_infer(headers):
+            with patch("tritonclient.http.aio._raise_if_error"):
+                with patch("tritonclient.http.aio.InferResult"):
+                    await self._client.infer(model_name=model, inputs=inputs)
+                    self._client._stub.post.assert_awaited_with(
+                        url=unittest.mock.ANY, data=unittest.mock.ANY, headers=headers
+                    )
+
+        self._client.register_plugin(self._plugin)
+        await run_infer(self._headers)
+
+        self._client.unregister_plugin()
+        await run_infer({})
+
+    async def asyncTearDown(self):
+        await self._client.close()
+
+
+class HTTPClientPluginTest(tu.TestResultCollector):
+    def setUp(self):
+        self._headers = {"MY-KEY": "MY-VALUE"}
+        self._plugin = TestPlugin(self._headers)
+        self._client = tritonhttpclient.InferenceServerClient(url="localhost:8001")
+
+        # Use magic mock for the client stub
+        self._client._client_stub = MagicMock()
+
+    def test_server_is_live(self):
+        # We are testing is_server_live as an example API that uses GET method
+        # for communication with the server.
+        self._client.register_plugin(self._plugin)
+        self._client.is_server_live()
+        self._client._client_stub.get.assert_called_with(
+            unittest.mock.ANY, headers=self._headers
+        )
+
+        # Make sure unregistering the plugin would no longer add the headers
+        self._client.unregister_plugin()
+        self._client.is_server_live()
+        self._client._client_stub.get.assert_called_with(unittest.mock.ANY, headers={})
+
+    def test_simple_infer(self):
+        np_input = np.arange(8, dtype=np.float32).reshape(1, -1)
+        model = "onnx_zero_1_float32"
+
+        # Setup inputs
+        inputs = []
+        inputs.append(
+            tritonhttpclient.InferInput(
+                "INPUT0", np_input.shape, np_to_triton_dtype(np_input.dtype)
+            )
+        )
+
+        # Set the binary data to False so that 'Inference-Header-Length' is not
+        # added to the headers.
+        inputs[0].set_data_from_numpy(np_input, binary_data=False)
+
+        def run_infer(headers):
+            with patch("tritonclient.http._client._raise_if_error"):
+                with patch("tritonclient.http._client.InferResult"):
+                    self._client.infer(model_name=model, inputs=inputs)
+                    self._client._client_stub.post.assert_called_with(
+                        request_uri=unittest.mock.ANY,
+                        body=unittest.mock.ANY,
+                        headers=headers,
+                    )
+
+        self._client.register_plugin(self._plugin)
+        run_infer(self._headers)
+
+        self._client.unregister_plugin()
+        run_infer({})
+
+    def tearDown(self):
+        self._client.close()
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_http/http_restricted_api_test.py b/qa/L0_http/http_restricted_api_test.py
new file mode 100755
index 0000000000..e5e3d5fd2d
--- /dev/null
+++ b/qa/L0_http/http_restricted_api_test.py
@@ -0,0 +1,94 @@
+#!/usr/bin/python
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+
+import unittest
+
+import numpy as np
+import tritonclient.http as tritonhttpclient
+from tritonclient.utils import InferenceServerException
+
+
+class RestrictedAPITest(unittest.TestCase):
+    def setUp(self):
+        self.model_name_ = "simple"
+        self.client_ = tritonhttpclient.InferenceServerClient("localhost:8000")
+
+    # Other unspecified APIs should not be restricted
+    def test_sanity(self):
+        self.client_.get_inference_statistics("simple")
+        self.client_.get_inference_statistics(
+            "simple", headers={"infer-key": "infer-value"}
+        )
+
+    # metadata, infer, model repository APIs are restricted.
+    # metadata and infer expects "infer-key : infer-value" header,
+    # model repository expected "admin-key : admin-value".
+    def test_model_repository(self):
+        with self.assertRaisesRegex(InferenceServerException, "This API is restricted"):
+            self.client_.unload_model(
+                self.model_name_, headers={"infer-key": "infer-value"}
+            )
+        # Request go through and get actual transaction error
+        with self.assertRaisesRegex(
+            InferenceServerException, "explicit model load / unload is not allowed"
+        ):
+            self.client_.unload_model(
+                self.model_name_, headers={"admin-key": "admin-value"}
+            )
+
+    def test_metadata(self):
+        with self.assertRaisesRegex(InferenceServerException, "This API is restricted"):
+            self.client_.get_server_metadata()
+        self.client_.get_server_metadata({"infer-key": "infer-value"})
+
+    def test_infer(self):
+        # setup
+        inputs = [
+            tritonhttpclient.InferInput("INPUT0", [1, 16], "INT32"),
+            tritonhttpclient.InferInput("INPUT1", [1, 16], "INT32"),
+        ]
+        inputs[0].set_data_from_numpy(np.ones(shape=(1, 16), dtype=np.int32))
+        inputs[1].set_data_from_numpy(np.ones(shape=(1, 16), dtype=np.int32))
+
+        # This test only care if the request goes through
+        with self.assertRaisesRegex(InferenceServerException, "This API is restricted"):
+            _ = self.client_.infer(
+                model_name=self.model_name_, inputs=inputs, headers={"test": "1"}
+            )
+        self.client_.infer(
+            model_name=self.model_name_,
+            inputs=inputs,
+            headers={"infer-key": "infer-value"},
+        )
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_http/http_test.py b/qa/L0_http/http_test.py
old mode 100644
new mode 100755
index 2a5e3c141e..1f292ffb88
--- a/qa/L0_http/http_test.py
+++ b/qa/L0_http/http_test.py
@@ -1,5 +1,5 @@
 #!/usr/bin/python
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -29,40 +29,39 @@
 
 sys.path.append("../common")
 
-import requests
 import unittest
+
 import numpy as np
+import requests
 import test_util as tu
 import tritonclient.http as tritonhttpclient
-from tritonclient.utils import np_to_triton_dtype, InferenceServerException
+from tritonclient.utils import InferenceServerException, np_to_triton_dtype
 
 
 class HttpTest(tu.TestResultCollector):
-
     def _get_infer_url(self, model_name):
         return "http://localhost:8000/v2/models/{}/infer".format(model_name)
 
-    def _raw_binary_helper(self,
-                           model,
-                           input_bytes,
-                           expected_output_bytes,
-                           extra_headers={}):
+    def _raw_binary_helper(
+        self, model, input_bytes, expected_output_bytes, extra_headers={}
+    ):
         # Select model that satisfies constraints for raw binary request
-        headers = {'Inference-Header-Content-Length': '0'}
+        headers = {"Inference-Header-Content-Length": "0"}
         # Add extra headers (if any) before sending request
         headers.update(extra_headers)
-        r = requests.post(self._get_infer_url(model),
-                          data=input_bytes,
-                          headers=headers)
+        r = requests.post(self._get_infer_url(model), data=input_bytes, headers=headers)
         r.raise_for_status()
 
         # Get the inference header size so we can locate the output binary data
         header_size = int(r.headers["Inference-Header-Content-Length"])
         # Assert input == output since this tests an identity model
         self.assertEqual(
-            expected_output_bytes, r.content[header_size:],
-            "Expected response body contains correct output binary data: {}; got: {}"
-            .format(expected_output_bytes, r.content[header_size:]))
+            expected_output_bytes,
+            r.content[header_size:],
+            "Expected response body contains correct output binary data: {}; got: {}".format(
+                expected_output_bytes, r.content[header_size:]
+            ),
+        )
 
     def test_raw_binary(self):
         model = "onnx_zero_1_float32"
@@ -80,54 +79,61 @@ def test_byte(self):
         # i.e. BYTE type the element count must be 1
         model = "onnx_zero_1_object_1_element"
         input = "427"
-        headers = {'Inference-Header-Content-Length': '0'}
-        r = requests.post(self._get_infer_url(model),
-                          data=input,
-                          headers=headers)
+        headers = {"Inference-Header-Content-Length": "0"}
+        r = requests.post(self._get_infer_url(model), data=input, headers=headers)
         r.raise_for_status()
 
         # Get the inference header size so we can locate the output binary data
         header_size = int(r.headers["Inference-Header-Content-Length"])
         # Triton returns BYTES tensor with byte size prepended
-        output = r.content[header_size + 4:].decode()
+        output = r.content[header_size + 4 :].decode()
         self.assertEqual(
-            input, output,
-            "Expected response body contains correct output binary data: {}; got: {}"
-            .format(input, output))
+            input,
+            output,
+            "Expected response body contains correct output binary data: {}; got: {}".format(
+                input, output
+            ),
+        )
 
     def test_byte_too_many_elements(self):
         # Select model that doesn't satisfy constraints for raw binary request
         # i.e. BYTE type the element count must be 1
         model = "onnx_zero_1_object"
         input = "427"
-        headers = {'Inference-Header-Content-Length': '0'}
-        r = requests.post(self._get_infer_url(model),
-                          data=input,
-                          headers=headers)
+        headers = {"Inference-Header-Content-Length": "0"}
+        r = requests.post(self._get_infer_url(model), data=input, headers=headers)
         self.assertEqual(
-            400, r.status_code,
+            400,
+            r.status_code,
             "Expected error code {} returned for the request; got: {}".format(
-                400, r.status_code))
+                400, r.status_code
+            ),
+        )
         self.assertIn(
-            "For BYTE datatype raw input, the model must have input shape [1]",
-            r.content.decode())
+            "For BYTE datatype raw input 'INPUT0', the model must have input shape [1]",
+            r.content.decode(),
+        )
 
     def test_multi_variable_dimensions(self):
         # Select model that doesn't satisfy constraints for raw binary request
         # i.e. this model has multiple variable-sized dimensions
         model = "onnx_zero_1_float16"
         input = np.ones([2, 2], dtype=np.float16)
-        headers = {'Inference-Header-Content-Length': '0'}
-        r = requests.post(self._get_infer_url(model),
-                          data=input.tobytes(),
-                          headers=headers)
+        headers = {"Inference-Header-Content-Length": "0"}
+        r = requests.post(
+            self._get_infer_url(model), data=input.tobytes(), headers=headers
+        )
         self.assertEqual(
-            400, r.status_code,
+            400,
+            r.status_code,
             "Expected error code {} returned for the request; got: {}".format(
-                400, r.status_code))
+                400, r.status_code
+            ),
+        )
         self.assertIn(
             "The shape of the raw input 'INPUT0' can not be deduced because there are more than one variable-sized dimension",
-            r.content.decode())
+            r.content.decode(),
+        )
 
     def test_multi_inputs(self):
         # Select model that doesn't satisfy constraints for raw binary request
@@ -136,21 +142,25 @@ def test_multi_inputs(self):
         # Use one numpy array, after tobytes() it can be seen as three inputs
         # each with 8 elements (this ambiguity is why this is not allowed)
         input = np.arange(24, dtype=np.float32)
-        headers = {'Inference-Header-Content-Length': '0'}
-        r = requests.post(self._get_infer_url(model),
-                          data=input.tobytes(),
-                          headers=headers)
+        headers = {"Inference-Header-Content-Length": "0"}
+        r = requests.post(
+            self._get_infer_url(model), data=input.tobytes(), headers=headers
+        )
         self.assertEqual(
-            400, r.status_code,
+            400,
+            r.status_code,
             "Expected error code {} returned for the request; got: {}".format(
-                400, r.status_code))
+                400, r.status_code
+            ),
+        )
         self.assertIn(
             "Raw request must only have 1 input (found 1) to be deduced but got 3 inputs in",
-            r.content.decode())
+            r.content.decode(),
+        )
 
     # This is to test that a properly chunk-encoded request by the caller works,
     # though Triton does not specifically do any special chunk handling outside
-    # of underlying HTTP libaries used
+    # of underlying HTTP libraries used
     # Future Enhancement: Test other encodings as they come up
     def test_content_encoding_chunked_manually(self):
         # Similar to test_raw_binary but test with extra headers
@@ -165,9 +175,8 @@ def test_content_encoding_chunked_manually(self):
         # Chunk bytes and line separator
         chunk_encoded_input += input_bytes + b"\r\n"
         # Final byte (0) and end message
-        chunk_encoded_input += b'0\r\n\r\n'
-        self._raw_binary_helper(model, chunk_encoded_input, input_bytes,
-                                extra_headers)
+        chunk_encoded_input += b"0\r\n\r\n"
+        self._raw_binary_helper(model, chunk_encoded_input, input_bytes, extra_headers)
 
     # Test that Python client rejects any "Transfer-Encoding" HTTP headers
     # as we don't specially handle encoding requests for the user through
@@ -183,17 +192,19 @@ def test_content_encoding_unsupported_client(self):
                 inputs = []
                 inputs.append(
                     tritonhttpclient.InferInput(
-                        'INPUT0', np_input.shape,
-                        np_to_triton_dtype(np_input.dtype)))
+                        "INPUT0", np_input.shape, np_to_triton_dtype(np_input.dtype)
+                    )
+                )
                 inputs[0].set_data_from_numpy(np_input)
 
-                with tritonhttpclient.InferenceServerClient(
-                        "localhost:8000") as client:
+                with tritonhttpclient.InferenceServerClient("localhost:8000") as client:
                     # Python client is expected to raise an exception to reject
                     # 'content-encoding' HTTP headers.
-                    with self.assertRaisesRegex(InferenceServerException,
-                                                "Unsupported HTTP header"):
+                    with self.assertRaisesRegex(
+                        InferenceServerException, "Unsupported HTTP header"
+                    ):
                         client.infer(model_name=model, inputs=inputs, headers=headers)
 
-if __name__ == '__main__':
+
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_http/nginx.conf b/qa/L0_http/nginx.conf
new file mode 100644
index 0000000000..fb62ca719c
--- /dev/null
+++ b/qa/L0_http/nginx.conf
@@ -0,0 +1,57 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+worker_processes  1;
+
+error_log  /var/log/nginx/error.log;
+
+events {
+    worker_connections  1024;
+}
+
+http {
+    # Configure basic authentication
+    auth_basic "Restricted Content";
+    auth_basic_user_file /opt/tritonserver/qa/L0_http/pswd;
+
+    # Define upstream server
+    upstream backend {
+        server localhost:8000;
+    }
+
+    # Define server block for reverse proxy
+    server {
+        listen 8004;
+
+        # Configure location for reverse proxy
+        location / {
+            proxy_pass http://backend;
+            proxy_set_header Host $host;
+            proxy_set_header X-Real-IP $remote_addr;
+            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+        }
+    }
+}
diff --git a/qa/L0_http/python_http_aio_test.py b/qa/L0_http/python_http_aio_test.py
new file mode 100755
index 0000000000..bd8d342bb1
--- /dev/null
+++ b/qa/L0_http/python_http_aio_test.py
@@ -0,0 +1,116 @@
+#!/usr/bin/env python
+# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import unittest
+
+import tritonclient.http.aio as httpclient
+from tritonclient.utils import *
+
+
+class TestHttpAioClient(unittest.IsolatedAsyncioTestCase):
+    """Test if aio rpc can reach the server"""
+
+    async def asyncSetUp(self):
+        self._triton_client = httpclient.InferenceServerClient(url="localhost:8000")
+
+    async def asyncTearDown(self):
+        await self._triton_client.close()
+
+    async def test_is_server_live(self):
+        ret = await self._triton_client.is_server_live()
+        self.assertEqual(ret, True)
+
+    async def test_is_server_ready(self):
+        ret = await self._triton_client.is_server_ready()
+        self.assertEqual(ret, True)
+
+    async def test_is_model_ready(self):
+        ret = await self._triton_client.is_model_ready("simple")
+        self.assertEqual(ret, True)
+
+    async def test_get_server_metadata(self):
+        ret = await self._triton_client.get_server_metadata()
+        self.assertEqual(ret["name"], "triton")
+
+    async def test_get_model_metadata(self):
+        ret = await self._triton_client.get_model_metadata("simple")
+        self.assertEqual(ret["name"], "simple")
+
+    async def test_get_model_config(self):
+        ret = await self._triton_client.get_model_config("simple")
+        self.assertEqual(ret["name"], "simple")
+
+    async def test_get_model_repository_index(self):
+        ret = await self._triton_client.get_model_repository_index()
+        self.assertEqual(len(ret), 7)
+
+    async def test_load_model(self):
+        with self.assertRaisesRegex(
+            InferenceServerException,
+            "explicit model load / unload is not allowed if polling is enabled",
+        ):
+            await self._triton_client.load_model("simple")
+
+    async def test_unload_model(self):
+        with self.assertRaisesRegex(
+            InferenceServerException,
+            "explicit model load / unload is not allowed if polling is enabled",
+        ):
+            await self._triton_client.load_model("simple")
+
+    async def test_get_inference_statistics(self):
+        await self._triton_client.get_inference_statistics()
+
+    async def test_update_trace_settings(self):
+        await self._triton_client.update_trace_settings()
+
+    async def test_get_trace_settings(self):
+        await self._triton_client.get_trace_settings()
+
+    async def test_get_system_shared_memory_status(self):
+        await self._triton_client.get_system_shared_memory_status()
+
+    async def test_register_system_shared_memory(self):
+        with self.assertRaisesRegex(InferenceServerException, ""):
+            await self._triton_client.register_system_shared_memory("", "", 0)
+
+    async def test_unregister_system_shared_memory(self):
+        await self._triton_client.unregister_system_shared_memory()
+
+    async def test_get_cuda_shared_memory_status(self):
+        await self._triton_client.get_cuda_shared_memory_status()
+
+    async def test_register_cuda_shared_memory(self):
+        with self.assertRaisesRegex(InferenceServerException, ""):
+            await self._triton_client.register_cuda_shared_memory("", b"", 0, 0)
+
+    async def test_unregister_cuda_shared_memory(self):
+        await self._triton_client.unregister_cuda_shared_memory()
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_http/test.sh b/qa/L0_http/test.sh
old mode 100644
new mode 100755
index 9b94d1ea2a..7ba219fe15
--- a/qa/L0_http/test.sh
+++ b/qa/L0_http/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -42,6 +42,10 @@ export CUDA_VISIBLE_DEVICES=0
 
 RET=0
 
+CLIENT_PLUGIN_TEST="./http_client_plugin_test.py"
+BASIC_AUTH_TEST="./http_basic_auth_test.py"
+RESTRICTED_API_TEST="./http_restricted_api_test.py"
+NGINX_CONF="./nginx.conf"
 # On windows the paths invoked by the script (running in WSL) must use
 # /mnt/c when needed but the paths on the tritonserver command-line
 # must be C:/ style.
@@ -52,6 +56,7 @@ if [[ "$(< /proc/sys/kernel/osrelease)" == *microsoft* ]]; then
     BACKEND_DIR=${BACKEND_DIR:=C:/tritonserver/backends}
     SERVER=${SERVER:=/mnt/c/tritonserver/bin/tritonserver.exe}
 
+    SIMPLE_AIO_INFER_CLIENT_PY=${SDKDIR}/python/simple_http_aio_infer_client.py
     SIMPLE_HEALTH_CLIENT_PY=${SDKDIR}/python/simple_http_health_metadata.py
     SIMPLE_INFER_CLIENT_PY=${SDKDIR}/python/simple_http_infer_client.py
     SIMPLE_ASYNC_INFER_CLIENT_PY=${SDKDIR}/python/simple_http_async_infer_client.py
@@ -83,6 +88,7 @@ else
     SERVER=${TRITON_DIR}/bin/tritonserver
     BACKEND_DIR=${TRITON_DIR}/backends
 
+    SIMPLE_AIO_INFER_CLIENT_PY=../clients/simple_http_aio_infer_client.py
     SIMPLE_HEALTH_CLIENT_PY=../clients/simple_http_health_metadata.py
     SIMPLE_INFER_CLIENT_PY=../clients/simple_http_infer_client.py
     SIMPLE_ASYNC_INFER_CLIENT_PY=../clients/simple_http_async_infer_client.py
@@ -143,6 +149,7 @@ fi
 
 IMAGE=../images/vulture.jpeg
 for i in \
+        $SIMPLE_AIO_INFER_CLIENT_PY \
         $SIMPLE_INFER_CLIENT_PY \
         $SIMPLE_ASYNC_INFER_CLIENT_PY \
         $SIMPLE_IMAGE_CLIENT_PY \
@@ -223,6 +230,13 @@ for i in \
     fi
 done
 
+# Test with json input and output data
+$SIMPLE_STRING_INFER_CLIENT --json-input-data --json-output-data >> ${CLIENT_LOG}.c++.json 2>&1
+if [ $? -ne 0 ]; then
+    cat ${CLIENT_LOG}.c++.json
+    RET=1
+fi
+
 # Test while reusing the InferInput and InferRequestedOutput objects
 $SIMPLE_REUSE_INFER_OBJECTS_CLIENT -v >> ${CLIENT_LOG}.c++.reuse 2>&1
 if [ $? -ne 0 ]; then
@@ -230,6 +244,24 @@ if [ $? -ne 0 ]; then
     RET=1
 fi
 
+python $CLIENT_PLUGIN_TEST >> ${CLIENT_LOG}.python.plugin 2>&1
+if [ $? -ne 0 ]; then
+    cat ${CLIENT_LOG}.python.plugin
+    RET=1
+fi
+
+# Create a password file with username:password
+echo -n 'username:' > pswd
+echo "password" | openssl passwd -stdin -apr1 >> pswd
+nginx -c `pwd`/$NGINX_CONF
+
+python $BASIC_AUTH_TEST
+if [ $? -ne 0 ]; then
+    cat ${CLIENT_LOG}.python.plugin.auth
+    RET=1
+fi
+service nginx stop
+
 # Test with the base path in url.
 $SIMPLE_INFER_CLIENT -u localhost:8000/base_path -v >> ${CLIENT_LOG}.c++.base_path_url 2>&1
 if [ $? -eq 0 ]; then
@@ -268,6 +300,10 @@ if [ $(cat ${CLIENT_LOG}.model_control | grep "PASS" | wc -l) -ne 1 ]; then
     cat ${CLIENT_LOG}.model_control
     RET=1
 fi
+if [ $(cat ${SERVER_LOG} | grep "Invalid config override" | wc -l) -eq 0 ]; then
+    cat ${SERVER_LOG}
+    RET=1
+fi
 
 set -e
 
@@ -469,7 +505,7 @@ wait $SERVER_PID
 # Run cpp client unit test
 rm -rf unit_test_models && mkdir unit_test_models
 cp -r $DATADIR/qa_model_repository/onnx_int32_int32_int32 unit_test_models/.
-cp -r ${MODELDIR}/simple unit_test_models/. 
+cp -r ${MODELDIR}/simple unit_test_models/.
 
 SERVER_ARGS="--backend-directory=${BACKEND_DIR} --model-repository=unit_test_models
             --trace-file=global_unittest.log --trace-level=TIMESTAMPS --trace-rate=1"
@@ -507,32 +543,61 @@ SERVER_ARGS="--model-repository=`pwd`/unit_test_models \
              --strict-model-config=false"
 SERVER_LOG="./inference_server_cc_unit_test.load.log"
 CLIENT_LOG="./cc_unit_test.load.log"
+
+for i in \
+   "LoadWithFileOverride" \
+   "LoadWithConfigOverride" \
+   ; do
+    run_server
+    if [ "$SERVER_PID" == "0" ]; then
+        echo -e "\n***\n*** Failed to start $SERVER\n***"
+        cat $SERVER_LOG
+        exit 1
+    fi
+
+    set +e
+    $CC_UNIT_TEST --gtest_filter=HTTP*$i >> ${CLIENT_LOG}.$i 2>&1
+    if [ $? -ne 0 ]; then
+        cat ${CLIENT_LOG}.$i
+        RET=1
+    fi
+    set -e
+
+    kill $SERVER_PID
+    wait $SERVER_PID
+done
+
+# Run python http aio unit test
+PYTHON_HTTP_AIO_TEST=python_http_aio_test.py
+CLIENT_LOG=`pwd`/python_http_aio_test.log
+SERVER_ARGS="--backend-directory=${BACKEND_DIR} --model-repository=${MODELDIR}"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
     cat $SERVER_LOG
     exit 1
 fi
-
 set +e
-$CC_UNIT_TEST --gtest_filter=HTTP*Load* >> ${CLIENT_LOG} 2>&1
+python $PYTHON_HTTP_AIO_TEST > $CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
-    cat ${CLIENT_LOG}
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Python HTTP AsyncIO Test Failed\n***"
     RET=1
 fi
 set -e
-
 kill $SERVER_PID
 wait $SERVER_PID
 
 # Run python unit test
-rm -r ${MODELDIR}/*
+MODELDIR=python_unit_test_models
+mkdir -p $MODELDIR
+rm -rf ${MODELDIR}/*
 cp -r $DATADIR/qa_identity_model_repository/onnx_zero_1_float32 ${MODELDIR}/.
 cp -r $DATADIR/qa_identity_model_repository/onnx_zero_1_object ${MODELDIR}/.
 cp -r $DATADIR/qa_identity_model_repository/onnx_zero_1_float16 ${MODELDIR}/.
 cp -r $DATADIR/qa_identity_model_repository/onnx_zero_3_float32 ${MODELDIR}/.
 cp -r ${MODELDIR}/onnx_zero_1_object ${MODELDIR}/onnx_zero_1_object_1_element && \
-    (cd models/onnx_zero_1_object_1_element && \
+    (cd $MODELDIR/onnx_zero_1_object_1_element && \
         sed -i "s/onnx_zero_1_object/onnx_zero_1_object_1_element/" config.pbtxt && \
         sed -i "0,/-1/{s/-1/1/}" config.pbtxt)
 
@@ -550,7 +615,45 @@ TEST_RESULT_FILE='test_results.txt'
 PYTHON_TEST=http_test.py
 EXPECTED_NUM_TESTS=8
 set +e
-python3 $PYTHON_TEST >$CLIENT_LOG 2>&1
+python $PYTHON_TEST >$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+### LLM / Generate REST API Endpoint Tests ###
+
+# Helper library to parse SSE events
+# https://github.com/mpetazzoni/sseclient
+pip install sseclient-py
+
+SERVER_ARGS="--model-repository=`pwd`/generate_models"
+SERVER_LOG="./inference_server_generate_endpoint_test.log"
+CLIENT_LOG="./generate_endpoint_test.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+## Python Unit Tests
+TEST_RESULT_FILE='test_results.txt'
+PYTHON_TEST=generate_endpoint_test.py
+EXPECTED_NUM_TESTS=14
+set +e
+python $PYTHON_TEST >$CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
     cat $CLIENT_LOG
     RET=1
@@ -567,6 +670,74 @@ set -e
 kill $SERVER_PID
 wait $SERVER_PID
 
+### Test Restricted  APIs ###
+### Repeated API not allowed
+
+MODELDIR="`pwd`/models"
+SERVER_ARGS="--model-repository=${MODELDIR}
+             --http-restricted-api=model-repository,health:k1=v1 \
+             --http-restricted-api=metadata,health:k2=v2"
+SERVER_LOG="./http_restricted_endpoint_test.log"
+CLIENT_LOG="./http_restricted_endpoint_test.log"
+run_server
+EXPECTED_MSG="api 'health' can not be specified in multiple config groups"
+if [ "$SERVER_PID" != "0" ]; then
+    echo -e "\n***\n*** Expect fail to start $SERVER\n***"
+    kill $SERVER_PID
+    wait $SERVER_PID
+    RET=1
+elif [ `grep -c "${EXPECTED_MSG}" ${SERVER_LOG}` != "1" ]; then
+    echo -e "\n***\n*** Failed. Expected ${EXPECTED_MSG} to be found in log\n***"
+    cat $SERVER_LOG
+    RET=1
+fi
+
+### Test Unknown Restricted  API###
+### Unknown API not allowed
+
+MODELDIR="`pwd`/models"
+SERVER_ARGS="--model-repository=${MODELDIR}
+             --http-restricted-api=model-reposit,health:k1=v1 \
+             --http-restricted-api=metadata,health:k2=v2"
+run_server
+EXPECTED_MSG="unknown restricted api 'model-reposit'"
+if [ "$SERVER_PID" != "0" ]; then
+    echo -e "\n***\n*** Expect fail to start $SERVER\n***"
+    kill $SERVER_PID
+    wait $SERVER_PID
+    RET=1
+elif [ `grep -c "${EXPECTED_MSG}" ${SERVER_LOG}` != "1" ]; then
+    echo -e "\n***\n*** Failed. Expected ${EXPECTED_MSG} to be found in log\n***"
+    cat $SERVER_LOG
+    RET=1
+fi
+
+### Test Restricted  APIs ###
+### Restricted model-repository, metadata, and inference
+
+SERVER_ARGS="--model-repository=${MODELDIR} \
+             --http-restricted-api=model-repository:admin-key=admin-value \
+             --http-restricted-api=inference,metadata:infer-key=infer-value"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+set +e
+
+python $RESTRICTED_API_TEST RestrictedAPITest > $CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Python HTTP Restricted Protocol Test Failed\n***"
+    RET=1
+fi
+set -e
+kill $SERVER_PID
+wait $SERVER_PID
+
+###
+
 if [ $RET -eq 0 ]; then
     echo -e "\n***\n*** Test Passed\n***"
 else
diff --git a/qa/L0_http_fuzz/fuzztest.py b/qa/L0_http_fuzz/fuzztest.py
old mode 100644
new mode 100755
index b8a52a4b2f..8e84ffffc7
--- a/qa/L0_http_fuzz/fuzztest.py
+++ b/qa/L0_http_fuzz/fuzztest.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,30 +27,32 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
+import glob
+import os
+import sqlite3
 import unittest
+
 import test_util as tu
-import sqlite3
 from boofuzz import *
-import glob
-import os
 
 
 class FuzzTest(tu.TestResultCollector):
-
     def _run_fuzz(self, url, logger):
         session = Session(
             target=Target(connection=TCPSocketConnection("127.0.0.1", 8000)),
             fuzz_loggers=logger,
-            keep_web_open=False)
+            keep_web_open=False,
+        )
 
         s_initialize(name="Request" + url)
         with s_block("Request-Line"):
-            s_group("Method", [
-                "GET", "HEAD", "POST", "PUT", "DELETE", "CONNECT", "OPTIONS",
-                "TRACE"
-            ])
+            s_group(
+                "Method",
+                ["GET", "HEAD", "POST", "PUT", "DELETE", "CONNECT", "OPTIONS", "TRACE"],
+            )
             s_delim(" ", name="space-1")
             s_string(url, name="Request-URI")
             s_delim(" ", name="space-2")
@@ -61,28 +65,36 @@ def _run_fuzz(self, url, logger):
 
     def test_failures_from_db(self):
         url_list = [
-            "/v2", "/v2/models/simple", "/v2/models/simple/infer",
-            "/v2/models/simple/versions/v1", "/v2/models/simple/config",
-            "/v2/models/simple/stats", "/v2/models/simple/ready",
-            "/v2/health/ready", "/v2/health/live", "/v2/repository/index",
+            "/v2",
+            "/v2/models/simple",
+            "/v2/models/simple/infer",
+            "/v2/models/simple/versions/v1",
+            "/v2/models/simple/config",
+            "/v2/models/simple/stats",
+            "/v2/models/simple/ready",
+            "/v2/health/ready",
+            "/v2/health/live",
+            "/v2/repository/index",
             "/v2/repository/models/simple/unload",
             "/v2/repository/models/simple/load",
-            "/v2/systemsharedmemory/status", "/v2/systemsharedmemory/register",
+            "/v2/systemsharedmemory/status",
+            "/v2/systemsharedmemory/register",
             "/v2/systemsharedmemory/unregister",
             "/v2/systemsharedmemory/region/xx/status",
-            "/v2/cudasharedmemory/status", "/v2/cudasharedmemory/register",
+            "/v2/cudasharedmemory/status",
+            "/v2/cudasharedmemory/register",
             "/v2/cudasharedmemory/unregister",
-            "/v2/cudasharedmemory/region/xx/status"
+            "/v2/cudasharedmemory/region/xx/status",
         ]
 
-        csv_log = open('fuzz_results.csv', 'w')
+        csv_log = open("fuzz_results.csv", "w")
         logger = [FuzzLoggerCsv(file_handle=csv_log)]
 
         for url in url_list:
             self._run_fuzz(url, logger)
 
             # Get latest db file
-            files = glob.glob('boofuzz-results/*')
+            files = glob.glob("boofuzz-results/*")
             dbfile = max(files, key=os.path.getctime)
 
             conn = sqlite3.connect(dbfile)
@@ -90,10 +102,8 @@ def test_failures_from_db(self):
 
             # Get number of failures, should be 0
             self.assertEqual(
-                len([
-                    x for x in c.execute(
-                        "SELECT * FROM steps WHERE type=\"fail\"")
-                ]), 0)
+                len([x for x in c.execute('SELECT * FROM steps WHERE type="fail"')]), 0
+            )
 
 
 if __name__ == "__main__":
diff --git a/qa/L0_http_fuzz/test.sh b/qa/L0_http_fuzz/test.sh
old mode 100644
new mode 100755
index e56a6018e8..f721135698
--- a/qa/L0_http_fuzz/test.sh
+++ b/qa/L0_http_fuzz/test.sh
@@ -55,6 +55,20 @@ SERVER=/opt/tritonserver/bin/tritonserver
 SERVER_ARGS="--model-repository=$DATADIR"
 source ../common/util.sh
 
+# Remove this once foobuzz and tornado packages upgrade to work with python 3.10
+# This test tests the server's ability to handle poor input and not the compatibility
+# with python 3.10. Python 3.8 is ok to use here.
+function_install_python38() {
+    source ../L0_backend_python/common.sh
+    install_conda
+    create_conda_env "3.8" "python-3-8"
+
+    # Install test script dependencies
+    pip3 install --upgrade wheel setuptools boofuzz==0.3.0 numpy pillow attrdict future grpcio requests gsutil \
+                            awscli six grpcio-channelz prettytable virtualenv
+}
+function_install_python38
+
 run_server
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
@@ -65,7 +79,7 @@ fi
 set +e
 
 # Test health
-python $FUZZTEST -v >> ${FUZZ_LOG} 2>&1
+python3 $FUZZTEST -v >> ${FUZZ_LOG} 2>&1
 if [ $? -ne 0 ]; then
     cat ${FUZZ_LOG}
     RET=1
diff --git a/qa/L0_https/test.sh b/qa/L0_https/test.sh
old mode 100644
new mode 100755
index 58473bf735..2c030332e5
--- a/qa/L0_https/test.sh
+++ b/qa/L0_https/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -42,6 +42,7 @@ export CUDA_VISIBLE_DEVICES=0
 
 RET=0
 
+SIMPLE_AIO_INFER_CLIENT_PY=../clients/simple_http_aio_infer_client.py
 SIMPLE_INFER_CLIENT_PY=../clients/simple_http_infer_client.py
 TEST_CLIENT=../clients/simple_http_infer_client
 
@@ -103,6 +104,11 @@ if [ $? -ne 0 ]; then
     cat ${CLIENT_LOG}.ssl_infer
     RET=1
 fi
+python $SIMPLE_AIO_INFER_CLIENT_PY -v -u localhost --ssl --key-file client.key --cert-file client.crt --ca-certs ca.crt >> ${CLIENT_LOG}.ssl_infer.aio 2>&1
+if [ $? -ne 0 ]; then
+    cat ${CLIENT_LOG}.ssl_infer.aio
+    RET=1
+fi
 
 $TEST_CLIENT -v -u https://localhost:443 --key-file client.key --cert-file client.crt --ca-certs ca.crt >> ${CLIENT_LOG}.c++.ssl_infer 2>&1
 if [ $? -ne 0 ]; then
@@ -116,6 +122,11 @@ if [ $? -ne 0 ]; then
     cat ${CLIENT_LOG}.ssl_infer_insecure
     RET=1
 fi
+python $SIMPLE_AIO_INFER_CLIENT_PY -v -u localhost --ssl --insecure >> ${CLIENT_LOG}.ssl_infer_insecure.aio 2>&1
+if [ $? -ne 0 ]; then
+    cat ${CLIENT_LOG}.ssl_infer_insecure.aio
+    RET=1
+fi
 
 $TEST_CLIENT -v -u https://localhost:443 --verify-host 0 --verify-peer 0 >> ${CLIENT_LOG}.c++.ssl_infer_insecure 2>&1
 if [ $? -ne 0 ]; then
@@ -124,7 +135,7 @@ if [ $? -ne 0 ]; then
 fi
 
 # Test failure cases for SSL
-# Try without SSL 
+# Try without SSL
 $SIMPLE_INFER_CLIENT_PY -v -u localhost >> ${CLIENT_LOG}.no_ssl_fail_infer 2>&1
 if [ $? -ne 0 ]; then
     cat ${CLIENT_LOG}.no_ssl_fail_infer
@@ -132,6 +143,13 @@ if [ $? -ne 0 ]; then
 else
     RET=1
 fi
+$SIMPLE_AIO_INFER_CLIENT_PY -v -u localhost >> ${CLIENT_LOG}.no_ssl_fail_infer.aio 2>&1
+if [ $? -ne 0 ]; then
+    cat ${CLIENT_LOG}.no_ssl_fail_infer.aio
+    echo -e "\n***\n*** Expected test failure\n***"
+else
+    RET=1
+fi
 
 $TEST_CLIENT -v -u https://localhost:443 >> ${CLIENT_LOG}.c++.no_ssl_fail_infer 2>&1
 if [ $? -ne 0 ]; then
@@ -150,6 +168,13 @@ if [ $? -ne 0 ]; then
 else
     RET=1
 fi
+$SIMPLE_AIO_INFER_CLIENT_PY -v -u localhost --ssl --key-file client2.key --cert-file client.crt --ca-certs ca.crt >> ${CLIENT_LOG}.ssl_wrong_key.aio 2>&1
+if [ $? -ne 0 ]; then
+    cat ${CLIENT_LOG}.ssl_wrong_key.aio
+    echo -e "\n***\n*** Expected test failure\n***"
+else
+    RET=1
+fi
 
 $TEST_CLIENT -v -u https://localhost:443 --key-file client2.key --cert-file client.crt --ca-certs ca.crt >> ${CLIENT_LOG}.c++.ssl_wrong_key 2>&1
 if [ $? -ne 0 ]; then
diff --git a/qa/L0_implicit_state/implicit_state.py b/qa/L0_implicit_state/implicit_state.py
old mode 100644
new mode 100755
index 72d8bb1a37..2cdf7ff2e0
--- a/qa/L0_implicit_state/implicit_state.py
+++ b/qa/L0_implicit_state/implicit_state.py
@@ -1,5 +1,5 @@
 #!/usr/bin/env python
-# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -27,77 +27,185 @@
 
 import sys
 
-sys.path.append('../common')
+sys.path.append("../common")
 
-import argparse
-import numpy as np
 import os
+import unittest
 from builtins import range
+
+import numpy as np
+import test_util as tu
 import tritonclient.http as tritonhttpclient
 from tritonclient.utils import InferenceServerException
-import unittest
-import test_util as tu
-BACKENDS = os.environ.get('BACKENDS', "onnx plan")
 
+BACKENDS = os.environ.get("BACKENDS", "onnx plan libtorch")
 
-class ImplicitStateTest(tu.TestResultCollector):
 
+class ImplicitStateTest(tu.TestResultCollector):
     def test_no_implicit_state(self):
         triton_client = tritonhttpclient.InferenceServerClient("localhost:8000")
         inputs = []
-        inputs.append(tritonhttpclient.InferInput('INPUT', [1], 'INT32'))
-        inputs.append(tritonhttpclient.InferInput('TEST_CASE', [1], 'INT32'))
+        inputs.append(tritonhttpclient.InferInput("INPUT", [1], "INT32"))
+        inputs.append(tritonhttpclient.InferInput("TEST_CASE", [1], "INT32"))
         inputs[0].set_data_from_numpy(np.random.randint(5, size=[1], dtype=np.int32))
         inputs[1].set_data_from_numpy(np.asarray([0], dtype=np.int32))
 
         with self.assertRaises(InferenceServerException) as e:
-            triton_client.infer(model_name="no_implicit_state", inputs=inputs, sequence_id=1, sequence_start=True)
+            triton_client.infer(
+                model_name="no_implicit_state",
+                inputs=inputs,
+                sequence_id=1,
+                sequence_start=True,
+            )
 
-        self.assertEqual(str(e.exception), "unable to add state 'undefined_state'. State configuration is missing for model 'no_implicit_state'.")
+        err_str = str(e.exception).lower()
+        self.assertIn("unable to add state 'undefined_state'", err_str)
+        self.assertIn(
+            "state configuration is missing for model 'no_implicit_state'", err_str
+        )
 
     def test_wrong_implicit_state_name(self):
         triton_client = tritonhttpclient.InferenceServerClient("localhost:8000")
         inputs = []
-        inputs.append(tritonhttpclient.InferInput('INPUT', [1], 'INT32'))
-        inputs.append(tritonhttpclient.InferInput('TEST_CASE', [1], 'INT32'))
+        inputs.append(tritonhttpclient.InferInput("INPUT", [1], "INT32"))
+        inputs.append(tritonhttpclient.InferInput("TEST_CASE", [1], "INT32"))
         inputs[0].set_data_from_numpy(np.random.randint(5, size=[1], dtype=np.int32))
         inputs[1].set_data_from_numpy(np.asarray([0], dtype=np.int32))
 
         with self.assertRaises(InferenceServerException) as e:
-            triton_client.infer(model_name="wrong_internal_state", inputs=inputs, sequence_id=2, sequence_start=True)
+            triton_client.infer(
+                model_name="wrong_internal_state",
+                inputs=inputs,
+                sequence_id=2,
+                sequence_start=True,
+            )
+
+        err_str = str(e.exception).lower()
+        self.assertIn("state 'undefined_state' is not a valid state name", err_str)
 
-        self.assertEqual(str(e.exception), "state 'undefined_state' is not a valid state name.")
+    def test_implicit_state_single_buffer(self):
+        triton_client = tritonhttpclient.InferenceServerClient("localhost:8000")
+        inputs = []
+        inputs.append(tritonhttpclient.InferInput("INPUT", [1], "INT32"))
+        inputs.append(tritonhttpclient.InferInput("TEST_CASE", [1], "INT32"))
+        inputs[0].set_data_from_numpy(np.random.randint(5, size=[1], dtype=np.int32))
+        inputs[1].set_data_from_numpy(np.asarray([2], dtype=np.int32))
+
+        triton_client.infer(
+            model_name="single_state_buffer",
+            inputs=inputs,
+            sequence_id=2,
+            sequence_start=True,
+            sequence_end=False,
+        )
+
+        triton_client.infer(
+            model_name="single_state_buffer",
+            inputs=inputs,
+            sequence_id=2,
+            sequence_start=False,
+            sequence_end=True,
+        )
+
+    def test_implicit_state_growable_memory(self):
+        triton_client = tritonhttpclient.InferenceServerClient("localhost:8000")
+        inputs = []
+        inputs.append(tritonhttpclient.InferInput("INPUT", [1], "INT32"))
+        inputs.append(tritonhttpclient.InferInput("TEST_CASE", [1], "INT32"))
+        inputs[0].set_data_from_numpy(np.random.randint(5, size=[1], dtype=np.int32))
+        inputs[1].set_data_from_numpy(np.asarray([3], dtype=np.int32))
+
+        output = triton_client.infer(
+            model_name="growable_memory",
+            inputs=inputs,
+            sequence_id=2,
+            sequence_start=True,
+            sequence_end=False,
+        )
+        output_state = output.as_numpy("OUTPUT_STATE")
+        expected_output_state = np.zeros(output_state.shape, dtype=np.int8)
+        np.testing.assert_equal(output_state, expected_output_state)
+
+        output = triton_client.infer(
+            model_name="growable_memory",
+            inputs=inputs,
+            sequence_id=2,
+            sequence_start=False,
+            sequence_end=False,
+        )
+        output_state = output.as_numpy("OUTPUT_STATE")
+        expected_output_state = np.concatenate(
+            [expected_output_state, np.ones(expected_output_state.shape, dtype=np.int8)]
+        )
+        np.testing.assert_equal(output_state, expected_output_state)
+
+        output = triton_client.infer(
+            model_name="growable_memory",
+            inputs=inputs,
+            sequence_id=2,
+            sequence_start=False,
+            sequence_end=False,
+        )
+        output_state = output.as_numpy("OUTPUT_STATE")
+        expected_output_state = np.concatenate(
+            [
+                expected_output_state,
+                np.full(
+                    (expected_output_state.shape[0] // 2,), dtype=np.int8, fill_value=2
+                ),
+            ]
+        )
+        np.testing.assert_equal(output_state, expected_output_state)
 
     def test_no_update(self):
-	    # Test implicit state without updating any state
+        # Test implicit state without updating any state
         triton_client = tritonhttpclient.InferenceServerClient("localhost:8000")
         inputs = []
-        inputs.append(tritonhttpclient.InferInput('INPUT', [1], 'INT32'))
-        inputs.append(tritonhttpclient.InferInput('TEST_CASE', [1], 'INT32'))
+        inputs.append(tritonhttpclient.InferInput("INPUT", [1], "INT32"))
+        inputs.append(tritonhttpclient.InferInput("TEST_CASE", [1], "INT32"))
         inputs[0].set_data_from_numpy(np.asarray([1], dtype=np.int32))
         inputs[1].set_data_from_numpy(np.asarray([1], dtype=np.int32))
         correlation_id = 3
 
         # Make sure the state is never updated.
-        result_start = triton_client.infer(model_name="no_state_update", inputs=inputs, sequence_id=correlation_id, sequence_start=True)
-        self.assertEqual(result_start.as_numpy('OUTPUT')[0], 1)
+        result_start = triton_client.infer(
+            model_name="no_state_update",
+            inputs=inputs,
+            sequence_id=correlation_id,
+            sequence_start=True,
+        )
+        self.assertEqual(result_start.as_numpy("OUTPUT")[0], 1)
         for _ in range(10):
-            result = triton_client.infer(model_name="no_state_update", inputs=inputs, sequence_id=correlation_id)
-            self.assertEqual(result.as_numpy('OUTPUT')[0], 1)
+            result = triton_client.infer(
+                model_name="no_state_update", inputs=inputs, sequence_id=correlation_id
+            )
+            self.assertEqual(result.as_numpy("OUTPUT")[0], 1)
 
-        result_start = triton_client.infer(model_name="no_state_update", inputs=inputs, sequence_id=correlation_id, sequence_end=True)
-        self.assertEqual(result.as_numpy('OUTPUT')[0], 1)
+        _ = triton_client.infer(
+            model_name="no_state_update",
+            inputs=inputs,
+            sequence_id=correlation_id,
+            sequence_end=True,
+        )
+        self.assertEqual(result.as_numpy("OUTPUT")[0], 1)
 
     def test_request_output_not_allowed(self):
         triton_client = tritonhttpclient.InferenceServerClient("localhost:8000")
-        inputs = []
-        inputs.append(tritonhttpclient.InferInput('INPUT', [1], 'INT32'))
-        inputs[0].set_data_from_numpy(np.asarray([1], dtype=np.int32))
-
-        outputs = []
-        outputs.append(tritonhttpclient.InferRequestedOutput('OUTPUT_STATE'))
 
         for backend in BACKENDS.split(" "):
+            inputs = []
+            if backend.strip() == "libtorch":
+                inputs.append(tritonhttpclient.InferInput("INPUT__0", [1], "INT32"))
+            else:
+                inputs.append(tritonhttpclient.InferInput("INPUT", [1], "INT32"))
+            inputs[0].set_data_from_numpy(np.asarray([1], dtype=np.int32))
+
+            outputs = []
+            if backend.strip() == "libtorch":
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT_STATE__1"))
+            else:
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT_STATE"))
+
             with self.assertRaises(InferenceServerException) as e:
                 triton_client.infer(
                     model_name=f"{backend}_nobatch_sequence_int32",
@@ -105,31 +213,52 @@ def test_request_output_not_allowed(self):
                     outputs=outputs,
                     sequence_id=1,
                     sequence_start=True,
-                    sequence_end=True)
-            self.assertTrue(str(e.exception).startswith("unexpected inference output 'OUTPUT_STATE' for model"))
+                    sequence_end=True,
+                )
+            if backend.strip() == "libtorch":
+                self.assertIn(
+                    "unexpected inference output 'OUTPUT_STATE__1' for model",
+                    str(e.exception),
+                )
+            else:
+                self.assertIn(
+                    "unexpected inference output 'OUTPUT_STATE' for model",
+                    str(e.exception),
+                )
 
     def test_request_output(self):
         triton_client = tritonhttpclient.InferenceServerClient("localhost:8000")
-        inputs = []
-        inputs.append(tritonhttpclient.InferInput('INPUT', [1], 'INT32'))
-        inputs[0].set_data_from_numpy(np.asarray([1], dtype=np.int32))
+        for backend in BACKENDS.split(" "):
+            inputs = []
+            if backend.strip() == "libtorch":
+                inputs.append(tritonhttpclient.InferInput("INPUT__0", [1], "INT32"))
+            else:
+                inputs.append(tritonhttpclient.InferInput("INPUT", [1], "INT32"))
+            inputs[0].set_data_from_numpy(np.asarray([1], dtype=np.int32))
 
-        outputs = []
-        outputs.append(tritonhttpclient.InferRequestedOutput('OUTPUT_STATE'))
-        outputs.append(tritonhttpclient.InferRequestedOutput('OUTPUT'))
+            outputs = []
+            if backend.strip() == "libtorch":
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT_STATE__1"))
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT__0"))
+            else:
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT_STATE"))
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT"))
 
-        for backend in BACKENDS.split(" "):
             result = triton_client.infer(
-                    model_name=f"{backend}_nobatch_sequence_int32_output",
-                    inputs=inputs,
-                    outputs=outputs,
-                    sequence_id=1,
-                    sequence_start=True,
-                    sequence_end=True)
-            self.assertTrue(result.as_numpy('OUTPUT_STATE')[0], 1)
-            self.assertTrue(result.as_numpy('OUTPUT')[0], 1)
+                model_name=f"{backend}_nobatch_sequence_int32_output",
+                inputs=inputs,
+                outputs=outputs,
+                sequence_id=1,
+                sequence_start=True,
+                sequence_end=True,
+            )
+            if backend.strip() == "libtorch":
+                self.assertTrue(result.as_numpy("OUTPUT_STATE__1")[0], 1)
+                self.assertTrue(result.as_numpy("OUTPUT__0")[0], 1)
+            else:
+                self.assertTrue(result.as_numpy("OUTPUT_STATE")[0], 1)
+                self.assertTrue(result.as_numpy("OUTPUT")[0], 1)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
-
diff --git a/qa/L0_implicit_state/models/growable_memory/config.pbtxt b/qa/L0_implicit_state/models/growable_memory/config.pbtxt
new file mode 100644
index 0000000000..0a7920bdf1
--- /dev/null
+++ b/qa/L0_implicit_state/models/growable_memory/config.pbtxt
@@ -0,0 +1,103 @@
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "growable_memory"
+backend: "implicit_state"
+max_batch_size: 0
+sequence_batching {
+  control_input [
+    {
+      name: "START"
+      control [
+        {
+          kind: CONTROL_SEQUENCE_START
+          fp32_false_true: [ 0, 1 ]
+        }
+      ]
+    },
+    {
+      name: "READY"
+      control [
+        {
+          kind: CONTROL_SEQUENCE_READY
+          fp32_false_true: [ 0, 1 ]
+        }
+      ]
+    },
+    {
+      name: "END"
+      control [
+        {
+          kind: CONTROL_SEQUENCE_END
+          fp32_false_true: [ 0, 1 ]
+        }
+      ]
+    }
+  ]
+  state [
+    {
+        input_name: "INPUT_STATE"
+        output_name: "OUTPUT_STATE"
+        data_type: TYPE_INT8
+        dims: [1024, 1024]
+        use_same_buffer_for_input_output: true
+        use_growable_memory: true
+    }
+  ]
+}
+
+input [
+  {
+    name: "INPUT"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  },
+  {
+    name: "TEST_CASE"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  }
+]
+
+output [
+  {
+    name: "OUTPUT"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  },
+  {
+    name: "OUTPUT_STATE"
+    data_type: TYPE_INT8
+    dims: [ 1 ]
+  }
+]
+
+instance_group [
+  {
+    count: 1
+    kind : KIND_GPU
+  }
+]
diff --git a/qa/L0_implicit_state/models/single_state_buffer/config.pbtxt b/qa/L0_implicit_state/models/single_state_buffer/config.pbtxt
new file mode 100644
index 0000000000..0f72d772a6
--- /dev/null
+++ b/qa/L0_implicit_state/models/single_state_buffer/config.pbtxt
@@ -0,0 +1,97 @@
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "single_state_buffer"
+backend: "implicit_state"
+max_batch_size: 0
+sequence_batching {
+  control_input [
+    {
+      name: "START"
+      control [
+        {
+          kind: CONTROL_SEQUENCE_START
+          fp32_false_true: [ 0, 1 ]
+        }
+      ]
+    },
+    {
+      name: "READY"
+      control [
+        {
+          kind: CONTROL_SEQUENCE_READY
+          fp32_false_true: [ 0, 1 ]
+        }
+      ]
+    },
+    {
+      name: "END"
+      control [
+        {
+          kind: CONTROL_SEQUENCE_END
+          fp32_false_true: [ 0, 1 ]
+        }
+      ]
+    }
+  ]
+  state [
+    {
+        input_name: "INPUT_STATE"
+        output_name: "OUTPUT_STATE"
+        data_type: TYPE_INT32
+        dims: 1
+        use_same_buffer_for_input_output: true
+    }
+  ]
+}
+
+input [
+  {
+    name: "INPUT"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  },
+  {
+    name: "TEST_CASE"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  }
+]
+
+output [
+  {
+    name: "OUTPUT"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  }
+]
+
+instance_group [
+  {
+    count: 1
+    kind : KIND_CPU
+  }
+]
diff --git a/qa/L0_implicit_state/test.sh b/qa/L0_implicit_state/test.sh
old mode 100644
new mode 100755
index 04436ec8d5..0722d29be1
--- a/qa/L0_implicit_state/test.sh
+++ b/qa/L0_implicit_state/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -41,14 +41,16 @@ DATADIR=${DATADIR:="/data/inferenceserver/${REPO_VERSION}"}
 TEST_RESULT_FILE='test_results.txt'
 
 export ENSEMBLES=0
-BACKENDS=${BACKENDS:="onnx plan"}
+BACKENDS=${BACKENDS:="libtorch onnx plan"}
 export BACKENDS
 export IMPLICIT_STATE=1
 INITIAL_STATE_ZERO=${INITIAL_STATE_ZERO:="0"}
 INITIAL_STATE_FILE=${INITIAL_STATE_FILE:="0"}
+SINGLE_STATE_BUFFER=${SINGLE_STATE_BUFFER:="0"}
 
 export INITIAL_STATE_ZERO
 export INITIAL_STATE_FILE
+export SINGLE_STATE_BUFFER
 
 MODELDIR=${MODELDIR:=`pwd`/models}
 TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"}
@@ -60,10 +62,14 @@ source ../common/util.sh
 cp ./libtriton_implicit_state.so models/no_implicit_state/
 cp ./libtriton_implicit_state.so models/no_state_update/
 cp ./libtriton_implicit_state.so models/wrong_internal_state/
+cp ./libtriton_implicit_state.so models/single_state_buffer/
+cp ./libtriton_implicit_state.so models/growable_memory/
 
 mkdir -p models/no_implicit_state/1/
 mkdir -p models/no_state_update/1/
 mkdir -p models/wrong_internal_state/1/
+mkdir -p models/single_state_buffer/1/
+mkdir -p models/growable_memory/1/
 
 for BACKEND in $BACKENDS; do
     dtype="int32"
@@ -78,15 +84,21 @@ for BACKEND in $BACKENDS; do
     rm -rf models/$model_name_allow_output
     cp -r $DATADIR/qa_sequence_implicit_model_repository/$model_name models/$model_name_allow_output
 
-    (cd models/$model_name_allow_output && \
-        sed -i "s/^name:.*/name: \"$model_name_allow_output\"/" config.pbtxt && \
-        echo -e "output [{ name: \"OUTPUT_STATE\" \n data_type: TYPE_INT32 \n dims: [ 1 ] }]" >> config.pbtxt)
+    if [ $BACKEND == "libtorch" ]; then
+    	(cd models/$model_name_allow_output && \
+    	    sed -i "s/^name:.*/name: \"$model_name_allow_output\"/" config.pbtxt && \
+    	    echo -e "output [{ name: \"OUTPUT_STATE__1\" \n data_type: TYPE_INT32 \n dims: [ 1 ] }]" >> config.pbtxt)
+    else
+    	(cd models/$model_name_allow_output && \
+    	    sed -i "s/^name:.*/name: \"$model_name_allow_output\"/" config.pbtxt && \
+    	    echo -e "output [{ name: \"OUTPUT_STATE\" \n data_type: TYPE_INT32 \n dims: [ 1 ] }]" >> config.pbtxt)
+    fi
 done
 
 CLIENT_LOG=`pwd`/client.log
-SERVER_ARGS="--backend-directory=${BACKEND_DIR} --model-repository=${MODELDIR}"
+SERVER_ARGS="--backend-directory=${BACKEND_DIR} --model-repository=${MODELDIR} --cuda-virtual-address-size=0:$((1024*1024*4))"
 IMPLICIT_STATE_CLIENT='implicit_state.py'
-EXPECTED_TEST_NUM=5
+EXPECTED_TEST_NUM=7
 rm -rf $CLIENT_LOG
 
 run_server
diff --git a/qa/L0_infer/infer_test.py b/qa/L0_infer/infer_test.py
old mode 100644
new mode 100755
index cb684c74db..3d5e116b4b
--- a/qa/L0_infer/infer_test.py
+++ b/qa/L0_infer/infer_test.py
@@ -1,4 +1,6 @@
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,64 +27,105 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
+import os
 import unittest
-import numpy as np
+
 import infer_util as iu
+import numpy as np
 import test_util as tu
-import os
-
 from tritonclient.utils import *
 
-TEST_SYSTEM_SHARED_MEMORY = bool(
-    int(os.environ.get('TEST_SYSTEM_SHARED_MEMORY', 0)))
-TEST_CUDA_SHARED_MEMORY = bool(int(os.environ.get('TEST_CUDA_SHARED_MEMORY',
-                                                  0)))
-CPU_ONLY = (os.environ.get('TRITON_SERVER_CPU_ONLY') is not None)
+TEST_SYSTEM_SHARED_MEMORY = bool(int(os.environ.get("TEST_SYSTEM_SHARED_MEMORY", 0)))
+TEST_CUDA_SHARED_MEMORY = bool(int(os.environ.get("TEST_CUDA_SHARED_MEMORY", 0)))
+CPU_ONLY = os.environ.get("TRITON_SERVER_CPU_ONLY") is not None
+TEST_VALGRIND = bool(int(os.environ.get("TEST_VALGRIND", 0)))
 
-USE_GRPC = (os.environ.get('USE_GRPC', 1) != "0")
-USE_HTTP = (os.environ.get('USE_HTTP', 1) != "0")
+USE_GRPC = os.environ.get("USE_GRPC", 1) != "0"
+USE_HTTP = os.environ.get("USE_HTTP", 1) != "0"
 assert USE_GRPC or USE_HTTP, "USE_GRPC or USE_HTTP must be non-zero"
 
 BACKENDS = os.environ.get(
-    'BACKENDS',
-    "graphdef savedmodel onnx libtorch plan python python_dlpack openvino")
-ENSEMBLES = bool(int(os.environ.get('ENSEMBLES', 1)))
+    "BACKENDS", "graphdef savedmodel onnx libtorch plan python python_dlpack openvino"
+)
+ENSEMBLES = bool(int(os.environ.get("ENSEMBLES", 1)))
+NOBATCH = bool(int(os.environ.get("NOBATCH", 1)))
+BATCH = bool(int(os.environ.get("BATCH", 1)))
 
 np_dtype_string = np.dtype(object)
 
+# 60 sec is the default value
+NETWORK_TIMEOUT = 300.0 if TEST_VALGRIND else 60.0
 
-class InferTest(tu.TestResultCollector):
 
-    def _full_exact(self, input_dtype, output0_dtype, output1_dtype,
-                    output0_raw, output1_raw, swap):
-
-        def _infer_exact_helper(tester,
-                                pf,
-                                tensor_shape,
-                                batch_size,
-                                input_dtype,
-                                output0_dtype,
-                                output1_dtype,
-                                output0_raw=True,
-                                output1_raw=True,
-                                model_version=None,
-                                swap=False,
-                                outputs=("OUTPUT0", "OUTPUT1"),
-                                use_http=USE_HTTP,
-                                use_grpc=USE_GRPC,
-                                use_http_json_tensors=True,
-                                skip_request_id_check=True,
-                                use_streaming=True,
-                                correlation_id=0):
+class InferTest(tu.TestResultCollector):
+    def _full_exact(
+        self,
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+        output0_raw,
+        output1_raw,
+        swap,
+        network_timeout=NETWORK_TIMEOUT,
+    ):
+        def _infer_exact_helper(
+            tester,
+            pf,
+            tensor_shape,
+            batch_size,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            output0_raw=True,
+            output1_raw=True,
+            model_version=None,
+            swap=False,
+            outputs=("OUTPUT0", "OUTPUT1"),
+            use_http=USE_HTTP,
+            use_grpc=USE_GRPC,
+            use_http_json_tensors=True,
+            skip_request_id_check=True,
+            use_streaming=True,
+            correlation_id=0,
+            network_timeout=NETWORK_TIMEOUT,
+        ):
             for bs in (1, batch_size):
                 # model that does not support batching
-                if bs == 1:
+                if NOBATCH:
+                    if bs == 1:
+                        iu.infer_exact(
+                            tester,
+                            pf + "_nobatch",
+                            tensor_shape,
+                            bs,
+                            input_dtype,
+                            output0_dtype,
+                            output1_dtype,
+                            output0_raw=output0_raw,
+                            output1_raw=output1_raw,
+                            model_version=model_version,
+                            swap=swap,
+                            outputs=outputs,
+                            use_http=use_http,
+                            use_grpc=use_grpc,
+                            use_http_json_tensors=use_http_json_tensors,
+                            skip_request_id_check=skip_request_id_check,
+                            use_streaming=use_streaming,
+                            correlation_id=correlation_id,
+                            use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                            use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                            network_timeout=network_timeout,
+                        )
+
+                if BATCH:
+                    # model that supports batching.
                     iu.infer_exact(
                         tester,
-                        pf + "_nobatch",
-                        tensor_shape,
+                        pf,
+                        (bs,) + tensor_shape,
                         bs,
                         input_dtype,
                         output0_dtype,
@@ -99,29 +142,9 @@ def _infer_exact_helper(tester,
                         use_streaming=use_streaming,
                         correlation_id=correlation_id,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
-
-                # model that supports batching.
-                iu.infer_exact(
-                    tester,
-                    pf, (bs,) + tensor_shape,
-                    bs,
-                    input_dtype,
-                    output0_dtype,
-                    output1_dtype,
-                    output0_raw=output0_raw,
-                    output1_raw=output1_raw,
-                    model_version=model_version,
-                    swap=swap,
-                    outputs=outputs,
-                    use_http=use_http,
-                    use_grpc=use_grpc,
-                    use_http_json_tensors=use_http_json_tensors,
-                    skip_request_id_check=skip_request_id_check,
-                    use_streaming=use_streaming,
-                    correlation_id=correlation_id,
-                    use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                        network_timeout=network_timeout,
+                    )
 
         input_size = 16
 
@@ -129,88 +152,131 @@ def _infer_exact_helper(tester,
         ensemble_prefix = [""]
         if ENSEMBLES:
             for prefix in all_ensemble_prefix:
-                if tu.validate_for_ensemble_model(prefix, input_dtype,
-                                                  output0_dtype, output1_dtype,
-                                                  (input_size,), (input_size,),
-                                                  (input_size,)):
+                if tu.validate_for_ensemble_model(
+                    prefix,
+                    input_dtype,
+                    output0_dtype,
+                    output1_dtype,
+                    (input_size,),
+                    (input_size,),
+                    (input_size,),
+                ):
                     ensemble_prefix.append(prefix)
 
-        if tu.validate_for_tf_model(input_dtype, output0_dtype, output1_dtype,
-                                    (input_size,), (input_size,),
-                                    (input_size,)):
+        if tu.validate_for_tf_model(
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            (input_size,),
+            (input_size,),
+            (input_size,),
+        ):
             for prefix in ensemble_prefix:
                 for pf in ["graphdef", "savedmodel"]:
                     if pf in BACKENDS:
-                        _infer_exact_helper(self,
-                                            prefix + pf, (input_size,),
-                                            8,
-                                            input_dtype,
-                                            output0_dtype,
-                                            output1_dtype,
-                                            output0_raw=output0_raw,
-                                            output1_raw=output1_raw,
-                                            swap=swap)
+                        _infer_exact_helper(
+                            self,
+                            prefix + pf,
+                            (input_size,),
+                            8,
+                            input_dtype,
+                            output0_dtype,
+                            output1_dtype,
+                            output0_raw=output0_raw,
+                            output1_raw=output1_raw,
+                            swap=swap,
+                            network_timeout=network_timeout,
+                        )
 
         if not CPU_ONLY and tu.validate_for_trt_model(
-                input_dtype, output0_dtype, output1_dtype, (input_size, 1, 1),
-            (input_size, 1, 1), (input_size, 1, 1)):
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            (input_size, 1, 1),
+            (input_size, 1, 1),
+            (input_size, 1, 1),
+        ):
             for prefix in ensemble_prefix:
-                if 'plan' in BACKENDS:
+                if "plan" in BACKENDS:
                     if input_dtype == np.int8:
-                        _infer_exact_helper(self,
-                                            prefix + 'plan', (input_size, 1, 1),
-                                            8,
-                                            input_dtype,
-                                            output0_dtype,
-                                            output1_dtype,
-                                            output0_raw=output0_raw,
-                                            output1_raw=output1_raw,
-                                            swap=swap)
+                        _infer_exact_helper(
+                            self,
+                            prefix + "plan",
+                            (input_size, 1, 1),
+                            8,
+                            input_dtype,
+                            output0_dtype,
+                            output1_dtype,
+                            output0_raw=output0_raw,
+                            output1_raw=output1_raw,
+                            swap=swap,
+                        )
                     else:
-                        _infer_exact_helper(self,
-                                            prefix + 'plan', (input_size,),
-                                            8,
-                                            input_dtype,
-                                            output0_dtype,
-                                            output1_dtype,
-                                            output0_raw=output0_raw,
-                                            output1_raw=output1_raw,
-                                            swap=swap)
-
-        if tu.validate_for_onnx_model(input_dtype, output0_dtype, output1_dtype,
-                                      (input_size,), (input_size,),
-                                      (input_size,)):
+                        _infer_exact_helper(
+                            self,
+                            prefix + "plan",
+                            (input_size,),
+                            8,
+                            input_dtype,
+                            output0_dtype,
+                            output1_dtype,
+                            output0_raw=output0_raw,
+                            output1_raw=output1_raw,
+                            swap=swap,
+                        )
+
+        if tu.validate_for_onnx_model(
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            (input_size,),
+            (input_size,),
+            (input_size,),
+        ):
             for prefix in ensemble_prefix:
-                if 'onnx' in BACKENDS:
-                    _infer_exact_helper(self,
-                                        prefix + 'onnx', (input_size,),
-                                        8,
-                                        input_dtype,
-                                        output0_dtype,
-                                        output1_dtype,
-                                        output0_raw=output0_raw,
-                                        output1_raw=output1_raw,
-                                        swap=swap)
-
-        if tu.validate_for_libtorch_model(input_dtype, output0_dtype,
-                                          output1_dtype, (input_size,),
-                                          (input_size,), (input_size,)):
+                if "onnx" in BACKENDS:
+                    _infer_exact_helper(
+                        self,
+                        prefix + "onnx",
+                        (input_size,),
+                        8,
+                        input_dtype,
+                        output0_dtype,
+                        output1_dtype,
+                        output0_raw=output0_raw,
+                        output1_raw=output1_raw,
+                        swap=swap,
+                    )
+
+        if tu.validate_for_libtorch_model(
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            (input_size,),
+            (input_size,),
+            (input_size,),
+        ):
             # Due to PyTorch bug
             # https://github.com/pytorch/pytorch/issues/66930 we can't
             # run this test with int8 input and int32 outputs.
-            if ((input_dtype == np.int8) and (output0_dtype == np.int32) and
-                (output1_dtype == np.int32)):
-                print('skipping pytorch test for int8_int32_int32')
+            if (
+                (input_dtype == np.int8)
+                and (output0_dtype == np.int32)
+                and (output1_dtype == np.int32)
+            ):
+                print("skipping pytorch test for int8_int32_int32")
             else:
                 for prefix in ensemble_prefix:
-                    if 'libtorch' in BACKENDS:
+                    if "libtorch" in BACKENDS:
                         # Skip batching for PyTorch String I/O
-                        if ((input_dtype == np_dtype_string) or
-                            (output0_dtype == np_dtype_string) or
-                            (output1_dtype == np_dtype_string)):
+                        if (
+                            (input_dtype == np_dtype_string)
+                            or (output0_dtype == np_dtype_string)
+                            or (output1_dtype == np_dtype_string)
+                        ):
                             iu.infer_exact(
                                 self,
-                                prefix + 'libtorch_nobatch',
+                                prefix + "libtorch_nobatch",
                                 (input_size,),
                                 1,  # batch_size
                                 input_dtype,
@@ -221,304 +287,382 @@ def _infer_exact_helper(tester,
                                 swap=swap,
                                 use_http=USE_HTTP,
                                 use_grpc=USE_GRPC,
-                                use_system_shared_memory=
-                                TEST_SYSTEM_SHARED_MEMORY,
-                                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                            )
                         else:
-                            _infer_exact_helper(self,
-                                                prefix + 'libtorch',
-                                                (input_size,),
-                                                8,
-                                                input_dtype,
-                                                output0_dtype,
-                                                output1_dtype,
-                                                output0_raw=output0_raw,
-                                                output1_raw=output1_raw,
-                                                swap=swap)
+                            _infer_exact_helper(
+                                self,
+                                prefix + "libtorch",
+                                (input_size,),
+                                8,
+                                input_dtype,
+                                output0_dtype,
+                                output1_dtype,
+                                output0_raw=output0_raw,
+                                output1_raw=output1_raw,
+                                swap=swap,
+                            )
 
         for prefix in ensemble_prefix:
             if prefix != "":
                 continue
+            if (
+                input_dtype == np.uint8
+                or output0_dtype == np.uint8
+                or output1_dtype == np.uint8
+            ):
+                continue
+
+            if "python_dlpack" in BACKENDS:
+                _infer_exact_helper(
+                    self,
+                    prefix + "python_dlpack",
+                    (input_size,),
+                    8,
+                    input_dtype,
+                    output0_dtype,
+                    output1_dtype,
+                    output0_raw=output0_raw,
+                    output1_raw=output1_raw,
+                    swap=swap,
+                )
+            elif "python" in BACKENDS:
+                _infer_exact_helper(
+                    self,
+                    prefix + "python",
+                    (input_size,),
+                    8,
+                    input_dtype,
+                    output0_dtype,
+                    output1_dtype,
+                    output0_raw=output0_raw,
+                    output1_raw=output1_raw,
+                    swap=swap,
+                )
 
-            if 'python_dlpack' in BACKENDS:
-                _infer_exact_helper(self,
-                                    prefix + 'python_dlpack', (input_size,),
-                                    8,
-                                    input_dtype,
-                                    output0_dtype,
-                                    output1_dtype,
-                                    output0_raw=output0_raw,
-                                    output1_raw=output1_raw,
-                                    swap=swap)
-            elif 'python' in BACKENDS:
-                _infer_exact_helper(self,
-                                    prefix + 'python', (input_size,),
-                                    8,
-                                    input_dtype,
-                                    output0_dtype,
-                                    output1_dtype,
-                                    output0_raw=output0_raw,
-                                    output1_raw=output1_raw,
-                                    swap=swap)
+    def test_raw_uuu(self):
+        self._full_exact(
+            np.uint8, np.uint8, np.uint8, output0_raw=True, output1_raw=True, swap=True
+        )
 
     def test_raw_bbb(self):
-        self._full_exact(np.int8,
-                         np.int8,
-                         np.int8,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=True)
+        self._full_exact(
+            np.int8, np.int8, np.int8, output0_raw=True, output1_raw=True, swap=True
+        )
 
     def test_raw_sss(self):
-        self._full_exact(np.int16,
-                         np.int16,
-                         np.int16,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=True)
+        self._full_exact(
+            np.int16, np.int16, np.int16, output0_raw=True, output1_raw=True, swap=True
+        )
 
     def test_raw_iii(self):
-        self._full_exact(np.int32,
-                         np.int32,
-                         np.int32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=True)
+        self._full_exact(
+            np.int32, np.int32, np.int32, output0_raw=True, output1_raw=True, swap=True
+        )
 
     def test_raw_lll(self):
-        self._full_exact(np.int64,
-                         np.int64,
-                         np.int64,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.int64, np.int64, np.int64, output0_raw=True, output1_raw=True, swap=False
+        )
 
     def test_raw_hhh(self):
-        self._full_exact(np.float16,
-                         np.float16,
-                         np.float16,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.float16,
+            np.float16,
+            np.float16,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_fff(self):
-        self._full_exact(np.float32,
-                         np.float32,
-                         np.float32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=True)
+        self._full_exact(
+            np.float32,
+            np.float32,
+            np.float32,
+            output0_raw=True,
+            output1_raw=True,
+            swap=True,
+        )
 
     def test_raw_hff(self):
-        self._full_exact(np.float16,
-                         np.float32,
-                         np.float32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.float16,
+            np.float32,
+            np.float32,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_bii(self):
-        self._full_exact(np.int8,
-                         np.int32,
-                         np.int32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.int8, np.int32, np.int32, output0_raw=True, output1_raw=True, swap=False
+        )
 
     def test_raw_ibb(self):
-        self._full_exact(np.int32,
-                         np.int8,
-                         np.int8,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.int32, np.int8, np.int8, output0_raw=True, output1_raw=True, swap=False
+        )
 
     def test_raw_ibs(self):
-        self._full_exact(np.int32,
-                         np.int8,
-                         np.int16,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.int32, np.int8, np.int16, output0_raw=True, output1_raw=True, swap=False
+        )
+
+    def test_raw_fuu(self):
+        self._full_exact(
+            np.float32,
+            np.uint8,
+            np.uint8,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
+
+    def test_raw_uff(self):
+        self._full_exact(
+            np.uint8,
+            np.float32,
+            np.float32,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
+
+    def test_raw_fuh(self):
+        self._full_exact(
+            np.float32,
+            np.uint8,
+            np.float16,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_iff(self):
-        self._full_exact(np.int32,
-                         np.float32,
-                         np.float32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.int32,
+            np.float32,
+            np.float32,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_fii(self):
-        self._full_exact(np.float32,
-                         np.int32,
-                         np.int32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.float32,
+            np.int32,
+            np.int32,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_ihs(self):
-        self._full_exact(np.int32,
-                         np.float16,
-                         np.int16,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.int32,
+            np.float16,
+            np.int16,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_ooo(self):
-        self._full_exact(np_dtype_string,
-                         np_dtype_string,
-                         np_dtype_string,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np_dtype_string,
+            np_dtype_string,
+            np_dtype_string,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_oii(self):
-        self._full_exact(np_dtype_string,
-                         np.int32,
-                         np.int32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np_dtype_string,
+            np.int32,
+            np.int32,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_oio(self):
-        self._full_exact(np_dtype_string,
-                         np.int32,
-                         np_dtype_string,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np_dtype_string,
+            np.int32,
+            np_dtype_string,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_ooi(self):
-        self._full_exact(np_dtype_string,
-                         np_dtype_string,
-                         np.int32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np_dtype_string,
+            np_dtype_string,
+            np.int32,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_ioo(self):
-        self._full_exact(np.int32,
-                         np_dtype_string,
-                         np_dtype_string,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.int32,
+            np_dtype_string,
+            np_dtype_string,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_iio(self):
-        self._full_exact(np.int32,
-                         np.int32,
-                         np_dtype_string,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.int32,
+            np.int32,
+            np_dtype_string,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_ioi(self):
-        self._full_exact(np.int32,
-                         np_dtype_string,
-                         np.int32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.int32,
+            np_dtype_string,
+            np.int32,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     # shared memory does not support class output
     if not (TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY):
 
         def test_class_bbb(self):
-            self._full_exact(np.int8,
-                             np.int8,
-                             np.int8,
-                             output0_raw=False,
-                             output1_raw=False,
-                             swap=True)
+            self._full_exact(
+                np.int8,
+                np.int8,
+                np.int8,
+                output0_raw=False,
+                output1_raw=False,
+                swap=True,
+            )
 
         def test_class_sss(self):
-            self._full_exact(np.int16,
-                             np.int16,
-                             np.int16,
-                             output0_raw=False,
-                             output1_raw=False,
-                             swap=True)
+            self._full_exact(
+                np.int16,
+                np.int16,
+                np.int16,
+                output0_raw=False,
+                output1_raw=False,
+                swap=True,
+            )
 
         def test_class_iii(self):
-            self._full_exact(np.int32,
-                             np.int32,
-                             np.int32,
-                             output0_raw=False,
-                             output1_raw=False,
-                             swap=True)
+            self._full_exact(
+                np.int32,
+                np.int32,
+                np.int32,
+                output0_raw=False,
+                output1_raw=False,
+                swap=True,
+            )
 
         def test_class_lll(self):
-            self._full_exact(np.int64,
-                             np.int64,
-                             np.int64,
-                             output0_raw=False,
-                             output1_raw=False,
-                             swap=False)
+            self._full_exact(
+                np.int64,
+                np.int64,
+                np.int64,
+                output0_raw=False,
+                output1_raw=False,
+                swap=False,
+            )
 
         def test_class_fff(self):
-            self._full_exact(np.float32,
-                             np.float32,
-                             np.float32,
-                             output0_raw=False,
-                             output1_raw=False,
-                             swap=True)
+            self._full_exact(
+                np.float32,
+                np.float32,
+                np.float32,
+                output0_raw=False,
+                output1_raw=False,
+                swap=True,
+            )
 
         def test_class_iff(self):
-            self._full_exact(np.int32,
-                             np.float32,
-                             np.float32,
-                             output0_raw=False,
-                             output1_raw=False,
-                             swap=False)
+            self._full_exact(
+                np.int32,
+                np.float32,
+                np.float32,
+                output0_raw=False,
+                output1_raw=False,
+                swap=False,
+            )
 
         def test_mix_bbb(self):
-            self._full_exact(np.int8,
-                             np.int8,
-                             np.int8,
-                             output0_raw=True,
-                             output1_raw=False,
-                             swap=True)
+            self._full_exact(
+                np.int8,
+                np.int8,
+                np.int8,
+                output0_raw=True,
+                output1_raw=False,
+                swap=True,
+            )
 
         def test_mix_sss(self):
-            self._full_exact(np.int16,
-                             np.int16,
-                             np.int16,
-                             output0_raw=False,
-                             output1_raw=True,
-                             swap=True)
+            self._full_exact(
+                np.int16,
+                np.int16,
+                np.int16,
+                output0_raw=False,
+                output1_raw=True,
+                swap=True,
+            )
 
         def test_mix_iii(self):
-            self._full_exact(np.int32,
-                             np.int32,
-                             np.int32,
-                             output0_raw=True,
-                             output1_raw=False,
-                             swap=True)
+            self._full_exact(
+                np.int32,
+                np.int32,
+                np.int32,
+                output0_raw=True,
+                output1_raw=False,
+                swap=True,
+            )
 
         def test_mix_lll(self):
-            self._full_exact(np.int64,
-                             np.int64,
-                             np.int64,
-                             output0_raw=False,
-                             output1_raw=True,
-                             swap=False)
+            self._full_exact(
+                np.int64,
+                np.int64,
+                np.int64,
+                output0_raw=False,
+                output1_raw=True,
+                swap=False,
+            )
 
         def test_mix_fff(self):
-            self._full_exact(np.float32,
-                             np.float32,
-                             np.float32,
-                             output0_raw=True,
-                             output1_raw=False,
-                             swap=True)
+            self._full_exact(
+                np.float32,
+                np.float32,
+                np.float32,
+                output0_raw=True,
+                output1_raw=False,
+                swap=True,
+            )
 
         def test_mix_iff(self):
-            self._full_exact(np.int32,
-                             np.float32,
-                             np.float32,
-                             output0_raw=False,
-                             output1_raw=True,
-                             swap=False)
+            self._full_exact(
+                np.int32,
+                np.float32,
+                np.float32,
+                output0_raw=False,
+                output1_raw=True,
+                swap=False,
+            )
 
     def test_raw_version_latest_1(self):
         input_size = 16
@@ -526,7 +670,7 @@ def test_raw_version_latest_1(self):
 
         # There are 3 versions of graphdef_int8_int8_int8 but
         # only version 3 should be available
-        for platform in ('graphdef', 'savedmodel'):
+        for platform in ("graphdef", "savedmodel"):
             if platform not in BACKENDS:
                 continue
             try:
@@ -543,10 +687,10 @@ def test_raw_version_latest_1(self):
                     use_http=USE_HTTP,
                     use_grpc=USE_GRPC,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
             except InferenceServerException as ex:
-                self.assertTrue(
-                    ex.message().startswith("Request for unknown model"))
+                self.assertTrue(ex.message().startswith("Request for unknown model"))
 
             try:
                 iu.infer_exact(
@@ -562,24 +706,26 @@ def test_raw_version_latest_1(self):
                     use_http=USE_HTTP,
                     use_grpc=USE_GRPC,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
             except InferenceServerException as ex:
-                self.assertTrue(
-                    ex.message().startswith("Request for unknown model"))
-
-            iu.infer_exact(self,
-                           platform,
-                           tensor_shape,
-                           1,
-                           np.int8,
-                           np.int8,
-                           np.int8,
-                           model_version=3,
-                           swap=True,
-                           use_http=USE_HTTP,
-                           use_grpc=USE_GRPC,
-                           use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                           use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                self.assertTrue(ex.message().startswith("Request for unknown model"))
+
+            iu.infer_exact(
+                self,
+                platform,
+                tensor_shape,
+                1,
+                np.int8,
+                np.int8,
+                np.int8,
+                model_version=3,
+                swap=True,
+                use_http=USE_HTTP,
+                use_grpc=USE_GRPC,
+                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+            )
 
     def test_raw_version_latest_2(self):
         input_size = 16
@@ -587,7 +733,7 @@ def test_raw_version_latest_2(self):
 
         # There are 3 versions of graphdef_int16_int16_int16 but only
         # versions 2 and 3 should be available
-        for platform in ('graphdef', 'savedmodel'):
+        for platform in ("graphdef", "savedmodel"):
             if platform not in BACKENDS:
                 continue
             try:
@@ -604,37 +750,41 @@ def test_raw_version_latest_2(self):
                     use_http=USE_HTTP,
                     use_grpc=USE_GRPC,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
             except InferenceServerException as ex:
-                self.assertTrue(
-                    ex.message().startswith("Request for unknown model"))
-
-            iu.infer_exact(self,
-                           platform,
-                           tensor_shape,
-                           1,
-                           np.int16,
-                           np.int16,
-                           np.int16,
-                           model_version=2,
-                           swap=True,
-                           use_http=USE_HTTP,
-                           use_grpc=USE_GRPC,
-                           use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                           use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
-            iu.infer_exact(self,
-                           platform,
-                           tensor_shape,
-                           1,
-                           np.int16,
-                           np.int16,
-                           np.int16,
-                           model_version=3,
-                           swap=True,
-                           use_http=USE_HTTP,
-                           use_grpc=USE_GRPC,
-                           use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                           use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                self.assertTrue(ex.message().startswith("Request for unknown model"))
+
+            iu.infer_exact(
+                self,
+                platform,
+                tensor_shape,
+                1,
+                np.int16,
+                np.int16,
+                np.int16,
+                model_version=2,
+                swap=True,
+                use_http=USE_HTTP,
+                use_grpc=USE_GRPC,
+                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+            )
+            iu.infer_exact(
+                self,
+                platform,
+                tensor_shape,
+                1,
+                np.int16,
+                np.int16,
+                np.int16,
+                model_version=3,
+                swap=True,
+                use_http=USE_HTTP,
+                use_grpc=USE_GRPC,
+                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+            )
 
     def test_raw_version_all(self):
         input_size = 16
@@ -642,48 +792,54 @@ def test_raw_version_all(self):
 
         # There are 3 versions of *_int32_int32_int32 and all should
         # be available.
-        for platform in ('graphdef', 'savedmodel'):
+        for platform in ("graphdef", "savedmodel"):
             if platform not in BACKENDS:
                 continue
-            iu.infer_exact(self,
-                           platform,
-                           tensor_shape,
-                           1,
-                           np.int32,
-                           np.int32,
-                           np.int32,
-                           model_version=1,
-                           swap=False,
-                           use_http=USE_HTTP,
-                           use_grpc=USE_GRPC,
-                           use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                           use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
-            iu.infer_exact(self,
-                           platform,
-                           tensor_shape,
-                           1,
-                           np.int32,
-                           np.int32,
-                           np.int32,
-                           model_version=2,
-                           swap=True,
-                           use_http=USE_HTTP,
-                           use_grpc=USE_GRPC,
-                           use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                           use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
-            iu.infer_exact(self,
-                           platform,
-                           tensor_shape,
-                           1,
-                           np.int32,
-                           np.int32,
-                           np.int32,
-                           model_version=3,
-                           swap=True,
-                           use_http=USE_HTTP,
-                           use_grpc=USE_GRPC,
-                           use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                           use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+            iu.infer_exact(
+                self,
+                platform,
+                tensor_shape,
+                1,
+                np.int32,
+                np.int32,
+                np.int32,
+                model_version=1,
+                swap=False,
+                use_http=USE_HTTP,
+                use_grpc=USE_GRPC,
+                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+            )
+            iu.infer_exact(
+                self,
+                platform,
+                tensor_shape,
+                1,
+                np.int32,
+                np.int32,
+                np.int32,
+                model_version=2,
+                swap=True,
+                use_http=USE_HTTP,
+                use_grpc=USE_GRPC,
+                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+            )
+            iu.infer_exact(
+                self,
+                platform,
+                tensor_shape,
+                1,
+                np.int32,
+                np.int32,
+                np.int32,
+                model_version=3,
+                swap=True,
+                use_http=USE_HTTP,
+                use_grpc=USE_GRPC,
+                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+            )
 
     def test_raw_version_specific_1(self):
         input_size = 16
@@ -691,22 +847,24 @@ def test_raw_version_specific_1(self):
 
         # There are 3 versions of *_float16_float16_float16 but only
         # version 1 should be available.
-        for platform in ('graphdef', 'savedmodel'):
+        for platform in ("graphdef", "savedmodel"):
             if platform not in BACKENDS:
                 continue
-            iu.infer_exact(self,
-                           platform,
-                           tensor_shape,
-                           1,
-                           np.float16,
-                           np.float16,
-                           np.float16,
-                           model_version=1,
-                           swap=False,
-                           use_http=USE_HTTP,
-                           use_grpc=USE_GRPC,
-                           use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                           use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+            iu.infer_exact(
+                self,
+                platform,
+                tensor_shape,
+                1,
+                np.float16,
+                np.float16,
+                np.float16,
+                model_version=1,
+                swap=False,
+                use_http=USE_HTTP,
+                use_grpc=USE_GRPC,
+                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+            )
 
             try:
                 iu.infer_exact(
@@ -722,10 +880,10 @@ def test_raw_version_specific_1(self):
                     use_http=USE_HTTP,
                     use_grpc=USE_GRPC,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
             except InferenceServerException as ex:
-                self.assertTrue(
-                    ex.message().startswith("Request for unknown model"))
+                self.assertTrue(ex.message().startswith("Request for unknown model"))
 
             try:
                 iu.infer_exact(
@@ -741,35 +899,37 @@ def test_raw_version_specific_1(self):
                     use_http=USE_HTTP,
                     use_grpc=USE_GRPC,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
             except InferenceServerException as ex:
-                self.assertTrue(
-                    ex.message().startswith("Request for unknown model"))
+                self.assertTrue(ex.message().startswith("Request for unknown model"))
 
     def test_raw_version_specific_1_3(self):
         input_size = 16
 
         # There are 3 versions of *_float32_float32_float32 but only
         # versions 1 and 3 should be available.
-        for platform in ('graphdef', 'savedmodel', 'plan'):
-            if platform == 'plan' and CPU_ONLY:
+        for platform in ("graphdef", "savedmodel", "plan"):
+            if platform == "plan" and CPU_ONLY:
                 continue
             if platform not in BACKENDS:
                 continue
             tensor_shape = (1, input_size)
-            iu.infer_exact(self,
-                           platform,
-                           tensor_shape,
-                           1,
-                           np.float32,
-                           np.float32,
-                           np.float32,
-                           model_version=1,
-                           swap=False,
-                           use_http=USE_HTTP,
-                           use_grpc=USE_GRPC,
-                           use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                           use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+            iu.infer_exact(
+                self,
+                platform,
+                tensor_shape,
+                1,
+                np.float32,
+                np.float32,
+                np.float32,
+                model_version=1,
+                swap=False,
+                use_http=USE_HTTP,
+                use_grpc=USE_GRPC,
+                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+            )
 
             try:
                 iu.infer_exact(
@@ -785,27 +945,29 @@ def test_raw_version_specific_1_3(self):
                     use_http=USE_HTTP,
                     use_grpc=USE_GRPC,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
             except InferenceServerException as ex:
-                self.assertTrue(
-                    ex.message().startswith("Request for unknown model"))
-
-            iu.infer_exact(self,
-                           platform,
-                           tensor_shape,
-                           1,
-                           np.float32,
-                           np.float32,
-                           np.float32,
-                           model_version=3,
-                           swap=True,
-                           use_http=USE_HTTP,
-                           use_grpc=USE_GRPC,
-                           use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                           use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                self.assertTrue(ex.message().startswith("Request for unknown model"))
+
+            iu.infer_exact(
+                self,
+                platform,
+                tensor_shape,
+                1,
+                np.float32,
+                np.float32,
+                np.float32,
+                model_version=3,
+                swap=True,
+                use_http=USE_HTTP,
+                use_grpc=USE_GRPC,
+                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+            )
 
     if ENSEMBLES:
-        if all(x in BACKENDS for x in ['graphdef', 'savedmodel']):
+        if all(x in BACKENDS for x in ["graphdef", "savedmodel"]):
 
             def test_ensemble_mix_platform(self):
                 # Skip on CPU only machine as TensorRT model is used in this ensemble
@@ -814,7 +976,8 @@ def test_ensemble_mix_platform(self):
                 for bs in (1, 8):
                     iu.infer_exact(
                         self,
-                        "mix_platform", (bs, 16),
+                        "mix_platform",
+                        (bs, 16),
                         bs,
                         np.float32,
                         np.float32,
@@ -822,7 +985,8 @@ def test_ensemble_mix_platform(self):
                         use_http=USE_HTTP,
                         use_grpc=USE_GRPC,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                    )
 
         if "graphdef" in BACKENDS:
 
@@ -830,7 +994,8 @@ def test_ensemble_mix_type(self):
                 for bs in (1, 8):
                     iu.infer_exact(
                         self,
-                        "mix_type", (bs, 16),
+                        "mix_type",
+                        (bs, 16),
                         bs,
                         np.int32,
                         np.float32,
@@ -838,15 +1003,17 @@ def test_ensemble_mix_type(self):
                         use_http=USE_HTTP,
                         use_grpc=USE_GRPC,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                    )
 
-        if all(x in BACKENDS for x in ['graphdef', 'savedmodel']):
+        if all(x in BACKENDS for x in ["graphdef", "savedmodel"]):
 
             def test_ensemble_mix_ensemble(self):
                 for bs in (1, 8):
                     iu.infer_exact(
                         self,
-                        "mix_ensemble", (bs, 16),
+                        "mix_ensemble",
+                        (bs, 16),
                         bs,
                         np.int32,
                         np.float32,
@@ -854,11 +1021,15 @@ def test_ensemble_mix_ensemble(self):
                         use_http=USE_HTTP,
                         use_grpc=USE_GRPC,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                    )
 
-        if all(x in BACKENDS for x in [
-                'graphdef',
-        ]):
+        if all(
+            x in BACKENDS
+            for x in [
+                "graphdef",
+            ]
+        ):
 
             def test_ensemble_mix_batch_nobatch(self):
                 base_names = ["batch_to_nobatch", "nobatch_to_batch"]
@@ -866,7 +1037,8 @@ def test_ensemble_mix_batch_nobatch(self):
                     for bs in (1, 8):
                         iu.infer_exact(
                             self,
-                            name, (bs, 16),
+                            name,
+                            (bs, 16),
                             bs,
                             np.float32,
                             np.float32,
@@ -874,10 +1046,12 @@ def test_ensemble_mix_batch_nobatch(self):
                             use_http=USE_HTTP,
                             use_grpc=USE_GRPC,
                             use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                            use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                            use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                        )
                     iu.infer_exact(
                         self,
-                        name + "_nobatch", (8, 16),
+                        name + "_nobatch",
+                        (8, 16),
                         1,
                         np.float32,
                         np.float32,
@@ -885,13 +1059,15 @@ def test_ensemble_mix_batch_nobatch(self):
                         use_http=USE_HTTP,
                         use_grpc=USE_GRPC,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                    )
 
                 # batch -> nobatch -> batch
                 for bs in (1, 8):
                     iu.infer_exact(
                         self,
-                        "mix_nobatch_batch", (bs, 16),
+                        "mix_nobatch_batch",
+                        (bs, 16),
                         bs,
                         np.float32,
                         np.float32,
@@ -899,17 +1075,19 @@ def test_ensemble_mix_batch_nobatch(self):
                         use_http=USE_HTTP,
                         use_grpc=USE_GRPC,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                    )
 
         if not (TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY):
 
             def test_ensemble_label_lookup(self):
-                if all(x in BACKENDS for x in ['graphdef', 'savedmodel']):
+                if all(x in BACKENDS for x in ["graphdef", "savedmodel"]):
                     # Ensemble needs to look up label from the actual model
                     for bs in (1, 8):
                         iu.infer_exact(
                             self,
-                            "mix_platform", (bs, 16),
+                            "mix_platform",
+                            (bs, 16),
                             bs,
                             np.float32,
                             np.float32,
@@ -919,14 +1097,16 @@ def test_ensemble_label_lookup(self):
                             use_http=USE_HTTP,
                             use_grpc=USE_GRPC,
                             use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                            use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                            use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                        )
 
-                if all(x in BACKENDS for x in ['graphdef', 'savedmodel']):
+                if all(x in BACKENDS for x in ["graphdef", "savedmodel"]):
                     # Label from the actual model will be passed along the nested ensemble
                     for bs in (1, 8):
                         iu.infer_exact(
                             self,
-                            "mix_ensemble", (bs, 16),
+                            "mix_ensemble",
+                            (bs, 16),
                             bs,
                             np.int32,
                             np.float32,
@@ -936,14 +1116,16 @@ def test_ensemble_label_lookup(self):
                             use_http=USE_HTTP,
                             use_grpc=USE_GRPC,
                             use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                            use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                            use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                        )
 
                 if "graphdef" in BACKENDS:
                     # If label file is provided, it will use the provided label file directly
                     try:
                         iu.infer_exact(
                             self,
-                            "wrong_label", (1, 16),
+                            "wrong_label",
+                            (1, 16),
                             1,
                             np.int32,
                             np.float32,
@@ -953,7 +1135,8 @@ def test_ensemble_label_lookup(self):
                             use_http=USE_HTTP,
                             use_grpc=USE_GRPC,
                             use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                            use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                            use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                        )
                     except AssertionError:
                         # Sanity check that infer_exact failed since this ensemble is provided
                         # with unexpected labels
@@ -963,7 +1146,8 @@ def test_ensemble_label_lookup(self):
                     for bs in (1, 8):
                         iu.infer_exact(
                             self,
-                            "label_override", (bs, 16),
+                            "label_override",
+                            (bs, 16),
                             bs,
                             np.int32,
                             np.float32,
@@ -973,8 +1157,9 @@ def test_ensemble_label_lookup(self):
                             use_http=USE_HTTP,
                             use_grpc=USE_GRPC,
                             use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                            use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                            use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                        )
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_infer/install_and_test.sh b/qa/L0_infer/install_and_test.sh
index f488f510f4..28e5dad52e 100755
--- a/qa/L0_infer/install_and_test.sh
+++ b/qa/L0_infer/install_and_test.sh
@@ -25,7 +25,7 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-# Note: This script is to be used with customized triton containers that need 
+# Note: This script is to be used with customized triton containers that need
 # dependencies to run L0_infer tests
 apt-get update && \
     apt-get install -y --no-install-recommends \
diff --git a/qa/L0_infer/test.sh b/qa/L0_infer/test.sh
index c48d4f8f64..34a669f874 100755
--- a/qa/L0_infer/test.sh
+++ b/qa/L0_infer/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -38,12 +38,14 @@ if [ ! -z "$TEST_REPO_ARCH" ]; then
     REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
 fi
 
+ldconfig || true
+
 export CUDA_VISIBLE_DEVICES=0
 
 TEST_RESULT_FILE='test_results.txt'
 CLIENT_LOG_BASE="./client"
 INFER_TEST=infer_test.py
-SERVER_TIMEOUT=360
+SERVER_TIMEOUT=${SERVER_TIMEOUT:=600}
 
 if [ -z "$TEST_SYSTEM_SHARED_MEMORY" ]; then
     TEST_SYSTEM_SHARED_MEMORY="0"
@@ -61,22 +63,25 @@ if [ "$TEST_VALGRIND" -eq 1 ]; then
     LEAKCHECK_LOG_BASE="./valgrind_test"
     LEAKCHECK=/usr/bin/valgrind
     LEAKCHECK_ARGS_BASE="--leak-check=full --show-leak-kinds=definite --max-threads=3000 --num-callers=20"
-    SERVER_TIMEOUT=3600
+    SERVER_TIMEOUT=4000
     rm -f $LEAKCHECK_LOG_BASE*
+    # Remove 'python', 'python_dlpack' and 'onnx' from BACKENDS and test them
+    # separately below.
+    BACKENDS="graphdef savedmodel libtorch plan openvino"
 fi
 
 if [ "$TEST_SYSTEM_SHARED_MEMORY" -eq 1 ] || [ "$TEST_CUDA_SHARED_MEMORY" -eq 1 ]; then
-  EXPECTED_NUM_TESTS=${EXPECTED_NUM_TESTS:="29"}
+    EXPECTED_NUM_TESTS=${EXPECTED_NUM_TESTS:="33"}
 else
-  EXPECTED_NUM_TESTS=${EXPECTED_NUM_TESTS:="42"}
+    EXPECTED_NUM_TESTS=${EXPECTED_NUM_TESTS:="46"}
 fi
 
-TF_VERSION=${TF_VERSION:=1}
+TF_VERSION=${TF_VERSION:=2}
 TEST_JETSON=${TEST_JETSON:=0}
 
 # Default size (in MB) of shared memory to be used by each python model
-# instance (Default is 64MB)
-DEFAULT_SHM_SIZE_MB=${DEFAULT_SHM_SIZE_MB:=64}
+# instance (Default is 1MB)
+DEFAULT_SHM_SIZE_MB=${DEFAULT_SHM_SIZE_MB:=1}
 DEFAULT_SHM_SIZE_BYTES=$((1024*1024*$DEFAULT_SHM_SIZE_MB))
 
 # On windows the paths invoked by the script (running in WSL) must use
@@ -93,6 +98,14 @@ else
     TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"}
     SERVER=${TRITON_DIR}/bin/tritonserver
     BACKEND_DIR=${TRITON_DIR}/backends
+
+    # PyTorch on SBSA requires libgomp to be loaded first. See the following
+    # GitHub issue for more information:
+    # https://github.com/pytorch/pytorch/issues/2575
+    arch=`uname -m`
+    if [ $arch = "aarch64" ]; then
+      SERVER_LD_PRELOAD=/usr/lib/$(uname -m)-linux-gnu/libgomp.so.1
+    fi
 fi
 
 # Allow more time to exit. Ensemble brings in too many models
@@ -123,31 +136,21 @@ export BACKENDS
 ENSEMBLES=${ENSEMBLES:="1"}
 export ENSEMBLES
 
+# Test for both batch and nobatch models
+NOBATCH=${NOBATCH:="1"}
+export NOBATCH
+BATCH=${BATCH:="1"}
+export BATCH
+
 if [[ $BACKENDS == *"python_dlpack"* ]]; then
-    if [ "$TEST_JETSON" == "0" ]; then
-        if [[ "aarch64" != $(uname -m) ]] ; then
-            pip3 install torch==1.9.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
-        else
-            pip3 install torch==1.9.0 -f https://download.pytorch.org/whl/torch_stable.html
-        fi
+    if [[ "aarch64" != $(uname -m) ]] ; then
+        pip3 install torch==1.13.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
+    else
+        pip3 install torch==1.13.0 -f https://download.pytorch.org/whl/torch_stable.html
     fi
 fi
 
-
-for TARGET in cpu gpu; do
-    if [ "$TRITON_SERVER_CPU_ONLY" == "1" ]; then
-        if [ "$TARGET" == "gpu" ]; then
-            echo -e "Skip GPU testing on CPU-only device"
-            continue
-        fi
-        # set strict readiness=false on CPU-only device to allow
-        # unsuccessful load of TensorRT plans, which require GPU.
-        SERVER_ARGS="--model-repository=${MODELDIR} --strict-readiness=false --exit-on-error=false ${SERVER_ARGS_EXTRA}"
-    fi
-
-    SERVER_LOG=$SERVER_LOG_BASE.${TARGET}.log
-    CLIENT_LOG=$CLIENT_LOG_BASE.${TARGET}.log
-
+function generate_model_repository() {
     rm -fr models && mkdir models
     for BACKEND in $BACKENDS; do
       if [ "$BACKEND" == "python" ] || [ "$BACKEND" == "python_dlpack" ]; then
@@ -204,6 +207,9 @@ for TARGET in cpu gpu; do
             fi
           fi
         done
+      elif [ "$BACKEND" == "plan" ] && [ "$TRITON_SERVER_CPU_ONLY" == "1" ]; then
+        # skip plan_tensorrt models since they don't run on CPU only containers
+        continue
       else
         cp -r ${DATADIR}/qa_model_repository/${BACKEND}* \
           models/.
@@ -214,7 +220,10 @@ for TARGET in cpu gpu; do
 
       # Copy identity backend models and ensembles
       for BACKEND in $BACKENDS; do
-        if [ "$BACKEND" != "python" ] && [ "$BACKEND" != "python_dlpack" ] && [ "$BACKEND" != "openvino" ]; then
+        if [ "$BACKEND" == "plan" ] && [ "$TRITON_SERVER_CPU_ONLY" == "1" ]; then
+            # skip plan_tensorrt models since they don't run on CPU only containers
+            continue
+        elif [ "$BACKEND" != "python" ] && [ "$BACKEND" != "python_dlpack" ] && [ "$BACKEND" != "openvino" ]; then
             cp -r ${DATADIR}/qa_ensemble_model_repository/qa_model_repository/*${BACKEND}* \
               models/.
         fi
@@ -242,7 +251,12 @@ for TARGET in cpu gpu; do
 
     KIND="KIND_GPU" && [[ "$TARGET" == "cpu" ]] && KIND="KIND_CPU"
     for FW in $BACKENDS; do
-      if [ "$FW" != "plan" ] && [ "$FW" != "python" ] && [ "$FW" != "python_dlpack" ] && [ "$FW" != "openvino" ];then
+      if [ "$FW" == "onnx" ] && [ "$TEST_VALGRIND" -eq 1 ]; then
+        # Reduce the instance count to make loading onnx models faster
+        for MC in `ls models/${FW}*/config.pbtxt`; do
+            echo "instance_group [ { kind: ${KIND} count: 1 }]" >> $MC
+        done
+      elif [ "$FW" != "plan" ] && [ "$FW" != "python" ] && [ "$FW" != "python_dlpack" ] && [ "$FW" != "openvino" ];then
         for MC in `ls models/${FW}*/config.pbtxt`; do
             echo "instance_group [ { kind: ${KIND} }]" >> $MC
         done
@@ -269,6 +283,21 @@ for TARGET in cpu gpu; do
             sed -i "s/max_batch_size: 1/max_batch_size: 0/" config.pbtxt && \
             sed -i "s/dims: \[ 1 \]/dims: \[ -1, -1 \]/" config.pbtxt)
 
+}
+
+for TARGET in cpu gpu; do
+    if [ "$TRITON_SERVER_CPU_ONLY" == "1" ]; then
+        if [ "$TARGET" == "gpu" ]; then
+            echo -e "Skip GPU testing on CPU-only device"
+            continue
+        fi
+    fi
+
+    SERVER_LOG=$SERVER_LOG_BASE.${TARGET}.log
+    CLIENT_LOG=$CLIENT_LOG_BASE.${TARGET}.log
+
+    generate_model_repository
+
     # Check if running a memory leak check
     if [ "$TEST_VALGRIND" -eq 1 ]; then
         LEAKCHECK_LOG=$LEAKCHECK_LOG_BASE.${TARGET}.log
@@ -299,7 +328,6 @@ for TARGET in cpu gpu; do
         fi
     fi
 
-
     set -e
 
     kill_server
@@ -314,6 +342,96 @@ for TARGET in cpu gpu; do
     set -e
 done
 
+# Run 'python', 'python_dlpack' and 'onnx' models separately in valgrind test.
+# Loading python and python_dlpack models has OOM issue when running with
+# valgrind, so loading only batch or nobatch models for each time.
+# Loading all the onnx models at once requires more than 12 hours. Loading them
+# separately to reduce the loading time.
+if [ "$TEST_VALGRIND" -eq 1 ]; then
+  TESTING_BACKENDS="python python_dlpack onnx"
+  EXPECTED_NUM_TESTS=42
+  if [[ "aarch64" != $(uname -m) ]] ; then
+      pip3 install torch==1.13.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
+  else
+      pip3 install torch==1.13.0 -f https://download.pytorch.org/whl/torch_stable.html
+  fi
+
+  for BACKENDS in $TESTING_BACKENDS; do
+    export BACKENDS
+    for TARGET in cpu gpu; do
+      rm -fr *models
+      generate_model_repository
+      mkdir nobatch_models
+      mv ./models/*nobatch_* ./nobatch_models/.
+      cp -fr ./models/nop_* ./nobatch_models/.
+
+      for BATCHING_MODE in batch nobatch; do
+        if [ "$TRITON_SERVER_CPU_ONLY" == "1" ]; then
+          if [ "$TARGET" == "gpu" ]; then
+              echo -e "Skip GPU testing on CPU-only device"
+              continue
+          fi
+        fi
+
+        SERVER_LOG=$SERVER_LOG_BASE.${TARGET}.${BACKENDS}.${BATCHING_MODE}.log
+        CLIENT_LOG=$CLIENT_LOG_BASE.${TARGET}.${BACKENDS}.${BATCHING_MODE}.log
+
+        if [ "$BATCHING_MODE" == "batch" ]; then
+          NOBATCH="0"
+          export NOBATCH
+          BATCH="1"
+          export BATCH
+          MODELDIR=`pwd`/models
+        else
+          NOBATCH="1"
+          export NOBATCH
+          BATCH="0"
+          export BATCH
+          MODELDIR=`pwd`/nobatch_models
+        fi
+
+        SERVER_ARGS="--model-repository=${MODELDIR} ${SERVER_ARGS_EXTRA}"
+        LEAKCHECK_LOG=$LEAKCHECK_LOG_BASE.${TARGET}.${BACKENDS}.${BATCHING_MODE}.log
+        LEAKCHECK_ARGS="$LEAKCHECK_ARGS_BASE --log-file=$LEAKCHECK_LOG"
+        run_server_leakcheck
+
+        if [ "$SERVER_PID" == "0" ]; then
+            echo -e "\n***\n*** Failed to start $SERVER\n***"
+            cat $SERVER_LOG
+            exit 1
+        fi
+
+        set +e
+
+        python3 $INFER_TEST >$CLIENT_LOG 2>&1
+        if [ $? -ne 0 ]; then
+            cat $CLIENT_LOG
+            RET=1
+        else
+            check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+            if [ $? -ne 0 ]; then
+                cat $CLIENT_LOG
+                cat $TEST_RESULT_FILE
+                echo -e "\n***\n*** Test Result Verification Failed\n***"
+                RET=1
+            fi
+        fi
+
+        set -e
+
+        kill_server
+
+        set +e
+        python3 ../common/check_valgrind_log.py -f $LEAKCHECK_LOG
+        if [ $? -ne 0 ]; then
+            RET=1
+        fi
+        set -e
+      done
+    done
+  done
+fi
+
 if [ $RET -eq 0 ]; then
   echo -e "\n***\n*** Test Passed\n***"
 else
diff --git a/qa/L0_infer_reshape/infer_reshape_test.py b/qa/L0_infer_reshape/infer_reshape_test.py
old mode 100644
new mode 100755
index 0c8156c98f..e77dcbecaf
--- a/qa/L0_infer_reshape/infer_reshape_test.py
+++ b/qa/L0_infer_reshape/infer_reshape_test.py
@@ -1,4 +1,6 @@
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,169 +27,216 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-from builtins import range
-from future.utils import iteritems
+import os
 import unittest
-import numpy as np
+
 import infer_util as iu
+import numpy as np
 import test_util as tu
-import os
 
 np_dtype_string = np.dtype(object)
 
-TEST_SYSTEM_SHARED_MEMORY = bool(
-    int(os.environ.get('TEST_SYSTEM_SHARED_MEMORY', 0)))
-TEST_CUDA_SHARED_MEMORY = bool(int(os.environ.get('TEST_CUDA_SHARED_MEMORY',
-                                                  0)))
+TEST_SYSTEM_SHARED_MEMORY = bool(int(os.environ.get("TEST_SYSTEM_SHARED_MEMORY", 0)))
+TEST_CUDA_SHARED_MEMORY = bool(int(os.environ.get("TEST_CUDA_SHARED_MEMORY", 0)))
 
 
 class InferReshapeTest(tu.TestResultCollector):
-
-    def _full_reshape(self,
-                      dtype,
-                      input_shapes,
-                      output_shapes=None,
-                      no_batch=True):
+    def _full_reshape(self, dtype, input_shapes, output_shapes=None, no_batch=True):
         # 'shapes' is list of shapes, one for each input.
         if output_shapes is None:
             output_shapes = input_shapes
 
         # For validation assume any shape can be used...
-        if tu.validate_for_tf_model(dtype, dtype, dtype, input_shapes[0],
-                                    input_shapes[0], input_shapes[0]):
+        if tu.validate_for_tf_model(
+            dtype, dtype, dtype, input_shapes[0], input_shapes[0], input_shapes[0]
+        ):
             # model that supports batching
             for bs in (1, 8):
-                full_shapes = [[
-                    bs,
-                ] + input_shape for input_shape in input_shapes]
-                full_output_shapes = [[
-                    bs,
-                ] + output_shape for output_shape in output_shapes]
+                full_shapes = [
+                    [
+                        bs,
+                    ]
+                    + input_shape
+                    for input_shape in input_shapes
+                ]
+                full_output_shapes = [
+                    [
+                        bs,
+                    ]
+                    + output_shape
+                    for output_shape in output_shapes
+                ]
                 iu.infer_zero(
                     self,
-                    'graphdef',
+                    "graphdef",
                     bs,
                     dtype,
                     full_shapes,
                     full_output_shapes,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
                 iu.infer_zero(
                     self,
-                    'savedmodel',
+                    "savedmodel",
                     bs,
                     dtype,
                     full_shapes,
                     full_output_shapes,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
             # model that does not support batching
             if no_batch:
                 iu.infer_zero(
                     self,
-                    'graphdef_nobatch',
+                    "graphdef_nobatch",
                     1,
                     dtype,
                     input_shapes,
                     output_shapes,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
                 iu.infer_zero(
                     self,
-                    'savedmodel_nobatch',
+                    "savedmodel_nobatch",
                     1,
                     dtype,
                     input_shapes,
                     output_shapes,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
 
-        if tu.validate_for_onnx_model(dtype, dtype, dtype, input_shapes[0],
-                                      input_shapes[0], input_shapes[0]):
+        if tu.validate_for_onnx_model(
+            dtype, dtype, dtype, input_shapes[0], input_shapes[0], input_shapes[0]
+        ):
             # model that supports batching
             for bs in (1, 8):
-                full_shapes = [[
-                    bs,
-                ] + input_shape for input_shape in input_shapes]
-                full_output_shapes = [[
-                    bs,
-                ] + output_shape for output_shape in output_shapes]
+                full_shapes = [
+                    [
+                        bs,
+                    ]
+                    + input_shape
+                    for input_shape in input_shapes
+                ]
+                full_output_shapes = [
+                    [
+                        bs,
+                    ]
+                    + output_shape
+                    for output_shape in output_shapes
+                ]
                 iu.infer_zero(
                     self,
-                    'onnx',
+                    "onnx",
                     bs,
                     dtype,
                     full_shapes,
                     full_output_shapes,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
             # model that does not support batching
             if no_batch:
                 iu.infer_zero(
                     self,
-                    'onnx_nobatch',
+                    "onnx_nobatch",
                     1,
                     dtype,
                     input_shapes,
                     output_shapes,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
 
-        # Skip for libtorch string I/O
-        if tu.validate_for_libtorch_model(dtype, dtype, dtype, input_shapes[0],
-                                          input_shapes[0], input_shapes[0]) and \
-                                              (dtype != np_dtype_string):
+        if tu.validate_for_libtorch_model(
+            dtype,
+            dtype,
+            dtype,
+            input_shapes[0],
+            input_shapes[0],
+            input_shapes[0],
+            reshape=True,
+        ):
             # skip variable size reshape on libtorch for now,
             # see "gen_qa_reshape_model.py" for detail
             if dtype != np.int32:
                 # model that does not support batching
-                if no_batch:
+                # skip for libtorch string I/O
+                if no_batch and (dtype != np_dtype_string):
                     iu.infer_zero(
                         self,
-                        'libtorch_nobatch',
+                        "libtorch_nobatch",
                         1,
                         dtype,
                         input_shapes,
                         output_shapes,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                    )
 
                 # model that supports batching
                 for bs in (1, 8):
-                    full_shapes = [[
-                        bs,
-                    ] + input_shape for input_shape in input_shapes]
-                    full_output_shapes = [[
-                        bs,
-                    ] + output_shape for output_shape in output_shapes]
+                    full_shapes = [
+                        [
+                            bs,
+                        ]
+                        + input_shape
+                        for input_shape in input_shapes
+                    ]
+                    full_output_shapes = [
+                        [
+                            bs,
+                        ]
+                        + output_shape
+                        for output_shape in output_shapes
+                    ]
                     iu.infer_zero(
                         self,
-                        'libtorch',
+                        "libtorch",
                         bs,
                         dtype,
                         full_shapes,
                         full_output_shapes,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                    )
 
         for name in ["simple_reshape", "sequence_reshape", "fan_reshape"]:
             # [TODO] Skip variable size reshape on ensemble for now.
             # Need rework on how ensemble for reshape are generated
             if dtype == np.int32:
                 break
-            if tu.validate_for_ensemble_model(name, dtype, dtype, dtype,
-                                              input_shapes[0], input_shapes[0],
-                                              input_shapes[0]):
+            if tu.validate_for_ensemble_model(
+                name,
+                dtype,
+                dtype,
+                dtype,
+                input_shapes[0],
+                input_shapes[0],
+                input_shapes[0],
+            ):
                 # model that supports batching
                 for bs in (1, 8):
-                    full_shapes = [[
-                        bs,
-                    ] + input_shape for input_shape in input_shapes]
-                    full_output_shapes = [[
-                        bs,
-                    ] + output_shape for output_shape in output_shapes]
+                    full_shapes = [
+                        [
+                            bs,
+                        ]
+                        + input_shape
+                        for input_shape in input_shapes
+                    ]
+                    full_output_shapes = [
+                        [
+                            bs,
+                        ]
+                        + output_shape
+                        for output_shape in output_shapes
+                    ]
                     iu.infer_zero(
                         self,
                         name,
@@ -196,58 +245,67 @@ def _full_reshape(self,
                         full_shapes,
                         full_output_shapes,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                    )
                 # model that does not support batching
                 if no_batch:
                     iu.infer_zero(
                         self,
-                        name + '_nobatch',
+                        name + "_nobatch",
                         1,
                         dtype,
                         input_shapes,
                         output_shapes,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                    )
 
-    def _trt_reshape(self,
-                     dtype,
-                     input_shapes,
-                     output_shapes=None,
-                     no_batch=True):
+    def _trt_reshape(self, dtype, input_shapes, output_shapes=None, no_batch=True):
         # 'shapes' is list of shapes, one for each input.
         if output_shapes is None:
             output_shapes = input_shapes
 
-        if tu.validate_for_trt_model(dtype, dtype, dtype, input_shapes[0],
-                                     input_shapes[0], input_shapes[0]):
+        if tu.validate_for_trt_model(
+            dtype, dtype, dtype, input_shapes[0], input_shapes[0], input_shapes[0]
+        ):
             # model that supports batching
             for bs in (1, 8):
-                full_shapes = [[
-                    bs,
-                ] + input_shape for input_shape in input_shapes]
-                full_output_shapes = [[
-                    bs,
-                ] + output_shape for output_shape in output_shapes]
+                full_shapes = [
+                    [
+                        bs,
+                    ]
+                    + input_shape
+                    for input_shape in input_shapes
+                ]
+                full_output_shapes = [
+                    [
+                        bs,
+                    ]
+                    + output_shape
+                    for output_shape in output_shapes
+                ]
                 iu.infer_zero(
                     self,
-                    'plan',
+                    "plan",
                     bs,
                     dtype,
                     full_shapes,
                     full_output_shapes,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
             # model that does not support batching
             if no_batch:
                 iu.infer_zero(
                     self,
-                    'plan_nobatch',
+                    "plan_nobatch",
                     1,
                     dtype,
                     input_shapes,
                     output_shapes,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
 
     def test_ff1(self):
         self._full_reshape(np.float32, input_shapes=([1],), no_batch=False)
@@ -260,21 +318,24 @@ def test_ff3(self):
         self._full_reshape(np.float32, input_shapes=([4, 4], [2], [2, 2, 3]))
 
     def test_ff4(self):
-        self._full_reshape(np.float32,
-                           input_shapes=([4, 4], [2], [2, 2, 3], [1]),
-                           output_shapes=([16], [1, 2], [3, 2, 2], [1]))
-        self._trt_reshape(np.float32,
-                          input_shapes=([4, 4], [2], [2, 2, 3], [1]),
-                          output_shapes=([2, 2, 4], [1, 2, 1], [3, 2,
-                                                                2], [1, 1, 1]))
+        self._full_reshape(
+            np.float32,
+            input_shapes=([4, 4], [2], [2, 2, 3], [1]),
+            output_shapes=([16], [1, 2], [3, 2, 2], [1]),
+        )
+        self._trt_reshape(
+            np.float32,
+            input_shapes=([4, 4], [2], [2, 2, 3], [1]),
+            output_shapes=([2, 2, 4], [1, 2, 1], [3, 2, 2], [1, 1, 1]),
+        )
 
     def test_ii1(self):
         self._full_reshape(np.int32, input_shapes=([2, 4, 5, 6],))
 
     def test_ii2(self):
-        self._full_reshape(np.int32,
-                           input_shapes=([4, 1], [2]),
-                           output_shapes=([1, 4], [1, 2]))
+        self._full_reshape(
+            np.int32, input_shapes=([4, 1], [2]), output_shapes=([1, 4], [1, 2])
+        )
 
     def test_ii3(self):
         self._full_reshape(np.int32, input_shapes=([1, 4, 1], [8], [2, 2, 3]))
@@ -283,5 +344,5 @@ def test_oo1(self):
         self._full_reshape(np.object_, input_shapes=([1],), no_batch=False)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_infer_reshape/test.sh b/qa/L0_infer_reshape/test.sh
index 325e24930d..218be954d9 100755
--- a/qa/L0_infer_reshape/test.sh
+++ b/qa/L0_infer_reshape/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2019-2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
diff --git a/qa/L0_infer_variable/infer_variable_test.py b/qa/L0_infer_variable/infer_variable_test.py
old mode 100644
new mode 100755
index 95e31c6962..e5e6470a3c
--- a/qa/L0_infer_variable/infer_variable_test.py
+++ b/qa/L0_infer_variable/infer_variable_test.py
@@ -1,4 +1,6 @@
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,52 +27,54 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
 import os
 import unittest
-import numpy as np
+
 import infer_util as iu
+import numpy as np
 import test_util as tu
 
 np_dtype_string = np.dtype(object)
 
-TEST_SYSTEM_SHARED_MEMORY = bool(
-    int(os.environ.get('TEST_SYSTEM_SHARED_MEMORY', 0)))
-TEST_CUDA_SHARED_MEMORY = bool(int(os.environ.get('TEST_CUDA_SHARED_MEMORY',
-                                                  0)))
+TEST_SYSTEM_SHARED_MEMORY = bool(int(os.environ.get("TEST_SYSTEM_SHARED_MEMORY", 0)))
+TEST_CUDA_SHARED_MEMORY = bool(int(os.environ.get("TEST_CUDA_SHARED_MEMORY", 0)))
 
 
 class InferVariableTest(tu.TestResultCollector):
-
-    def _full_exact(self,
-                    input_dtype,
-                    output0_dtype,
-                    output1_dtype,
-                    input_shape,
-                    output0_shape,
-                    output1_shape,
-                    output0_raw=True,
-                    output1_raw=True,
-                    swap=False):
-
-        def _infer_exact_helper(tester,
-                                pf,
-                                tensor_shape,
-                                batch_size,
-                                input_dtype,
-                                output0_dtype,
-                                output1_dtype,
-                                output0_raw=True,
-                                output1_raw=True,
-                                model_version=None,
-                                swap=False,
-                                outputs=("OUTPUT0", "OUTPUT1"),
-                                use_http=True,
-                                use_grpc=True,
-                                skip_request_id_check=False,
-                                use_streaming=True,
-                                correlation_id=0):
+    def _full_exact(
+        self,
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+        input_shape,
+        output0_shape,
+        output1_shape,
+        output0_raw=True,
+        output1_raw=True,
+        swap=False,
+    ):
+        def _infer_exact_helper(
+            tester,
+            pf,
+            tensor_shape,
+            batch_size,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            output0_raw=True,
+            output1_raw=True,
+            model_version=None,
+            swap=False,
+            outputs=("OUTPUT0", "OUTPUT1"),
+            use_http=True,
+            use_grpc=True,
+            skip_request_id_check=False,
+            use_streaming=True,
+            correlation_id=0,
+        ):
             for bs in (1, batch_size):
                 # model that does not support batching
                 if bs == 1:
@@ -93,15 +97,23 @@ def _infer_exact_helper(tester,
                         use_streaming=use_streaming,
                         correlation_id=correlation_id,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                    )
 
                 # model that supports batching. Skip for libtorch string I/O
-                elif pf == 'libtorch' and tu.validate_for_libtorch_model(
-                        input_dtype, output0_dtype, output1_dtype, tensor_shape,
-                        tensor_shape, tensor_shape, bs):
+                elif pf == "libtorch" and tu.validate_for_libtorch_model(
+                    input_dtype,
+                    output0_dtype,
+                    output1_dtype,
+                    tensor_shape,
+                    tensor_shape,
+                    tensor_shape,
+                    bs,
+                ):
                     iu.infer_exact(
                         tester,
-                        pf, (bs,) + tensor_shape,
+                        pf,
+                        (bs,) + tensor_shape,
                         bs,
                         input_dtype,
                         output0_dtype,
@@ -117,91 +129,128 @@ def _infer_exact_helper(tester,
                         use_streaming=use_streaming,
                         correlation_id=correlation_id,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                    )
 
         all_ensemble_prefix = ["simple_", "sequence_", "fan_"]
         ensemble_prefix = [""]
         for prefix in all_ensemble_prefix:
-            if tu.validate_for_ensemble_model(prefix, input_dtype,
-                                              output0_dtype, output1_dtype,
-                                              input_shape, input_shape,
-                                              input_shape):
+            if tu.validate_for_ensemble_model(
+                prefix,
+                input_dtype,
+                output0_dtype,
+                output1_dtype,
+                input_shape,
+                input_shape,
+                input_shape,
+            ):
                 ensemble_prefix.append(prefix)
 
-        if tu.validate_for_tf_model(input_dtype, output0_dtype, output1_dtype,
-                                    input_shape, output0_shape, output1_shape):
+        if tu.validate_for_tf_model(
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            input_shape,
+            output0_shape,
+            output1_shape,
+        ):
             for prefix in ensemble_prefix:
                 for pf in ["graphdef", "savedmodel"]:
-                    _infer_exact_helper(self,
-                                        prefix + pf,
-                                        input_shape,
-                                        8,
-                                        input_dtype,
-                                        output0_dtype,
-                                        output1_dtype,
-                                        output0_raw=output0_raw,
-                                        output1_raw=output1_raw,
-                                        swap=swap)
-
-        if tu.validate_for_trt_model(input_dtype, output0_dtype, output1_dtype,
-                                     input_shape, output0_shape, output1_shape):
+                    _infer_exact_helper(
+                        self,
+                        prefix + pf,
+                        input_shape,
+                        8,
+                        input_dtype,
+                        output0_dtype,
+                        output1_dtype,
+                        output0_raw=output0_raw,
+                        output1_raw=output1_raw,
+                        swap=swap,
+                    )
+
+        if tu.validate_for_trt_model(
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            input_shape,
+            output0_shape,
+            output1_shape,
+        ):
             for prefix in ensemble_prefix:
                 if input_dtype == np.int8:
-                    _infer_exact_helper(self,
-                                        prefix + 'plan',
-                                        input_shape + (1, 1),
-                                        8,
-                                        input_dtype,
-                                        output0_dtype,
-                                        output1_dtype,
-                                        output0_raw=output0_raw,
-                                        output1_raw=output1_raw,
-                                        swap=swap)
+                    _infer_exact_helper(
+                        self,
+                        prefix + "plan",
+                        input_shape + (1, 1),
+                        8,
+                        input_dtype,
+                        output0_dtype,
+                        output1_dtype,
+                        output0_raw=output0_raw,
+                        output1_raw=output1_raw,
+                        swap=swap,
+                    )
                 else:
-                    _infer_exact_helper(self,
-                                        prefix + 'plan',
-                                        input_shape,
-                                        8,
-                                        input_dtype,
-                                        output0_dtype,
-                                        output1_dtype,
-                                        output0_raw=output0_raw,
-                                        output1_raw=output1_raw,
-                                        swap=swap)
-
-        if tu.validate_for_onnx_model(input_dtype, output0_dtype, output1_dtype,
-                                      input_shape, output0_shape,
-                                      output1_shape):
+                    _infer_exact_helper(
+                        self,
+                        prefix + "plan",
+                        input_shape,
+                        8,
+                        input_dtype,
+                        output0_dtype,
+                        output1_dtype,
+                        output0_raw=output0_raw,
+                        output1_raw=output1_raw,
+                        swap=swap,
+                    )
+
+        if tu.validate_for_onnx_model(
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            input_shape,
+            output0_shape,
+            output1_shape,
+        ):
             # No basic ensemble models are created against custom models [TODO]
-            _infer_exact_helper(self,
-                                'onnx',
-                                input_shape,
-                                8,
-                                input_dtype,
-                                output0_dtype,
-                                output1_dtype,
-                                output0_raw=output0_raw,
-                                output1_raw=output1_raw,
-                                swap=swap)
-
-        if tu.validate_for_libtorch_model(input_dtype, output0_dtype,
-                                          output1_dtype, input_shape,
-                                          output0_shape, output1_shape):
+            _infer_exact_helper(
+                self,
+                "onnx",
+                input_shape,
+                8,
+                input_dtype,
+                output0_dtype,
+                output1_dtype,
+                output0_raw=output0_raw,
+                output1_raw=output1_raw,
+                swap=swap,
+            )
+
+        if tu.validate_for_libtorch_model(
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            input_shape,
+            output0_shape,
+            output1_shape,
+        ):
             # No basic ensemble models are created against custom models [TODO]
-            _infer_exact_helper(self,
-                                'libtorch',
-                                input_shape,
-                                8,
-                                input_dtype,
-                                output0_dtype,
-                                output1_dtype,
-                                output0_raw=output0_raw,
-                                output1_raw=output1_raw,
-                                swap=swap)
+            _infer_exact_helper(
+                self,
+                "libtorch",
+                input_shape,
+                8,
+                input_dtype,
+                output0_dtype,
+                output1_dtype,
+                output0_raw=output0_raw,
+                output1_raw=output1_raw,
+                swap=swap,
+            )
 
     def test_raw_fff(self):
-        self._full_exact(np.float32, np.float32, np.float32, (16,), (16,),
-                         (16,))
+        self._full_exact(np.float32, np.float32, np.float32, (16,), (16,), (16,))
 
     def test_raw_fii(self):
         self._full_exact(np.float32, np.int32, np.int32, (2, 8), (2, 8), (2, 8))
@@ -210,8 +259,9 @@ def test_raw_fll(self):
         self._full_exact(np.float32, np.int64, np.int64, (8, 4), (8, 4), (8, 4))
 
     def test_raw_fil(self):
-        self._full_exact(np.float32, np.int32, np.int64, (2, 8, 2), (2, 8, 2),
-                         (2, 8, 2))
+        self._full_exact(
+            np.float32, np.int32, np.int64, (2, 8, 2), (2, 8, 2), (2, 8, 2)
+        )
 
     def test_raw_ffi(self):
         self._full_exact(np.float32, np.float32, np.int32, (16,), (16,), (16,))
@@ -220,95 +270,148 @@ def test_raw_iii(self):
         self._full_exact(np.int32, np.int32, np.int32, (2, 8), (2, 8), (2, 8))
 
     def test_faw_iif(self):
-        self._full_exact(np.int32, np.int32, np.float32, (2, 8, 2), (2, 8, 2),
-                         (2, 8, 2))
+        self._full_exact(
+            np.int32, np.int32, np.float32, (2, 8, 2), (2, 8, 2), (2, 8, 2)
+        )
 
     def test_raw_ooo(self):
-        self._full_exact(np_dtype_string, np_dtype_string, np_dtype_string,
-                         (16,), (16,), (16,))
+        self._full_exact(
+            np_dtype_string, np_dtype_string, np_dtype_string, (16,), (16,), (16,)
+        )
 
     def test_raw_oii(self):
-        self._full_exact(np_dtype_string, np.int32, np.int32, (2, 8), (2, 8),
-                         (2, 8))
+        self._full_exact(np_dtype_string, np.int32, np.int32, (2, 8), (2, 8), (2, 8))
 
     def test_raw_ooi(self):
-        self._full_exact(np_dtype_string, np_dtype_string, np.int32, (8, 4),
-                         (8, 4), (8, 4))
+        self._full_exact(
+            np_dtype_string, np_dtype_string, np.int32, (8, 4), (8, 4), (8, 4)
+        )
 
     def test_raw_oio(self):
-        self._full_exact(np_dtype_string, np.int32, np_dtype_string, (2, 8, 2),
-                         (2, 8, 2), (2, 8, 2))
+        self._full_exact(
+            np_dtype_string, np.int32, np_dtype_string, (2, 8, 2), (2, 8, 2), (2, 8, 2)
+        )
 
     def test_class_fff(self):
-        self._full_exact(np.float32,
-                         np.float32,
-                         np.float32, (16,), (16,), (16,),
-                         output0_raw=False,
-                         output1_raw=False)
+        self._full_exact(
+            np.float32,
+            np.float32,
+            np.float32,
+            (16,),
+            (16,),
+            (16,),
+            output0_raw=False,
+            output1_raw=False,
+        )
 
     def test_class_fii(self):
-        self._full_exact(np.float32,
-                         np.int32,
-                         np.int32, (2, 8), (2, 8), (2, 8),
-                         output0_raw=False,
-                         output1_raw=False)
+        self._full_exact(
+            np.float32,
+            np.int32,
+            np.int32,
+            (2, 8),
+            (2, 8),
+            (2, 8),
+            output0_raw=False,
+            output1_raw=False,
+        )
 
     def test_class_fll(self):
-        self._full_exact(np.float32,
-                         np.int64,
-                         np.int64, (8, 4), (8, 4), (8, 4),
-                         output0_raw=False,
-                         output1_raw=False)
+        self._full_exact(
+            np.float32,
+            np.int64,
+            np.int64,
+            (8, 4),
+            (8, 4),
+            (8, 4),
+            output0_raw=False,
+            output1_raw=False,
+        )
 
     def test_class_fil(self):
-        self._full_exact(np.float32,
-                         np.int32,
-                         np.int64, (2, 8, 2), (2, 8, 2), (2, 8, 2),
-                         output0_raw=False,
-                         output1_raw=False)
+        self._full_exact(
+            np.float32,
+            np.int32,
+            np.int64,
+            (2, 8, 2),
+            (2, 8, 2),
+            (2, 8, 2),
+            output0_raw=False,
+            output1_raw=False,
+        )
 
     def test_class_ffi(self):
-        self._full_exact(np.float32,
-                         np.float32,
-                         np.int32, (16,), (16,), (16,),
-                         output0_raw=False,
-                         output1_raw=False)
+        self._full_exact(
+            np.float32,
+            np.float32,
+            np.int32,
+            (16,),
+            (16,),
+            (16,),
+            output0_raw=False,
+            output1_raw=False,
+        )
 
     def test_class_iii(self):
-        self._full_exact(np.int32,
-                         np.int32,
-                         np.int32, (2, 8), (2, 8), (2, 8),
-                         output0_raw=False,
-                         output1_raw=False)
+        self._full_exact(
+            np.int32,
+            np.int32,
+            np.int32,
+            (2, 8),
+            (2, 8),
+            (2, 8),
+            output0_raw=False,
+            output1_raw=False,
+        )
 
     def test_class_iif(self):
-        self._full_exact(np.int32,
-                         np.int32,
-                         np.float32, (2, 8, 2), (2, 8, 2), (2, 8, 2),
-                         output0_raw=False,
-                         output1_raw=False)
+        self._full_exact(
+            np.int32,
+            np.int32,
+            np.float32,
+            (2, 8, 2),
+            (2, 8, 2),
+            (2, 8, 2),
+            output0_raw=False,
+            output1_raw=False,
+        )
 
     def test_mix_ffi(self):
-        self._full_exact(np.float32,
-                         np.float32,
-                         np.int32, (16,), (16,), (16,),
-                         output0_raw=True,
-                         output1_raw=False)
+        self._full_exact(
+            np.float32,
+            np.float32,
+            np.int32,
+            (16,),
+            (16,),
+            (16,),
+            output0_raw=True,
+            output1_raw=False,
+        )
 
     def test_mix_iii(self):
-        self._full_exact(np.int32,
-                         np.int32,
-                         np.int32, (2, 8), (2, 8), (2, 8),
-                         output0_raw=False,
-                         output1_raw=True)
+        self._full_exact(
+            np.int32,
+            np.int32,
+            np.int32,
+            (2, 8),
+            (2, 8),
+            (2, 8),
+            output0_raw=False,
+            output1_raw=True,
+        )
 
     def test_mix_iif(self):
-        self._full_exact(np.int32,
-                         np.int32,
-                         np.float32, (2, 8, 2), (2, 8, 2), (2, 8, 2),
-                         output0_raw=True,
-                         output1_raw=False)
-
-
-if __name__ == '__main__':
+        self._full_exact(
+            np.int32,
+            np.int32,
+            np.float32,
+            (2, 8, 2),
+            (2, 8, 2),
+            (2, 8, 2),
+            output0_raw=True,
+            output1_raw=False,
+        )
+
+
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_infer_zero/infer_zero_test.py b/qa/L0_infer_zero/infer_zero_test.py
old mode 100644
new mode 100755
index e326529996..9e9b0f4625
--- a/qa/L0_infer_zero/infer_zero_test.py
+++ b/qa/L0_infer_zero/infer_zero_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,107 +27,128 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-from builtins import range
-from future.utils import iteritems
+import os
 import unittest
-import numpy as np
+
 import infer_util as iu
+import numpy as np
 import test_util as tu
-import os
 
 np_dtype_string = np.dtype(object)
 
-TEST_SYSTEM_SHARED_MEMORY = bool(
-    int(os.environ.get('TEST_SYSTEM_SHARED_MEMORY', 0)))
-TEST_CUDA_SHARED_MEMORY = bool(int(os.environ.get('TEST_CUDA_SHARED_MEMORY',
-                                                  0)))
+TEST_SYSTEM_SHARED_MEMORY = bool(int(os.environ.get("TEST_SYSTEM_SHARED_MEMORY", 0)))
+TEST_CUDA_SHARED_MEMORY = bool(int(os.environ.get("TEST_CUDA_SHARED_MEMORY", 0)))
 
 
 class InferZeroTest(tu.TestResultCollector):
-
     def _full_zero(self, dtype, shapes):
         # 'shapes' is list of shapes, one for each input.
 
         # For validation assume any shape can be used...
-        if tu.validate_for_tf_model(dtype, dtype, dtype, shapes[0], shapes[0],
-                                    shapes[0]):
+        if tu.validate_for_tf_model(
+            dtype, dtype, dtype, shapes[0], shapes[0], shapes[0]
+        ):
             # model that supports batching
             for bs in (1, 8):
-                batch_shapes = [[
-                    bs,
-                ] + shape for shape in shapes]
+                batch_shapes = [
+                    [
+                        bs,
+                    ]
+                    + shape
+                    for shape in shapes
+                ]
                 iu.infer_zero(
                     self,
-                    'graphdef',
+                    "graphdef",
                     bs,
                     dtype,
                     batch_shapes,
                     batch_shapes,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
                 iu.infer_zero(
                     self,
-                    'savedmodel',
+                    "savedmodel",
                     bs,
                     dtype,
                     batch_shapes,
                     batch_shapes,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
             # model that does not support batching
-            iu.infer_zero(self,
-                          'graphdef_nobatch',
-                          1,
-                          dtype,
-                          shapes,
-                          shapes,
-                          use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                          use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
-            iu.infer_zero(self,
-                          'savedmodel_nobatch',
-                          1,
-                          dtype,
-                          shapes,
-                          shapes,
-                          use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                          use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
-
-        if tu.validate_for_onnx_model(dtype, dtype, dtype, shapes[0], shapes[0],
-                                      shapes[0]):
+            iu.infer_zero(
+                self,
+                "graphdef_nobatch",
+                1,
+                dtype,
+                shapes,
+                shapes,
+                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+            )
+            iu.infer_zero(
+                self,
+                "savedmodel_nobatch",
+                1,
+                dtype,
+                shapes,
+                shapes,
+                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+            )
+
+        if tu.validate_for_onnx_model(
+            dtype, dtype, dtype, shapes[0], shapes[0], shapes[0]
+        ):
             # model that supports batching
             for bs in (1, 8):
-                batch_shapes = [[
-                    bs,
-                ] + shape for shape in shapes]
+                batch_shapes = [
+                    [
+                        bs,
+                    ]
+                    + shape
+                    for shape in shapes
+                ]
                 iu.infer_zero(
                     self,
-                    'onnx',
+                    "onnx",
                     bs,
                     dtype,
                     batch_shapes,
                     batch_shapes,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
             # model that does not support batching
-            iu.infer_zero(self,
-                          'onnx_nobatch',
-                          1,
-                          dtype,
-                          shapes,
-                          shapes,
-                          use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                          use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+            iu.infer_zero(
+                self,
+                "onnx_nobatch",
+                1,
+                dtype,
+                shapes,
+                shapes,
+                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+            )
 
         for name in ["simple_zero", "sequence_zero", "fan_zero"]:
-            if tu.validate_for_ensemble_model(name, dtype, dtype, dtype,
-                                              shapes[0], shapes[0], shapes[0]):
+            if tu.validate_for_ensemble_model(
+                name, dtype, dtype, dtype, shapes[0], shapes[0], shapes[0]
+            ):
                 # model that supports batching
                 for bs in (1, 8):
-                    batch_shapes = [[
-                        bs,
-                    ] + shape for shape in shapes]
+                    batch_shapes = [
+                        [
+                            bs,
+                        ]
+                        + shape
+                        for shape in shapes
+                    ]
                     iu.infer_zero(
                         self,
                         name,
@@ -134,81 +157,135 @@ def _full_zero(self, dtype, shapes):
                         batch_shapes,
                         batch_shapes,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                    )
                 # model that does not support batching
                 iu.infer_zero(
                     self,
-                    name + '_nobatch',
+                    name + "_nobatch",
                     1,
                     dtype,
                     shapes,
                     shapes,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
 
     def test_ff1_sanity(self):
-        self._full_zero(np.float32, ([
-            1,
-        ],))
+        self._full_zero(
+            np.float32,
+            (
+                [
+                    1,
+                ],
+            ),
+        )
 
     def test_ff1(self):
-        self._full_zero(np.float32, ([
-            0,
-        ],))
+        self._full_zero(
+            np.float32,
+            (
+                [
+                    0,
+                ],
+            ),
+        )
 
     def test_ff3_sanity(self):
-        self._full_zero(np.float32, ([
-            1,
-        ], [
-            2,
-        ], [
-            1,
-        ]))
+        self._full_zero(
+            np.float32,
+            (
+                [
+                    1,
+                ],
+                [
+                    2,
+                ],
+                [
+                    1,
+                ],
+            ),
+        )
 
     def test_ff3_0(self):
-        self._full_zero(np.float32, ([
-            0,
-        ], [
-            0,
-        ], [
-            0,
-        ]))
+        self._full_zero(
+            np.float32,
+            (
+                [
+                    0,
+                ],
+                [
+                    0,
+                ],
+                [
+                    0,
+                ],
+            ),
+        )
 
     def test_ff3_1(self):
-        self._full_zero(np.float32, ([
-            0,
-        ], [
-            0,
-        ], [
-            1,
-        ]))
+        self._full_zero(
+            np.float32,
+            (
+                [
+                    0,
+                ],
+                [
+                    0,
+                ],
+                [
+                    1,
+                ],
+            ),
+        )
 
     def test_ff3_2(self):
-        self._full_zero(np.float32, ([
-            0,
-        ], [
-            1,
-        ], [
-            0,
-        ]))
+        self._full_zero(
+            np.float32,
+            (
+                [
+                    0,
+                ],
+                [
+                    1,
+                ],
+                [
+                    0,
+                ],
+            ),
+        )
 
     def test_ff3_3(self):
-        self._full_zero(np.float32, ([
-            1,
-        ], [
-            0,
-        ], [
-            0,
-        ]))
+        self._full_zero(
+            np.float32,
+            (
+                [
+                    1,
+                ],
+                [
+                    0,
+                ],
+                [
+                    0,
+                ],
+            ),
+        )
 
     def test_ff3_4(self):
-        self._full_zero(np.float32, ([
-            1,
-        ], [
-            0,
-        ], [
-            1,
-        ]))
+        self._full_zero(
+            np.float32,
+            (
+                [
+                    1,
+                ],
+                [
+                    0,
+                ],
+                [
+                    1,
+                ],
+            ),
+        )
 
     def test_hh1_sanity(self):
         self._full_zero(np.float16, ([2, 2],))
@@ -241,14 +318,24 @@ def test_hh3_4(self):
         self._full_zero(np.float16, ([1, 1], [0, 6], [2, 2]))
 
     def test_oo1_sanity(self):
-        self._full_zero(np_dtype_string, ([
-            2,
-        ],))
+        self._full_zero(
+            np_dtype_string,
+            (
+                [
+                    2,
+                ],
+            ),
+        )
 
     def test_oo1(self):
-        self._full_zero(np_dtype_string, ([
-            0,
-        ],))
+        self._full_zero(
+            np_dtype_string,
+            (
+                [
+                    0,
+                ],
+            ),
+        )
 
     def test_oo3_sanity(self):
         self._full_zero(np_dtype_string, ([2, 2], [2, 2], [1, 1]))
@@ -269,15 +356,25 @@ def test_oo3_4(self):
         self._full_zero(np_dtype_string, ([1, 1], [0, 6], [2, 2]))
 
     def test_bb1_sanity(self):
-        self._full_zero(bool, ([
-            10,
-        ],))
+        self._full_zero(
+            bool,
+            (
+                [
+                    10,
+                ],
+            ),
+        )
 
     def test_bb1_0(self):
-        self._full_zero(bool, ([
-            0,
-        ],))
+        self._full_zero(
+            bool,
+            (
+                [
+                    0,
+                ],
+            ),
+        )
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_infer_zero/test.sh b/qa/L0_infer_zero/test.sh
index 7f10f0dd18..02676b2f85 100755
--- a/qa/L0_infer_zero/test.sh
+++ b/qa/L0_infer_zero/test.sh
@@ -55,6 +55,10 @@ rm -fr models && mkdir models
 cp -r /data/inferenceserver/${REPO_VERSION}/qa_identity_model_repository/* models/. && \
     cp -r /data/inferenceserver/${REPO_VERSION}/qa_ensemble_model_repository/qa_identity_model_repository/* models/.
 
+# Remove version-compatible TensorRT models, as they require version-compatibility
+# mode to be turned on when starting the server.
+rm -rf models/plan_compatible*
+
 create_nop_version_dir `pwd`/models
 
 RET=0
diff --git a/qa/L0_inferentia_perf_analyzer/test.sh b/qa/L0_inferentia_perf_analyzer/test.sh
old mode 100644
new mode 100755
index 21e361ee6c..1881e07f87
--- a/qa/L0_inferentia_perf_analyzer/test.sh
+++ b/qa/L0_inferentia_perf_analyzer/test.sh
@@ -25,21 +25,21 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-# First need to set up enviroment
+# First need to set up environment
 if [ ${USE_TENSORFLOW} == "1" ] && [ ${USE_PYTORCH} == "1" ] ; then
     echo " Unsupported test configuration. Only one of USE_TENSORFLOW and USE_PYTORCH can be set to 1."
     exit 0
 elif [ ${USE_TENSORFLOW} == "1" ] ; then
-    echo "Setting up enviroment with tensorflow 1"
+    echo "Setting up environment with tensorflow 1"
     source ${TRITON_PATH}/python_backend/inferentia/scripts/setup.sh -t --tensorflow-version 1
 elif [ ${USE_PYTORCH} == "1" ] ; then
-    echo "Setting up enviroment with pytorch"
+    echo "Setting up environment with pytorch"
     source ${TRITON_PATH}/python_backend/inferentia/scripts/setup.sh -p
-else 
+else
     echo " Unsupported test configuration. USE_TENSORFLOW flag is: ${USE_TENSORFLOW} and USE_PYTORCH flag is: ${USE_PYTORCH}. Only one of them can be set to 1."
     exit 0
 fi
-echo "done setting up enviroment"
+echo "done setting up environment"
 
 REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
 if [ "$#" -ge 1 ]; then
@@ -80,32 +80,32 @@ function create_inferentia_models () {
     for DISABLE_DEFAULT_BATCHING_FLAG in ${DISABLE_DEFAULT_BATCHING_FLAGS}; do
         for BATCHED_FLAG in ${BATCHED_FLAGS}; do
             for TEST_TYPE in ${TEST_TYPES}; do
-                CURR_GEN_SCRIPT="${GEN_SCRIPT} --model_type ${MODEL_TYPE}  
-                --triton_model_dir ${TRITON_PATH}/models_${TEST_TYPE}${BATCHED_FLAG}${TEST_FRAMEWORK}${DISABLE_DEFAULT_BATCHING_FLAG}/add-sub-1x4 
+                CURR_GEN_SCRIPT="${GEN_SCRIPT} --model_type ${MODEL_TYPE}
+                --triton_model_dir ${TRITON_PATH}/models_${TEST_TYPE}${BATCHED_FLAG}${TEST_FRAMEWORK}${DISABLE_DEFAULT_BATCHING_FLAG}/add-sub-1x4
                 --compiled_model ${COMPILED_MODEL}"
                 if [ ${DISABLE_DEFAULT_BATCHING_FLAG} == "_no_batch" ]; then
-                    CURR_GEN_SCRIPT="${CURR_GEN_SCRIPT} 
+                    CURR_GEN_SCRIPT="${CURR_GEN_SCRIPT}
                     --disable_batch_requests_to_neuron"
                 fi
                 if [ ${BATCHED_FLAG} == "_batched_" ]; then
                     CURR_GEN_SCRIPT="${CURR_GEN_SCRIPT}
-                    --triton_input INPUT__0,INT64,4 INPUT__1,INT64,4 
-                    --triton_output OUTPUT__0,INT64,4 OUTPUT__1,INT64,4          
-                    --enable_dynamic_batching 
-                    --max_batch_size 1000 
-                    --preferred_batch_size 8 
+                    --triton_input INPUT__0,INT64,4 INPUT__1,INT64,4
+                    --triton_output OUTPUT__0,INT64,4 OUTPUT__1,INT64,4
+                    --enable_dynamic_batching
+                    --max_batch_size 1000
+                    --preferred_batch_size 8
                     --max_queue_delay_microseconds 100"
                 else
                     CURR_GEN_SCRIPT="${CURR_GEN_SCRIPT}
-                    --triton_input INPUT__0,INT64,-1x4 INPUT__1,INT64,-1x4 
+                    --triton_input INPUT__0,INT64,-1x4 INPUT__1,INT64,-1x4
                     --triton_output OUTPUT__0,INT64,-1x4 OUTPUT__1,INT64,-1x4"
                 fi
                 if [ ${TEST_TYPE} == "single" ]; then
-                    CURR_GEN_SCRIPT="${CURR_GEN_SCRIPT}   
+                    CURR_GEN_SCRIPT="${CURR_GEN_SCRIPT}
                     --neuron_core_range 0:0"
                 elif [ ${TEST_TYPE} == "multiple" ]; then
-                    CURR_GEN_SCRIPT="${CURR_GEN_SCRIPT} 
-                    --triton_model_instance_count 3 
+                    CURR_GEN_SCRIPT="${CURR_GEN_SCRIPT}
+                    --triton_model_instance_count 3
                     --neuron_core_range 0:7"
                 fi
                 echo ${CURR_GEN_SCRIPT}
diff --git a/qa/L0_io/test.sh b/qa/L0_io/test.sh
index ac1ad5559e..84ab4fb0c0 100755
--- a/qa/L0_io/test.sh
+++ b/qa/L0_io/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -47,16 +47,14 @@ MODELSDIR=`pwd`/models
 DATADIR=/data/inferenceserver/${REPO_VERSION}/qa_model_repository
 ENSEMBLEDIR=/data/inferenceserver/${REPO_VERSION}/qa_ensemble_model_repository/qa_model_repository
 
-export CUDA_VISIBLE_DEVICES=0,1
-
 # Must explicitly set LD_LIBRARY_PATH so that IO_TEST_UTIL can find
 # libtritonserver.so.
 LD_LIBRARY_PATH=/opt/tritonserver/lib:$LD_LIBRARY_PATH
 
-rm -f $CLIENT_LOG.*
+rm -f $CLIENT_LOG*
 
 # PyTorch is required for the Python backend dlpack add sub models
-pip3 install torch==1.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
+pip3 install torch==1.13.0+cu117 -f https://download.pytorch.org/whl/torch_stable.html
 RET=0
 
 # Prepare float32 models with basic config
@@ -70,8 +68,7 @@ for trial in graphdef savedmodel onnx libtorch plan python python_dlpack; do
             cp ../python_models/add_sub/config.pbtxt $MODELSDIR/${full}/. && \
             (cd $MODELSDIR/${full} && \
                     sed -i "s/label_filename:.*//" config.pbtxt && \
-                    sed -i "0,/name:.*/{s/name:.*/name: \"${full}\"/}" config.pbtxt && \
-                                        echo "max_batch_size: 64" >> config.pbtxt)
+                    echo "max_batch_size: 64" >> config.pbtxt)
 
         # ensemble version of the model.
         mkdir -p $MODELSDIR/fan_${full}/1 && \
@@ -148,23 +145,47 @@ cp -r $MODELSDIR/fan_graphdef_float32_float32_float32 $MODELSDIR/fan_${full} &&
 cp -r $ENSEMBLEDIR/nop_TYPE_FP32_-1 $MODELSDIR/. && \
     mkdir -p $MODELSDIR/nop_TYPE_FP32_-1/1
 
+# prepare libtorch multi-device and multi-gpu models
+cp -r ../L0_libtorch_instance_group_kind_model/models/libtorch_multi_device $MODELSDIR/.
+cp ../L0_libtorch_instance_group_kind_model/gen_models.py ./gen_libtorch_model.py
+mkdir -p $MODELSDIR/libtorch_multi_device/1
+mkdir -p $MODELSDIR/libtorch_multi_gpu/1
+cp $MODELSDIR/libtorch_multi_device/config.pbtxt $MODELSDIR/libtorch_multi_gpu/.
+(cd $MODELSDIR/libtorch_multi_gpu && \
+    sed -i "s/name: \"libtorch_multi_device\"/name: \"libtorch_multi_gpu\"/" config.pbtxt)
+
+set +e
+python3 gen_libtorch_model.py >> $CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Error when generating libtorch models. \n***"
+    cat $CLIENT_LOG
+    exit 1
+fi
+set -e
+
+TRIALS="graphdef savedmodel onnx libtorch plan python python_dlpack libtorch_multi_gpu libtorch_multi_device"
 for input_device in -1 0 1; do
     for output_device in -1 0 1; do
-        for trial in graphdef savedmodel onnx libtorch plan python python_dlpack; do
+        for trial in ${TRIALS}; do
             # TensorRT Plan should only be deployed on GPU device
             model_devices="-1 0 1" && [[ "$trial" == "plan" ]] && model_devices="0 1"
+            full=${trial}_float32_float32_float32 && [[ "$trial" == "libtorch_multi"* ]] && full=${trial}
+
             for model_device in $model_devices; do
-                full=${trial}_float32_float32_float32
                 full_log=$CLIENT_LOG.$full.$input_device.$output_device.$model_device
 
                 host_policy=cpu
                 if [ "$model_device" == "-1" ]; then
-                    (cd $MODELSDIR/${full} && \
-                        sed -i "s/instance_group.*/instance_group [{ kind: KIND_CPU }]/" config.pbtxt)
+                    if [[ "$trial" != "libtorch_multi"* ]]; then
+                        (cd $MODELSDIR/${full} && \
+                            sed -i "s/instance_group.*/instance_group [{ kind: KIND_CPU }]/" config.pbtxt)
+                    fi
                 else
                     host_policy=gpu_${model_device}
-                    (cd $MODELSDIR/${full} && \
-                        sed -i "s/instance_group.*/instance_group [{ kind: KIND_GPU, gpus: [${model_device}] }]/" config.pbtxt)
+                    if [[ "$trial" != "libtorch_multi"* ]]; then
+                        (cd $MODELSDIR/${full} && \
+                            sed -i "s/instance_group.*/instance_group [{ kind: KIND_GPU, gpus: [${model_device}] }]/" config.pbtxt)
+                    fi
                 fi
 
                 set +e
@@ -196,14 +217,16 @@ for input_device in -1 0 1; do
                 set -e
 
                 # ensemble
-                set +e
-                $IO_TEST_UTIL -i $input_device -o $output_device -r $MODELSDIR -m fan_$full >>$full_log.ensemble 2>&1
-                if [ $? -ne 0 ]; then
-                    cat $full_log.ensemble
-                    echo -e "\n***\n*** Test Failed\n***"
-                    RET=1
+                if [[ "$trial" != "libtorch_multi"* ]]; then
+                    set +e
+                    $IO_TEST_UTIL -i $input_device -o $output_device -r $MODELSDIR -m fan_$full >>$full_log.ensemble 2>&1
+                    if [ $? -ne 0 ]; then
+                        cat $full_log.ensemble
+                        echo -e "\n***\n*** Test Failed\n***"
+                        RET=1
+                    fi
+                    set -e
                 fi
-                set -e
             done
         done
 
diff --git a/qa/L0_iterative_sequence/iterative_sequence_e2e.py b/qa/L0_iterative_sequence/iterative_sequence_e2e.py
new file mode 100755
index 0000000000..378b6ebe82
--- /dev/null
+++ b/qa/L0_iterative_sequence/iterative_sequence_e2e.py
@@ -0,0 +1,192 @@
+#!/usr/bin/env python
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+
+import json
+
+# GRPC streaming helpers..
+import queue
+import unittest
+from functools import partial
+
+import numpy as np
+import requests
+import sseclient
+import test_util as tu
+import tritonclient.grpc as grpcclient
+from tritonclient.utils import InferenceServerException
+
+MODEL_CONFIG_BASE = """
+{{
+"backend": "iterative_sequence",
+"max_batch_size": 4,
+"input" : [
+  {{
+    "name": "INPUT",
+    "data_type": "TYPE_INT32",
+    "dims": [ 1 ]
+  }}
+],
+"output" : [
+  {{
+    "name": "OUTPUT",
+    "data_type": "TYPE_INT32",
+    "dims": [ 1 ]
+  }}
+],
+"model_transaction_policy" : {{
+  "decoupled": true
+}},
+{},
+"instance_group" : [{{ "kind": "KIND_CPU" }}]
+}}
+"""
+
+
+class UserData:
+    def __init__(self):
+        self._completed_requests = queue.Queue()
+
+
+def callback(user_data, result, error):
+    if error:
+        user_data._completed_requests.put(error)
+    else:
+        user_data._completed_requests.put(result)
+
+
+class IterativeSequenceTest(tu.TestResultCollector):
+    def setUp(self):
+        # Always make sure the original config is used
+        with grpcclient.InferenceServerClient("localhost:8001") as triton_client:
+            triton_client.load_model("iterative_sequence")
+
+    def test_generate_stream(self):
+        headers = {"Accept": "text/event-stream"}
+        url = "http://localhost:8000/v2/models/iterative_sequence/generate_stream"
+        inputs = {"INPUT": 2}
+        res = requests.post(url, data=json.dumps(inputs), headers=headers)
+        res.raise_for_status()
+        client = sseclient.SSEClient(res)
+        res_count = 2
+        for event in client.events():
+            res_count -= 1
+            data = json.loads(event.data)
+            self.assertIn("OUTPUT", data)
+            self.assertEqual(res_count, data["OUTPUT"])
+        self.assertEqual(0, res_count)
+
+    def test_grpc_stream(self, sequence_id=0, sequence_start=False):
+        user_data = UserData()
+        with grpcclient.InferenceServerClient("localhost:8001") as triton_client:
+            triton_client.start_stream(callback=partial(callback, user_data))
+            inputs = []
+            inputs.append(grpcclient.InferInput("INPUT", [1, 1], "INT32"))
+            inputs[0].set_data_from_numpy(np.array([[2]], dtype=np.int32))
+
+            triton_client.async_stream_infer(
+                model_name="iterative_sequence",
+                inputs=inputs,
+                sequence_id=sequence_id,
+                sequence_start=sequence_start,
+            )
+            res_count = 2
+            while res_count > 0:
+                data_item = user_data._completed_requests.get()
+                res_count -= 1
+                if type(data_item) == InferenceServerException:
+                    raise data_item
+                else:
+                    self.assertEqual(res_count, data_item.as_numpy("OUTPUT")[0][0])
+            self.assertEqual(0, res_count)
+
+    def test_reschedule_error(self):
+        # Use short idle timeout (< backend reschedule delay: 0.5s) so that
+        # the backend won't be able to reschedule the request as the scheduler
+        # will terminate the sequence early
+        config = r'"sequence_batching" : { "iterative_sequence" : true, "max_sequence_idle_microseconds" : 200000 }'
+        with grpcclient.InferenceServerClient("localhost:8001") as triton_client:
+            triton_client.load_model(
+                "iterative_sequence", config=MODEL_CONFIG_BASE.format(config)
+            )
+        with self.assertRaises(InferenceServerException) as context:
+            # Without specifying 'iterative_sequence : true', the sequence
+            # batcher expects sequence parameters to be provided explicitly
+            self.test_grpc_stream()
+        print(str(context.exception))
+        self.assertTrue(
+            "must specify the START flag on the first request of the sequence"
+            in str(context.exception)
+        )
+
+    def test_unsupported_sequence_scheduler(self):
+        # Override model config with scheduler settings that do not support
+        # request rescheduling.
+        configs = [
+            r'"sequence_batching" : { "direct" : {}, "iterative_sequence" : false }',
+            r'"sequence_batching" : { "oldest" : {}, "iterative_sequence" : false }',
+        ]
+        sid = 1
+        for sc in configs:
+            with grpcclient.InferenceServerClient("localhost:8001") as triton_client:
+                triton_client.load_model(
+                    "iterative_sequence", config=MODEL_CONFIG_BASE.format(sc)
+                )
+            with self.assertRaises(InferenceServerException) as context:
+                # Without specifying 'iterative_sequence : true', the sequence
+                # batcher expects sequence parameters to be provided explicitly
+                self.test_grpc_stream(sequence_id=sid, sequence_start=True)
+            sid += 1
+            self.assertTrue(
+                "Request is released with TRITONSERVER_REQUEST_RELEASE_RESCHEDULE"
+                in str(context.exception)
+            )
+
+    def test_unsupported_dynamic_scheduler(self):
+        # Override model config with scheduler settings that do not support
+        # request rescheduling.
+        configs = [
+            r'"dynamic_batching" : {}',
+        ]
+        for sc in configs:
+            with grpcclient.InferenceServerClient("localhost:8001") as triton_client:
+                triton_client.load_model(
+                    "iterative_sequence", config=MODEL_CONFIG_BASE.format(sc)
+                )
+            with self.assertRaises(InferenceServerException) as context:
+                self.test_grpc_stream()
+            self.assertTrue(
+                "Request is released with TRITONSERVER_REQUEST_RELEASE_RESCHEDULE"
+                in str(context.exception)
+            )
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_iterative_sequence/models/iterative_sequence/config.pbtxt b/qa/L0_iterative_sequence/models/iterative_sequence/config.pbtxt
new file mode 100644
index 0000000000..d6e539007b
--- /dev/null
+++ b/qa/L0_iterative_sequence/models/iterative_sequence/config.pbtxt
@@ -0,0 +1,48 @@
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+backend: "iterative_sequence"
+max_batch_size: 4
+input [
+  {
+    name: "INPUT"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  }
+]
+model_transaction_policy {
+  decoupled: True
+}
+sequence_batching {
+  iterative_sequence : true
+}
+instance_group [{ kind: KIND_CPU }]
diff --git a/qa/L0_iterative_sequence/test.sh b/qa/L0_iterative_sequence/test.sh
new file mode 100755
index 0000000000..09117ffe93
--- /dev/null
+++ b/qa/L0_iterative_sequence/test.sh
@@ -0,0 +1,92 @@
+#!/bin/bash
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
+if [ "$#" -ge 1 ]; then
+    REPO_VERSION=$1
+fi
+if [ -z "$REPO_VERSION" ]; then
+    echo -e "Repository version must be specified"
+    echo -e "\n***\n*** Test Failed\n***"
+    exit 1
+fi
+if [ ! -z "$TEST_REPO_ARCH" ]; then
+    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
+fi
+
+source ../common/util.sh
+
+RET=0
+
+CLIENT_LOG="./iterative_sequence_client.log"
+TEST_PY=./iterative_sequence_e2e.py
+EXPECTED_NUM_TESTS="5"
+TEST_RESULT_FILE='test_results.txt'
+
+
+export CUDA_VISIBLE_DEVICES=0
+
+rm -fr *.log
+
+pip install sseclient-py
+
+SERVER=/opt/tritonserver/bin/tritonserver
+SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=EXPLICIT"
+SERVER_LOG="./inference_server.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python $TEST_PY >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+if [ $RET -eq 0 ]; then
+    echo -e "\n***\n*** Test Passed\n***"
+else
+    cat $CLIENT_LOG
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Test FAILED\n***"
+fi
+
+exit $RET
diff --git a/qa/L0_java_memory_growth/MemoryGrowthTest.java b/qa/L0_java_memory_growth/MemoryGrowthTest.java
index d5a8092872..28243459ec 100644
--- a/qa/L0_java_memory_growth/MemoryGrowthTest.java
+++ b/qa/L0_java_memory_growth/MemoryGrowthTest.java
@@ -1,4 +1,4 @@
-// Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 //
 // Redistribution and use in source and binary forms, with or without
 // modification, are permitted provided that the following conditions
@@ -24,880 +24,920 @@
 // (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
+import static org.bytedeco.tritonserver.global.tritonserver.*;
+
+import com.google.gson.*;
 import java.io.*;
 import java.util.*;
 import java.util.concurrent.*;
-import com.google.gson.*;
 import org.bytedeco.javacpp.*;
 import org.bytedeco.tritonserver.tritonserver.*;
-import static org.bytedeco.tritonserver.global.tritonserver.*;
 
 public class MemoryGrowthTest {
-    static final double TRITON_MIN_COMPUTE_CAPABILITY = 6.0;
-    private static boolean done = false;
-    static float max_growth_allowed = .10f;
-    static int max_mem_allowed = 30;
-
-    static void FAIL(String MSG) {
-        System.err.println("failure: " + MSG);
-        System.exit(1);
-    }
-
-    static void FAIL_IF_ERR(TRITONSERVER_Error err__, String MSG) {
-        if (err__ != null) {
-            System.err.println("error: " + MSG + ":"
-                             + TRITONSERVER_ErrorCodeString(err__) + " - "
-                             + TRITONSERVER_ErrorMessage(err__));
-            TRITONSERVER_ErrorDelete(err__);
-            System.exit(1);
-        }
+  static final double TRITON_MIN_COMPUTE_CAPABILITY = 6.0;
+  private static boolean done = false;
+  static float max_growth_allowed = .10f;
+  static int max_mem_allowed = 30;
+
+  static void FAIL(String MSG)
+  {
+    System.err.println("failure: " + MSG);
+    System.exit(1);
+  }
+
+  static void FAIL_IF_ERR(TRITONSERVER_Error err__, String MSG)
+  {
+    if (err__ != null) {
+      System.err.println(
+          "error: " + MSG + ":" + TRITONSERVER_ErrorCodeString(err__) + " - "
+          + TRITONSERVER_ErrorMessage(err__));
+      TRITONSERVER_ErrorDelete(err__);
+      System.exit(1);
     }
+  }
 
-    static boolean enforce_memory_type = false;
-    static int requested_memory_type;
-    // Parameters for percentile range to include (exclude outliers)
-    static final int max_percentile = 90;
-    static final int min_percentile = 10;
+  static boolean enforce_memory_type = false;
+  static int requested_memory_type;
+  // Parameters for percentile range to include (exclude outliers)
+  static final int max_percentile = 90;
+  static final int min_percentile = 10;
 
-    static class TRITONSERVER_ServerDeleter extends TRITONSERVER_Server {
-        public TRITONSERVER_ServerDeleter(TRITONSERVER_Server p) { super(p); deallocator(new DeleteDeallocator(this)); }
-        protected static class DeleteDeallocator extends TRITONSERVER_Server implements Deallocator {
-            DeleteDeallocator(Pointer p) { super(p); }
-            @Override public void deallocate() { TRITONSERVER_ServerDelete(this); }
-        }
-    }
-
-    static void
-    Usage(String msg)
+  static class TRITONSERVER_ServerDeleter extends TRITONSERVER_Server {
+    public TRITONSERVER_ServerDeleter(TRITONSERVER_Server p)
     {
-      if (msg != null) {
-        System.err.println(msg);
-      }
-
-      System.err.println("Usage: java " + MemoryGrowthTest.class.getSimpleName() + " [options]");
-      System.err.println("\t-i Set number of iterations");
-      System.err.println("\t-m <\"system\"|\"pinned\"|gpu>"
-                       + " Enforce the memory type for input and output tensors."
-                       + " If not specified, inputs will be in system memory and outputs"
-                       + " will be based on the model's preferred type.");
-      System.err.println("\t-v Enable verbose logging");
-      System.err.println("\t-r [model repository absolute path]");
-      System.err.println("\t--max-growth Specify maximum allowed memory growth (%)");
-      System.err.println("\t--max-memory Specify maximum allowed memory (MB)");
-
-      System.exit(1);
+      super(p);
+      deallocator(new DeleteDeallocator(this));
     }
-
-    static class ResponseAlloc extends TRITONSERVER_ResponseAllocatorAllocFn_t {
-        @Override public TRITONSERVER_Error call (
-            TRITONSERVER_ResponseAllocator allocator, String tensor_name,
-            long byte_size, int preferred_memory_type,
-            long preferred_memory_type_id, Pointer userp, PointerPointer buffer,
-            PointerPointer buffer_userp, IntPointer actual_memory_type,
-            LongPointer actual_memory_type_id)
-        {
-          // Initially attempt to make the actual memory type and id that we
-          // allocate be the same as preferred memory type
-          actual_memory_type.put(0, preferred_memory_type);
-          actual_memory_type_id.put(0, preferred_memory_type_id);
-
-          // If 'byte_size' is zero just return 'buffer' == nullptr, we don't
-          // need to do any other book-keeping.
-          if (byte_size == 0) {
-            buffer.put(0, null);
-            buffer_userp.put(0, null);
-          } else {
-            Pointer allocated_ptr = new Pointer();
-            if (enforce_memory_type) {
-              actual_memory_type.put(0, requested_memory_type);
-            }
-
-            actual_memory_type.put(0, TRITONSERVER_MEMORY_CPU);
-            allocated_ptr = Pointer.malloc(byte_size);
-
-            // Pass the tensor name with buffer_userp so we can show it when
-            // releasing the buffer.
-            if (!allocated_ptr.isNull()) {
-              buffer.put(0, allocated_ptr);
-              buffer_userp.put(0, Loader.newGlobalRef(tensor_name));
-            }
-          }
-
-          return null;  // Success
-        }
+    protected static class DeleteDeallocator
+        extends TRITONSERVER_Server implements Deallocator {
+      DeleteDeallocator(Pointer p) { super(p); }
+      @Override public void deallocate() { TRITONSERVER_ServerDelete(this); }
     }
+  }
 
-    static class ResponseRelease extends TRITONSERVER_ResponseAllocatorReleaseFn_t {
-        @Override public TRITONSERVER_Error call (
-            TRITONSERVER_ResponseAllocator allocator, Pointer buffer, Pointer buffer_userp,
-            long byte_size, int memory_type, long memory_type_id)
-        {
-          String name = null;
-          if (buffer_userp != null) {
-            name = (String)Loader.accessGlobalRef(buffer_userp);
-          } else {
-            name = "<unknown>";
-          }
-          Pointer.free(buffer);
-          Loader.deleteGlobalRef(buffer_userp);
-
-          return null;  // Success
-        }
+  static void Usage(String msg)
+  {
+    if (msg != null) {
+      System.err.println(msg);
     }
 
-    static class InferRequestComplete extends TRITONSERVER_InferenceRequestReleaseFn_t {
-        @Override public void call (
-            TRITONSERVER_InferenceRequest request, int flags, Pointer userp)
-        {
-          // We reuse the request so we don't delete it here.
-        }
-    }
-
-    static class InferResponseComplete extends TRITONSERVER_InferenceResponseCompleteFn_t {
-        @Override public void call (
-            TRITONSERVER_InferenceResponse response, int flags, Pointer userp)
-        {
-          if (response != null) {
-            // Send 'response' to the future.
-            futures.get(userp).complete(response);
-          }
+    System.err.println(
+        "Usage: java " + MemoryGrowthTest.class.getSimpleName() + " [options]");
+    System.err.println("\t-i Set number of iterations");
+    System.err.println(
+        "\t-m <\"system\"|\"pinned\"|gpu>"
+        + " Enforce the memory type for input and output tensors."
+        + " If not specified, inputs will be in system memory and outputs"
+        + " will be based on the model's preferred type.");
+    System.err.println("\t-v Enable verbose logging");
+    System.err.println("\t-r [model repository absolute path]");
+    System.err.println(
+        "\t--max-growth Specify maximum allowed memory growth (%)");
+    System.err.println("\t--max-memory Specify maximum allowed memory (MB)");
+
+    System.exit(1);
+  }
+
+  static class ResponseAlloc extends TRITONSERVER_ResponseAllocatorAllocFn_t {
+    @Override
+    public TRITONSERVER_Error call(
+        TRITONSERVER_ResponseAllocator allocator, String tensor_name,
+        long byte_size, int preferred_memory_type,
+        long preferred_memory_type_id, Pointer userp, PointerPointer buffer,
+        PointerPointer buffer_userp, IntPointer actual_memory_type,
+        LongPointer actual_memory_type_id)
+    {
+      // Initially attempt to make the actual memory type and id that we
+      // allocate be the same as preferred memory type
+      actual_memory_type.put(0, preferred_memory_type);
+      actual_memory_type_id.put(0, preferred_memory_type_id);
+
+      // If 'byte_size' is zero just return 'buffer' == nullptr, we don't
+      // need to do any other book-keeping.
+      if (byte_size == 0) {
+        buffer.put(0, null);
+        buffer_userp.put(0, null);
+      } else {
+        Pointer allocated_ptr = new Pointer();
+        if (enforce_memory_type) {
+          actual_memory_type.put(0, requested_memory_type);
         }
-    }
 
-    static ConcurrentHashMap<Pointer, CompletableFuture<TRITONSERVER_InferenceResponse>> futures = new ConcurrentHashMap<>();
-    static ResponseAlloc responseAlloc = new ResponseAlloc();
-    static ResponseRelease responseRelease = new ResponseRelease();
-    static InferRequestComplete inferRequestComplete = new InferRequestComplete();
-    static InferResponseComplete inferResponseComplete = new InferResponseComplete();
+        actual_memory_type.put(0, TRITONSERVER_MEMORY_CPU);
+        allocated_ptr = Pointer.malloc(byte_size);
 
-    static TRITONSERVER_Error
-    ParseModelMetadata(
-        JsonObject model_metadata, boolean[] is_int,
-        boolean[] is_torch_model)
-    {
-      String seen_data_type = null;
-      for (JsonElement input_element : model_metadata.get("inputs").getAsJsonArray()) {
-        JsonObject input = input_element.getAsJsonObject();
-        if (!input.get("datatype").getAsString().equals("INT32") &&
-            !input.get("datatype").getAsString().equals("FP32")) {
-          return TRITONSERVER_ErrorNew(
-              TRITONSERVER_ERROR_UNSUPPORTED,
-              "simple lib example only supports model with data type INT32 or " +
-              "FP32");
-        }
-        if (seen_data_type == null) {
-          seen_data_type = input.get("datatype").getAsString();
-        } else if (!seen_data_type.equals(input.get("datatype").getAsString())) {
-          return TRITONSERVER_ErrorNew(
-              TRITONSERVER_ERROR_INVALID_ARG,
-              "the inputs and outputs of 'simple' model must have the data type");
-        }
-      }
-      for (JsonElement output_element : model_metadata.get("outputs").getAsJsonArray()) {
-        JsonObject output = output_element.getAsJsonObject();
-        if (!output.get("datatype").getAsString().equals("INT32") &&
-            !output.get("datatype").getAsString().equals("FP32")) {
-          return TRITONSERVER_ErrorNew(
-              TRITONSERVER_ERROR_UNSUPPORTED,
-              "simple lib example only supports model with data type INT32 or " +
-              "FP32");
-        } else if (!seen_data_type.equals(output.get("datatype").getAsString())) {
-          return TRITONSERVER_ErrorNew(
-              TRITONSERVER_ERROR_INVALID_ARG,
-              "the inputs and outputs of 'simple' model must have the data type");
+        // Pass the tensor name with buffer_userp so we can show it when
+        // releasing the buffer.
+        if (!allocated_ptr.isNull()) {
+          buffer.put(0, allocated_ptr);
+          buffer_userp.put(0, Loader.newGlobalRef(tensor_name));
         }
       }
 
-      is_int[0] = seen_data_type.equals("INT32");
-      is_torch_model[0] =
-          model_metadata.get("platform").getAsString().equals("pytorch_libtorch");
-      return null;
+      return null; // Success
     }
-
-    static void
-    GenerateInputData(
-        IntPointer[] input0_data, IntPointer[] input1_data)
+  }
+
+  static class ResponseRelease
+      extends TRITONSERVER_ResponseAllocatorReleaseFn_t {
+    @Override
+    public TRITONSERVER_Error call(
+        TRITONSERVER_ResponseAllocator allocator, Pointer buffer,
+        Pointer buffer_userp, long byte_size, int memory_type,
+        long memory_type_id)
     {
-      input0_data[0] = new IntPointer(16);
-      input1_data[0] = new IntPointer(16);
-      for (int i = 0; i < 16; ++i) {
-        input0_data[0].put(i, i);
-        input1_data[0].put(i, 1);
+      String name = null;
+      if (buffer_userp != null) {
+        name = (String) Loader.accessGlobalRef(buffer_userp);
+      } else {
+        name = "<unknown>";
       }
+      Pointer.free(buffer);
+      Loader.deleteGlobalRef(buffer_userp);
+
+      return null; // Success
     }
+  }
 
-    static void
-    GenerateInputData(
-        FloatPointer[] input0_data, FloatPointer[] input1_data)
+  static class InferRequestComplete
+      extends TRITONSERVER_InferenceRequestReleaseFn_t {
+    @Override
+    public void call(
+        TRITONSERVER_InferenceRequest request, int flags, Pointer userp)
     {
-      input0_data[0] = new FloatPointer(16);
-      input1_data[0] = new FloatPointer(16);
-      for (int i = 0; i < 16; ++i) {
-        input0_data[0].put(i, i);
-        input1_data[0].put(i, 1);
-      }
+      // We reuse the request so we don't delete it here.
     }
+  }
 
-    static void
-    CompareResult(
-        String output0_name, String output1_name,
-        IntPointer input0, IntPointer input1, IntPointer output0,
-        IntPointer output1)
+  static class InferResponseComplete
+      extends TRITONSERVER_InferenceResponseCompleteFn_t {
+    @Override
+    public void call(
+        TRITONSERVER_InferenceResponse response, int flags, Pointer userp)
     {
-      for (int i = 0; i < 16; ++i) {
-        if ((input0.get(i) + input1.get(i)) != output0.get(i)) {
-          FAIL("incorrect sum in " + output0_name);
-        }
-        if ((input0.get(i) - input1.get(i)) != output1.get(i)) {
-          FAIL("incorrect difference in " + output1_name);
-        }
+      if (response != null) {
+        // Send 'response' to the future.
+        futures.get(userp).complete(response);
+      }
+    }
+  }
+
+  static ConcurrentHashMap<
+      Pointer, CompletableFuture<TRITONSERVER_InferenceResponse>> futures =
+      new ConcurrentHashMap<>();
+  static ResponseAlloc responseAlloc = new ResponseAlloc();
+  static ResponseRelease responseRelease = new ResponseRelease();
+  static InferRequestComplete inferRequestComplete = new InferRequestComplete();
+  static InferResponseComplete inferResponseComplete =
+      new InferResponseComplete();
+
+  static TRITONSERVER_Error ParseModelMetadata(
+      JsonObject model_metadata, boolean[] is_int, boolean[] is_torch_model)
+  {
+    String seen_data_type = null;
+    for (JsonElement input_element :
+         model_metadata.get("inputs").getAsJsonArray()) {
+      JsonObject input = input_element.getAsJsonObject();
+      if (!input.get("datatype").getAsString().equals("INT32")
+          && !input.get("datatype").getAsString().equals("FP32")) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_UNSUPPORTED,
+            "simple lib example only supports model with data type INT32 or "
+                + "FP32");
+      }
+      if (seen_data_type == null) {
+        seen_data_type = input.get("datatype").getAsString();
+      } else if (!seen_data_type.equals(input.get("datatype").getAsString())) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INVALID_ARG,
+            "the inputs and outputs of 'simple' model must have the data type");
+      }
+    }
+    for (JsonElement output_element :
+         model_metadata.get("outputs").getAsJsonArray()) {
+      JsonObject output = output_element.getAsJsonObject();
+      if (!output.get("datatype").getAsString().equals("INT32")
+          && !output.get("datatype").getAsString().equals("FP32")) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_UNSUPPORTED,
+            "simple lib example only supports model with data type INT32 or "
+                + "FP32");
+      } else if (!seen_data_type.equals(output.get("datatype").getAsString())) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INVALID_ARG,
+            "the inputs and outputs of 'simple' model must have the data type");
       }
     }
 
-    static void
-    CompareResult(
-        String output0_name, String output1_name,
-        FloatPointer input0, FloatPointer input1, FloatPointer output0,
-        FloatPointer output1)
-    {
-      for (int i = 0; i < 16; ++i) {
-        if ((input0.get(i) + input1.get(i)) != output0.get(i)) {
-          FAIL("incorrect sum in " + output0_name);
-        }
-        if ((input0.get(i) - input1.get(i)) != output1.get(i)) {
-          FAIL("incorrect difference in " + output1_name);
-        }
+    is_int[0] = seen_data_type.equals("INT32");
+    is_torch_model[0] =
+        model_metadata.get("platform").getAsString().equals("pytorch_libtorch");
+    return null;
+  }
+
+  static void GenerateInputData(
+      IntPointer[] input0_data, IntPointer[] input1_data)
+  {
+    input0_data[0] = new IntPointer(16);
+    input1_data[0] = new IntPointer(16);
+    for (int i = 0; i < 16; ++i) {
+      input0_data[0].put(i, i);
+      input1_data[0].put(i, 1);
+    }
+  }
+
+  static void GenerateInputData(
+      FloatPointer[] input0_data, FloatPointer[] input1_data)
+  {
+    input0_data[0] = new FloatPointer(16);
+    input1_data[0] = new FloatPointer(16);
+    for (int i = 0; i < 16; ++i) {
+      input0_data[0].put(i, i);
+      input1_data[0].put(i, 1);
+    }
+  }
+
+  static void CompareResult(
+      String output0_name, String output1_name, IntPointer input0,
+      IntPointer input1, IntPointer output0, IntPointer output1)
+  {
+    for (int i = 0; i < 16; ++i) {
+      if ((input0.get(i) + input1.get(i)) != output0.get(i)) {
+        FAIL("incorrect sum in " + output0_name);
+      }
+      if ((input0.get(i) - input1.get(i)) != output1.get(i)) {
+        FAIL("incorrect difference in " + output1_name);
+      }
+    }
+  }
+
+  static void CompareResult(
+      String output0_name, String output1_name, FloatPointer input0,
+      FloatPointer input1, FloatPointer output0, FloatPointer output1)
+  {
+    for (int i = 0; i < 16; ++i) {
+      if ((input0.get(i) + input1.get(i)) != output0.get(i)) {
+        FAIL("incorrect sum in " + output0_name);
+      }
+      if ((input0.get(i) - input1.get(i)) != output1.get(i)) {
+        FAIL("incorrect difference in " + output1_name);
       }
     }
+  }
+
+  static void Check(
+      TRITONSERVER_InferenceResponse response, Pointer input0_data,
+      Pointer input1_data, String output0, String output1,
+      long expected_byte_size, int expected_datatype, boolean is_int)
+  {
+    HashMap<String, Pointer> output_data = new HashMap<>();
+
+    int[] output_count = {0};
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceResponseOutputCount(response, output_count),
+        "getting number of response outputs");
+    if (output_count[0] != 2) {
+      FAIL("expecting 2 response outputs, got " + output_count[0]);
+    }
 
-    static void
-    Check(
-        TRITONSERVER_InferenceResponse response,
-        Pointer input0_data, Pointer input1_data,
-        String output0, String output1,
-        long expected_byte_size,
-        int expected_datatype, boolean is_int)
-    {
-      HashMap<String, Pointer> output_data = new HashMap<>();
+    for (int idx = 0; idx < output_count[0]; ++idx) {
+      BytePointer cname = new BytePointer((Pointer) null);
+      IntPointer datatype = new IntPointer(1);
+      LongPointer shape = new LongPointer((Pointer) null);
+      LongPointer dim_count = new LongPointer(1);
+      Pointer base = new Pointer();
+      SizeTPointer byte_size = new SizeTPointer(1);
+      IntPointer memory_type = new IntPointer(1);
+      LongPointer memory_type_id = new LongPointer(1);
+      Pointer userp = new Pointer();
 
-      int[] output_count = {0};
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceResponseOutputCount(response, output_count),
-          "getting number of response outputs");
-      if (output_count[0] != 2) {
-        FAIL("expecting 2 response outputs, got " + output_count[0]);
-      }
-
-      for (int idx = 0; idx < output_count[0]; ++idx) {
-        BytePointer cname = new BytePointer((Pointer)null);
-        IntPointer datatype = new IntPointer(1);
-        LongPointer shape = new LongPointer((Pointer)null);
-        LongPointer dim_count = new LongPointer(1);
-        Pointer base = new Pointer();
-        SizeTPointer byte_size = new SizeTPointer(1);
-        IntPointer memory_type = new IntPointer(1);
-        LongPointer memory_type_id = new LongPointer(1);
-        Pointer userp = new Pointer();
-
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseOutput(
-                response, idx, cname, datatype, shape, dim_count, base,
-                byte_size, memory_type, memory_type_id, userp),
-            "getting output info");
-
-        if (cname.isNull()) {
-          FAIL("unable to get output name");
-        }
-
-        String name = cname.getString();
-        if ((!name.equals(output0)) && (!name.equals(output1))) {
-          FAIL("unexpected output '" + name + "'");
-        }
-
-        if ((dim_count.get() != 2) || (shape.get(0) != 1) || (shape.get(1) != 16)) {
-          FAIL("unexpected shape for '" + name + "'");
-        }
+          TRITONSERVER_InferenceResponseOutput(
+              response, idx, cname, datatype, shape, dim_count, base, byte_size,
+              memory_type, memory_type_id, userp),
+          "getting output info");
 
-        if (datatype.get() != expected_datatype) {
-          FAIL(
-              "unexpected datatype '" +
-              TRITONSERVER_DataTypeString(datatype.get()) + "' for '" +
-              name + "'");
-        }
-
-        if (byte_size.get() != expected_byte_size) {
-          FAIL(
-              "unexpected byte-size, expected " +
-              expected_byte_size + ", got " +
-              byte_size.get() + " for " + name);
-        }
-
-        if (enforce_memory_type && (memory_type.get() != requested_memory_type)) {
-          FAIL(
-              "unexpected memory type, expected to be allocated in " +
-              TRITONSERVER_MemoryTypeString(requested_memory_type) +
-              ", got " + TRITONSERVER_MemoryTypeString(memory_type.get()) +
-              ", id " + memory_type_id.get() + " for " + name);
-        }
+      if (cname.isNull()) {
+        FAIL("unable to get output name");
+      }
 
-        // We make a copy of the data here... which we could avoid for
-        // performance reasons but ok for this simple example.
-        BytePointer odata = new BytePointer(byte_size.get());
-        output_data.put(name, odata);
-        odata.put(base.limit(byte_size.get()));
+      String name = cname.getString();
+      if ((!name.equals(output0)) && (!name.equals(output1))) {
+        FAIL("unexpected output '" + name + "'");
       }
 
-      if (is_int) {
-        CompareResult(
-            output0, output1, new IntPointer(input0_data), new IntPointer(input1_data),
-            new IntPointer(output_data.get(output0)), new IntPointer(output_data.get(output1)));
-      } else {
-        CompareResult(
-            output0, output1, new FloatPointer(input0_data), new FloatPointer(input1_data),
-            new FloatPointer(output_data.get(output0)), new FloatPointer(output_data.get(output1)));
-      }
-    }
-
-    /**
-    Returns whether the memory growth is within the acceptable range
-    @param  max_float_allowed     Maximum allowed memory growth (%)
-    @param  max_mem_allowed       Maximum allowed memory (MB)
-     */
-    static boolean
-    ValidateMemoryGrowth(float max_growth_allowed, int max_mem_allowed){
-      // Allocate list starting capacity to hold up to 24 hours worth of snapshots.
-      List<Double> memory_snapshots = new ArrayList<Double>(20000);
-      while(!done){
-        try {
-          Thread.sleep(5000);
-        } catch (InterruptedException e){
-          System.out.println("Memory growth validation interrupted.");
-        }
-        System.gc();
-        double snapshot = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
-        memory_snapshots.add(snapshot);
-        System.out.println("Memory allocated (MB):" + snapshot/1E6);
+      if ((dim_count.get() != 2) || (shape.get(0) != 1)
+          || (shape.get(1) != 16)) {
+        FAIL("unexpected shape for '" + name + "'");
       }
-      if(memory_snapshots.size() < 5){
-        System.out.println("Error: Not enough snapshots, found " + memory_snapshots.size()
-        + " snapshots");
-        return false;
+
+      if (datatype.get() != expected_datatype) {
+        FAIL(
+            "unexpected datatype '"
+            + TRITONSERVER_DataTypeString(datatype.get()) + "' for '" + name
+            + "'");
       }
 
-      // Measure memory growth without outliers by taking difference
-      // between 90th percentile and 10th percentile memory usage.
-      final double bytes_in_mb = 1E6;
-      Collections.sort(memory_snapshots);
-      int index_max = ((int) Math.ceil(max_percentile / 100.0 * memory_snapshots.size())) - 1;
-      int index_min = ((int) Math.ceil(min_percentile / 100.0 * memory_snapshots.size())) - 1;
-      double memory_allocation_delta = memory_snapshots.get(index_max) - memory_snapshots.get(index_min);
-      double memory_allocation_delta_mb = memory_allocation_delta / bytes_in_mb;
-      double memory_allocation_delta_percent = memory_allocation_delta / memory_snapshots.get(index_max);
+      if (byte_size.get() != expected_byte_size) {
+        FAIL(
+            "unexpected byte-size, expected " + expected_byte_size + ", got "
+            + byte_size.get() + " for " + name);
+      }
 
-      System.out.println("Change in memory allocation (MB): " +
-          memory_allocation_delta_mb + ", " +
-          (memory_allocation_delta_percent * 100) + "%");
+      if (enforce_memory_type && (memory_type.get() != requested_memory_type)) {
+        FAIL(
+            "unexpected memory type, expected to be allocated in "
+            + TRITONSERVER_MemoryTypeString(requested_memory_type) + ", got "
+            + TRITONSERVER_MemoryTypeString(memory_type.get()) + ", id "
+            + memory_type_id.get() + " for " + name);
+      }
 
-      boolean passed = true;
+      // We make a copy of the data here... which we could avoid for
+      // performance reasons but ok for this simple example.
+      BytePointer odata = new BytePointer(byte_size.get());
+      output_data.put(name, odata);
+      odata.put(base.limit(byte_size.get()));
+    }
 
-      if(memory_allocation_delta_percent >= max_growth_allowed){
-        passed = false;
-        System.out.println("Exceeded allowed memory growth (" +
-          (max_growth_allowed * 100) + "%)");
+    if (is_int) {
+      CompareResult(
+          output0, output1, new IntPointer(input0_data),
+          new IntPointer(input1_data), new IntPointer(output_data.get(output0)),
+          new IntPointer(output_data.get(output1)));
+    } else {
+      CompareResult(
+          output0, output1, new FloatPointer(input0_data),
+          new FloatPointer(input1_data),
+          new FloatPointer(output_data.get(output0)),
+          new FloatPointer(output_data.get(output1)));
+    }
+  }
+
+  /**
+  Returns whether the memory growth is within the acceptable range
+  @param  max_float_allowed     Maximum allowed memory growth (%)
+  @param  max_mem_allowed       Maximum allowed memory (MB)
+   */
+  static boolean ValidateMemoryGrowth(
+      float max_growth_allowed, int max_mem_allowed)
+  {
+    // Allocate list starting capacity to hold up to 24 hours worth of
+    // snapshots.
+    List<Double> memory_snapshots = new ArrayList<Double>(20000);
+    while (!done) {
+      try {
+        Thread.sleep(5000);
       }
-
-      if((memory_snapshots.get(index_max) / bytes_in_mb) >= max_mem_allowed){
-        passed = false;
-        System.out.println("Exceeded allowed memory (" + max_mem_allowed + 
-          "MB), got " + (memory_snapshots.get(index_max) / bytes_in_mb) + "MB");
+      catch (InterruptedException e) {
+        System.out.println("Memory growth validation interrupted.");
       }
-      return passed;
+      System.gc();
+      double snapshot = Runtime.getRuntime().totalMemory()
+          - Runtime.getRuntime().freeMemory();
+      memory_snapshots.add(snapshot);
+      System.out.println("Memory allocated (MB):" + snapshot / 1E6);
+    }
+    if (memory_snapshots.size() < 5) {
+      System.out.println(
+          "Error: Not enough snapshots, found " + memory_snapshots.size()
+          + " snapshots");
+      return false;
     }
 
-    static void
-    RunInference(TRITONSERVER_ServerDeleter server, String model_name, boolean[] is_int, boolean[] is_torch_model, boolean check_accuracy)
-    throws Exception
-    {
-      // Create the allocator that will be used to allocate buffers for
-      // the result tensors.
-      TRITONSERVER_ResponseAllocator allocator = new TRITONSERVER_ResponseAllocator(null);
-      FAIL_IF_ERR(
-          TRITONSERVER_ResponseAllocatorNew(
-              allocator, responseAlloc, responseRelease, null /* start_fn */),
-          "creating response allocator");
-
-      // Inference
-      TRITONSERVER_InferenceRequest irequest = new TRITONSERVER_InferenceRequest(null);
-      FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestNew(
-              irequest, server, model_name, -1 /* model_version */),
-          "creating inference request");
+    // Measure memory growth without outliers by taking difference
+    // between 90th percentile and 10th percentile memory usage.
+    final double bytes_in_mb = 1E6;
+    Collections.sort(memory_snapshots);
+    int index_max =
+        ((int) Math.ceil(max_percentile / 100.0 * memory_snapshots.size())) - 1;
+    int index_min =
+        ((int) Math.ceil(min_percentile / 100.0 * memory_snapshots.size())) - 1;
+    double memory_allocation_delta =
+        memory_snapshots.get(index_max) - memory_snapshots.get(index_min);
+    double memory_allocation_delta_mb = memory_allocation_delta / bytes_in_mb;
+    double memory_allocation_delta_percent =
+        memory_allocation_delta / memory_snapshots.get(index_max);
+
+    System.out.println(
+        "Change in memory allocation (MB): " + memory_allocation_delta_mb + ", "
+        + (memory_allocation_delta_percent * 100) + "%");
+
+    boolean passed = true;
+
+    if (memory_allocation_delta_percent >= max_growth_allowed) {
+      passed = false;
+      System.out.println(
+          "Exceeded allowed memory growth (" + (max_growth_allowed * 100)
+          + "%)");
+    }
 
-      FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestSetId(irequest, "my_request_id"),
-          "setting ID for the request");
+    if ((memory_snapshots.get(index_max) / bytes_in_mb) >= max_mem_allowed) {
+      passed = false;
+      System.out.println(
+          "Exceeded allowed memory (" + max_mem_allowed + "MB), got "
+          + (memory_snapshots.get(index_max) / bytes_in_mb) + "MB");
+    }
+    return passed;
+  }
+
+  static void RunInference(
+      TRITONSERVER_ServerDeleter server, String model_name, boolean[] is_int,
+      boolean[] is_torch_model, boolean check_accuracy) throws Exception
+  {
+    // Create the allocator that will be used to allocate buffers for
+    // the result tensors.
+    TRITONSERVER_ResponseAllocator allocator =
+        new TRITONSERVER_ResponseAllocator(null);
+    FAIL_IF_ERR(
+        TRITONSERVER_ResponseAllocatorNew(
+            allocator, responseAlloc, responseRelease, null /* start_fn */),
+        "creating response allocator");
+
+    // Inference
+    TRITONSERVER_InferenceRequest irequest =
+        new TRITONSERVER_InferenceRequest(null);
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestNew(
+            irequest, server, model_name, -1 /* model_version */),
+        "creating inference request");
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestSetId(irequest, "my_request_id"),
+        "setting ID for the request");
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestSetReleaseCallback(
+            irequest, inferRequestComplete, null /* request_release_userp */),
+        "setting request release callback");
+
+    // Inputs
+    String input0 = is_torch_model[0] ? "INPUT__0" : "INPUT0";
+    String input1 = is_torch_model[0] ? "INPUT__1" : "INPUT1";
+
+    long[] input0_shape = {1, 16};
+    long[] input1_shape = {1, 16};
+
+    int datatype =
+        (is_int[0]) ? TRITONSERVER_TYPE_INT32 : TRITONSERVER_TYPE_FP32;
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestAddInput(
+            irequest, input0, datatype, input0_shape, input0_shape.length),
+        "setting input 0 meta-data for the request");
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestAddInput(
+            irequest, input1, datatype, input1_shape, input1_shape.length),
+        "setting input 1 meta-data for the request");
+
+    String output0 = is_torch_model[0] ? "OUTPUT__0" : "OUTPUT0";
+    String output1 = is_torch_model[0] ? "OUTPUT__1" : "OUTPUT1";
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestAddRequestedOutput(irequest, output0),
+        "requesting output 0 for the request");
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestAddRequestedOutput(irequest, output1),
+        "requesting output 1 for the request");
+
+    // Create the data for the two input tensors. Initialize the first
+    // to unique values and the second to all ones.
+    BytePointer input0_data;
+    BytePointer input1_data;
+    if (is_int[0]) {
+      IntPointer[] p0 = {null}, p1 = {null};
+      GenerateInputData(p0, p1);
+      input0_data = p0[0].getPointer(BytePointer.class);
+      input1_data = p1[0].getPointer(BytePointer.class);
+    } else {
+      FloatPointer[] p0 = {null}, p1 = {null};
+      GenerateInputData(p0, p1);
+      input0_data = p0[0].getPointer(BytePointer.class);
+      input1_data = p1[0].getPointer(BytePointer.class);
+    }
 
-      FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestSetReleaseCallback(
-              irequest, inferRequestComplete, null /* request_release_userp */),
-          "setting request release callback");
+    long input0_size = input0_data.limit();
+    long input1_size = input1_data.limit();
 
-      // Inputs
-      String input0 = is_torch_model[0] ? "INPUT__0" : "INPUT0";
-      String input1 = is_torch_model[0] ? "INPUT__1" : "INPUT1";
+    Pointer input0_base = input0_data;
+    Pointer input1_base = input1_data;
 
-      long[] input0_shape = {1, 16};
-      long[] input1_shape = {1, 16};
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestAppendInputData(
+            irequest, input0, input0_base, input0_size, requested_memory_type,
+            0 /* memory_type_id */),
+        "assigning INPUT0 data");
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestAppendInputData(
+            irequest, input1, input1_base, input1_size, requested_memory_type,
+            0 /* memory_type_id */),
+        "assigning INPUT1 data");
 
-      int datatype =
-          (is_int[0]) ? TRITONSERVER_TYPE_INT32 : TRITONSERVER_TYPE_FP32;
+    // Perform inference...
+    {
+      CompletableFuture<TRITONSERVER_InferenceResponse> completed =
+          new CompletableFuture<>();
+      futures.put(irequest, completed);
 
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestAddInput(
-              irequest, input0, datatype, input0_shape, input0_shape.length),
-          "setting input 0 meta-data for the request");
+          TRITONSERVER_InferenceRequestSetResponseCallback(
+              irequest, allocator, null /* response_allocator_userp */,
+              inferResponseComplete, irequest),
+          "setting response callback");
+
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestAddInput(
-              irequest, input1, datatype, input1_shape, input1_shape.length),
-          "setting input 1 meta-data for the request");
+          TRITONSERVER_ServerInferAsync(server, irequest, null /* trace */),
+          "running inference");
 
-      String output0 = is_torch_model[0] ? "OUTPUT__0" : "OUTPUT0";
-      String output1 = is_torch_model[0] ? "OUTPUT__1" : "OUTPUT1";
+      // Wait for the inference to complete.
+      TRITONSERVER_InferenceResponse completed_response = completed.get();
+      futures.remove(irequest);
 
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestAddRequestedOutput(irequest, output0),
-          "requesting output 0 for the request");
+          TRITONSERVER_InferenceResponseError(completed_response),
+          "response status");
+      if (check_accuracy) {
+        Check(
+            completed_response, input0_data, input1_data, output0, output1,
+            input0_size, datatype, is_int[0]);
+      }
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestAddRequestedOutput(irequest, output1),
-          "requesting output 1 for the request");
+          TRITONSERVER_InferenceResponseDelete(completed_response),
+          "deleting inference response");
+    }
 
-      // Create the data for the two input tensors. Initialize the first
-      // to unique values and the second to all ones.
-      BytePointer input0_data;
-      BytePointer input1_data;
+    // Modify some input data in place and then reuse the request
+    // object. For simplicity we only do this when the input tensors are
+    // in non-pinned system memory.
+    if (!enforce_memory_type
+        || (requested_memory_type == TRITONSERVER_MEMORY_CPU)) {
       if (is_int[0]) {
-        IntPointer[] p0 = {null}, p1 = {null};
-        GenerateInputData(p0, p1);
-        input0_data = p0[0].getPointer(BytePointer.class);
-        input1_data = p1[0].getPointer(BytePointer.class);
+        new IntPointer(input0_data).put(0, 27);
       } else {
-        FloatPointer[] p0 = {null}, p1 = {null};
-        GenerateInputData(p0, p1);
-        input0_data = p0[0].getPointer(BytePointer.class);
-        input1_data = p1[0].getPointer(BytePointer.class);
+        new FloatPointer(input0_data).put(0, 27.0f);
       }
 
-      long input0_size = input0_data.limit();
-      long input1_size = input1_data.limit();
+      CompletableFuture<TRITONSERVER_InferenceResponse> completed =
+          new CompletableFuture<>();
+      futures.put(irequest, completed);
 
-      Pointer input0_base = input0_data;
-      Pointer input1_base = input1_data;
+      // Using a new promise so have to re-register the callback to set
+      // the promise as the userp.
+      FAIL_IF_ERR(
+          TRITONSERVER_InferenceRequestSetResponseCallback(
+              irequest, allocator, null /* response_allocator_userp */,
+              inferResponseComplete, irequest),
+          "setting response callback");
 
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestAppendInputData(
-              irequest, input0, input0_base, input0_size, requested_memory_type,
-              0 /* memory_type_id */),
-          "assigning INPUT0 data");
+          TRITONSERVER_ServerInferAsync(server, irequest, null /* trace */),
+          "running inference");
+
+      // Wait for the inference to complete.
+      TRITONSERVER_InferenceResponse completed_response = completed.get();
+      futures.remove(irequest);
+      FAIL_IF_ERR(
+          TRITONSERVER_InferenceResponseError(completed_response),
+          "response status");
+      if (check_accuracy) {
+        Check(
+            completed_response, input0_data, input1_data, output0, output1,
+            input0_size, datatype, is_int[0]);
+      }
+
+      FAIL_IF_ERR(
+          TRITONSERVER_InferenceResponseDelete(completed_response),
+          "deleting inference response");
+    }
+
+    // Remove input data and then add back different data.
+    {
+      FAIL_IF_ERR(
+          TRITONSERVER_InferenceRequestRemoveAllInputData(irequest, input0),
+          "removing INPUT0 data");
       FAIL_IF_ERR(
           TRITONSERVER_InferenceRequestAppendInputData(
-              irequest, input1, input1_base, input1_size, requested_memory_type,
+              irequest, input0, input1_base, input1_size, requested_memory_type,
               0 /* memory_type_id */),
-          "assigning INPUT1 data");
-
-      // Perform inference...
-      {
-        CompletableFuture<TRITONSERVER_InferenceResponse> completed = new CompletableFuture<>();
-        futures.put(irequest, completed);
-
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceRequestSetResponseCallback(
-                irequest, allocator, null /* response_allocator_userp */,
-                inferResponseComplete, irequest),
-            "setting response callback");
-
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerInferAsync(
-                server, irequest, null /* trace */),
-            "running inference");
-
-        // Wait for the inference to complete.
-        TRITONSERVER_InferenceResponse completed_response = completed.get();
-        futures.remove(irequest);
-
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseError(completed_response),
-            "response status");
-        if (check_accuracy) {
-          Check(
-              completed_response, input0_data, input1_data, output0, output1,
-              input0_size, datatype, is_int[0]);
-        }
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseDelete(completed_response),
-            "deleting inference response");
-      }
-
-      // Modify some input data in place and then reuse the request
-      // object. For simplicity we only do this when the input tensors are
-      // in non-pinned system memory.
-      if (!enforce_memory_type ||
-          (requested_memory_type == TRITONSERVER_MEMORY_CPU)) {
-        if (is_int[0]) {
-          new IntPointer(input0_data).put(0, 27);
-        } else {
-          new FloatPointer(input0_data).put(0, 27.0f);
-        }
+          "assigning INPUT1 data to INPUT0");
 
-        CompletableFuture<TRITONSERVER_InferenceResponse> completed = new CompletableFuture<>();
-        futures.put(irequest, completed);
-
-        // Using a new promise so have to re-register the callback to set
-        // the promise as the userp.
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceRequestSetResponseCallback(
-                irequest, allocator, null /* response_allocator_userp */,
-                inferResponseComplete, irequest),
-            "setting response callback");
-
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerInferAsync(
-                server, irequest, null /* trace */),
-            "running inference");
-
-        // Wait for the inference to complete.
-        TRITONSERVER_InferenceResponse completed_response = completed.get();
-        futures.remove(irequest);
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseError(completed_response),
-            "response status");
-        if (check_accuracy) {
-          Check(
-              completed_response, input0_data, input1_data, output0, output1,
-              input0_size, datatype, is_int[0]);
-        }
+      CompletableFuture<TRITONSERVER_InferenceResponse> completed =
+          new CompletableFuture<>();
+      futures.put(irequest, completed);
 
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseDelete(completed_response),
-            "deleting inference response");
-      }
-
-      // Remove input data and then add back different data.
-      {
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceRequestRemoveAllInputData(irequest, input0),
-            "removing INPUT0 data");
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceRequestAppendInputData(
-                irequest, input0, input1_base, input1_size, requested_memory_type,
-                0 /* memory_type_id */),
-            "assigning INPUT1 data to INPUT0");
-
-        CompletableFuture<TRITONSERVER_InferenceResponse> completed = new CompletableFuture<>();
-        futures.put(irequest, completed);
-
-        // Using a new promise so have to re-register the callback to set
-        // the promise as the userp.
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceRequestSetResponseCallback(
-                irequest, allocator, null /* response_allocator_userp */,
-                inferResponseComplete, irequest),
-            "setting response callback");
-
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerInferAsync(
-                server, irequest, null /* trace */),
-            "running inference");
-
-        // Wait for the inference to complete.
-        TRITONSERVER_InferenceResponse completed_response = completed.get();
-        futures.remove(irequest);
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseError(completed_response),
-            "response status");
-
-        if (check_accuracy) {
-          // Both inputs are using input1_data...
-          Check(
-              completed_response, input1_data, input1_data, output0, output1,
-              input0_size, datatype, is_int[0]);
-        }
+      // Using a new promise so have to re-register the callback to set
+      // the promise as the userp.
+      FAIL_IF_ERR(
+          TRITONSERVER_InferenceRequestSetResponseCallback(
+              irequest, allocator, null /* response_allocator_userp */,
+              inferResponseComplete, irequest),
+          "setting response callback");
 
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseDelete(completed_response),
-            "deleting inference response");
-      }
+      FAIL_IF_ERR(
+          TRITONSERVER_ServerInferAsync(server, irequest, null /* trace */),
+          "running inference");
 
+      // Wait for the inference to complete.
+      TRITONSERVER_InferenceResponse completed_response = completed.get();
+      futures.remove(irequest);
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestDelete(irequest),
-          "deleting inference request");
+          TRITONSERVER_InferenceResponseError(completed_response),
+          "response status");
+
+      if (check_accuracy) {
+        // Both inputs are using input1_data...
+        Check(
+            completed_response, input1_data, input1_data, output0, output1,
+            input0_size, datatype, is_int[0]);
+      }
 
       FAIL_IF_ERR(
-          TRITONSERVER_ResponseAllocatorDelete(allocator),
-          "deleting response allocator");
+          TRITONSERVER_InferenceResponseDelete(completed_response),
+          "deleting inference response");
     }
 
-    public static void
-    main(String[] args) throws Exception
-    {
-      int num_iterations = 1000000;
-      String model_repository_path = null;
-      int verbose_level = 0;
-      boolean check_accuracy = false;
-
-      // Parse commandline...
-      for (int i = 0; i < args.length; i++) {
-        switch (args[i]) {
-          case "-i":
-            i++;
-            try {
-              num_iterations = Integer.parseInt(args[i]);
-            } catch (NumberFormatException e){
-              Usage(
-                  "-i must be used to specify number of iterations");
-            }
-            break;
-          case "-m":
-            enforce_memory_type = true;
-            i++;
-            if (args[i].equals("system")) {
-              requested_memory_type = TRITONSERVER_MEMORY_CPU;
-            } else if (args[i].equals("pinned")) {
-              requested_memory_type = TRITONSERVER_MEMORY_CPU_PINNED;
-            } else if (args[i].equals("gpu")) {
-              requested_memory_type = TRITONSERVER_MEMORY_GPU;
-            } else {
-              Usage(
-                  "-m must be used to specify one of the following types:" +
-                  " <\"system\"|\"pinned\"|gpu>");
-            }
-            break;
-          case "-r":
-            model_repository_path = args[++i];
-            break;
-          case "-v":
-            verbose_level = 1;
-            break;
-          case "-c":
-            check_accuracy = true;
-            break;
-          case "-?":
-            Usage(null);
-            break;
-          case "--max-growth":
-            i++;
-            try {
-              max_growth_allowed = Integer.parseInt(args[i]) / 100.0f;
-            } catch (NumberFormatException e){
-              Usage(
-                  "--max-growth must be an integer value specifying allowed memory growth (%)");
-            }
-            break;
-          case "--max-memory":
-            i++;
-            try {
-              max_mem_allowed = Integer.parseInt(args[i]);
-            } catch (NumberFormatException e){
-              Usage(
-                  "--max-memory must be an integer value specifying maximum allowed memory (MB)");
-            }
-            break;
-        }
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestDelete(irequest),
+        "deleting inference request");
+
+    FAIL_IF_ERR(
+        TRITONSERVER_ResponseAllocatorDelete(allocator),
+        "deleting response allocator");
+  }
+
+  public static void main(String[] args) throws Exception
+  {
+    int num_iterations = 1000000;
+    String model_repository_path = null;
+    int verbose_level = 0;
+    boolean check_accuracy = false;
+
+    // Parse commandline...
+    for (int i = 0; i < args.length; i++) {
+      switch (args[i]) {
+        case "-i":
+          i++;
+          try {
+            num_iterations = Integer.parseInt(args[i]);
+          }
+          catch (NumberFormatException e) {
+            Usage("-i must be used to specify number of iterations");
+          }
+          break;
+        case "-m":
+          enforce_memory_type = true;
+          i++;
+          if (args[i].equals("system")) {
+            requested_memory_type = TRITONSERVER_MEMORY_CPU;
+          } else if (args[i].equals("pinned")) {
+            requested_memory_type = TRITONSERVER_MEMORY_CPU_PINNED;
+          } else if (args[i].equals("gpu")) {
+            requested_memory_type = TRITONSERVER_MEMORY_GPU;
+          } else {
+            Usage(
+                "-m must be used to specify one of the following types:"
+                + " <\"system\"|\"pinned\"|gpu>");
+          }
+          break;
+        case "-r":
+          model_repository_path = args[++i];
+          break;
+        case "-v":
+          verbose_level = 1;
+          break;
+        case "-c":
+          check_accuracy = true;
+          break;
+        case "-?":
+          Usage(null);
+          break;
+        case "--max-growth":
+          i++;
+          try {
+            max_growth_allowed = Integer.parseInt(args[i]) / 100.0f;
+          }
+          catch (NumberFormatException e) {
+            Usage(
+                "--max-growth must be an integer value specifying allowed memory growth (%)");
+          }
+          break;
+        case "--max-memory":
+          i++;
+          try {
+            max_mem_allowed = Integer.parseInt(args[i]);
+          }
+          catch (NumberFormatException e) {
+            Usage(
+                "--max-memory must be an integer value specifying maximum allowed memory (MB)");
+          }
+          break;
       }
+    }
 
-      if (model_repository_path == null) {
-        Usage("-r must be used to specify model repository path");
-      }
-      if (enforce_memory_type && requested_memory_type != TRITONSERVER_MEMORY_CPU) {
-        Usage("-m can only be set to \"system\" without enabling GPU");
-      }
+    if (model_repository_path == null) {
+      Usage("-r must be used to specify model repository path");
+    }
+    if (enforce_memory_type
+        && requested_memory_type != TRITONSERVER_MEMORY_CPU) {
+      Usage("-m can only be set to \"system\" without enabling GPU");
+    }
 
-      // Check API version.
-      int[] api_version_major = {0}, api_version_minor = {0};
-      FAIL_IF_ERR(
-          TRITONSERVER_ApiVersion(api_version_major, api_version_minor),
-          "getting Triton API version");
-      if ((TRITONSERVER_API_VERSION_MAJOR != api_version_major[0]) ||
-          (TRITONSERVER_API_VERSION_MINOR > api_version_minor[0])) {
-        FAIL("triton server API version mismatch");
-      }
+    // Check API version.
+    int[] api_version_major = {0}, api_version_minor = {0};
+    FAIL_IF_ERR(
+        TRITONSERVER_ApiVersion(api_version_major, api_version_minor),
+        "getting Triton API version");
+    if ((TRITONSERVER_API_VERSION_MAJOR != api_version_major[0])
+        || (TRITONSERVER_API_VERSION_MINOR > api_version_minor[0])) {
+      FAIL("triton server API version mismatch");
+    }
 
-      // Create the server...
-      TRITONSERVER_ServerOptions server_options = new TRITONSERVER_ServerOptions(null);
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsNew(server_options),
-          "creating server options");
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetModelRepositoryPath(
-              server_options, model_repository_path),
-          "setting model repository path");
+    // Create the server...
+    TRITONSERVER_ServerOptions server_options =
+        new TRITONSERVER_ServerOptions(null);
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsNew(server_options),
+        "creating server options");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetModelRepositoryPath(
+            server_options, model_repository_path),
+        "setting model repository path");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetLogVerbose(server_options, verbose_level),
+        "setting verbose logging level");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetBackendDirectory(
+            server_options, "/opt/tritonserver/backends"),
+        "setting backend directory");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetRepoAgentDirectory(
+            server_options, "/opt/tritonserver/repoagents"),
+        "setting repository agent directory");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetStrictModelConfig(server_options, true),
+        "setting strict model configuration");
+    double min_compute_capability = TRITON_MIN_COMPUTE_CAPABILITY;
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetMinSupportedComputeCapability(
+            server_options, min_compute_capability),
+        "setting minimum supported CUDA compute capability");
+
+    TRITONSERVER_Server server_ptr = new TRITONSERVER_Server(null);
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerNew(server_ptr, server_options), "creating server");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsDelete(server_options),
+        "deleting server options");
+
+    TRITONSERVER_ServerDeleter server =
+        new TRITONSERVER_ServerDeleter(server_ptr);
+
+    // Wait until the server is both live and ready.
+    int health_iters = 0;
+    while (true) {
+      boolean[] live = {false}, ready = {false};
       FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetLogVerbose(server_options, verbose_level),
-          "setting verbose logging level");
+          TRITONSERVER_ServerIsLive(server, live),
+          "unable to get server liveness");
       FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetBackendDirectory(
-              server_options, "/opt/tritonserver/backends"),
-          "setting backend directory");
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetRepoAgentDirectory(
-              server_options, "/opt/tritonserver/repoagents"),
-          "setting repository agent directory");
+          TRITONSERVER_ServerIsReady(server, ready),
+          "unable to get server readiness");
+      System.out.println(
+          "Server Health: live " + live[0] + ", ready " + ready[0]);
+      if (live[0] && ready[0]) {
+        break;
+      }
+
+      if (++health_iters >= 10) {
+        FAIL("failed to find healthy inference server");
+      }
+
+      Thread.sleep(500);
+    }
+
+    // Print status of the server.
+    {
+      TRITONSERVER_Message server_metadata_message =
+          new TRITONSERVER_Message(null);
       FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetStrictModelConfig(server_options, true),
-          "setting strict model configuration");
-      double min_compute_capability = TRITON_MIN_COMPUTE_CAPABILITY;
+          TRITONSERVER_ServerMetadata(server, server_metadata_message),
+          "unable to get server metadata message");
+      BytePointer buffer = new BytePointer((Pointer) null);
+      SizeTPointer byte_size = new SizeTPointer(1);
       FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetMinSupportedComputeCapability(
-              server_options, min_compute_capability),
-          "setting minimum supported CUDA compute capability");
+          TRITONSERVER_MessageSerializeToJson(
+              server_metadata_message, buffer, byte_size),
+          "unable to serialize server metadata message");
+
+      System.out.println("Server Status:");
+      System.out.println(buffer.limit(byte_size.get()).getString());
 
-      TRITONSERVER_Server server_ptr = new TRITONSERVER_Server(null);
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerNew(server_ptr, server_options), "creating server");
       FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsDelete(server_options),
-          "deleting server options");
-
-      TRITONSERVER_ServerDeleter server = new TRITONSERVER_ServerDeleter(server_ptr);
-
-      // Wait until the server is both live and ready.
-      int health_iters = 0;
-      while (true) {
-        boolean[] live = {false}, ready = {false};
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerIsLive(server, live),
-            "unable to get server liveness");
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerIsReady(server, ready),
-            "unable to get server readiness");
-        System.out.println("Server Health: live " + live[0] + ", ready " + ready[0]);
-        if (live[0] && ready[0]) {
-          break;
-        }
+          TRITONSERVER_MessageDelete(server_metadata_message),
+          "deleting status metadata");
+    }
+
+    String model_name = "simple";
 
+    // Wait for the model to become available.
+    boolean[] is_torch_model = {false};
+    boolean[] is_int = {true};
+    boolean[] is_ready = {false};
+    health_iters = 0;
+    while (!is_ready[0]) {
+      FAIL_IF_ERR(
+          TRITONSERVER_ServerModelIsReady(server, model_name, 1, is_ready),
+          "unable to get model readiness");
+      if (!is_ready[0]) {
         if (++health_iters >= 10) {
-          FAIL("failed to find healthy inference server");
+          FAIL("model failed to be ready in 10 iterations");
         }
-
         Thread.sleep(500);
+        continue;
       }
 
-      // Print status of the server.
-      {
-        TRITONSERVER_Message server_metadata_message = new TRITONSERVER_Message(null);
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerMetadata(server, server_metadata_message),
-            "unable to get server metadata message");
-        BytePointer buffer = new BytePointer((Pointer)null);
-        SizeTPointer byte_size = new SizeTPointer(1);
-        FAIL_IF_ERR(
-            TRITONSERVER_MessageSerializeToJson(
-                server_metadata_message, buffer, byte_size),
-            "unable to serialize server metadata message");
-
-        System.out.println("Server Status:");
-        System.out.println(buffer.limit(byte_size.get()).getString());
-
-        FAIL_IF_ERR(
-            TRITONSERVER_MessageDelete(server_metadata_message),
-            "deleting status metadata");
-      }
-
-      String model_name = "simple";
-
-      // Wait for the model to become available.
-      boolean[] is_torch_model = {false};
-      boolean[] is_int = {true};
-      boolean[] is_ready = {false};
-      health_iters = 0;
-      while (!is_ready[0]) {
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerModelIsReady(
-                server, model_name, 1, is_ready),
-            "unable to get model readiness");
-        if (!is_ready[0]) {
-          if (++health_iters >= 10) {
-            FAIL("model failed to be ready in 10 iterations");
-          }
-          Thread.sleep(500);
-          continue;
-        }
-
-        TRITONSERVER_Message model_metadata_message = new TRITONSERVER_Message(null);
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerModelMetadata(
-                server, model_name, 1, model_metadata_message),
-            "unable to get model metadata message");
-        BytePointer buffer = new BytePointer((Pointer)null);
-        SizeTPointer byte_size = new SizeTPointer(1);
-        FAIL_IF_ERR(
-            TRITONSERVER_MessageSerializeToJson(
-                model_metadata_message, buffer, byte_size),
-            "unable to serialize model status protobuf");
-
-        JsonParser parser = new JsonParser();
-        JsonObject model_metadata = null;
-        try {
-          model_metadata = parser.parse(buffer.limit(byte_size.get()).getString()).getAsJsonObject();
-        } catch (Exception e) {
-          FAIL("error: failed to parse model metadata from JSON: " + e);
-        }
+      TRITONSERVER_Message model_metadata_message =
+          new TRITONSERVER_Message(null);
+      FAIL_IF_ERR(
+          TRITONSERVER_ServerModelMetadata(
+              server, model_name, 1, model_metadata_message),
+          "unable to get model metadata message");
+      BytePointer buffer = new BytePointer((Pointer) null);
+      SizeTPointer byte_size = new SizeTPointer(1);
+      FAIL_IF_ERR(
+          TRITONSERVER_MessageSerializeToJson(
+              model_metadata_message, buffer, byte_size),
+          "unable to serialize model status protobuf");
+
+      JsonParser parser = new JsonParser();
+      JsonObject model_metadata = null;
+      try {
+        model_metadata = parser.parse(buffer.limit(byte_size.get()).getString())
+                             .getAsJsonObject();
+      }
+      catch (Exception e) {
+        FAIL("error: failed to parse model metadata from JSON: " + e);
+      }
 
-        FAIL_IF_ERR(
-            TRITONSERVER_MessageDelete(model_metadata_message),
-            "deleting status protobuf");
+      FAIL_IF_ERR(
+          TRITONSERVER_MessageDelete(model_metadata_message),
+          "deleting status protobuf");
 
-        if (!model_metadata.get("name").getAsString().equals(model_name)) {
-          FAIL("unable to find metadata for model");
-        }
+      if (!model_metadata.get("name").getAsString().equals(model_name)) {
+        FAIL("unable to find metadata for model");
+      }
 
-        boolean found_version = false;
-        if (model_metadata.has("versions")) {
-          for (JsonElement version : model_metadata.get("versions").getAsJsonArray()) {
-            if (version.getAsString().equals("1")) {
-              found_version = true;
-              break;
-            }
+      boolean found_version = false;
+      if (model_metadata.has("versions")) {
+        for (JsonElement version :
+             model_metadata.get("versions").getAsJsonArray()) {
+          if (version.getAsString().equals("1")) {
+            found_version = true;
+            break;
           }
         }
-        if (!found_version) {
-          FAIL("unable to find version 1 status for model");
-        }
-
-        FAIL_IF_ERR(
-            ParseModelMetadata(model_metadata, is_int, is_torch_model),
-            "parsing model metadata");
+      }
+      if (!found_version) {
+        FAIL("unable to find version 1 status for model");
       }
 
-      Runnable runnable =
-        () -> {
-          boolean passed = ValidateMemoryGrowth(max_growth_allowed, max_mem_allowed);
-          
-          // Sleep to give the garbage collector time to free the server.
-          // This avoids race conditions between Triton bindings' printing and
-          // Java's native printing below.
-          try {
-            Thread.sleep(5000);
-          } catch (InterruptedException e){
-            System.out.println("Sleep interrupted: " + e.toString());
-          }
-
-          if(passed){
-            System.out.println("Memory growth test passed");
-          } else {
-            System.out.println("Memory growth test FAILED");
-          }
-        };
-      Thread memory_thread = new Thread(runnable);
-      memory_thread.start();
+      FAIL_IF_ERR(
+          ParseModelMetadata(model_metadata, is_int, is_torch_model),
+          "parsing model metadata");
+    }
 
-      for(int i = 0; i < num_iterations; i++){
-        try (PointerScope scope = new PointerScope()) {
-          RunInference(server, model_name, is_int, is_torch_model, check_accuracy);
-        }
+    Runnable runnable = () ->
+    {
+      boolean passed =
+          ValidateMemoryGrowth(max_growth_allowed, max_mem_allowed);
+
+      // Sleep to give the garbage collector time to free the server.
+      // This avoids race conditions between Triton bindings' printing and
+      // Java's native printing below.
+      try {
+        Thread.sleep(5000);
+      }
+      catch (InterruptedException e) {
+        System.out.println("Sleep interrupted: " + e.toString());
       }
-      done = true;
-      memory_thread.join();
 
-      System.exit(0);
+      if (passed) {
+        System.out.println("Memory growth test passed");
+      } else {
+        System.out.println("Memory growth test FAILED");
+      }
+    };
+    Thread memory_thread = new Thread(runnable);
+    memory_thread.start();
+
+    for (int i = 0; i < num_iterations; i++) {
+      try (PointerScope scope = new PointerScope()) {
+        RunInference(
+            server, model_name, is_int, is_torch_model, check_accuracy);
+      }
     }
+    done = true;
+    memory_thread.join();
+
+    System.exit(0);
+  }
 }
diff --git a/qa/L0_java_memory_growth/test.sh b/qa/L0_java_memory_growth/test.sh
index 60bacb9b94..d5ec33a5d5 100755
--- a/qa/L0_java_memory_growth/test.sh
+++ b/qa/L0_java_memory_growth/test.sh
@@ -27,14 +27,12 @@
 
 # Set up test files based on installation instructions
 # https://github.com/bytedeco/javacpp-presets/blob/master/tritonserver/README.md
-set +e
-rm -r javacpp-presets
-git clone https://github.com/bytedeco/javacpp-presets.git
-cd javacpp-presets
-mvn clean install --projects .,tritonserver
-mvn clean install -f platform --projects ../tritonserver/platform -Djavacpp.platform.host
-cd ..
+JAVACPP_BRANCH=${JAVACPP_BRANCH:="https://github.com/bytedeco/javacpp-presets.git"}
+JAVACPP_BRANCH_TAG=${JAVACPP_BRANCH_TAG:="master"}
 set -e
+git clone --single-branch --depth=1 -b ${TRITON_CLIENT_REPO_TAG} https://github.com/triton-inference-server/client.git
+source client/src/java-api-bindings/scripts/install_dependencies_and_build.sh -b $PWD --javacpp-branch ${JAVACPP_BRANCH} --javacpp-tag ${JAVACPP_BRANCH_TAG} --keep-build-dependencies
+cd ..
 
 export MAVEN_OPTS="-XX:MaxGCPauseMillis=40"
 MODEL_REPO=`pwd`/models
@@ -76,12 +74,12 @@ fi
 LOG_IDX=$((LOG_IDX+1))
 CLIENT_LOG="./client_$LOG_IDX.log"
 
-# Longer-running memory growth test 
+# Longer-running memory growth test
 ITERS=1000000
 MAX_MEM_GROWTH_MB=10
 if [ "$TRITON_PERF_LONG" == 1 ]; then
     # ~1 day
-    ITERS=125000000
+    ITERS=150000000
     MAX_MEM_GROWTH_MB=25
 fi
 
diff --git a/qa/L0_java_resnet/ResnetTest.java b/qa/L0_java_resnet/ResnetTest.java
index 9bf46b22f7..4827273926 100644
--- a/qa/L0_java_resnet/ResnetTest.java
+++ b/qa/L0_java_resnet/ResnetTest.java
@@ -1,4 +1,4 @@
-// Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 //
 // Redistribution and use in source and binary forms, with or without
 // modification, are permitted provided that the following conditions
@@ -24,593 +24,616 @@
 // (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
+import static org.bytedeco.tritonserver.global.tritonserver.*;
+
+import com.google.gson.*;
 import java.io.*;
 import java.util.*;
 import java.util.concurrent.*;
-import com.google.gson.*;
 import org.bytedeco.javacpp.*;
 import org.bytedeco.tritonserver.tritonserver.*;
-import static org.bytedeco.tritonserver.global.tritonserver.*;
 
 public class ResnetTest {
-    // Maximum allowed difference from expected model outputs
-    private static final float ALLOWED_DELTA = .001f;
-    private static final String[] MODELS = {
-      "resnet50_fp32_libtorch",
-      "resnet50_fp32_onnx",
+  // Maximum allowed difference from expected model outputs
+  private static final float ALLOWED_DELTA = .001f;
+  private static final String[] MODELS = {
+      "resnet50_fp32_libtorch", "resnet50_fp32_onnx",
       // TODO: fix build to support GPU only resnet50v1.5_fp16_savedmodel
       //"resnet50v1.5_fp16_savedmodel",
-      };
-    private static final double TRITON_MIN_COMPUTE_CAPABILITY = 6.0;
-    private enum Backend {
-      NONE,
-      ONNX,
-      TF,
-      TORCH,
+  };
+  private static final double TRITON_MIN_COMPUTE_CAPABILITY = 6.0;
+  private enum Backend {
+    NONE,
+    ONNX,
+    TF,
+    TORCH,
+  }
+
+  static void FAIL(String MSG)
+  {
+    System.err.println("failure: " + MSG);
+    System.exit(1);
+  }
+
+  static void FAIL_IF_ERR(TRITONSERVER_Error err__, String MSG)
+  {
+    if (err__ != null) {
+      System.err.println(
+          "error: " + MSG + ":" + TRITONSERVER_ErrorCodeString(err__) + " - "
+          + TRITONSERVER_ErrorMessage(err__));
+      TRITONSERVER_ErrorDelete(err__);
+      System.exit(1);
     }
+  }
 
-    static void FAIL(String MSG) {
-        System.err.println("failure: " + MSG);
-        System.exit(1);
-    }
+  static boolean enforce_memory_type = false;
+  static int requested_memory_type;
 
-    static void FAIL_IF_ERR(TRITONSERVER_Error err__, String MSG) {
-        if (err__ != null) {
-            System.err.println("error: " + MSG + ":"
-                             + TRITONSERVER_ErrorCodeString(err__) + " - "
-                             + TRITONSERVER_ErrorMessage(err__));
-            TRITONSERVER_ErrorDelete(err__);
-            System.exit(1);
-        }
+  static class TRITONSERVER_ServerDeleter extends TRITONSERVER_Server {
+    public TRITONSERVER_ServerDeleter(TRITONSERVER_Server p)
+    {
+      super(p);
+      deallocator(new DeleteDeallocator(this));
     }
+    protected static class DeleteDeallocator
+        extends TRITONSERVER_Server implements Deallocator {
+      DeleteDeallocator(Pointer p) { super(p); }
+      @Override public void deallocate() { TRITONSERVER_ServerDelete(this); }
+    }
+  }
 
-    static boolean enforce_memory_type = false;
-    static int requested_memory_type;
-
-    static class TRITONSERVER_ServerDeleter extends TRITONSERVER_Server {
-        public TRITONSERVER_ServerDeleter(TRITONSERVER_Server p) { super(p); deallocator(new DeleteDeallocator(this)); }
-        protected static class DeleteDeallocator extends TRITONSERVER_Server implements Deallocator {
-            DeleteDeallocator(Pointer p) { super(p); }
-            @Override public void deallocate() { TRITONSERVER_ServerDelete(this); }
-        }
+  static void Usage(String msg)
+  {
+    if (msg != null) {
+      System.err.println(msg);
     }
 
-    static void
-    Usage(String msg)
+    System.err.println(
+        "Usage: java " + ResnetTest.class.getSimpleName() + " [options]");
+    System.err.println(
+        "\t-m <\"system\"|\"pinned\"|gpu>"
+        + " Enforce the memory type for input and output tensors."
+        + " If not specified, inputs will be in system memory and outputs"
+        + " will be based on the model's preferred type.");
+    System.err.println("\t-v Enable verbose logging");
+    System.err.println("\t-r [model repository absolute path]");
+
+    System.exit(1);
+  }
+
+  static class ResponseAlloc extends TRITONSERVER_ResponseAllocatorAllocFn_t {
+    @Override
+    public TRITONSERVER_Error call(
+        TRITONSERVER_ResponseAllocator allocator, String tensor_name,
+        long byte_size, int preferred_memory_type,
+        long preferred_memory_type_id, Pointer userp, PointerPointer buffer,
+        PointerPointer buffer_userp, IntPointer actual_memory_type,
+        LongPointer actual_memory_type_id)
     {
-      if (msg != null) {
-        System.err.println(msg);
-      }
+      // Initially attempt to make the actual memory type and id that we
+      // allocate be the same as preferred memory type
+      actual_memory_type.put(0, preferred_memory_type);
+      actual_memory_type_id.put(0, preferred_memory_type_id);
+
+      // If 'byte_size' is zero just return 'buffer' == nullptr, we don't
+      // need to do any other book-keeping.
+      if (byte_size == 0) {
+        buffer.put(0, null);
+        buffer_userp.put(0, null);
+        System.out.println(
+            "allocated " + byte_size + " bytes for result tensor "
+            + tensor_name);
+      } else {
+        Pointer allocated_ptr = new Pointer();
+        if (enforce_memory_type) {
+          actual_memory_type.put(0, requested_memory_type);
+        }
 
-      System.err.println("Usage: java " + ResnetTest.class.getSimpleName() + " [options]");
-      System.err.println("\t-m <\"system\"|\"pinned\"|gpu>"
-                       + " Enforce the memory type for input and output tensors."
-                       + " If not specified, inputs will be in system memory and outputs"
-                       + " will be based on the model's preferred type.");
-      System.err.println("\t-v Enable verbose logging");
-      System.err.println("\t-r [model repository absolute path]");
+        actual_memory_type.put(0, TRITONSERVER_MEMORY_CPU);
+        allocated_ptr = Pointer.malloc(byte_size);
+
+        // Pass the tensor name with buffer_userp so we can show it when
+        // releasing the buffer.
+        if (!allocated_ptr.isNull()) {
+          buffer.put(0, allocated_ptr);
+          buffer_userp.put(0, Loader.newGlobalRef(tensor_name));
+          System.out.println(
+              "allocated " + byte_size + " bytes in "
+              + TRITONSERVER_MemoryTypeString(actual_memory_type.get())
+              + " for result tensor " + tensor_name);
+        }
+      }
 
-      System.exit(1);
+      return null; // Success
     }
+  }
+
+  static class ResponseRelease
+      extends TRITONSERVER_ResponseAllocatorReleaseFn_t {
+    @Override
+    public TRITONSERVER_Error call(
+        TRITONSERVER_ResponseAllocator allocator, Pointer buffer,
+        Pointer buffer_userp, long byte_size, int memory_type,
+        long memory_type_id)
+    {
+      String name = null;
+      if (buffer_userp != null) {
+        name = (String) Loader.accessGlobalRef(buffer_userp);
+      } else {
+        name = "<unknown>";
+      }
 
-    static class ResponseAlloc extends TRITONSERVER_ResponseAllocatorAllocFn_t {
-        @Override public TRITONSERVER_Error call (
-            TRITONSERVER_ResponseAllocator allocator, String tensor_name,
-            long byte_size, int preferred_memory_type,
-            long preferred_memory_type_id, Pointer userp, PointerPointer buffer,
-            PointerPointer buffer_userp, IntPointer actual_memory_type,
-            LongPointer actual_memory_type_id)
-        {
-          // Initially attempt to make the actual memory type and id that we
-          // allocate be the same as preferred memory type
-          actual_memory_type.put(0, preferred_memory_type);
-          actual_memory_type_id.put(0, preferred_memory_type_id);
-
-          // If 'byte_size' is zero just return 'buffer' == nullptr, we don't
-          // need to do any other book-keeping.
-          if (byte_size == 0) {
-            buffer.put(0, null);
-            buffer_userp.put(0, null);
-            System.out.println("allocated " + byte_size + " bytes for result tensor " + tensor_name);
-          } else {
-            Pointer allocated_ptr = new Pointer();
-            if (enforce_memory_type) {
-              actual_memory_type.put(0, requested_memory_type);
-            }
-
-            actual_memory_type.put(0, TRITONSERVER_MEMORY_CPU);
-            allocated_ptr = Pointer.malloc(byte_size);
-
-            // Pass the tensor name with buffer_userp so we can show it when
-            // releasing the buffer.
-            if (!allocated_ptr.isNull()) {
-              buffer.put(0, allocated_ptr);
-              buffer_userp.put(0, Loader.newGlobalRef(tensor_name));
-              System.out.println("allocated " + byte_size + " bytes in "
-                               + TRITONSERVER_MemoryTypeString(actual_memory_type.get())
-                               + " for result tensor " + tensor_name);
-            }
-          }
+      Pointer.free(buffer);
+      Loader.deleteGlobalRef(buffer_userp);
 
-          return null;  // Success
-        }
+      return null; // Success
     }
+  }
 
-    static class ResponseRelease extends TRITONSERVER_ResponseAllocatorReleaseFn_t {
-        @Override public TRITONSERVER_Error call (
-            TRITONSERVER_ResponseAllocator allocator, Pointer buffer, Pointer buffer_userp,
-            long byte_size, int memory_type, long memory_type_id)
-        {
-          String name = null;
-          if (buffer_userp != null) {
-            name = (String)Loader.accessGlobalRef(buffer_userp);
-          } else {
-            name = "<unknown>";
-          }
-          
-          Pointer.free(buffer);
-          Loader.deleteGlobalRef(buffer_userp);
-
-          return null;  // Success
-        }
+  static class InferRequestComplete
+      extends TRITONSERVER_InferenceRequestReleaseFn_t {
+    @Override
+    public void call(
+        TRITONSERVER_InferenceRequest request, int flags, Pointer userp)
+    {
+      // We reuse the request so we don't delete it here.
     }
+  }
 
-    static class InferRequestComplete extends TRITONSERVER_InferenceRequestReleaseFn_t {
-        @Override public void call (
-            TRITONSERVER_InferenceRequest request, int flags, Pointer userp)
-        {
-          // We reuse the request so we don't delete it here.
-        }
+  static class InferResponseComplete
+      extends TRITONSERVER_InferenceResponseCompleteFn_t {
+    @Override
+    public void call(
+        TRITONSERVER_InferenceResponse response, int flags, Pointer userp)
+    {
+      if (response != null) {
+        // Send 'response' to the future.
+        futures.get(userp).complete(response);
+      }
     }
-
-    static class InferResponseComplete extends TRITONSERVER_InferenceResponseCompleteFn_t {
-        @Override public void call (
-            TRITONSERVER_InferenceResponse response, int flags, Pointer userp)
-        {
-          if (response != null) {
-            // Send 'response' to the future.
-            futures.get(userp).complete(response);
-          }
-        }
+  }
+
+  static ConcurrentHashMap<
+      Pointer, CompletableFuture<TRITONSERVER_InferenceResponse>> futures =
+      new ConcurrentHashMap<>();
+  static ResponseAlloc responseAlloc = new ResponseAlloc();
+  static ResponseRelease responseRelease = new ResponseRelease();
+  static InferRequestComplete inferRequestComplete = new InferRequestComplete();
+  static InferResponseComplete inferResponseComplete =
+      new InferResponseComplete();
+
+  static void GenerateInputData(FloatPointer[] input_data)
+  {
+    // Input size is 3 * 224 * 224
+    input_data[0] = new FloatPointer(150528);
+    for (int i = 0; i < 150528; ++i) {
+      input_data[0].put(i, 1);
     }
-
-    static ConcurrentHashMap<Pointer, CompletableFuture<TRITONSERVER_InferenceResponse>> futures = new ConcurrentHashMap<>();
-    static ResponseAlloc responseAlloc = new ResponseAlloc();
-    static ResponseRelease responseRelease = new ResponseRelease();
-    static InferRequestComplete inferRequestComplete = new InferRequestComplete();
-    static InferResponseComplete inferResponseComplete = new InferResponseComplete();
-
-    static void
-    GenerateInputData(
-        FloatPointer[] input_data)
-    {
-      // Input size is 3 * 224 * 224
-      input_data[0] = new FloatPointer(150528);
-      for (int i = 0; i < 150528; ++i) {
-        input_data[0].put(i, 1);
+  }
+
+  static boolean AreValidResults(
+      String model_name, FloatPointer output, FloatPointer expected_output)
+  {
+    int output_length = model_name.contains("tensorflow") ? 1001 : 1000;
+    for (int i = 0; i < output_length; ++i) {
+      float difference = output.get(i) - expected_output.get(i);
+      if (difference > ALLOWED_DELTA) {
+        System.out.println(
+            model_name + "inference failure: unexpected output "
+            + "in " + model_name + ", index " + i);
+
+        System.out.println(
+            "Value: " + output.get(i) + ", expected " + expected_output.get(i));
+
+        return false; // Failure
       }
     }
+    return true; // Success
+  }
+
+  static void Check(
+      String model_name, Backend backend,
+      TRITONSERVER_InferenceResponse response, Pointer input_data,
+      String output, int expected_datatype) throws Exception
+  {
+    HashMap<String, Pointer> output_data = new HashMap<>();
+
+    int[] output_count = {0};
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceResponseOutputCount(response, output_count),
+        "getting number of response outputs");
+    if (output_count[0] != 1) {
+      FAIL("expecting 1 response output, got " + output_count[0]);
+    }
 
-    static boolean
-    AreValidResults(
-        String model_name, FloatPointer output, FloatPointer expected_output)
-    {
-      int output_length = model_name.contains("tensorflow") ? 1001 : 1000;
-      for (int i = 0; i < output_length; ++i) {
-        float difference = output.get(i) - expected_output.get(i);
-        if (difference > ALLOWED_DELTA) {
-          System.out.println(model_name + "inference failure: unexpected output " +
-          "in " + model_name + ", index " + i);
+    for (int idx = 0; idx < output_count[0]; ++idx) {
+      BytePointer cname = new BytePointer((Pointer) null);
+      IntPointer datatype = new IntPointer(1);
+      LongPointer shape = new LongPointer((Pointer) null);
+      LongPointer dim_count = new LongPointer(1);
+      Pointer base = new Pointer();
+      SizeTPointer byte_size = new SizeTPointer(1);
+      IntPointer memory_type = new IntPointer(1);
+      LongPointer memory_type_id = new LongPointer(1);
+      Pointer userp = new Pointer();
 
-          System.out.println("Value: " + output.get(i) + ", expected " +
-          expected_output.get(i));
+      FAIL_IF_ERR(
+          TRITONSERVER_InferenceResponseOutput(
+              response, idx, cname, datatype, shape, dim_count, base, byte_size,
+              memory_type, memory_type_id, userp),
+          "getting output info");
 
-          return false; // Failure
-        }
+      if (cname.isNull()) {
+        FAIL("unable to get output name");
       }
-      return true; // Success
-    }
 
-    static void
-    Check(
-        String model_name, Backend backend,
-        TRITONSERVER_InferenceResponse response,
-        Pointer input_data, String output,
-        int expected_datatype) throws Exception
-    {
-      HashMap<String, Pointer> output_data = new HashMap<>();
-
-      int[] output_count = {0};
-      FAIL_IF_ERR(
-          TRITONSERVER_InferenceResponseOutputCount(response, output_count),
-          "getting number of response outputs");
-      if (output_count[0] != 1) {
-        FAIL("expecting 1 response output, got " + output_count[0]);
+      String name = cname.getString();
+      if (!name.equals(output)) {
+        FAIL("unexpected output '" + name + "'");
       }
 
-      for (int idx = 0; idx < output_count[0]; ++idx) {
-        BytePointer cname = new BytePointer((Pointer)null);
-        IntPointer datatype = new IntPointer(1);
-        LongPointer shape = new LongPointer((Pointer)null);
-        LongPointer dim_count = new LongPointer(1);
-        Pointer base = new Pointer();
-        SizeTPointer byte_size = new SizeTPointer(1);
-        IntPointer memory_type = new IntPointer(1);
-        LongPointer memory_type_id = new LongPointer(1);
-        Pointer userp = new Pointer();
-
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseOutput(
-                response, idx, cname, datatype, shape, dim_count, base,
-                byte_size, memory_type, memory_type_id, userp),
-            "getting output info");
-
-        if (cname.isNull()) {
-          FAIL("unable to get output name");
-        }
+      int output_length = backend == backend.TF ? 1001 : 1000;
 
-        String name = cname.getString();
-        if (!name.equals(output)) {
-          FAIL("unexpected output '" + name + "'");
-        }
+      if ((dim_count.get() != 2) || (shape.get(0) != 1)
+          || shape.get(1) != output_length) {
+        FAIL("unexpected shape for '" + name + "'");
+      }
 
-        int output_length = backend == backend.TF ? 1001: 1000;
+      if (datatype.get() != expected_datatype) {
+        FAIL(
+            "unexpected datatype '"
+            + TRITONSERVER_DataTypeString(datatype.get()) + "' for '" + name
+            + "'");
+      }
 
-        if ((dim_count.get() != 2) || (shape.get(0) != 1)
-        || shape.get(1) != output_length) {
-          FAIL("unexpected shape for '" + name + "'");
-        }
+      if (enforce_memory_type && (memory_type.get() != requested_memory_type)) {
+        FAIL(
+            "unexpected memory type, expected to be allocated in "
+            + TRITONSERVER_MemoryTypeString(requested_memory_type) + ", got "
+            + TRITONSERVER_MemoryTypeString(memory_type.get()) + ", id "
+            + memory_type_id.get() + " for " + name);
+      }
 
-        if (datatype.get() != expected_datatype) {
-          FAIL(
-              "unexpected datatype '" +
-              TRITONSERVER_DataTypeString(datatype.get()) + "' for '" +
-              name + "'");
-        }
+      // We make a copy of the data here... which we could avoid for
+      // performance reasons but ok for this simple example.
+      BytePointer odata = new BytePointer(byte_size.get());
+      output_data.put(name, odata);
+      odata.put(base.limit(byte_size.get()));
+    }
 
-        if (enforce_memory_type && (memory_type.get() != requested_memory_type)) {
-          FAIL(
-              "unexpected memory type, expected to be allocated in " +
-              TRITONSERVER_MemoryTypeString(requested_memory_type) +
-              ", got " + TRITONSERVER_MemoryTypeString(memory_type.get()) +
-              ", id " + memory_type_id.get() + " for " + name);
-        }
+    // Expected output for model
+    String file_name = "expected_output_data/expected_output_";
+    switch (backend) {
+      case ONNX:
+        file_name += "onnx";
+        break;
+      case TF:
+        file_name += "tensorflow";
+        break;
+      case TORCH:
+        file_name += "pytorch";
+        break;
+      default:
+        FAIL("Unsupported model type");
+        break;
+    }
+    file_name += ".txt";
 
-        // We make a copy of the data here... which we could avoid for
-        // performance reasons but ok for this simple example.
-        BytePointer odata = new BytePointer(byte_size.get());
-        output_data.put(name, odata);
-        odata.put(base.limit(byte_size.get()));
-      }
+    int output_length = backend == backend.TF ? 1001 : 1000;
+    FloatPointer expected_output = new FloatPointer(output_length);
 
-      // Expected output for model
-      String file_name = "expected_output_data/expected_output_";
-      switch (backend) {
-        case ONNX:
-          file_name += "onnx";
-          break;
-        case TF:
-          file_name += "tensorflow";
-          break;
-        case TORCH:
-          file_name += "pytorch";
-          break;
-        default:
-          FAIL("Unsupported model type");
-          break;
-      }
-      file_name += ".txt";
-      
-      int output_length = backend == backend.TF ? 1001: 1000;
-      FloatPointer expected_output = new FloatPointer(output_length);
-
-      try (Scanner scanner = new Scanner(new File(file_name))) {
-        for (int i = 0; i < output_length; ++i) {
-          expected_output.put(i, scanner.nextFloat());
-        } 
+    try (Scanner scanner = new Scanner(new File(file_name))) {
+      for (int i = 0; i < output_length; ++i) {
+        expected_output.put(i, scanner.nextFloat());
       }
+    }
 
-      boolean correct_results = AreValidResults(
-          model_name, new FloatPointer(output_data.get(output)),
-          expected_output);
+    boolean correct_results = AreValidResults(
+        model_name, new FloatPointer(output_data.get(output)), expected_output);
 
-      if(correct_results){
-        System.out.println(backend.name() + " test PASSED");
-      } else {
-        System.out.println(backend.name() + " test FAILED");
-      }
+    if (correct_results) {
+      System.out.println(backend.name() + " test PASSED");
+    } else {
+      System.out.println(backend.name() + " test FAILED");
     }
+  }
 
-    static void
-    PerformInference(
+  static void PerformInference(
       TRITONSERVER_ServerDeleter server, String model_name) throws Exception
-    {
-      // Get type of model
-      Backend backend = Backend.NONE;
-      if(model_name.contains("onnx")) {
-        backend = Backend.ONNX;
-      } else if (model_name.contains("savedmodel")) {
-        backend = Backend.TF;
-      } else if (model_name.contains("torch")) {
-        backend = Backend.TORCH;
-      } else {
-        FAIL("Supported model types (Onnx, TensorFlow, Torch) " +
-        "cannot be inferred from model name " + model_name);
-      }
+  {
+    // Get type of model
+    Backend backend = Backend.NONE;
+    if (model_name.contains("onnx")) {
+      backend = Backend.ONNX;
+    } else if (model_name.contains("savedmodel")) {
+      backend = Backend.TF;
+    } else if (model_name.contains("torch")) {
+      backend = Backend.TORCH;
+    } else {
+      FAIL(
+          "Supported model types (Onnx, TensorFlow, Torch) "
+          + "cannot be inferred from model name " + model_name);
+    }
 
-      // Wait for the model to become available.
-      boolean[] is_ready = {false};
-      int health_iters = 0;
-      while (!is_ready[0]) {
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerModelIsReady(
-                server, model_name, 1, is_ready),
-            "unable to get model readiness");
-        if (!is_ready[0]) {
-          if (++health_iters >= 10) {
-            FAIL(model_name + " model failed to be ready in 10 iterations");
-          }
-          Thread.sleep(500);
-          continue;
+    // Wait for the model to become available.
+    boolean[] is_ready = {false};
+    int health_iters = 0;
+    while (!is_ready[0]) {
+      FAIL_IF_ERR(
+          TRITONSERVER_ServerModelIsReady(server, model_name, 1, is_ready),
+          "unable to get model readiness");
+      if (!is_ready[0]) {
+        if (++health_iters >= 10) {
+          FAIL(model_name + " model failed to be ready in 10 iterations");
         }
+        Thread.sleep(500);
+        continue;
       }
+    }
+
+    // Create the allocator that will be used to allocate buffers for
+    // the result tensors.
+    TRITONSERVER_ResponseAllocator allocator =
+        new TRITONSERVER_ResponseAllocator(null);
+    FAIL_IF_ERR(
+        TRITONSERVER_ResponseAllocatorNew(
+            allocator, responseAlloc, responseRelease, null /* start_fn */),
+        "creating response allocator");
+
+    // Inference
+    TRITONSERVER_InferenceRequest irequest =
+        new TRITONSERVER_InferenceRequest(null);
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestNew(
+            irequest, server, model_name, -1 /* model_version */),
+        "creating inference request");
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestSetId(irequest, "my_request_id"),
+        "setting ID for the request");
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestSetReleaseCallback(
+            irequest, inferRequestComplete, null /* request_release_userp */),
+        "setting request release callback");
+
+
+    // Model inputs
+    String input = "";
+    String output = "";
+    long[] input_shape = {1, 224, 224, 3};
+
+    switch (backend) {
+      case ONNX:
+        input = "import/input:0";
+        output = "import/resnet_v1_50/predictions/Softmax:0";
+        break;
+      case TF:
+        input = "input";
+        output = "probabilities";
+        break;
+      case TORCH:
+        input = "INPUT__0";
+        input_shape[1] = 3;
+        input_shape[3] = 224;
+        output = "OUTPUT__0";
+        break;
+      default:
+        FAIL("Unsupported model type");
+        break;
+    }
+
+    int datatype = TRITONSERVER_TYPE_FP32;
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestAddInput(
+            irequest, input, datatype, input_shape, input_shape.length),
+        "setting input 0 meta-data for the request");
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestAddRequestedOutput(irequest, output),
+        "requesting output 0 for the request");
+
+    // Create the data for the two input tensors. Initialize the first
+    // to unique values and the second to all ones.
+    BytePointer input_data;
+    FloatPointer[] p0 = {null};
+    GenerateInputData(p0);
+    input_data = p0[0].getPointer(BytePointer.class);
+    long input_size = input_data.limit();
+    Pointer input_base = input_data;
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestAppendInputData(
+            irequest, input, input_base, input_size, requested_memory_type,
+            0 /* memory_type_id */),
+        "assigning INPUT data");
+
+    // Perform inference...
+    {
+      CompletableFuture<TRITONSERVER_InferenceResponse> completed =
+          new CompletableFuture<>();
+      futures.put(irequest, completed);
 
-      // Create the allocator that will be used to allocate buffers for
-      // the result tensors.
-      TRITONSERVER_ResponseAllocator allocator = new TRITONSERVER_ResponseAllocator(null);
       FAIL_IF_ERR(
-          TRITONSERVER_ResponseAllocatorNew(
-              allocator, responseAlloc, responseRelease, null /* start_fn */),
-          "creating response allocator");
+          TRITONSERVER_InferenceRequestSetResponseCallback(
+              irequest, allocator, null /* response_allocator_userp */,
+              inferResponseComplete, irequest),
+          "setting response callback");
 
-      // Inference
-      TRITONSERVER_InferenceRequest irequest = new TRITONSERVER_InferenceRequest(null);
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestNew(
-              irequest, server, model_name, -1 /* model_version */),
-          "creating inference request");
+          TRITONSERVER_ServerInferAsync(server, irequest, null /* trace */),
+          "running inference");
+
+      // Wait for the inference to complete.
+      TRITONSERVER_InferenceResponse completed_response = completed.get();
+      futures.remove(irequest);
 
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestSetId(irequest, "my_request_id"),
-          "setting ID for the request");
+          TRITONSERVER_InferenceResponseError(completed_response),
+          "response status");
+
+      Check(
+          model_name, backend, completed_response, input_data, output,
+          datatype);
 
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestSetReleaseCallback(
-              irequest, inferRequestComplete, null /* request_release_userp */),
-          "setting request release callback");
-
-      
-      // Model inputs
-      String input = "";
-      String output = "";
-      long[] input_shape = {1, 224, 224, 3};
-
-      switch (backend) {
-        case ONNX:
-          input = "import/input:0";
-          output = "import/resnet_v1_50/predictions/Softmax:0";
+          TRITONSERVER_InferenceResponseDelete(completed_response),
+          "deleting inference response");
+    }
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestDelete(irequest),
+        "deleting inference request");
+
+    FAIL_IF_ERR(
+        TRITONSERVER_ResponseAllocatorDelete(allocator),
+        "deleting response allocator");
+  }
+
+  public static void main(String[] args) throws Exception
+  {
+    String model_repository_path = null;
+    int verbose_level = 0;
+
+    // Parse commandline...
+    for (int i = 0; i < args.length; i++) {
+      switch (args[i]) {
+        case "-m": {
+          enforce_memory_type = true;
+          i++;
+          if (args[i].equals("system")) {
+            requested_memory_type = TRITONSERVER_MEMORY_CPU;
+          } else if (args[i].equals("pinned")) {
+            requested_memory_type = TRITONSERVER_MEMORY_CPU_PINNED;
+          } else if (args[i].equals("gpu")) {
+            requested_memory_type = TRITONSERVER_MEMORY_GPU;
+          } else {
+            Usage(
+                "-m must be used to specify one of the following types:"
+                + " <\"system\"|\"pinned\"|gpu>");
+          }
           break;
-        case TF:
-          input = "input";
-          output = "probabilities";
+        }
+        case "-r":
+          model_repository_path = args[++i];
           break;
-        case TORCH:
-          input = "INPUT__0";
-          input_shape[1] = 3;
-          input_shape[3] = 224;
-          output = "OUTPUT__0";
+        case "-v":
+          verbose_level = 1;
           break;
-        default:
-          FAIL("Unsupported model type");
+        case "-?":
+          Usage(null);
           break;
       }
+    }
 
-      int datatype = TRITONSERVER_TYPE_FP32;
-
-      FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestAddInput(
-              irequest, input, datatype, input_shape, input_shape.length),
-          "setting input 0 meta-data for the request");
-
-      FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestAddRequestedOutput(irequest, output),
-          "requesting output 0 for the request");
-
-      // Create the data for the two input tensors. Initialize the first
-      // to unique values and the second to all ones.
-      BytePointer input_data;
-      FloatPointer[] p0 = {null};
-      GenerateInputData(p0);
-      input_data = p0[0].getPointer(BytePointer.class);
-      long input_size = input_data.limit();
-      Pointer input_base = input_data;
+    if (model_repository_path == null) {
+      Usage("-r must be used to specify model repository path");
+    }
+    if (enforce_memory_type
+        && requested_memory_type != TRITONSERVER_MEMORY_CPU) {
+      Usage("-m can only be set to \"system\" without enabling GPU");
+    }
 
-      FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestAppendInputData(
-              irequest, input, input_base, input_size, requested_memory_type,
-              0 /* memory_type_id */),
-          "assigning INPUT data");
-
-      // Perform inference...
-      {
-        CompletableFuture<TRITONSERVER_InferenceResponse> completed = new CompletableFuture<>();
-        futures.put(irequest, completed);
-
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceRequestSetResponseCallback(
-                irequest, allocator, null /* response_allocator_userp */,
-                inferResponseComplete, irequest),
-            "setting response callback");
-
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerInferAsync(
-                server, irequest, null /* trace */),
-            "running inference");
-
-        // Wait for the inference to complete.
-        TRITONSERVER_InferenceResponse completed_response = completed.get();
-        futures.remove(irequest);
-
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseError(completed_response),
-            "response status");
-
-        Check(
-            model_name, backend, completed_response, input_data, output, datatype);
-
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseDelete(completed_response),
-            "deleting inference response");
-      }
+    // Check API version.
+    int[] api_version_major = {0}, api_version_minor = {0};
+    FAIL_IF_ERR(
+        TRITONSERVER_ApiVersion(api_version_major, api_version_minor),
+        "getting Triton API version");
+    if ((TRITONSERVER_API_VERSION_MAJOR != api_version_major[0])
+        || (TRITONSERVER_API_VERSION_MINOR > api_version_minor[0])) {
+      FAIL("triton server API version mismatch");
+    }
 
+    // Create the server...
+    TRITONSERVER_ServerOptions server_options =
+        new TRITONSERVER_ServerOptions(null);
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsNew(server_options),
+        "creating server options");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetModelRepositoryPath(
+            server_options, model_repository_path),
+        "setting model repository path");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetLogVerbose(server_options, verbose_level),
+        "setting verbose logging level");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetBackendDirectory(
+            server_options, "/opt/tritonserver/backends"),
+        "setting backend directory");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetRepoAgentDirectory(
+            server_options, "/opt/tritonserver/repoagents"),
+        "setting repository agent directory");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetStrictModelConfig(server_options, true),
+        "setting strict model configuration");
+    double min_compute_capability = TRITON_MIN_COMPUTE_CAPABILITY;
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetMinSupportedComputeCapability(
+            server_options, min_compute_capability),
+        "setting minimum supported CUDA compute capability");
+
+    TRITONSERVER_Server server_ptr = new TRITONSERVER_Server(null);
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerNew(server_ptr, server_options), "creating server");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsDelete(server_options),
+        "deleting server options");
+
+    TRITONSERVER_ServerDeleter server =
+        new TRITONSERVER_ServerDeleter(server_ptr);
+
+    // Wait until the server is both live and ready.
+    int health_iters = 0;
+    while (true) {
+      boolean[] live = {false}, ready = {false};
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestDelete(irequest),
-          "deleting inference request");
-
+          TRITONSERVER_ServerIsLive(server, live),
+          "unable to get server liveness");
       FAIL_IF_ERR(
-          TRITONSERVER_ResponseAllocatorDelete(allocator),
-          "deleting response allocator");
-    }
-    
-    public static void
-    main(String[] args) throws Exception
-    {
-      String model_repository_path = null;
-      int verbose_level = 0;
-
-      // Parse commandline...
-      for (int i = 0; i < args.length; i++) {
-        switch (args[i]) {
-          case "-m": {
-            enforce_memory_type = true;
-            i++;
-            if (args[i].equals("system")) {
-              requested_memory_type = TRITONSERVER_MEMORY_CPU;
-            } else if (args[i].equals("pinned")) {
-              requested_memory_type = TRITONSERVER_MEMORY_CPU_PINNED;
-            } else if (args[i].equals("gpu")) {
-              requested_memory_type = TRITONSERVER_MEMORY_GPU;
-            } else {
-              Usage(
-                  "-m must be used to specify one of the following types:" +
-                  " <\"system\"|\"pinned\"|gpu>");
-            }
-            break;
-          }
-          case "-r":
-            model_repository_path = args[++i];
-            break;
-          case "-v":
-            verbose_level = 1;
-            break;
-          case "-?":
-            Usage(null);
-            break;
-        }
-      }
-
-      if (model_repository_path == null) {
-        Usage("-r must be used to specify model repository path");
-      }
-      if (enforce_memory_type && requested_memory_type != TRITONSERVER_MEMORY_CPU) {
-        Usage("-m can only be set to \"system\" without enabling GPU");
+          TRITONSERVER_ServerIsReady(server, ready),
+          "unable to get server readiness");
+      System.out.println(
+          "Server Health: live " + live[0] + ", ready " + ready[0]);
+      if (live[0] && ready[0]) {
+        break;
       }
 
-      // Check API version.
-      int[] api_version_major = {0}, api_version_minor = {0};
-      FAIL_IF_ERR(
-          TRITONSERVER_ApiVersion(api_version_major, api_version_minor),
-          "getting Triton API version");
-      if ((TRITONSERVER_API_VERSION_MAJOR != api_version_major[0]) ||
-          (TRITONSERVER_API_VERSION_MINOR > api_version_minor[0])) {
-        FAIL("triton server API version mismatch");
+      if (++health_iters >= 10) {
+        FAIL("failed to find healthy inference server");
       }
 
-      // Create the server...
-      TRITONSERVER_ServerOptions server_options = new TRITONSERVER_ServerOptions(null);
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsNew(server_options),
-          "creating server options");
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetModelRepositoryPath(
-              server_options, model_repository_path),
-          "setting model repository path");
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetLogVerbose(server_options, verbose_level),
-          "setting verbose logging level");
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetBackendDirectory(
-              server_options, "/opt/tritonserver/backends"),
-          "setting backend directory");
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetRepoAgentDirectory(
-              server_options, "/opt/tritonserver/repoagents"),
-          "setting repository agent directory");
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetStrictModelConfig(server_options, true),
-          "setting strict model configuration");
-      double min_compute_capability = TRITON_MIN_COMPUTE_CAPABILITY;
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetMinSupportedComputeCapability(
-              server_options, min_compute_capability),
-          "setting minimum supported CUDA compute capability");
+      Thread.sleep(500);
+    }
 
-      TRITONSERVER_Server server_ptr = new TRITONSERVER_Server(null);
+    // Print status of the server.
+    {
+      TRITONSERVER_Message server_metadata_message =
+          new TRITONSERVER_Message(null);
       FAIL_IF_ERR(
-          TRITONSERVER_ServerNew(server_ptr, server_options), "creating server");
+          TRITONSERVER_ServerMetadata(server, server_metadata_message),
+          "unable to get server metadata message");
+      BytePointer buffer = new BytePointer((Pointer) null);
+      SizeTPointer byte_size = new SizeTPointer(1);
       FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsDelete(server_options),
-          "deleting server options");
-
-      TRITONSERVER_ServerDeleter server = new TRITONSERVER_ServerDeleter(server_ptr);
-
-      // Wait until the server is both live and ready.
-      int health_iters = 0;
-      while (true) {
-        boolean[] live = {false}, ready = {false};
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerIsLive(server, live),
-            "unable to get server liveness");
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerIsReady(server, ready),
-            "unable to get server readiness");
-        System.out.println("Server Health: live " + live[0] + ", ready " + ready[0]);
-        if (live[0] && ready[0]) {
-          break;
-        }
-
-        if (++health_iters >= 10) {
-          FAIL("failed to find healthy inference server");
-        }
+          TRITONSERVER_MessageSerializeToJson(
+              server_metadata_message, buffer, byte_size),
+          "unable to serialize server metadata message");
 
-        Thread.sleep(500);
-      }
-
-      // Print status of the server.
-      {
-        TRITONSERVER_Message server_metadata_message = new TRITONSERVER_Message(null);
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerMetadata(server, server_metadata_message),
-            "unable to get server metadata message");
-        BytePointer buffer = new BytePointer((Pointer)null);
-        SizeTPointer byte_size = new SizeTPointer(1);
-        FAIL_IF_ERR(
-            TRITONSERVER_MessageSerializeToJson(
-                server_metadata_message, buffer, byte_size),
-            "unable to serialize server metadata message");
-
-        System.out.println("Server Status:");
-        System.out.println(buffer.limit(byte_size.get()).getString());
-
-        FAIL_IF_ERR(
-            TRITONSERVER_MessageDelete(server_metadata_message),
-            "deleting status metadata");
-      }
+      System.out.println("Server Status:");
+      System.out.println(buffer.limit(byte_size.get()).getString());
 
-      for(String model : MODELS) {
-        PerformInference(server, model);
-      }
+      FAIL_IF_ERR(
+          TRITONSERVER_MessageDelete(server_metadata_message),
+          "deleting status metadata");
+    }
 
-      System.exit(0);
+    for (String model : MODELS) {
+      PerformInference(server, model);
     }
+
+    System.exit(0);
+  }
 }
diff --git a/qa/L0_java_resnet/test.sh b/qa/L0_java_resnet/test.sh
index e2f424fd7e..1ca08b4c65 100755
--- a/qa/L0_java_resnet/test.sh
+++ b/qa/L0_java_resnet/test.sh
@@ -41,6 +41,8 @@ fi
 # Models
 DATADIR=/data/inferenceserver/${REPO_VERSION}
 MODEL_REPO=`pwd`/models
+JAVACPP_BRANCH=${JAVACPP_BRANCH:="https://github.com/bytedeco/javacpp-presets.git"}
+JAVACPP_BRANCH_TAG=${JAVACPP_BRANCH_TAG:="master"}
 
 # Create local model repository
 mkdir -p ${MODEL_REPO}
@@ -53,14 +55,10 @@ done
 
 # Set up test files based on installation instructions
 # https://github.com/bytedeco/javacpp-presets/blob/master/tritonserver/README.md
-set +e
-rm -r javacpp-presets
-git clone https://github.com/bytedeco/javacpp-presets.git
-cd javacpp-presets
-mvn clean install --projects .,tritonserver
-mvn clean install -f platform --projects ../tritonserver/platform -Djavacpp.platform.host
-cd ..
 set -e
+git clone --single-branch --depth=1 -b ${TRITON_CLIENT_REPO_TAG} https://github.com/triton-inference-server/client.git
+source client/src/java-api-bindings/scripts/install_dependencies_and_build.sh -b $PWD --javacpp-branch ${JAVACPP_BRANCH} --javacpp-tag ${JAVACPP_BRANCH_TAG} --keep-build-dependencies
+cd ..
 
 CLIENT_LOG="client.log"
 SAMPLES_REPO=`pwd`/javacpp-presets/tritonserver/samples/simple
diff --git a/qa/L0_java_sequence_batcher/SequenceTest.java b/qa/L0_java_sequence_batcher/SequenceTest.java
index 3fdc5d63c1..cfce3584de 100644
--- a/qa/L0_java_sequence_batcher/SequenceTest.java
+++ b/qa/L0_java_sequence_batcher/SequenceTest.java
@@ -1,4 +1,4 @@
-// Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 //
 // Redistribution and use in source and binary forms, with or without
 // modification, are permitted provided that the following conditions
@@ -24,615 +24,642 @@
 // (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
+import static org.bytedeco.tritonserver.global.tritonserver.*;
+
+import com.google.gson.*;
 import java.io.*;
 import java.util.*;
 import java.util.concurrent.*;
-import com.google.gson.*;
 import org.bytedeco.javacpp.*;
 import org.bytedeco.tritonserver.tritonserver.*;
-import static org.bytedeco.tritonserver.global.tritonserver.*;
 
 public class SequenceTest {
-
-    // Boilerplate code for setting up Triton
-    static void FAIL(String MSG) {
-        System.err.println("Failure: " + MSG);
-        System.exit(1);
-    }
-
-    static void FAIL_IF_ERR(TRITONSERVER_Error err__, String MSG) {
-        if (err__ != null) {
-            System.err.println("error: " + MSG + ":"
-                             + TRITONSERVER_ErrorCodeString(err__) + " - "
-                             + TRITONSERVER_ErrorMessage(err__));
-            TRITONSERVER_ErrorDelete(err__);
-            System.exit(1);
-        }
+  // Boilerplate code for setting up Triton
+  static void FAIL(String MSG)
+  {
+    System.err.println("Failure: " + MSG);
+    System.exit(1);
+  }
+
+  static void FAIL_IF_ERR(TRITONSERVER_Error err__, String MSG)
+  {
+    if (err__ != null) {
+      System.err.println(
+          "error: " + MSG + ":" + TRITONSERVER_ErrorCodeString(err__) + " - "
+          + TRITONSERVER_ErrorMessage(err__));
+      TRITONSERVER_ErrorDelete(err__);
+      System.exit(1);
     }
+  }
 
-    static int requested_memory_type = TRITONSERVER_MEMORY_CPU;
-
-    static class TRITONSERVER_ServerDeleter extends TRITONSERVER_Server {
-        public TRITONSERVER_ServerDeleter(TRITONSERVER_Server p) { super(p); deallocator(new DeleteDeallocator(this)); }
-        protected static class DeleteDeallocator extends TRITONSERVER_Server implements Deallocator {
-            DeleteDeallocator(Pointer p) { super(p); }
-            @Override public void deallocate() { TRITONSERVER_ServerDelete(this); }
-        }
-    }
+  static int requested_memory_type = TRITONSERVER_MEMORY_CPU;
 
-    static void
-    Usage(String msg)
+  static class TRITONSERVER_ServerDeleter extends TRITONSERVER_Server {
+    public TRITONSERVER_ServerDeleter(TRITONSERVER_Server p)
     {
-      if (msg != null) {
-        System.err.println(msg);
-      }
-
-      System.err.println("Usage: java " + SequenceTest.class.getSimpleName() + " [options]");
-      System.err.println("\t-m [model name]");
-      System.err.println("\t-v Enable verbose logging");
-      System.err.println("\t-r [model repository absolute path]");
-
-      System.exit(1);
+      super(p);
+      deallocator(new DeleteDeallocator(this));
     }
-
-    static class ResponseAlloc extends TRITONSERVER_ResponseAllocatorAllocFn_t {
-        @Override public TRITONSERVER_Error call (
-            TRITONSERVER_ResponseAllocator allocator, String tensor_name,
-            long byte_size, int preferred_memory_type,
-            long preferred_memory_type_id, Pointer userp, PointerPointer buffer,
-            PointerPointer buffer_userp, IntPointer actual_memory_type,
-            LongPointer actual_memory_type_id)
-        {
-          // Initially attempt to make the actual memory type and id that we
-          // allocate be the same as preferred memory type
-          actual_memory_type.put(0, preferred_memory_type);
-          actual_memory_type_id.put(0, preferred_memory_type_id);
-
-          // If 'byte_size' is zero just return 'buffer' == nullptr, we don't
-          // need to do any other book-keeping.
-          if (byte_size == 0) {
-            buffer.put(0, null);
-            buffer_userp.put(0, null);
-            System.out.println("allocated " + byte_size + " bytes for result tensor " + tensor_name);
-          } else {
-            Pointer allocated_ptr = new Pointer();
-            actual_memory_type.put(0, requested_memory_type);
-
-            actual_memory_type.put(0, TRITONSERVER_MEMORY_CPU);
-            allocated_ptr = Pointer.malloc(byte_size);
-
-            // Pass the tensor name with buffer_userp so we can show it when
-            // releasing the buffer.
-            if (!allocated_ptr.isNull()) {
-              buffer.put(0, allocated_ptr);
-              buffer_userp.put(0, new BytePointer(tensor_name));
-              System.out.println("allocated " + byte_size + " bytes in "
-                               + TRITONSERVER_MemoryTypeString(actual_memory_type.get())
-                               + " for result tensor " + tensor_name);
-            }
-          }
-
-          return null;  // Success
-        }
+    protected static class DeleteDeallocator
+        extends TRITONSERVER_Server implements Deallocator {
+      DeleteDeallocator(Pointer p) { super(p); }
+      @Override public void deallocate() { TRITONSERVER_ServerDelete(this); }
     }
+  }
 
-    static class ResponseRelease extends TRITONSERVER_ResponseAllocatorReleaseFn_t {
-        @Override public TRITONSERVER_Error call (
-            TRITONSERVER_ResponseAllocator allocator, Pointer buffer, Pointer buffer_userp,
-            long byte_size, int memory_type, long memory_type_id)
-        {
-          BytePointer name = null;
-          if (buffer_userp != null) {
-            name = new BytePointer(buffer_userp);
-          } else {
-            name = new BytePointer("<unknown>");
-          }
-
-          System.out.println("Releasing buffer " + buffer + " of size " + byte_size
-                           + " in " + TRITONSERVER_MemoryTypeString(memory_type)
-                           + " for result '" + name.getString() + "'");
-          Pointer.free(buffer);
-          name.deallocate();
-
-          return null;  // Success
-        }
+  static void Usage(String msg)
+  {
+    if (msg != null) {
+      System.err.println(msg);
     }
 
-    static class InferRequestComplete extends TRITONSERVER_InferenceRequestReleaseFn_t {
-        @Override public void call (
-            TRITONSERVER_InferenceRequest request, int flags, Pointer userp)
-        {
-          // We reuse the request so we don't delete it here.
+    System.err.println(
+        "Usage: java " + SequenceTest.class.getSimpleName() + " [options]");
+    System.err.println("\t-m [model name]");
+    System.err.println("\t-v Enable verbose logging");
+    System.err.println("\t-r [model repository absolute path]");
+
+    System.exit(1);
+  }
+
+  static class ResponseAlloc extends TRITONSERVER_ResponseAllocatorAllocFn_t {
+    @Override
+    public TRITONSERVER_Error call(
+        TRITONSERVER_ResponseAllocator allocator, String tensor_name,
+        long byte_size, int preferred_memory_type,
+        long preferred_memory_type_id, Pointer userp, PointerPointer buffer,
+        PointerPointer buffer_userp, IntPointer actual_memory_type,
+        LongPointer actual_memory_type_id)
+    {
+      // Initially attempt to make the actual memory type and id that we
+      // allocate be the same as preferred memory type
+      actual_memory_type.put(0, preferred_memory_type);
+      actual_memory_type_id.put(0, preferred_memory_type_id);
+
+      // If 'byte_size' is zero just return 'buffer' == nullptr, we don't
+      // need to do any other book-keeping.
+      if (byte_size == 0) {
+        buffer.put(0, null);
+        buffer_userp.put(0, null);
+        System.out.println(
+            "allocated " + byte_size + " bytes for result tensor "
+            + tensor_name);
+      } else {
+        Pointer allocated_ptr = new Pointer();
+        actual_memory_type.put(0, requested_memory_type);
+
+        actual_memory_type.put(0, TRITONSERVER_MEMORY_CPU);
+        allocated_ptr = Pointer.malloc(byte_size);
+
+        // Pass the tensor name with buffer_userp so we can show it when
+        // releasing the buffer.
+        if (!allocated_ptr.isNull()) {
+          buffer.put(0, allocated_ptr);
+          buffer_userp.put(0, new BytePointer(tensor_name));
+          System.out.println(
+              "allocated " + byte_size + " bytes in "
+              + TRITONSERVER_MemoryTypeString(actual_memory_type.get())
+              + " for result tensor " + tensor_name);
         }
-    }
+      }
 
-    static class InferResponseComplete extends TRITONSERVER_InferenceResponseCompleteFn_t {
-        @Override public void call (
-            TRITONSERVER_InferenceResponse response, int flags, Pointer userp)
-        {
-          if (response != null) {
-            // Send 'response' to the future.
-            futures.get(userp).complete(response);
-          }
-        }
+      return null; // Success
     }
-
-    static ConcurrentHashMap<Pointer, CompletableFuture<TRITONSERVER_InferenceResponse>> futures = new ConcurrentHashMap<>();
-    static ResponseAlloc responseAlloc = new ResponseAlloc();
-    static ResponseRelease responseRelease = new ResponseRelease();
-    static InferRequestComplete inferRequestComplete = new InferRequestComplete();
-    static InferResponseComplete inferResponseComplete = new InferResponseComplete();
-
-    static TRITONSERVER_Error
-    ParseModelMetadata(
-        JsonObject model_metadata,
-        boolean[] is_torch_model)
+  }
+
+  static class ResponseRelease
+      extends TRITONSERVER_ResponseAllocatorReleaseFn_t {
+    @Override
+    public TRITONSERVER_Error call(
+        TRITONSERVER_ResponseAllocator allocator, Pointer buffer,
+        Pointer buffer_userp, long byte_size, int memory_type,
+        long memory_type_id)
     {
-      String seen_data_type = null;
-      for (JsonElement input_element : model_metadata.get("inputs").getAsJsonArray()) {
-        JsonObject input = input_element.getAsJsonObject();
-        if (!input.get("datatype").getAsString().equals("INT32")) {
-          return TRITONSERVER_ErrorNew(
-              TRITONSERVER_ERROR_UNSUPPORTED,
-              "sequence qa example only supports model with data type INT32");
-        }
-        if (seen_data_type == null) {
-          seen_data_type = input.get("datatype").getAsString();
-        } else if (!seen_data_type.equals(input.get("datatype").getAsString())) {
-          return TRITONSERVER_ErrorNew(
-              TRITONSERVER_ERROR_INVALID_ARG,
-              "the inputs and outputs of sequence model must have the data type");
-        }
-      }
-      for (JsonElement output_element : model_metadata.get("outputs").getAsJsonArray()) {
-        JsonObject output = output_element.getAsJsonObject();
-        if (!output.get("datatype").getAsString().equals("INT32")) {
-          return TRITONSERVER_ErrorNew(
-              TRITONSERVER_ERROR_UNSUPPORTED,
-              "sequence qa example only supports model with data type INT32");
-        } else if (!seen_data_type.equals(output.get("datatype").getAsString())) {
-          return TRITONSERVER_ErrorNew(
-              TRITONSERVER_ERROR_INVALID_ARG,
-              "the inputs and outputs of sequence' model must have the data type");
-        }
+      BytePointer name = null;
+      if (buffer_userp != null) {
+        name = new BytePointer(buffer_userp);
+      } else {
+        name = new BytePointer("<unknown>");
       }
 
-      is_torch_model[0] =
-          model_metadata.get("platform").getAsString().equals("pytorch_libtorch");
-      return null;
+      System.out.println(
+          "Releasing buffer " + buffer + " of size " + byte_size + " in "
+          + TRITONSERVER_MemoryTypeString(memory_type) + " for result '"
+          + name.getString() + "'");
+      Pointer.free(buffer);
+      name.deallocate();
+
+      return null; // Success
     }
+  }
 
-    // Custom function to set metadata required for sequence batcher
-    static void
-    SetSequenceMetadata(TRITONSERVER_InferenceRequest irequest, long correlation_id, boolean sequence_start, boolean sequence_end)
+  static class InferRequestComplete
+      extends TRITONSERVER_InferenceRequestReleaseFn_t {
+    @Override
+    public void call(
+        TRITONSERVER_InferenceRequest request, int flags, Pointer userp)
     {
+      // We reuse the request so we don't delete it here.
+    }
+  }
 
-      FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestSetCorrelationId(
-              irequest, correlation_id), "Unable to set correlation ID");
-      int flags = 0;
-      if(sequence_start) {
-        flags += TRITONSERVER_REQUEST_FLAG_SEQUENCE_START;
+  static class InferResponseComplete
+      extends TRITONSERVER_InferenceResponseCompleteFn_t {
+    @Override
+    public void call(
+        TRITONSERVER_InferenceResponse response, int flags, Pointer userp)
+    {
+      if (response != null) {
+        // Send 'response' to the future.
+        futures.get(userp).complete(response);
       }
-      if(sequence_end) {
-        flags += TRITONSERVER_REQUEST_FLAG_SEQUENCE_END;
+    }
+  }
+
+  static ConcurrentHashMap<
+      Pointer, CompletableFuture<TRITONSERVER_InferenceResponse>> futures =
+      new ConcurrentHashMap<>();
+  static ResponseAlloc responseAlloc = new ResponseAlloc();
+  static ResponseRelease responseRelease = new ResponseRelease();
+  static InferRequestComplete inferRequestComplete = new InferRequestComplete();
+  static InferResponseComplete inferResponseComplete =
+      new InferResponseComplete();
+
+  static TRITONSERVER_Error ParseModelMetadata(
+      JsonObject model_metadata, boolean[] is_torch_model)
+  {
+    String seen_data_type = null;
+    for (JsonElement input_element :
+         model_metadata.get("inputs").getAsJsonArray()) {
+      JsonObject input = input_element.getAsJsonObject();
+      if (!input.get("datatype").getAsString().equals("INT32")) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_UNSUPPORTED,
+            "sequence qa example only supports model with data type INT32");
+      }
+      if (seen_data_type == null) {
+        seen_data_type = input.get("datatype").getAsString();
+      } else if (!seen_data_type.equals(input.get("datatype").getAsString())) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INVALID_ARG,
+            "the inputs and outputs of sequence model must have the data type");
+      }
+    }
+    for (JsonElement output_element :
+         model_metadata.get("outputs").getAsJsonArray()) {
+      JsonObject output = output_element.getAsJsonObject();
+      if (!output.get("datatype").getAsString().equals("INT32")) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_UNSUPPORTED,
+            "sequence qa example only supports model with data type INT32");
+      } else if (!seen_data_type.equals(output.get("datatype").getAsString())) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INVALID_ARG,
+            "the inputs and outputs of sequence' model must have the data type");
       }
-      FAIL_IF_ERR(
-        TRITONSERVER_InferenceRequestSetFlags(
-            irequest, flags), "Unable to set flags");
-
     }
 
-    // Custom function for adjusting sequence batcher
-    // expected results for backends that do not implement
-    // full accumulator
-    static int
-    GetExpectedResult(String model_name, int expected_result, int value, String flag){
-      if((!model_name.contains("nobatch") && !model_name.contains("custom")) ||
-          model_name.contains("graphdef") || model_name.contains("plan") ||
-          model_name.contains("onnx") || model_name.contains("libtorch")){
-            expected_result = value;
-            if(flag != null && flag.contains("start")){
-              expected_result++;
-            }
-        }
-        return expected_result;
+    is_torch_model[0] =
+        model_metadata.get("platform").getAsString().equals("pytorch_libtorch");
+    return null;
+  }
+
+  // Custom function to set metadata required for sequence batcher
+  static void SetSequenceMetadata(
+      TRITONSERVER_InferenceRequest irequest, long correlation_id,
+      boolean sequence_start, boolean sequence_end)
+  {
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestSetCorrelationId(irequest, correlation_id),
+        "Unable to set correlation ID");
+    int flags = 0;
+    if (sequence_start) {
+      flags += TRITONSERVER_REQUEST_FLAG_SEQUENCE_START;
+    }
+    if (sequence_end) {
+      flags += TRITONSERVER_REQUEST_FLAG_SEQUENCE_END;
+    }
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestSetFlags(irequest, flags),
+        "Unable to set flags");
+  }
+
+  // Custom function for adjusting sequence batcher
+  // expected results for backends that do not implement
+  // full accumulator
+  static int GetExpectedResult(
+      String model_name, int expected_result, int value, String flag)
+  {
+    if ((!model_name.contains("nobatch") && !model_name.contains("custom"))
+        || model_name.contains("graphdef") || model_name.contains("plan")
+        || model_name.contains("onnx") || model_name.contains("libtorch")) {
+      expected_result = value;
+      if (flag != null && flag.contains("start")) {
+        expected_result++;
+      }
+    }
+    return expected_result;
+  }
+
+  // Standard function for checking response parameters,
+  // plus customized check that final sequence result
+  // "out" matches expected result
+  static void Check(
+      String model_name, TRITONSERVER_InferenceResponse response,
+      int input_value, String output0, long expected_byte_size,
+      int expected_datatype, boolean sequence_end, int expected_result)
+  {
+    HashMap<String, Pointer> output_data = new HashMap<>();
+
+    int[] output_count = {0};
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceResponseOutputCount(response, output_count),
+        "getting number of response outputs");
+    if (output_count[0] != 1) {
+      FAIL("expecting 1 response outputs, got " + output_count[0]);
     }
 
-    // Standard function for checking response parameters,
-    // plus customized check that final sequence result
-    // "out" matches expected result
-    static void
-    Check(
-        String model_name,
-        TRITONSERVER_InferenceResponse response,
-        int input_value, String output0,
-        long expected_byte_size, int expected_datatype,
-        boolean sequence_end, int expected_result)
-    {
-      HashMap<String, Pointer> output_data = new HashMap<>();
+    for (int idx = 0; idx < output_count[0]; ++idx) {
+      BytePointer cname = new BytePointer((Pointer) null);
+      IntPointer datatype = new IntPointer(1);
+      LongPointer shape = new LongPointer((Pointer) null);
+      LongPointer dim_count = new LongPointer(1);
+      Pointer base = new Pointer();
+      SizeTPointer byte_size = new SizeTPointer(1);
+      IntPointer memory_type = new IntPointer(1);
+      LongPointer memory_type_id = new LongPointer(1);
+      Pointer userp = new Pointer();
 
-      int[] output_count = {0};
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceResponseOutputCount(response, output_count),
-          "getting number of response outputs");
-      if (output_count[0] != 1) {
-        FAIL("expecting 1 response outputs, got " + output_count[0]);
+          TRITONSERVER_InferenceResponseOutput(
+              response, idx, cname, datatype, shape, dim_count, base, byte_size,
+              memory_type, memory_type_id, userp),
+          "getting output info");
+
+      if (cname.isNull()) {
+        FAIL("unable to get output name");
       }
 
-      for (int idx = 0; idx < output_count[0]; ++idx) {
-        BytePointer cname = new BytePointer((Pointer)null);
-        IntPointer datatype = new IntPointer(1);
-        LongPointer shape = new LongPointer((Pointer)null);
-        LongPointer dim_count = new LongPointer(1);
-        Pointer base = new Pointer();
-        SizeTPointer byte_size = new SizeTPointer(1);
-        IntPointer memory_type = new IntPointer(1);
-        LongPointer memory_type_id = new LongPointer(1);
-        Pointer userp = new Pointer();
-
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseOutput(
-                response, idx, cname, datatype, shape, dim_count, base,
-                byte_size, memory_type, memory_type_id, userp),
-            "getting output info");
-
-        if (cname.isNull()) {
-          FAIL("unable to get output name");
-        }
+      String name = cname.getString();
+      if (!name.equals(output0)) {
+        FAIL("unexpected output '" + name + "'");
+      }
 
-        String name = cname.getString();
-        if (!name.equals(output0)) {
-          FAIL("unexpected output '" + name + "'");
-        }
+      if ((dim_count.get() != 1) || (shape.get(0) != 1)) {
+        FAIL("unexpected shape for '" + name + "'");
+      }
 
-        if ((dim_count.get() != 1) || (shape.get(0) != 1)) {
-          FAIL("unexpected shape for '" + name + "'");
-        }
+      if (datatype.get() != expected_datatype) {
+        FAIL(
+            "unexpected datatype '"
+            + TRITONSERVER_DataTypeString(datatype.get()) + "' for '" + name
+            + "'");
+      }
 
-        if (datatype.get() != expected_datatype) {
-          FAIL(
-              "unexpected datatype '" +
-              TRITONSERVER_DataTypeString(datatype.get()) + "' for '" +
-              name + "'");
-        }
+      if (byte_size.get() != expected_byte_size) {
+        FAIL(
+            "unexpected byte-size, expected " + expected_byte_size + ", got "
+            + byte_size.get() + " for " + name);
+      }
 
-        if (byte_size.get() != expected_byte_size) {
-          FAIL(
-              "unexpected byte-size, expected " +
-              expected_byte_size + ", got " +
-              byte_size.get() + " for " + name);
-        }
+      if (memory_type.get() != requested_memory_type) {
+        FAIL(
+            "unexpected memory type, expected to be allocated in "
+            + TRITONSERVER_MemoryTypeString(requested_memory_type) + ", got "
+            + TRITONSERVER_MemoryTypeString(memory_type.get()) + ", id "
+            + memory_type_id.get() + " for " + name);
+      }
 
-        if (memory_type.get() != requested_memory_type) {
-          FAIL(
-              "unexpected memory type, expected to be allocated in " +
-              TRITONSERVER_MemoryTypeString(requested_memory_type) +
-              ", got " + TRITONSERVER_MemoryTypeString(memory_type.get()) +
-              ", id " + memory_type_id.get() + " for " + name);
-        }
+      // We make a copy of the data here... which we could avoid for
+      // performance reasons but ok for this sequence example.
+      BytePointer odata = new BytePointer(byte_size.get());
+      output_data.put(name, odata);
+      System.out.println(name + " is stored in system memory");
+      odata.put(base.limit(byte_size.get()));
+    }
 
-        // We make a copy of the data here... which we could avoid for
-        // performance reasons but ok for this sequence example.
-        BytePointer odata = new BytePointer(byte_size.get());
-        output_data.put(name, odata);
-        System.out.println(name + " is stored in system memory");
-        odata.put(base.limit(byte_size.get()));
+    int out = new IntPointer(output_data.get(output0)).get(0);
+    System.out.println("Value: " + out);
+    if (sequence_end) {
+      expected_result =
+          GetExpectedResult(model_name, expected_result, input_value, "end");
+      if (out != expected_result) {
+        FAIL("Expected result: " + expected_result + ", got " + out);
+      } else {
+        System.out.println(model_name + " test PASSED");
       }
-
-      int out = new IntPointer(output_data.get(output0)).get(0);
-      System.out.println("Value: " + out);
-      if(sequence_end){
-        expected_result = GetExpectedResult(model_name, expected_result,
-            input_value, "end");
-        if(out != expected_result){
-          FAIL("Expected result: " + expected_result + ", got " + out);
-        } else {
-          System.out.println(model_name + " test PASSED");
-        }
+    }
+  }
+
+  // Boilerplate main function to run inference
+  // for provided model, custom setting of
+  // sequence metadata
+  public static void main(String[] args) throws Exception
+  {
+    String model_repository_path = null;
+    String model_name = null;
+    int verbose_level = 0;
+
+    // Parse commandline...
+    for (int i = 0; i < args.length; i++) {
+      switch (args[i]) {
+        case "-m":
+          model_name = args[++i];
+          break;
+        case "-r":
+          model_repository_path = args[++i];
+          break;
+        case "-v":
+          verbose_level = 1;
+          break;
+        case "-?":
+          Usage(null);
+          break;
       }
     }
 
-    // Boilerplate main function to run inference
-    // for provided model, custom setting of
-    // sequence metadata
-    public static void
-    main(String[] args) throws Exception
-    {
-      String model_repository_path = null;
-      String model_name = null;
-      int verbose_level = 0;
-
-      // Parse commandline...
-      for (int i = 0; i < args.length; i++) {
-        switch (args[i]) {
-          case "-m":
-            model_name = args[++i];
-            break;
-          case "-r":
-            model_repository_path = args[++i];
-            break;
-          case "-v":
-            verbose_level = 1;
-            break;
-          case "-?":
-            Usage(null);
-            break;
-        }
-      }
+    if (model_name == null) {
+      Usage("-m must be used to specify model name");
+    }
+    if (model_repository_path == null) {
+      Usage("-r must be used to specify model repository path");
+    }
 
-      if(model_name == null) {
-        Usage("-m must be used to specify model name");
-      }
-      if (model_repository_path == null) {
-        Usage("-r must be used to specify model repository path");
-      }
+    // Check API version.
+    int[] api_version_major = {0}, api_version_minor = {0};
+    FAIL_IF_ERR(
+        TRITONSERVER_ApiVersion(api_version_major, api_version_minor),
+        "getting Triton API version");
+    if ((TRITONSERVER_API_VERSION_MAJOR != api_version_major[0])
+        || (TRITONSERVER_API_VERSION_MINOR > api_version_minor[0])) {
+      FAIL("triton server API version mismatch");
+    }
 
-      // Check API version.
-      int[] api_version_major = {0}, api_version_minor = {0};
+    // Create the server...
+    TRITONSERVER_ServerOptions server_options =
+        new TRITONSERVER_ServerOptions(null);
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsNew(server_options),
+        "creating server options");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetModelRepositoryPath(
+            server_options, model_repository_path),
+        "setting model repository path");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetLogVerbose(server_options, verbose_level),
+        "setting verbose logging level");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetBackendDirectory(
+            server_options, "/opt/tritonserver/backends"),
+        "setting backend directory");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetRepoAgentDirectory(
+            server_options, "/opt/tritonserver/repoagents"),
+        "setting repository agent directory");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetStrictModelConfig(server_options, true),
+        "setting strict model configuration");
+
+    TRITONSERVER_Server server_ptr = new TRITONSERVER_Server(null);
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerNew(server_ptr, server_options), "creating server");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsDelete(server_options),
+        "deleting server options");
+
+    TRITONSERVER_ServerDeleter server =
+        new TRITONSERVER_ServerDeleter(server_ptr);
+
+    // Wait until the server is both live and ready.
+    int health_iters = 0;
+    while (true) {
+      boolean[] live = {false}, ready = {false};
+      FAIL_IF_ERR(
+          TRITONSERVER_ServerIsLive(server, live),
+          "unable to get server liveness");
       FAIL_IF_ERR(
-          TRITONSERVER_ApiVersion(api_version_major, api_version_minor),
-          "getting Triton API version");
-      if ((TRITONSERVER_API_VERSION_MAJOR != api_version_major[0]) ||
-          (TRITONSERVER_API_VERSION_MINOR > api_version_minor[0])) {
-        FAIL("triton server API version mismatch");
+          TRITONSERVER_ServerIsReady(server, ready),
+          "unable to get server readiness");
+      System.out.println(
+          "Server Health: live " + live[0] + ", ready " + ready[0]);
+      if (live[0] && ready[0]) {
+        break;
       }
 
-      // Create the server...
-      TRITONSERVER_ServerOptions server_options = new TRITONSERVER_ServerOptions(null);
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsNew(server_options),
-          "creating server options");
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetModelRepositoryPath(
-              server_options, model_repository_path),
-          "setting model repository path");
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetLogVerbose(server_options, verbose_level),
-          "setting verbose logging level");
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetBackendDirectory(
-              server_options, "/opt/tritonserver/backends"),
-          "setting backend directory");
+      if (++health_iters >= 10) {
+        FAIL("failed to find healthy inference server");
+      }
+
+      Thread.sleep(500);
+    }
+
+    // Print status of the server.
+    {
+      TRITONSERVER_Message server_metadata_message =
+          new TRITONSERVER_Message(null);
       FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetRepoAgentDirectory(
-              server_options, "/opt/tritonserver/repoagents"),
-          "setting repository agent directory");
+          TRITONSERVER_ServerMetadata(server, server_metadata_message),
+          "unable to get server metadata message");
+      BytePointer buffer = new BytePointer((Pointer) null);
+      SizeTPointer byte_size = new SizeTPointer(1);
       FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetStrictModelConfig(server_options, true),
-          "setting strict model configuration");
+          TRITONSERVER_MessageSerializeToJson(
+              server_metadata_message, buffer, byte_size),
+          "unable to serialize server metadata message");
+
+      System.out.println("Server Status:");
+      System.out.println(buffer.limit(byte_size.get()).getString());
 
-      TRITONSERVER_Server server_ptr = new TRITONSERVER_Server(null);
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerNew(server_ptr, server_options), "creating server");
       FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsDelete(server_options),
-          "deleting server options");
-
-      TRITONSERVER_ServerDeleter server = new TRITONSERVER_ServerDeleter(server_ptr);
-
-      // Wait until the server is both live and ready.
-      int health_iters = 0;
-      while (true) {
-        boolean[] live = {false}, ready = {false};
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerIsLive(server, live),
-            "unable to get server liveness");
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerIsReady(server, ready),
-            "unable to get server readiness");
-        System.out.println("Server Health: live " + live[0] + ", ready " + ready[0]);
-        if (live[0] && ready[0]) {
-          break;
-        }
+          TRITONSERVER_MessageDelete(server_metadata_message),
+          "deleting status metadata");
+    }
 
+    // Wait for the model to become available.
+    boolean[] is_torch_model = {false};
+    boolean[] is_ready = {false};
+    health_iters = 0;
+    while (!is_ready[0]) {
+      FAIL_IF_ERR(
+          TRITONSERVER_ServerModelIsReady(server, model_name, 1, is_ready),
+          "unable to get model readiness");
+      if (!is_ready[0]) {
         if (++health_iters >= 10) {
-          FAIL("failed to find healthy inference server");
+          FAIL("model failed to be ready in 10 iterations");
         }
-
         Thread.sleep(500);
+        continue;
       }
 
-      // Print status of the server.
-      {
-        TRITONSERVER_Message server_metadata_message = new TRITONSERVER_Message(null);
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerMetadata(server, server_metadata_message),
-            "unable to get server metadata message");
-        BytePointer buffer = new BytePointer((Pointer)null);
-        SizeTPointer byte_size = new SizeTPointer(1);
-        FAIL_IF_ERR(
-            TRITONSERVER_MessageSerializeToJson(
-                server_metadata_message, buffer, byte_size),
-            "unable to serialize server metadata message");
-
-        System.out.println("Server Status:");
-        System.out.println(buffer.limit(byte_size.get()).getString());
-
-        FAIL_IF_ERR(
-            TRITONSERVER_MessageDelete(server_metadata_message),
-            "deleting status metadata");
+      TRITONSERVER_Message model_metadata_message =
+          new TRITONSERVER_Message(null);
+      FAIL_IF_ERR(
+          TRITONSERVER_ServerModelMetadata(
+              server, model_name, 1, model_metadata_message),
+          "unable to get model metadata message");
+      BytePointer buffer = new BytePointer((Pointer) null);
+      SizeTPointer byte_size = new SizeTPointer(1);
+      FAIL_IF_ERR(
+          TRITONSERVER_MessageSerializeToJson(
+              model_metadata_message, buffer, byte_size),
+          "unable to serialize model status protobuf");
+
+      JsonParser parser = new JsonParser();
+      JsonObject model_metadata = null;
+      try {
+        model_metadata = parser.parse(buffer.limit(byte_size.get()).getString())
+                             .getAsJsonObject();
+      }
+      catch (Exception e) {
+        FAIL("error: failed to parse model metadata from JSON: " + e);
       }
 
-      // Wait for the model to become available.
-      boolean[] is_torch_model = {false};
-      boolean[] is_ready = {false};
-      health_iters = 0;
-      while (!is_ready[0]) {
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerModelIsReady(
-                server, model_name, 1, is_ready),
-            "unable to get model readiness");
-        if (!is_ready[0]) {
-          if (++health_iters >= 10) {
-            FAIL("model failed to be ready in 10 iterations");
-          }
-          Thread.sleep(500);
-          continue;
-        }
-
-        TRITONSERVER_Message model_metadata_message = new TRITONSERVER_Message(null);
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerModelMetadata(
-                server, model_name, 1, model_metadata_message),
-            "unable to get model metadata message");
-        BytePointer buffer = new BytePointer((Pointer)null);
-        SizeTPointer byte_size = new SizeTPointer(1);
-        FAIL_IF_ERR(
-            TRITONSERVER_MessageSerializeToJson(
-                model_metadata_message, buffer, byte_size),
-            "unable to serialize model status protobuf");
-
-        JsonParser parser = new JsonParser();
-        JsonObject model_metadata = null;
-        try {
-          model_metadata = parser.parse(buffer.limit(byte_size.get()).getString()).getAsJsonObject();
-        } catch (Exception e) {
-          FAIL("error: failed to parse model metadata from JSON: " + e);
-        }
-
-        FAIL_IF_ERR(
-            TRITONSERVER_MessageDelete(model_metadata_message),
-            "deleting status protobuf");
+      FAIL_IF_ERR(
+          TRITONSERVER_MessageDelete(model_metadata_message),
+          "deleting status protobuf");
 
-        if (!model_metadata.get("name").getAsString().equals(model_name)) {
-          FAIL("unable to find metadata for model");
-        }
+      if (!model_metadata.get("name").getAsString().equals(model_name)) {
+        FAIL("unable to find metadata for model");
+      }
 
-        boolean found_version = false;
-        if (model_metadata.has("versions")) {
-          for (JsonElement version : model_metadata.get("versions").getAsJsonArray()) {
-            if (version.getAsString().equals("1")) {
-              found_version = true;
-              break;
-            }
+      boolean found_version = false;
+      if (model_metadata.has("versions")) {
+        for (JsonElement version :
+             model_metadata.get("versions").getAsJsonArray()) {
+          if (version.getAsString().equals("1")) {
+            found_version = true;
+            break;
           }
         }
-        if (!found_version) {
-          FAIL("unable to find version 1 status for model");
-        }
-
-        FAIL_IF_ERR(
-            ParseModelMetadata(model_metadata, is_torch_model),
-            "parsing model metadata");
+      }
+      if (!found_version) {
+        FAIL("unable to find version 1 status for model");
       }
 
-      // Create the allocator that will be used to allocate buffers for
-      // the result tensors.
-      TRITONSERVER_ResponseAllocator allocator = new TRITONSERVER_ResponseAllocator(null);
       FAIL_IF_ERR(
-          TRITONSERVER_ResponseAllocatorNew(
-              allocator, responseAlloc, responseRelease, null /* start_fn */),
-          "creating response allocator");
+          ParseModelMetadata(model_metadata, is_torch_model),
+          "parsing model metadata");
+    }
 
-      // Inference
-      TRITONSERVER_InferenceRequest irequest = new TRITONSERVER_InferenceRequest(null);
-      FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestNew(
-              irequest, server, model_name, -1 /* model_version */),
-          "creating inference request");
+    // Create the allocator that will be used to allocate buffers for
+    // the result tensors.
+    TRITONSERVER_ResponseAllocator allocator =
+        new TRITONSERVER_ResponseAllocator(null);
+    FAIL_IF_ERR(
+        TRITONSERVER_ResponseAllocatorNew(
+            allocator, responseAlloc, responseRelease, null /* start_fn */),
+        "creating response allocator");
+
+    // Inference
+    TRITONSERVER_InferenceRequest irequest =
+        new TRITONSERVER_InferenceRequest(null);
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestNew(
+            irequest, server, model_name, -1 /* model_version */),
+        "creating inference request");
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestSetId(irequest, "my_request_id"),
+        "setting ID for the request");
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestSetReleaseCallback(
+            irequest, inferRequestComplete, null /* request_release_userp */),
+        "setting request release callback");
+
+    // Inputs
+    String input0 = is_torch_model[0] ? "INPUT__0" : "INPUT";
+
+    long[] input0_shape = {1};
+
+    int datatype = TRITONSERVER_TYPE_INT32;
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestAddInput(
+            irequest, input0, datatype, input0_shape, input0_shape.length),
+        "setting input 0 meta-data for the request");
+
+    String output0 = is_torch_model[0] ? "OUTPUT__0" : "OUTPUT";
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestAddRequestedOutput(irequest, output0),
+        "requesting output 0 for the request");
+
+    // Non-zero ID for the sequence requests
+    long correlation_id = 5;
+    // Number of requests in the sequence
+    int num_requests = 9;
+    // Expected_result is  1+2+3+...+num_requests
+    int expected_result = num_requests * (1 + num_requests) / 2;
+    boolean sequence_start = true;
+    boolean sequence_end = false;
+
+    // Create the initial data for the input tensor.
+    IntPointer[] p0 = {new IntPointer(1)};
+    BytePointer input0_data = p0[0].getPointer(BytePointer.class);
+    long input0_size = input0_data.limit();
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestAppendInputData(
+            irequest, input0, input0_data, input0_size, requested_memory_type,
+            0 /* memory_type_id */),
+        "assigning INPUT0 data");
+
+    for (int i = 0; i < num_requests; i++) {
+      // Update input value
+      int input = i + 1;
+      p0[0].put(0, input);
+
+      // Set sequence metadata
+      if (i == 1) {
+        sequence_start = false;
+      }
+      if (i == num_requests - 1) {
+        sequence_end = true;
+      }
+      SetSequenceMetadata(
+          irequest, correlation_id, sequence_start, sequence_end);
 
-      FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestSetId(irequest, "my_request_id"),
-          "setting ID for the request");
+      // Perform inference...
+      CompletableFuture<TRITONSERVER_InferenceResponse> completed =
+          new CompletableFuture<>();
+      futures.put(irequest, completed);
 
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestSetReleaseCallback(
-              irequest, inferRequestComplete, null /* request_release_userp */),
-          "setting request release callback");
-
-      // Inputs
-      String input0 = is_torch_model[0] ? "INPUT__0" : "INPUT";
-
-      long[] input0_shape = {1};
-
-      int datatype = TRITONSERVER_TYPE_INT32;
+          TRITONSERVER_InferenceRequestSetResponseCallback(
+              irequest, allocator, null /* response_allocator_userp */,
+              inferResponseComplete, irequest),
+          "setting response callback");
 
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestAddInput(
-              irequest, input0, datatype, input0_shape, input0_shape.length),
-          "setting input 0 meta-data for the request");
+          TRITONSERVER_ServerInferAsync(server, irequest, null /* trace */),
+          "running inference");
 
-      String output0 = is_torch_model[0] ? "OUTPUT__0" : "OUTPUT";
+      // Wait for the inference to complete.
+      TRITONSERVER_InferenceResponse completed_response = completed.get();
+      futures.remove(irequest);
 
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestAddRequestedOutput(irequest, output0),
-          "requesting output 0 for the request");
-
-      // Non-zero ID for the sequence requests
-      long correlation_id = 5;
-      // Number of requests in the sequence
-      int num_requests = 9;
-      // Expected_result is  1+2+3+...+num_requests
-      int expected_result = num_requests * (1 + num_requests) / 2;
-      boolean sequence_start = true;
-      boolean sequence_end = false;
-
-      // Create the initial data for the input tensor.
-      IntPointer[] p0 = {new IntPointer(1)};
-      BytePointer input0_data = p0[0].getPointer(BytePointer.class);
-      long input0_size = input0_data.limit();
+          TRITONSERVER_InferenceResponseError(completed_response),
+          "response status");
 
-      FAIL_IF_ERR(
-            TRITONSERVER_InferenceRequestAppendInputData(
-                irequest, input0, input0_data, input0_size, requested_memory_type,
-                0 /* memory_type_id */),
-            "assigning INPUT0 data");
-
-      for(int i = 0; i < num_requests; i++) {
-        // Update input value
-        int input = i + 1;
-        p0[0].put(0, input);
-
-        // Set sequence metadata
-        if(i == 1) {
-          sequence_start = false;
-        }
-        if(i == num_requests - 1) {
-          sequence_end = true;
-        }
-        SetSequenceMetadata(irequest, correlation_id, sequence_start, sequence_end);
-        
-        // Perform inference...
-        CompletableFuture<TRITONSERVER_InferenceResponse> completed = new CompletableFuture<>();
-        futures.put(irequest, completed);
-
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceRequestSetResponseCallback(
-                irequest, allocator, null /* response_allocator_userp */,
-                inferResponseComplete, irequest),
-            "setting response callback");
-
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerInferAsync(
-                server, irequest, null /* trace */),
-            "running inference");
-
-        // Wait for the inference to complete.
-        TRITONSERVER_InferenceResponse completed_response = completed.get();
-        futures.remove(irequest);
-
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseError(completed_response),
-            "response status");
-
-        Check(
-            model_name, completed_response, input, output0, input0_size,
-            datatype, sequence_end, expected_result);
-
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseDelete(completed_response),
-            "deleting inference response");
-      }
+      Check(
+          model_name, completed_response, input, output0, input0_size, datatype,
+          sequence_end, expected_result);
 
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestDelete(irequest),
-          "deleting inference request");
+          TRITONSERVER_InferenceResponseDelete(completed_response),
+          "deleting inference response");
+    }
 
-      FAIL_IF_ERR(
-          TRITONSERVER_ResponseAllocatorDelete(allocator),
-          "deleting response allocator");
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestDelete(irequest),
+        "deleting inference request");
 
-      System.exit(0);
-    }
+    FAIL_IF_ERR(
+        TRITONSERVER_ResponseAllocatorDelete(allocator),
+        "deleting response allocator");
+
+    System.exit(0);
+  }
 }
diff --git a/qa/L0_java_sequence_batcher/test.sh b/qa/L0_java_sequence_batcher/test.sh
index 1fe3a97fb2..2f988322d9 100755
--- a/qa/L0_java_sequence_batcher/test.sh
+++ b/qa/L0_java_sequence_batcher/test.sh
@@ -40,17 +40,15 @@ fi
 
 # Models
 DATADIR=/data/inferenceserver/${REPO_VERSION}
+JAVACPP_BRANCH=${JAVACPP_BRANCH:="https://github.com/bytedeco/javacpp-presets.git"}
+JAVACPP_BRANCH_TAG=${JAVACPP_BRANCH_TAG:="master"}
 
 # Set up test files based on installation instructions
 # https://github.com/bytedeco/javacpp-presets/blob/master/tritonserver/README.md
-set +e
-rm -r javacpp-presets
-git clone https://github.com/bytedeco/javacpp-presets.git
-cd javacpp-presets
-mvn clean install --projects .,tritonserver
-mvn clean install -f platform --projects ../tritonserver/platform -Djavacpp.platform.host
-cd ..
 set -e
+git clone --single-branch --depth=1 -b ${TRITON_CLIENT_REPO_TAG} https://github.com/triton-inference-server/client.git
+source client/src/java-api-bindings/scripts/install_dependencies_and_build.sh -b $PWD --javacpp-branch ${JAVACPP_BRANCH} --javacpp-tag ${JAVACPP_BRANCH_TAG} --keep-build-dependencies
+cd ..
 
 CLIENT_LOG="client.log"
 MODEL_REPO=`pwd`/models
diff --git a/qa/L0_java_simple_example/test.sh b/qa/L0_java_simple_example/test.sh
index b3a54d6a11..e9726edff4 100755
--- a/qa/L0_java_simple_example/test.sh
+++ b/qa/L0_java_simple_example/test.sh
@@ -37,14 +37,12 @@ if [ -z "$REPO_VERSION" ]; then
     exit 1
 fi
 
-set +e
-rm -r javacpp-presets
-git clone https://github.com/bytedeco/javacpp-presets.git
-cd javacpp-presets
-mvn clean install --projects .,tritonserver
-mvn clean install -f platform --projects ../tritonserver/platform -Djavacpp.platform.host
-cd ..
+JAVACPP_BRANCH=${JAVACPP_BRANCH:="https://github.com/bytedeco/javacpp-presets.git"}
+JAVACPP_BRANCH_TAG=${JAVACPP_BRANCH_TAG:="master"}
 set -e
+git clone --single-branch --depth=1 -b ${TRITON_CLIENT_REPO_TAG} https://github.com/triton-inference-server/client.git
+source client/src/java-api-bindings/scripts/install_dependencies_and_build.sh -b $PWD --javacpp-branch ${JAVACPP_BRANCH} --javacpp-tag ${JAVACPP_BRANCH_TAG} --keep-build-dependencies
+cd ..
 
 CLIENT_LOG="client_cpu_only.log"
 DATADIR=/data/inferenceserver/${REPO_VERSION}/qa_model_repository
diff --git a/qa/L0_json/test.sh b/qa/L0_json/test.sh
new file mode 100755
index 0000000000..522e17aa95
--- /dev/null
+++ b/qa/L0_json/test.sh
@@ -0,0 +1,44 @@
+#!/bin/bash
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+RET=0
+UNIT_TEST="./triton_json_test"
+TEST_LOG="./triton_json_test.log"
+$UNIT_TEST >> $TEST_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $TEST_LOG
+    echo -e "\n***\n*** Triton Json Unit Test Failed\n***"
+    RET=1
+fi
+
+if [ $RET -eq 0 ]; then
+  echo -e "\n***\n*** Test Passed\n***"
+else
+  echo -e "\n***\n*** Test FAILED\n***"
+fi
+
+exit $RET
diff --git a/qa/L0_large_payload/large_payload_test.py b/qa/L0_large_payload/large_payload_test.py
old mode 100644
new mode 100755
index 5ad0939a6f..fff57290ef
--- a/qa/L0_large_payload/large_payload_test.py
+++ b/qa/L0_large_payload/large_payload_test.py
@@ -1,4 +1,6 @@
-# Copyright 2019-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,19 +27,20 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
 import math
 import unittest
+
 import numpy as np
 import test_util as tu
 import tritongrpcclient as grpcclient
 import tritonhttpclient as httpclient
-from tritonclientutils import np_to_triton_dtype, InferenceServerException
+from tritonclientutils import InferenceServerException, np_to_triton_dtype
 
 
 class LargePayLoadTest(tu.TestResultCollector):
-
     def setUp(self):
         self._data_type = np.float32
 
@@ -45,36 +48,40 @@ def setUp(self):
         # hard limit on 2GBs for the size of input tensors. All backends except
         # plan backend should be able to handle payloads larger than 2GBs using
         # HTTP.
-        very_large_tensor_shape = (math.trunc(
-            3 * (1024 * 1024 * 1024) / np.dtype(self._data_type).itemsize),)
+        very_large_tensor_shape = (
+            math.trunc(3 * (1024 * 1024 * 1024) / np.dtype(self._data_type).itemsize),
+        )
         self._very_large_in0 = np.random.random(very_large_tensor_shape).astype(
-            self._data_type)
+            self._data_type
+        )
 
         # 1.9 GBs allows us to test gRPC with moderate sizes too.
-        large_tensor_shape = (math.trunc(1.9 * (1024 * 1024 * 1024) //
-                                         np.dtype(self._data_type).itemsize),)
-        self._large_in0 = np.random.random(large_tensor_shape).astype(
-            self._data_type)
+        large_tensor_shape = (
+            math.trunc(
+                1.9 * (1024 * 1024 * 1024) // np.dtype(self._data_type).itemsize
+            ),
+        )
+        self._large_in0 = np.random.random(large_tensor_shape).astype(self._data_type)
 
         small_tensor_shape = (1,)
-        self._small_in0 = np.random.random(small_tensor_shape).astype(
-            self._data_type)
-
-        self._clients = ((httpclient,
-                          httpclient.InferenceServerClient('localhost:8000')),
-                         (grpcclient,
-                          grpcclient.InferenceServerClient('localhost:8001')))
-
-    def _test_helper(self,
-                     client,
-                     model_name,
-                     input_name='INPUT0',
-                     output_name='OUTPUT0'):
-        # plan does not supoort large batch sizes.
-        if not model_name.startswith('plan'):
+        self._small_in0 = np.random.random(small_tensor_shape).astype(self._data_type)
+
+        self._clients = (
+            (httpclient, httpclient.InferenceServerClient("localhost:8000")),
+            (grpcclient, grpcclient.InferenceServerClient("localhost:8001")),
+        )
+
+    def _test_helper(
+        self, client, model_name, input_name="INPUT0", output_name="OUTPUT0"
+    ):
+        # plan does not support large batch sizes.
+        if not model_name.startswith("plan"):
             inputs = [
-                client[0].InferInput(input_name, self._large_in0.shape,
-                                     np_to_triton_dtype(self._data_type))
+                client[0].InferInput(
+                    input_name,
+                    self._large_in0.shape,
+                    np_to_triton_dtype(self._data_type),
+                )
             ]
             inputs[0].set_data_from_numpy(self._large_in0)
             results = client[1].infer(model_name, inputs)
@@ -83,13 +90,17 @@ def _test_helper(self,
             # the framework and protocol do support large payload
             self.assertTrue(
                 np.array_equal(self._large_in0, results.as_numpy(output_name)),
-                "output is different from input")
+                "output is different from input",
+            )
 
         if client[0] == httpclient:
             # FIXME HTTPServer cannot support large payloads. See DLIS-1776.
             inputs = [
-                client[0].InferInput(input_name, self._very_large_in0.shape,
-                                     np_to_triton_dtype(self._data_type))
+                client[0].InferInput(
+                    input_name,
+                    self._very_large_in0.shape,
+                    np_to_triton_dtype(self._data_type),
+                )
             ]
             inputs[0].set_data_from_numpy(self._very_large_in0)
             with self.assertRaises(InferenceServerException):
@@ -112,56 +123,54 @@ def _test_helper(self,
 
         # Send a small payload to verify if the server is still functional
         inputs = [
-            client[0].InferInput(input_name, self._small_in0.shape,
-                                 np_to_triton_dtype(self._data_type))
+            client[0].InferInput(
+                input_name, self._small_in0.shape, np_to_triton_dtype(self._data_type)
+            )
         ]
         inputs[0].set_data_from_numpy(self._small_in0)
         results = client[1].infer(model_name, inputs)
         self.assertTrue(
             np.array_equal(self._small_in0, results.as_numpy(output_name)),
-            "output is different from input")
+            "output is different from input",
+        )
 
     def test_graphdef(self):
         # graphdef_nobatch_zero_1_float32 is identity model with input shape [-1]
         for client in self._clients:
-            model_name = tu.get_zero_model_name("graphdef_nobatch", 1,
-                                                self._data_type)
+            model_name = tu.get_zero_model_name("graphdef_nobatch", 1, self._data_type)
             self._test_helper(client, model_name)
 
     def test_savedmodel(self):
         # savedmodel_nobatch_zero_1_float32 is identity model with input shape [-1]
         for client in self._clients:
-            model_name = tu.get_zero_model_name("savedmodel_nobatch", 1,
-                                                self._data_type)
+            model_name = tu.get_zero_model_name(
+                "savedmodel_nobatch", 1, self._data_type
+            )
             self._test_helper(client, model_name)
 
     def test_onnx(self):
         # onnx_nobatch_zero_1_float32 is identity model with input shape [-1]
         for client in self._clients:
-            model_name = tu.get_zero_model_name("onnx_nobatch", 1,
-                                                self._data_type)
+            model_name = tu.get_zero_model_name("onnx_nobatch", 1, self._data_type)
             self._test_helper(client, model_name)
 
     def test_python(self):
         # python_nobatch_zero_1_float32 is identity model with input shape [-1]
         for client in self._clients:
-            model_name = tu.get_zero_model_name("python_nobatch", 1,
-                                                self._data_type)
+            model_name = tu.get_zero_model_name("python_nobatch", 1, self._data_type)
             self._test_helper(client, model_name)
 
     def test_plan(self):
         # plan_nobatch_zero_1_float32 is identity model with input shape [-1]
         for client in self._clients:
-            model_name = tu.get_zero_model_name("plan_nobatch", 1,
-                                                self._data_type)
+            model_name = tu.get_zero_model_name("plan_nobatch", 1, self._data_type)
             self._test_helper(client, model_name)
 
     def test_libtorch(self):
         # libtorch_nobatch_zero_1_float32 is identity model with input shape [-1]
         for client in self._clients:
-            model_name = tu.get_zero_model_name("libtorch_nobatch", 1,
-                                                self._data_type)
-            self._test_helper(client, model_name, 'INPUT__0', 'OUTPUT__0')
+            model_name = tu.get_zero_model_name("libtorch_nobatch", 1, self._data_type)
+            self._test_helper(client, model_name, "INPUT__0", "OUTPUT__0")
 
     def test_custom(self):
         # custom_zero_1_float32 is identity model with input shape [-1]
@@ -170,5 +179,5 @@ def test_custom(self):
             self._test_helper(client, model_name)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_large_payload/test.sh b/qa/L0_large_payload/test.sh
old mode 100644
new mode 100755
diff --git a/qa/L0_libtorch_inference_mode/test.sh b/qa/L0_libtorch_inference_mode/test.sh
old mode 100644
new mode 100755
index 5017f12769..85b4a49fae
--- a/qa/L0_libtorch_inference_mode/test.sh
+++ b/qa/L0_libtorch_inference_mode/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2021 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -73,7 +73,7 @@ for FLAG in true false; do
 
     set +e
 
-    python $SIMPLE_INFER_CLIENT_PY >> $CLIENT_LOG 2>&1
+    python $LIBTORCH_INFER_CLIENT_PY >> $CLIENT_LOG 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
diff --git a/qa/L0_libtorch_instance_group_kind_model/client.py b/qa/L0_libtorch_instance_group_kind_model/client.py
new file mode 100755
index 0000000000..92bead3464
--- /dev/null
+++ b/qa/L0_libtorch_instance_group_kind_model/client.py
@@ -0,0 +1,90 @@
+#!/usr/bin/env python
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import os
+import sys
+
+sys.path.append("../common")
+
+import unittest
+
+import numpy as np
+import test_util as tu
+import tritonclient.http as httpclient
+
+# By default, find tritonserver on "localhost", but can be overridden
+# with TRITONSERVER_IPADDR envvar
+_tritonserver_ipaddr = os.environ.get("TRITONSERVER_IPADDR", "localhost")
+
+
+class InferTest(tu.TestResultCollector):
+    def test_infer(self):
+        try:
+            triton_client = httpclient.InferenceServerClient(
+                url=f"{_tritonserver_ipaddr}:8000"
+            )
+        except Exception as e:
+            print("channel creation failed: " + str(e))
+            sys.exit(1)
+
+        model_name = os.environ["MODEL_NAME"]
+
+        inputs = []
+        outputs = []
+        inputs.append(httpclient.InferInput("INPUT0", [1, 16], "FP32"))
+        inputs.append(httpclient.InferInput("INPUT1", [1, 16], "FP32"))
+
+        # Create the data for the two input tensors.
+        input0_data = np.arange(start=0, stop=16, dtype=np.float32)
+        input0_data = np.expand_dims(input0_data, axis=0)
+        input1_data = np.arange(start=32, stop=48, dtype=np.float32)
+        input1_data = np.expand_dims(input1_data, axis=0)
+
+        # Initialize the data
+        inputs[0].set_data_from_numpy(input0_data, binary_data=True)
+        inputs[1].set_data_from_numpy(input1_data, binary_data=True)
+
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT__0", binary_data=True))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT__1", binary_data=True))
+
+        results = triton_client.infer(model_name, inputs, outputs=outputs)
+
+        output0_data = results.as_numpy("OUTPUT__0")
+        output1_data = results.as_numpy("OUTPUT__1")
+
+        expected_output_0 = input0_data + input1_data
+        expected_output_1 = input0_data - input1_data
+
+        self.assertEqual(output0_data.shape, (1, 16))
+        self.assertEqual(output1_data.shape, (1, 16))
+
+        self.assertTrue(np.all(expected_output_0 == output0_data))
+        self.assertTrue(np.all(expected_output_1 == output1_data))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_libtorch_instance_group_kind_model/gen_models.py b/qa/L0_libtorch_instance_group_kind_model/gen_models.py
new file mode 100755
index 0000000000..e61980f491
--- /dev/null
+++ b/qa/L0_libtorch_instance_group_kind_model/gen_models.py
@@ -0,0 +1,90 @@
+#!/usr/bin/python
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import torch
+import torch.nn as nn
+
+
+class SumModule(nn.Module):
+    def __init__(self, device):
+        super(SumModule, self).__init__()
+        self.device = device
+
+    def forward(self, INPUT0, INPUT1):
+        INPUT0 = INPUT0.to(self.device)
+        INPUT1 = INPUT1.to(self.device)
+        print(
+            "SumModule - INPUT0 device: {}, INPUT1 device: {}\n".format(
+                INPUT0.device, INPUT1.device
+            )
+        )
+        return INPUT0 + INPUT1
+
+
+class DiffModule(nn.Module):
+    def __init__(self, device):
+        super(DiffModule, self).__init__()
+        self.device = device
+
+    def forward(self, INPUT0, INPUT1):
+        INPUT0 = INPUT0.to(self.device)
+        INPUT1 = INPUT1.to(self.device)
+        print(
+            "DiffModule - INPUT0 device: {}, INPUT1 device: {}\n".format(
+                INPUT0.device, INPUT1.device
+            )
+        )
+        return INPUT0 - INPUT1
+
+
+class TestModel(nn.Module):
+    def __init__(self, device0, device1):
+        super(TestModel, self).__init__()
+        self.device0 = device0
+        self.device1 = device1
+
+        self.layer1 = SumModule(self.device0)
+        self.layer2 = DiffModule(self.device1)
+
+    def forward(self, INPUT0, INPUT1):
+        op0 = self.layer1(INPUT0, INPUT1)
+        op1 = self.layer2(INPUT0, INPUT1)
+        return op0, op1
+
+
+if torch.cuda.device_count() < 4:
+    print("Need at least 4 GPUs to run this test")
+    exit(1)
+
+devices = [("cuda:2", "cuda:0"), ("cpu", "cuda:3")]
+model_names = ["libtorch_multi_gpu", "libtorch_multi_device"]
+
+for device_pair, model_name in zip(devices, model_names):
+    model = TestModel(device_pair[0], device_pair[1])
+    model_path = "models/" + model_name + "/1/model.pt"
+    scripted_model = torch.jit.script(model)
+    scripted_model.save(model_path)
diff --git a/qa/L0_libtorch_instance_group_kind_model/models/libtorch_multi_device/config.pbtxt b/qa/L0_libtorch_instance_group_kind_model/models/libtorch_multi_device/config.pbtxt
new file mode 100644
index 0000000000..bf8ca0d649
--- /dev/null
+++ b/qa/L0_libtorch_instance_group_kind_model/models/libtorch_multi_device/config.pbtxt
@@ -0,0 +1,60 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "libtorch_multi_device"
+platform: "pytorch_libtorch"
+max_batch_size: 8
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ 16 ]
+  },
+  {
+    name: "INPUT1"
+    data_type: TYPE_FP32
+    dims: [ 16 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT__0"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  },
+  {
+    name: "OUTPUT__1"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  }
+]
+
+instance_group [
+  {
+    kind: KIND_MODEL
+  }
+]
diff --git a/qa/L0_libtorch_instance_group_kind_model/test.sh b/qa/L0_libtorch_instance_group_kind_model/test.sh
new file mode 100755
index 0000000000..04d76bd036
--- /dev/null
+++ b/qa/L0_libtorch_instance_group_kind_model/test.sh
@@ -0,0 +1,149 @@
+#!/bin/bash
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
+if [ "$#" -ge 1 ]; then
+    REPO_VERSION=$1
+fi
+if [ -z "$REPO_VERSION" ]; then
+    echo -e "Repository version must be specified"
+    echo -e "\n***\n*** Test Failed\n***"
+    exit 1
+fi
+if [ ! -z "$TEST_REPO_ARCH" ]; then
+    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
+fi
+
+pip3 uninstall -y torch
+pip3 install torch==1.13.0+cu117 -f https://download.pytorch.org/whl/torch_stable.html
+
+DATADIR=/data/inferenceserver/${REPO_VERSION}/qa_model_repository
+SERVER=/opt/tritonserver/bin/tritonserver
+SERVER_ARGS="--model-repository=models --log-verbose=1"
+SERVER_LOG="./inference_server.log"
+
+CLIENT_PY=./client.py
+CLIENT_LOG="./client.log"
+EXPECTED_NUM_TESTS="1"
+TEST_RESULT_FILE='test_results.txt'
+
+source ../common/util.sh
+
+RET=0
+
+rm -f *.log *.txt
+
+mkdir -p models/libtorch_multi_device/1
+mkdir -p models/libtorch_multi_gpu/1
+cp models/libtorch_multi_device/config.pbtxt models/libtorch_multi_gpu/.
+(cd models/libtorch_multi_gpu && \
+    sed -i "s/name: \"libtorch_multi_device\"/name: \"libtorch_multi_gpu\"/" config.pbtxt)
+
+# Generate the models which are partitioned across multiple devices
+set +e
+python3 gen_models.py >> $CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Error when generating models. \n***"
+    cat $CLIENT_LOG
+    exit 1
+fi
+set -e
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+
+export MODEL_NAME='libtorch_multi_device'
+python3 $CLIENT_PY >> $CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Model $MODEL_NAME FAILED. \n***"
+    cat $CLIENT_LOG
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+
+MESSAGES=("SumModule - INPUT0 device: cpu, INPUT1 device: cpu"
+    "DiffModule - INPUT0 device: cuda:3, INPUT1 device: cuda:3")
+for MESSAGE in "${MESSAGES[@]}"; do
+    if grep -q "$MESSAGE" "$SERVER_LOG"; then
+        echo -e "Found \"$MESSAGE\"" >> "$CLIENT_LOG"
+    else
+        echo -e "Not found \"$MESSAGE\"" >> "$CLIENT_LOG"
+        RET=1
+    fi
+done
+
+export MODEL_NAME='libtorch_multi_gpu'
+python3 $CLIENT_PY >> $CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Model $MODEL_NAME FAILED. \n***"
+    cat $CLIENT_LOG
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+
+MESSAGES=("SumModule - INPUT0 device: cuda:2, INPUT1 device: cuda:2"
+    "DiffModule - INPUT0 device: cuda:0, INPUT1 device: cuda:0")
+for MESSAGE in "${MESSAGES[@]}"; do
+    if grep -q "$MESSAGE" "$SERVER_LOG"; then
+        echo -e "Found \"$MESSAGE\"" >> "$CLIENT_LOG"
+    else
+        echo -e "Not found \"$MESSAGE\"" >> "$CLIENT_LOG"
+        RET=1
+    fi
+done
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+if [ $RET -eq 0 ]; then
+    echo -e "\n***\n*** Test Passed\n***"
+else
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test FAILED\n***"
+fi
+
+exit $RET
diff --git a/qa/L0_libtorch_io_names/io_names_client.py b/qa/L0_libtorch_io_names/io_names_client.py
old mode 100644
new mode 100755
index 54cf972778..b74e520de2
--- a/qa/L0_libtorch_io_names/io_names_client.py
+++ b/qa/L0_libtorch_io_names/io_names_client.py
@@ -1,5 +1,5 @@
 #!/usr/bin/python
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,24 +26,22 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-from builtins import range
-from future.utils import iteritems
 import unittest
-import test_util as tu
-import numpy as np
+from builtins import range
 
+import numpy as np
+import test_util as tu
 import tritonclient.http as httpclient
-from tritonclient.utils import np_to_triton_dtype
-from tritonclient.utils import InferenceServerException
 
 
 class IONamingConvention(tu.TestResultCollector):
-
     def _infer_helper(self, model_name, io_names, reversed_order=False):
-        triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                         verbose=False)
+        triton_client = httpclient.InferenceServerClient(
+            "localhost:8000", verbose=False
+        )
 
         # Create the data for the two inputs. Initialize the first to unique
         # integers and the second to all ones.
@@ -55,30 +53,34 @@ def _infer_helper(self, model_name, io_names, reversed_order=False):
         output_req = []
         inputs.append(
             httpclient.InferInput(
-                io_names[0] if not reversed_order else io_names[1], [1, 16],
-                "FP32"))
+                io_names[0] if not reversed_order else io_names[1], [1, 16], "FP32"
+            )
+        )
         inputs[-1].set_data_from_numpy(input0_data)
         inputs.append(
             httpclient.InferInput(
-                io_names[1] if not reversed_order else io_names[0], [1, 16],
-                "FP32"))
+                io_names[1] if not reversed_order else io_names[0], [1, 16], "FP32"
+            )
+        )
         inputs[-1].set_data_from_numpy(input1_data)
         output_req.append(
-            httpclient.InferRequestedOutput(io_names[2], binary_data=True))
+            httpclient.InferRequestedOutput(io_names[2], binary_data=True)
+        )
         output_req.append(
-            httpclient.InferRequestedOutput(io_names[3], binary_data=True))
+            httpclient.InferRequestedOutput(io_names[3], binary_data=True)
+        )
 
         results = triton_client.infer(model_name, inputs, outputs=output_req)
 
         output0_data = results.as_numpy(
-            io_names[2] if not reversed_order else io_names[3])
+            io_names[2] if not reversed_order else io_names[3]
+        )
         output1_data = results.as_numpy(
-            io_names[3] if not reversed_order else io_names[2])
+            io_names[3] if not reversed_order else io_names[2]
+        )
         for i in range(16):
-            self.assertEqual(input0_data[0][i] - input1_data[0][i],
-                             output0_data[0][i])
-            self.assertEqual(input0_data[0][i] + input1_data[0][i],
-                             output1_data[0][i])
+            self.assertEqual(input0_data[0][i] - input1_data[0][i], output0_data[0][i])
+            self.assertEqual(input0_data[0][i] + input1_data[0][i], output1_data[0][i])
 
     def test_io_index(self):
         io_names = ["INPUT__0", "INPUT__1", "OUTPUT__0", "OUTPUT__1"]
@@ -110,10 +112,8 @@ def test_mix_arguments_index(self):
 
     def test_unordered_index(self):
         io_names = ["INPUT1", "INPUT0", "OUT__1", "OUT__0"]
-        self._infer_helper("libtorch_unordered_index",
-                           io_names,
-                           reversed_order=True)
+        self._infer_helper("libtorch_unordered_index", io_names, reversed_order=True)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_libtorch_io_names/test.sh b/qa/L0_libtorch_io_names/test.sh
old mode 100644
new mode 100755
diff --git a/qa/L0_libtorch_io_types/test.sh b/qa/L0_libtorch_io_types/test.sh
new file mode 100755
index 0000000000..ddd38810b6
--- /dev/null
+++ b/qa/L0_libtorch_io_types/test.sh
@@ -0,0 +1,131 @@
+#!/bin/bash
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
+if [ "$#" -ge 1 ]; then
+    REPO_VERSION=$1
+fi
+if [ -z "$REPO_VERSION" ]; then
+    echo -e "Repository version must be specified"
+    echo -e "\n***\n*** Test Failed\n***"
+    exit 1
+fi
+if [ ! -z "$TEST_REPO_ARCH" ]; then
+    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
+fi
+
+export CUDA_VISIBLE_DEVICES=0
+
+SERVER=/opt/tritonserver/bin/tritonserver
+SERVER_ARGS="--model-repository=models"
+SERVER_LOG="./server.log"
+DATADIR=/data/inferenceserver/${REPO_VERSION}
+source ../common/util.sh
+
+# Test unsupported INPUT data type
+rm -rf models && mkdir -p models
+cp -r $DATADIR/qa_model_repository/libtorch_int32_int8_int8 models/libtorch_invalid_input_type && \
+    sed -i 's/libtorch_int32_int8_int8/libtorch_invalid_input_type/' models/libtorch_invalid_input_type/config.pbtxt && \
+    sed -i 's/TYPE_INT32/TYPE_UINT32/' models/libtorch_invalid_input_type/config.pbtxt
+
+rm -f *.log
+
+run_server
+if [ "$SERVER_PID" != "0" ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Unexpected server start $SERVER\n***"
+    kill $SERVER_PID
+    wait $SERVER_PID
+    exit 1
+fi
+
+set +e
+grep "unsupported datatype TYPE_UINT32 for input 'INPUT0' for model 'libtorch_invalid_input_type'" $SERVER_LOG
+if [ $? -ne 0 ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Unsupported INPUT datatype not found in server log\n***"
+    exit 1
+fi
+set -e
+
+# Test unsupported OUTPUT data type
+rm -rf models && mkdir -p models
+cp -r $DATADIR/qa_model_repository/libtorch_int32_int8_int8 models/libtorch_invalid_output_type && \
+    sed -i 's/libtorch_int32_int8_int8/libtorch_invalid_output_type/' models/libtorch_invalid_output_type/config.pbtxt && \
+    sed -i 's/TYPE_INT8/TYPE_UINT64/' models/libtorch_invalid_output_type/config.pbtxt
+
+rm -f *.log
+
+run_server
+if [ "$SERVER_PID" != "0" ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Unexpected server start $SERVER\n***"
+    kill $SERVER_PID
+    wait $SERVER_PID
+    exit 1
+fi
+
+set +e
+grep "unsupported datatype TYPE_UINT64 for output 'OUTPUT__0' for model 'libtorch_invalid_output_type'" $SERVER_LOG
+if [ $? -ne 0 ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Unsupported OUTPUT datatype not found in server log\n***"
+    exit 1
+fi
+set -e
+
+# Test unsupported sequence_batching data type
+rm -rf models && mkdir -p models
+cp -r $DATADIR/qa_variable_sequence_model_repository/libtorch_sequence_int32 models/libtorch_invalid_sequence_int32 && \
+    sed -i 's/libtorch_sequence_int32/libtorch_invalid_sequence_int32/' models/libtorch_invalid_sequence_int32/config.pbtxt && \
+    sed -i 's/READY__2/CORRID__2/' models/libtorch_invalid_sequence_int32/config.pbtxt && \
+    sed -i 's/CONTROL_SEQUENCE_READY/CONTROL_SEQUENCE_CORRID/' models/libtorch_invalid_sequence_int32/config.pbtxt && \
+    sed -i ':begin;$!N;s/CORRID\n\(.*\)int32_false_true: \[ 0, 1 \]/CORRID\ndata_type: TYPE_UINT32/' models/libtorch_invalid_sequence_int32/config.pbtxt
+
+rm -f *.log
+
+run_server
+if [ "$SERVER_PID" != "0" ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Unexpected server start $SERVER\n***"
+    kill $SERVER_PID
+    wait $SERVER_PID
+    exit 1
+fi
+
+set +e
+grep "input 'CORRID__2' type 'TYPE_UINT32' is not supported by PyTorch." $SERVER_LOG
+if [ $? -ne 0 ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Unsupported sequence_batching datatype not found in server log\n***"
+    exit 1
+fi
+set -e
+
+# Test passed
+echo -e "\n***\n*** Test Passed\n***"
+exit 0
diff --git a/qa/L0_libtorch_nvfuser/test.sh b/qa/L0_libtorch_nvfuser/test.sh
old mode 100644
new mode 100755
index b4a31e9984..4614a66de1
--- a/qa/L0_libtorch_nvfuser/test.sh
+++ b/qa/L0_libtorch_nvfuser/test.sh
@@ -91,8 +91,7 @@ parameters: {
 
     NVFUSER_LOG="NvFuser is "
     if [ "$FLAG" == "true" ]; then
-        # NvFuser support has been disabled. Change to 'enabled' when fixed.
-        NVFUSER_LOG+="disabled"
+        NVFUSER_LOG+="enabled"
     elif [ "$FLAG" == "false" ]; then
         NVFUSER_LOG+="disabled"
     else
diff --git a/qa/L0_libtorch_optimized_execution/test.sh b/qa/L0_libtorch_optimized_execution/test.sh
old mode 100644
new mode 100755
diff --git a/qa/L0_libtorch_shared_weights/libtorch_shared_weights_test.py b/qa/L0_libtorch_shared_weights/libtorch_shared_weights_test.py
old mode 100644
new mode 100755
index 3f08b63962..7c2fdb5a71
--- a/qa/L0_libtorch_shared_weights/libtorch_shared_weights_test.py
+++ b/qa/L0_libtorch_shared_weights/libtorch_shared_weights_test.py
@@ -1,4 +1,6 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,33 +30,29 @@
 
 sys.path.append("../common")
 
-import argparse
-import numpy as np
-import requests as httpreq
 import unittest
 from builtins import range
-import tritonhttpclient as httpclient
+
+import numpy as np
 import test_util as tu
+import tritonhttpclient as httpclient
 
 FLAGS = None
 
 
 class SharedWeightsTest(tu.TestResultCollector):
-
     def _full_exact(self, model_name, request_concurrency, shape):
-
         # Run async requests to make sure backend handles concurrent requests
         # correctly.
         client = httpclient.InferenceServerClient(
-            "localhost:8000", concurrency=request_concurrency)
+            "localhost:8000", concurrency=request_concurrency
+        )
         input_datas = []
         requests = []
         for i in range(request_concurrency):
             input_data = (16384 * np.random.randn(*shape)).astype(np.float32)
             input_datas.append(input_data)
-            inputs = [
-                httpclient.InferInput("INPUT__0", input_data.shape, "FP32")
-            ]
+            inputs = [httpclient.InferInput("INPUT__0", input_data.shape, "FP32")]
             inputs[0].set_data_from_numpy(input_data)
             requests.append(client.async_infer(model_name, inputs))
 
@@ -64,8 +62,7 @@ def _full_exact(self, model_name, request_concurrency, shape):
             results = requests[i].get_result()
 
             output_data = results.as_numpy("OUTPUT__0")
-            self.assertIsNotNone(output_data,
-                                 "error: expected 'OUTPUT__0' to be found")
+            self.assertIsNotNone(output_data, "error: expected 'OUTPUT__0' to be found")
             np.testing.assert_allclose(output_data, input_datas[i])
 
     def test_pytorch_identity_model(self):
@@ -73,5 +70,5 @@ def test_pytorch_identity_model(self):
         self._full_exact(model_name, 128, [8])
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_libtorch_shared_weights/test.sh b/qa/L0_libtorch_shared_weights/test.sh
old mode 100644
new mode 100755
index e6f23b7a45..6ca251ce32
--- a/qa/L0_libtorch_shared_weights/test.sh
+++ b/qa/L0_libtorch_shared_weights/test.sh
@@ -1,4 +1,5 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/bin/bash
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
diff --git a/qa/L0_lifecycle/lifecycle_test.py b/qa/L0_lifecycle/lifecycle_test.py
old mode 100644
new mode 100755
index aaf5b033dc..ea2eecb20a
--- a/qa/L0_lifecycle/lifecycle_test.py
+++ b/qa/L0_lifecycle/lifecycle_test.py
@@ -1,4 +1,6 @@
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,92 +30,101 @@
 
 sys.path.append("../common")
 
-from builtins import range
-from functools import partial
+import base64
+import concurrent.futures
+import json
 import os
 import shutil
 import signal
+import threading
 import time
 import unittest
-import numpy as np
+from builtins import range
+from functools import partial
+
 import infer_util as iu
+import numpy as np
 import test_util as tu
-import threading
-
 import tritonclient.grpc as grpcclient
 import tritonclient.http as httpclient
 from tritonclient.utils import InferenceServerException
 
 
 class LifeCycleTest(tu.TestResultCollector):
-
-    def _infer_success_models(self,
-                              model_base_names,
-                              versions,
-                              tensor_shape,
-                              swap=False):
+    def _infer_success_models(
+        self, model_base_names, versions, tensor_shape, swap=False
+    ):
         for base_name in model_base_names:
             try:
-                model_name = tu.get_model_name(base_name, np.float32,
-                                               np.float32, np.float32)
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                model_name = tu.get_model_name(
+                    base_name, np.float32, np.float32, np.float32
+                )
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     # FIXME is_server_ready should be true here DLIS-1296
                     # self.assertTrue(triton_client.is_server_ready())
                     for v in versions:
                         self.assertTrue(
-                            triton_client.is_model_ready(model_name, str(v)))
+                            triton_client.is_model_ready(model_name, str(v))
+                        )
 
                 for v in versions:
-                    iu.infer_exact(self,
-                                   base_name,
-                                   tensor_shape,
-                                   1,
-                                   np.float32,
-                                   np.float32,
-                                   np.float32,
-                                   model_version=v,
-                                   swap=(swap or (v != 1)))
+                    iu.infer_exact(
+                        self,
+                        base_name,
+                        tensor_shape,
+                        1,
+                        np.float32,
+                        np.float32,
+                        np.float32,
+                        model_version=v,
+                        swap=(swap or (v != 1)),
+                    )
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
-    def _infer_success_identity(self, model_base, versions, tensor_dtype,
-                                tensor_shape):
+    def _infer_success_identity(self, model_base, versions, tensor_dtype, tensor_shape):
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             self.assertTrue(triton_client.is_server_live())
             self.assertTrue(triton_client.is_server_ready())
             for v in versions:
                 self.assertTrue(
                     triton_client.is_model_ready(
-                        tu.get_zero_model_name(model_base, 1, tensor_dtype),
-                        str(v)))
+                        tu.get_zero_model_name(model_base, 1, tensor_dtype), str(v)
+                    )
+                )
 
             for v in versions:
-                iu.infer_zero(self,
-                              model_base,
-                              1,
-                              tensor_dtype,
-                              tensor_shape,
-                              tensor_shape,
-                              use_http=False,
-                              use_grpc=True,
-                              use_http_json_tensors=False,
-                              use_streaming=False)
+                iu.infer_zero(
+                    self,
+                    model_base,
+                    1,
+                    tensor_dtype,
+                    tensor_shape,
+                    tensor_shape,
+                    use_http=False,
+                    use_grpc=True,
+                    use_http_json_tensors=False,
+                    use_streaming=False,
+                )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
     def _get_client(self, use_grpc=False):
         if use_grpc:
-            triton_client = grpcclient.InferenceServerClient("localhost:8001",
-                                                             verbose=True)
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
         else:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
         return triton_client
 
     def _async_load(self, model_name, use_grpc):
@@ -129,8 +140,9 @@ def test_parse_error_noexit(self):
         # SERVER_FAILED_TO_INITIALIZE status.
         # Server is not live and not ready regardless of --strict-readiness
         try:
-            triton_client = grpcclient.InferenceServerClient("localhost:8001",
-                                                             verbose=True)
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
             self.assertFalse(triton_client.is_server_live())
             self.assertFalse(triton_client.is_server_ready())
             md = triton_client.get_server_metadata()
@@ -140,13 +152,14 @@ def test_parse_error_noexit(self):
             self.assertTrue(False, "unexpected error {}".format(ex))
 
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             self.assertFalse(triton_client.is_server_live())
             self.assertFalse(triton_client.is_server_ready())
             md = triton_client.get_server_metadata()
-            self.assertEqual(os.environ["TRITON_SERVER_VERSION"], md['version'])
-            self.assertEqual("triton", md['name'])
+            self.assertEqual(os.environ["TRITON_SERVER_VERSION"], md["version"])
+            self.assertEqual("triton", md["name"])
         except InferenceServerException as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -156,17 +169,20 @@ def test_parse_error_modelfail(self):
 
         # Server was started but with a model that fails to load
         try:
-            model_name = tu.get_model_name('graphdef', np.float32, np.float32,
-                                           np.float32)
+            model_name = tu.get_model_name(
+                "graphdef", np.float32, np.float32, np.float32
+            )
 
-            triton_client = grpcclient.InferenceServerClient("localhost:8001",
-                                                             verbose=True)
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
             self.assertTrue(triton_client.is_server_live())
             self.assertFalse(triton_client.is_server_ready())
             self.assertFalse(triton_client.is_model_ready(model_name, "1"))
 
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             self.assertTrue(triton_client.is_server_live())
             self.assertFalse(triton_client.is_server_ready())
             self.assertFalse(triton_client.is_model_ready(model_name, "1"))
@@ -175,35 +191,38 @@ def test_parse_error_modelfail(self):
 
         # Inferencing with the missing model should fail.
         try:
-            iu.infer_exact(self, 'graphdef', tensor_shape, 1, np.float32,
-                           np.float32, np.float32)
-            self.assertTrue(
-                False, "expected error for unavailable model " + model_name)
+            iu.infer_exact(
+                self, "graphdef", tensor_shape, 1, np.float32, np.float32, np.float32
+            )
+            self.assertTrue(False, "expected error for unavailable model " + model_name)
         except Exception as ex:
             self.assertIn(
                 "Request for unknown model: 'graphdef_float32_float32_float32' has no available versions",
-                ex.message())
+                ex.message(),
+            )
 
         # And other models should be loaded successfully
         try:
-            for base_name in ['savedmodel', 'onnx']:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
-                    model_name = tu.get_model_name(base_name, np.float32,
-                                                   np.float32, np.float32)
-                    self.assertTrue(
-                        triton_client.is_model_ready(model_name, "1"))
-
-                iu.infer_exact(self,
-                               base_name,
-                               tensor_shape,
-                               1,
-                               np.float32,
-                               np.float32,
-                               np.float32,
-                               model_version=1)
+            for base_name in ["savedmodel", "onnx"]:
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
+                    model_name = tu.get_model_name(
+                        base_name, np.float32, np.float32, np.float32
+                    )
+                    self.assertTrue(triton_client.is_model_ready(model_name, "1"))
+
+                iu.infer_exact(
+                    self,
+                    base_name,
+                    tensor_shape,
+                    1,
+                    np.float32,
+                    np.float32,
+                    np.float32,
+                    model_version=1,
+                )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -213,17 +232,20 @@ def test_parse_error_modelfail_nostrict(self):
 
         # Server was started but with a model that fails to load
         try:
-            model_name = tu.get_model_name('graphdef', np.float32, np.float32,
-                                           np.float32)
+            model_name = tu.get_model_name(
+                "graphdef", np.float32, np.float32, np.float32
+            )
 
-            triton_client = grpcclient.InferenceServerClient("localhost:8001",
-                                                             verbose=True)
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
             self.assertTrue(triton_client.is_server_live())
             self.assertTrue(triton_client.is_server_ready())
             self.assertFalse(triton_client.is_model_ready(model_name, "1"))
 
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             self.assertTrue(triton_client.is_server_live())
             self.assertTrue(triton_client.is_server_ready())
             self.assertFalse(triton_client.is_model_ready(model_name, "1"))
@@ -232,35 +254,38 @@ def test_parse_error_modelfail_nostrict(self):
 
         # Inferencing with the missing model should fail.
         try:
-            iu.infer_exact(self, 'graphdef', tensor_shape, 1, np.float32,
-                           np.float32, np.float32)
-            self.assertTrue(
-                False, "expected error for unavailable model " + model_name)
+            iu.infer_exact(
+                self, "graphdef", tensor_shape, 1, np.float32, np.float32, np.float32
+            )
+            self.assertTrue(False, "expected error for unavailable model " + model_name)
         except Exception as ex:
             self.assertIn(
                 "Request for unknown model: 'graphdef_float32_float32_float32' has no available versions",
-                ex.message())
+                ex.message(),
+            )
 
         # And other models should be loaded successfully
         try:
-            for base_name in ['savedmodel', 'onnx']:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
-                    model_name = tu.get_model_name(base_name, np.float32,
-                                                   np.float32, np.float32)
-                    self.assertTrue(
-                        triton_client.is_model_ready(model_name, "1"))
-
-                iu.infer_exact(self,
-                               base_name,
-                               tensor_shape,
-                               1,
-                               np.float32,
-                               np.float32,
-                               np.float32,
-                               model_version=1)
+            for base_name in ["savedmodel", "onnx"]:
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
+                    model_name = tu.get_model_name(
+                        base_name, np.float32, np.float32, np.float32
+                    )
+                    self.assertTrue(triton_client.is_model_ready(model_name, "1"))
+
+                iu.infer_exact(
+                    self,
+                    base_name,
+                    tensor_shape,
+                    1,
+                    np.float32,
+                    np.float32,
+                    np.float32,
+                    model_version=1,
+                )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -268,13 +293,14 @@ def test_parse_error_no_model_config(self):
         tensor_shape = (1, 16)
 
         # Server was started but with a model that fails to be polled
-        for triton_client in (httpclient.InferenceServerClient("localhost:8000",
-                                                               verbose=True),
-                              grpcclient.InferenceServerClient("localhost:8001",
-                                                               verbose=True)):
+        for triton_client in (
+            httpclient.InferenceServerClient("localhost:8000", verbose=True),
+            grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+        ):
             try:
-                model_name = tu.get_model_name('graphdef', np.float32,
-                                               np.float32, np.float32)
+                model_name = tu.get_model_name(
+                    "graphdef", np.float32, np.float32, np.float32
+                )
 
                 # expecting ready because not strict readiness
                 self.assertTrue(triton_client.is_server_live())
@@ -282,29 +308,36 @@ def test_parse_error_no_model_config(self):
 
                 md = triton_client.get_model_metadata(model_name, "1")
                 self.assertTrue(
-                    False, "expected model '" + model_name +
-                    "' to be ignored due to polling failure")
+                    False,
+                    "expected model '"
+                    + model_name
+                    + "' to be ignored due to polling failure",
+                )
 
             except Exception as ex:
                 self.assertIn(
                     "Request for unknown model: 'graphdef_float32_float32_float32' is not found",
-                    ex.message())
+                    ex.message(),
+                )
 
         # And other models should be loaded successfully
         try:
-            for base_name in ['savedmodel', 'onnx']:
-                model_name = tu.get_model_name(base_name, np.float32,
-                                               np.float32, np.float32)
+            for base_name in ["savedmodel", "onnx"]:
+                model_name = tu.get_model_name(
+                    base_name, np.float32, np.float32, np.float32
+                )
                 self.assertTrue(triton_client.is_model_ready(model_name, "1"))
 
-                iu.infer_exact(self,
-                               base_name,
-                               tensor_shape,
-                               1,
-                               np.float32,
-                               np.float32,
-                               np.float32,
-                               model_version=1)
+                iu.infer_exact(
+                    self,
+                    base_name,
+                    tensor_shape,
+                    1,
+                    np.float32,
+                    np.float32,
+                    np.float32,
+                    model_version=1,
+                )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -312,10 +345,10 @@ def test_init_error_modelfail(self):
         # --strict-readiness=true so server is live but not ready
 
         # Server was started but with models that fail to load
-        for triton_client in (httpclient.InferenceServerClient("localhost:8000",
-                                                               verbose=True),
-                              grpcclient.InferenceServerClient("localhost:8001",
-                                                               verbose=True)):
+        for triton_client in (
+            httpclient.InferenceServerClient("localhost:8000", verbose=True),
+            grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+        ):
             try:
                 self.assertTrue(triton_client.is_server_live())
                 self.assertFalse(triton_client.is_server_ready())
@@ -330,24 +363,27 @@ def test_init_error_modelfail(self):
 
             # And other models should be loaded successfully
             try:
-                for base_name in ['graphdef', 'savedmodel', 'onnx']:
-                    model_name = tu.get_model_name(base_name, np.float32,
-                                                   np.float32, np.float32)
+                for base_name in ["graphdef", "savedmodel", "onnx"]:
+                    model_name = tu.get_model_name(
+                        base_name, np.float32, np.float32, np.float32
+                    )
                     self.assertTrue(triton_client.is_model_ready(model_name))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
         try:
             tensor_shape = (1, 16)
-            for base_name in ['graphdef', 'savedmodel', 'onnx']:
-                iu.infer_exact(self,
-                               base_name,
-                               tensor_shape,
-                               1,
-                               np.float32,
-                               np.float32,
-                               np.float32,
-                               model_version=1)
+            for base_name in ["graphdef", "savedmodel", "onnx"]:
+                iu.infer_exact(
+                    self,
+                    base_name,
+                    tensor_shape,
+                    1,
+                    np.float32,
+                    np.float32,
+                    np.float32,
+                    model_version=1,
+                )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -356,95 +392,105 @@ def test_parse_error_model_no_version(self):
         tensor_shape = (1, 16)
 
         # Server was started but with a model that fails to load
-        for triton_client in (httpclient.InferenceServerClient("localhost:8000",
-                                                               verbose=True),
-                              grpcclient.InferenceServerClient("localhost:8001",
-                                                               verbose=True)):
+        for triton_client in (
+            httpclient.InferenceServerClient("localhost:8000", verbose=True),
+            grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+        ):
             try:
                 self.assertTrue(triton_client.is_server_live())
                 self.assertFalse(triton_client.is_server_ready())
 
-                model_name = tu.get_model_name('graphdef', np.float32,
-                                               np.float32, np.float32)
+                model_name = tu.get_model_name(
+                    "graphdef", np.float32, np.float32, np.float32
+                )
                 self.assertFalse(triton_client.is_model_ready(model_name))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
             # Sanity check that other models are loaded properly
             try:
-                for base_name in ['savedmodel', 'onnx']:
-                    model_name = tu.get_model_name(base_name, np.float32,
-                                                   np.float32, np.float32)
+                for base_name in ["savedmodel", "onnx"]:
+                    model_name = tu.get_model_name(
+                        base_name, np.float32, np.float32, np.float32
+                    )
                     self.assertTrue(triton_client.is_model_ready(model_name))
                 for version in ["1", "3"]:
-                    model_name = tu.get_model_name("plan", np.float32,
-                                                   np.float32, np.float32)
-                    self.assertTrue(
-                        triton_client.is_model_ready(model_name, version))
+                    model_name = tu.get_model_name(
+                        "plan", np.float32, np.float32, np.float32
+                    )
+                    self.assertTrue(triton_client.is_model_ready(model_name, version))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
         try:
-            for base_name in ['savedmodel', 'onnx']:
-                iu.infer_exact(self,
-                               base_name,
-                               tensor_shape,
-                               1,
-                               np.float32,
-                               np.float32,
-                               np.float32,
-                               swap=True)
+            for base_name in ["savedmodel", "onnx"]:
+                iu.infer_exact(
+                    self,
+                    base_name,
+                    tensor_shape,
+                    1,
+                    np.float32,
+                    np.float32,
+                    np.float32,
+                    swap=True,
+                )
             for version in [1, 3]:
-                iu.infer_exact(self,
-                               'plan',
-                               tensor_shape,
-                               1,
-                               np.float32,
-                               np.float32,
-                               np.float32,
-                               swap=(version == 3),
-                               model_version=version)
+                iu.infer_exact(
+                    self,
+                    "plan",
+                    tensor_shape,
+                    1,
+                    np.float32,
+                    np.float32,
+                    np.float32,
+                    swap=(version == 3),
+                    model_version=version,
+                )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
         try:
-            iu.infer_exact(self, 'graphdef', tensor_shape, 1, np.float32,
-                           np.float32, np.float32)
-            self.assertTrue(
-                False, "expected error for unavailable model " + model_name)
+            iu.infer_exact(
+                self, "graphdef", tensor_shape, 1, np.float32, np.float32, np.float32
+            )
+            self.assertTrue(False, "expected error for unavailable model " + model_name)
         except Exception as ex:
             self.assertIn(
                 "Request for unknown model: 'graphdef_float32_float32_float32' has no available versions",
-                ex.message())
+                ex.message(),
+            )
 
     def test_parse_ignore_zero_prefixed_version(self):
         tensor_shape = (1, 16)
 
         # Server was started but only version 1 is loaded
-        for triton_client in (httpclient.InferenceServerClient("localhost:8000",
-                                                               verbose=True),
-                              grpcclient.InferenceServerClient("localhost:8001",
-                                                               verbose=True)):
+        for triton_client in (
+            httpclient.InferenceServerClient("localhost:8000", verbose=True),
+            grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+        ):
             try:
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
 
-                model_name = tu.get_model_name('savedmodel', np.float32,
-                                               np.float32, np.float32)
+                model_name = tu.get_model_name(
+                    "savedmodel", np.float32, np.float32, np.float32
+                )
                 self.assertTrue(triton_client.is_model_ready(model_name, "1"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
         try:
             # swap=False for version 1
-            iu.infer_exact(self,
-                           'savedmodel',
-                           tensor_shape,
-                           1,
-                           np.float32,
-                           np.float32,
-                           np.float32,
-                           swap=False)
+            iu.infer_exact(
+                self,
+                "savedmodel",
+                tensor_shape,
+                1,
+                np.float32,
+                np.float32,
+                np.float32,
+                swap=False,
+            )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -452,53 +498,54 @@ def test_parse_ignore_non_intergral_version(self):
         tensor_shape = (1, 16)
 
         # Server was started but only version 1 is loaded
-        for triton_client in (httpclient.InferenceServerClient("localhost:8000",
-                                                               verbose=True),
-                              grpcclient.InferenceServerClient("localhost:8001",
-                                                               verbose=True)):
+        for triton_client in (
+            httpclient.InferenceServerClient("localhost:8000", verbose=True),
+            grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+        ):
             try:
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
 
-                model_name = tu.get_model_name('savedmodel', np.float32,
-                                               np.float32, np.float32)
+                model_name = tu.get_model_name(
+                    "savedmodel", np.float32, np.float32, np.float32
+                )
                 self.assertTrue(triton_client.is_model_ready(model_name, "1"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
         try:
             # swap=False for version 1
-            iu.infer_exact(self,
-                           'savedmodel',
-                           tensor_shape,
-                           1,
-                           np.float32,
-                           np.float32,
-                           np.float32,
-                           swap=False)
+            iu.infer_exact(
+                self,
+                "savedmodel",
+                tensor_shape,
+                1,
+                np.float32,
+                np.float32,
+                np.float32,
+                swap=False,
+            )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
     def test_dynamic_model_load_unload(self):
         tensor_shape = (1, 16)
-        savedmodel_name = tu.get_model_name('savedmodel', np.float32,
-                                            np.float32, np.float32)
-        onnx_name = tu.get_model_name('onnx', np.float32, np.float32,
-                                      np.float32)
+        savedmodel_name = tu.get_model_name(
+            "savedmodel", np.float32, np.float32, np.float32
+        )
+        onnx_name = tu.get_model_name("onnx", np.float32, np.float32, np.float32)
 
         # Make sure savedmodel model is not in the status (because
         # initially it is not in the model repository)
-        for triton_client in (httpclient.InferenceServerClient("localhost:8000",
-                                                               verbose=True),
-                              grpcclient.InferenceServerClient("localhost:8001",
-                                                               verbose=True)):
+        for triton_client in (
+            httpclient.InferenceServerClient("localhost:8000", verbose=True),
+            grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+        ):
             try:
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "1"))
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "3"))
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "1"))
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "3"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "1"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "3"))
             except Exception as ex:
@@ -509,16 +556,14 @@ def test_dynamic_model_load_unload(self):
         try:
             shutil.copytree(savedmodel_name, "models/" + savedmodel_name)
             time.sleep(5)  # wait for model to load
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertTrue(
-                    triton_client.is_model_ready(savedmodel_name, "1"))
-                self.assertTrue(
-                    triton_client.is_model_ready(savedmodel_name, "3"))
+                self.assertTrue(triton_client.is_model_ready(savedmodel_name, "1"))
+                self.assertTrue(triton_client.is_model_ready(savedmodel_name, "3"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "1"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "3"))
         except Exception as ex:
@@ -526,47 +571,58 @@ def test_dynamic_model_load_unload(self):
 
         # Run inference on the just loaded model
         try:
-            iu.infer_exact(self,
-                           'savedmodel',
-                           tensor_shape,
-                           1,
-                           np.float32,
-                           np.float32,
-                           np.float32,
-                           swap=True)
+            iu.infer_exact(
+                self,
+                "savedmodel",
+                tensor_shape,
+                1,
+                np.float32,
+                np.float32,
+                np.float32,
+                swap=True,
+            )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Make sure savedmodel has execution stats
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             stats = triton_client.get_inference_statistics(savedmodel_name)
             self.assertEqual(len(stats["model_stats"]), 2)
             for idx in range(len(stats["model_stats"])):
-                self.assertEqual(stats["model_stats"][idx]["name"],
-                                 savedmodel_name)
+                self.assertEqual(stats["model_stats"][idx]["name"], savedmodel_name)
                 if stats["model_stats"][idx]["version"] == "1":
                     self.assertEqual(
-                        stats["model_stats"][idx]["inference_stats"]["success"]
-                        ["count"], 0)
+                        stats["model_stats"][idx]["inference_stats"]["success"][
+                            "count"
+                        ],
+                        0,
+                    )
                 else:
                     self.assertNotEqual(
-                        stats["model_stats"][idx]["inference_stats"]["success"]
-                        ["count"], 0)
-
-            triton_client = grpcclient.InferenceServerClient("localhost:8001",
-                                                             verbose=True)
+                        stats["model_stats"][idx]["inference_stats"]["success"][
+                            "count"
+                        ],
+                        0,
+                    )
+
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
             stats = triton_client.get_inference_statistics(savedmodel_name)
             self.assertEqual(len(stats.model_stats), 2)
             for idx in range(len(stats.model_stats)):
                 self.assertEqual(stats.model_stats[idx].name, savedmodel_name)
                 if stats.model_stats[idx].version == "1":
                     self.assertEqual(
-                        stats.model_stats[idx].inference_stats.success.count, 0)
+                        stats.model_stats[idx].inference_stats.success.count, 0
+                    )
                 else:
                     self.assertNotEqual(
-                        stats.model_stats[idx].inference_stats.success.count, 0)
+                        stats.model_stats[idx].inference_stats.success.count, 0
+                    )
 
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
@@ -576,16 +632,14 @@ def test_dynamic_model_load_unload(self):
         try:
             shutil.rmtree("models/" + savedmodel_name)
             time.sleep(5)  # wait for model to unload
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "1"))
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "3"))
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "1"))
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "3"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "1"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "3"))
         except Exception as ex:
@@ -593,62 +647,65 @@ def test_dynamic_model_load_unload(self):
 
         # Model is removed so inference should fail
         try:
-            iu.infer_exact(self,
-                           'savedmodel',
-                           tensor_shape,
-                           1,
-                           np.float32,
-                           np.float32,
-                           np.float32,
-                           swap=True)
+            iu.infer_exact(
+                self,
+                "savedmodel",
+                tensor_shape,
+                1,
+                np.float32,
+                np.float32,
+                np.float32,
+                swap=True,
+            )
             self.assertTrue(
-                False,
-                "expected error for unavailable model " + savedmodel_name)
+                False, "expected error for unavailable model " + savedmodel_name
+            )
         except Exception as ex:
             self.assertIn(
-                "Request for unknown model: '{}' has no available versions".
-                format(savedmodel_name), ex.message())
+                "Request for unknown model: '{}' has no available versions".format(
+                    savedmodel_name
+                ),
+                ex.message(),
+            )
 
         # Add back the same model. The status/stats should be reset.
         try:
             shutil.copytree(savedmodel_name, "models/" + savedmodel_name)
             time.sleep(5)  # wait for model to load
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertTrue(
-                    triton_client.is_model_ready(savedmodel_name, "1"))
-                self.assertTrue(
-                    triton_client.is_model_ready(savedmodel_name, "3"))
+                self.assertTrue(triton_client.is_model_ready(savedmodel_name, "1"))
+                self.assertTrue(triton_client.is_model_ready(savedmodel_name, "3"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "1"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "3"))
 
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             stats = triton_client.get_inference_statistics(savedmodel_name)
             self.assertEqual(len(stats["model_stats"]), 2)
             self.assertEqual(stats["model_stats"][0]["name"], savedmodel_name)
             self.assertEqual(stats["model_stats"][1]["name"], savedmodel_name)
             self.assertEqual(
-                stats["model_stats"][0]["inference_stats"]["success"]["count"],
-                0)
+                stats["model_stats"][0]["inference_stats"]["success"]["count"], 0
+            )
             self.assertEqual(
-                stats["model_stats"][1]["inference_stats"]["success"]["count"],
-                0)
+                stats["model_stats"][1]["inference_stats"]["success"]["count"], 0
+            )
 
-            triton_client = grpcclient.InferenceServerClient("localhost:8001",
-                                                             verbose=True)
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
             stats = triton_client.get_inference_statistics(savedmodel_name)
             self.assertEqual(len(stats.model_stats), 2)
             self.assertEqual(stats.model_stats[0].name, savedmodel_name)
             self.assertEqual(stats.model_stats[1].name, savedmodel_name)
-            self.assertEqual(stats.model_stats[0].inference_stats.success.count,
-                             0)
-            self.assertEqual(stats.model_stats[1].inference_stats.success.count,
-                             0)
+            self.assertEqual(stats.model_stats[0].inference_stats.success.count, 0)
+            self.assertEqual(stats.model_stats[1].inference_stats.success.count, 0)
 
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
@@ -658,16 +715,14 @@ def test_dynamic_model_load_unload(self):
         try:
             shutil.rmtree("models/" + onnx_name)
             time.sleep(5)  # wait for model to unload
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertTrue(
-                    triton_client.is_model_ready(savedmodel_name, "1"))
-                self.assertTrue(
-                    triton_client.is_model_ready(savedmodel_name, "3"))
+                self.assertTrue(triton_client.is_model_ready(savedmodel_name, "1"))
+                self.assertTrue(triton_client.is_model_ready(savedmodel_name, "3"))
                 self.assertFalse(triton_client.is_model_ready(onnx_name, "1"))
                 self.assertFalse(triton_client.is_model_ready(onnx_name, "3"))
         except Exception as ex:
@@ -675,41 +730,41 @@ def test_dynamic_model_load_unload(self):
 
         # Model is removed so inference should fail
         try:
-            iu.infer_exact(self,
-                           'onnx',
-                           tensor_shape,
-                           1,
-                           np.float32,
-                           np.float32,
-                           np.float32,
-                           swap=True)
-            self.assertTrue(False,
-                            "expected error for unavailable model " + onnx_name)
+            iu.infer_exact(
+                self,
+                "onnx",
+                tensor_shape,
+                1,
+                np.float32,
+                np.float32,
+                np.float32,
+                swap=True,
+            )
+            self.assertTrue(False, "expected error for unavailable model " + onnx_name)
         except Exception as ex:
             self.assertIn(
                 "Request for unknown model: 'onnx_float32_float32_float32' has no available versions",
-                ex.message())
+                ex.message(),
+            )
 
     def test_dynamic_model_load_unload_disabled(self):
         tensor_shape = (1, 16)
-        savedmodel_name = tu.get_model_name('savedmodel', np.float32,
-                                            np.float32, np.float32)
-        onnx_name = tu.get_model_name('onnx', np.float32, np.float32,
-                                      np.float32)
+        savedmodel_name = tu.get_model_name(
+            "savedmodel", np.float32, np.float32, np.float32
+        )
+        onnx_name = tu.get_model_name("onnx", np.float32, np.float32, np.float32)
 
         # Make sure savedmodel model is not in the status (because
         # initially it is not in the model repository)
-        for triton_client in (httpclient.InferenceServerClient("localhost:8000",
-                                                               verbose=True),
-                              grpcclient.InferenceServerClient("localhost:8001",
-                                                               verbose=True)):
+        for triton_client in (
+            httpclient.InferenceServerClient("localhost:8000", verbose=True),
+            grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+        ):
             try:
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "1"))
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "3"))
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "1"))
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "3"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "1"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "3"))
             except Exception as ex:
@@ -720,16 +775,14 @@ def test_dynamic_model_load_unload_disabled(self):
         try:
             shutil.copytree(savedmodel_name, "models/" + savedmodel_name)
             time.sleep(5)  # wait for model to load
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "1"))
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "3"))
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "1"))
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "3"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "1"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "3"))
         except Exception as ex:
@@ -737,37 +790,38 @@ def test_dynamic_model_load_unload_disabled(self):
 
         # Run inference which should fail because the model isn't there
         try:
-            iu.infer_exact(self,
-                           'savedmodel',
-                           tensor_shape,
-                           1,
-                           np.float32,
-                           np.float32,
-                           np.float32,
-                           swap=True)
+            iu.infer_exact(
+                self,
+                "savedmodel",
+                tensor_shape,
+                1,
+                np.float32,
+                np.float32,
+                np.float32,
+                swap=True,
+            )
             self.assertTrue(
-                False,
-                "expected error for unavailable model " + savedmodel_name)
+                False, "expected error for unavailable model " + savedmodel_name
+            )
         except Exception as ex:
             self.assertIn(
                 "Request for unknown model: 'savedmodel_float32_float32_float32' is not found",
-                ex.message())
+                ex.message(),
+            )
 
         # Remove one of the original models from the model repository.
         # Unloading is disabled so it should remain available in the status.
         try:
             shutil.rmtree("models/" + onnx_name)
             time.sleep(5)  # wait for model to unload (but it shouldn't)
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "1"))
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "3"))
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "1"))
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "3"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "1"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "3"))
         except Exception as ex:
@@ -776,84 +830,93 @@ def test_dynamic_model_load_unload_disabled(self):
         # Run inference to make sure model still being served even
         # though deleted from model repository
         try:
-            iu.infer_exact(self,
-                           'onnx',
-                           tensor_shape,
-                           1,
-                           np.float32,
-                           np.float32,
-                           np.float32,
-                           swap=True)
+            iu.infer_exact(
+                self,
+                "onnx",
+                tensor_shape,
+                1,
+                np.float32,
+                np.float32,
+                np.float32,
+                swap=True,
+            )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
     def test_dynamic_version_load_unload(self):
         tensor_shape = (1, 16)
-        graphdef_name = tu.get_model_name('graphdef', np.int32, np.int32,
-                                          np.int32)
+        graphdef_name = tu.get_model_name("graphdef", np.int32, np.int32, np.int32)
 
         # There are 3 versions. Make sure that all have status and are
         # ready.
         try:
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "1"))
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "2"))
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "3"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "1"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "2"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "3"))
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Run inference on version 1 to make sure it is available
         try:
-            iu.infer_exact(self,
-                           'graphdef',
-                           tensor_shape,
-                           1,
-                           np.int32,
-                           np.int32,
-                           np.int32,
-                           swap=False,
-                           model_version=1)
+            iu.infer_exact(
+                self,
+                "graphdef",
+                tensor_shape,
+                1,
+                np.int32,
+                np.int32,
+                np.int32,
+                swap=False,
+                model_version=1,
+            )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Make sure only version 1 has execution stats in the status.
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             stats = triton_client.get_inference_statistics(graphdef_name)
             self.assertEqual(len(stats["model_stats"]), 3)
             for idx in range(len(stats["model_stats"])):
-                self.assertEqual(stats["model_stats"][idx]["name"],
-                                 graphdef_name)
+                self.assertEqual(stats["model_stats"][idx]["name"], graphdef_name)
                 if stats["model_stats"][idx]["version"] == "1":
                     self.assertNotEqual(
-                        stats["model_stats"][idx]["inference_stats"]["success"]
-                        ["count"], 0)
+                        stats["model_stats"][idx]["inference_stats"]["success"][
+                            "count"
+                        ],
+                        0,
+                    )
                 else:
                     self.assertEqual(
-                        stats["model_stats"][idx]["inference_stats"]["success"]
-                        ["count"], 0)
-
-            triton_client = grpcclient.InferenceServerClient("localhost:8001",
-                                                             verbose=True)
+                        stats["model_stats"][idx]["inference_stats"]["success"][
+                            "count"
+                        ],
+                        0,
+                    )
+
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
             stats = triton_client.get_inference_statistics(graphdef_name)
             self.assertEqual(len(stats.model_stats), 3)
             for idx in range(len(stats.model_stats)):
                 self.assertEqual(stats.model_stats[idx].name, graphdef_name)
                 if stats.model_stats[idx].version == "1":
                     self.assertNotEqual(
-                        stats.model_stats[idx].inference_stats.success.count, 0)
+                        stats.model_stats[idx].inference_stats.success.count, 0
+                    )
                 else:
                     self.assertEqual(
-                        stats.model_stats[idx].inference_stats.success.count, 0)
+                        stats.model_stats[idx].inference_stats.success.count, 0
+                    )
 
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
@@ -863,87 +926,81 @@ def test_dynamic_version_load_unload(self):
         try:
             shutil.rmtree("models/" + graphdef_name + "/1")
             time.sleep(5)  # wait for version to unload
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertFalse(
-                    triton_client.is_model_ready(graphdef_name, "1"))
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "2"))
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "3"))
+                self.assertFalse(triton_client.is_model_ready(graphdef_name, "1"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "2"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "3"))
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Version is removed so inference should fail
         try:
-            iu.infer_exact(self,
-                           'graphdef',
-                           tensor_shape,
-                           1,
-                           np.int32,
-                           np.int32,
-                           np.int32,
-                           swap=False,
-                           model_version=1)
+            iu.infer_exact(
+                self,
+                "graphdef",
+                tensor_shape,
+                1,
+                np.int32,
+                np.int32,
+                np.int32,
+                swap=False,
+                model_version=1,
+            )
             self.assertTrue(
-                False, "expected error for unavailable model " + graphdef_name)
+                False, "expected error for unavailable model " + graphdef_name
+            )
         except Exception as ex:
             self.assertIn(
                 "Request for unknown model: 'graphdef_int32_int32_int32' version 1 is not at ready state",
-                ex.message())
+                ex.message(),
+            )
 
         # Add another version to the model repository.
         try:
-            shutil.copytree("models/" + graphdef_name + "/2",
-                            "models/" + graphdef_name + "/7")
+            shutil.copytree(
+                "models/" + graphdef_name + "/2", "models/" + graphdef_name + "/7"
+            )
             time.sleep(5)  # wait for version to load
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertFalse(
-                    triton_client.is_model_ready(graphdef_name, "1"))
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "2"))
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "3"))
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "7"))
+                self.assertFalse(triton_client.is_model_ready(graphdef_name, "1"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "2"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "3"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "7"))
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
     def test_dynamic_version_load_unload_disabled(self):
         tensor_shape = (1, 16)
-        graphdef_name = tu.get_model_name('graphdef', np.int32, np.int32,
-                                          np.int32)
+        graphdef_name = tu.get_model_name("graphdef", np.int32, np.int32, np.int32)
 
         # Add a new version to the model repository and give it time to
         # load. But it shouldn't load because dynamic loading is
         # disabled.
         try:
-            shutil.copytree("models/" + graphdef_name + "/2",
-                            "models/" + graphdef_name + "/7")
+            shutil.copytree(
+                "models/" + graphdef_name + "/2", "models/" + graphdef_name + "/7"
+            )
             time.sleep(5)  # wait for model to load
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "1"))
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "2"))
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "3"))
-                self.assertFalse(
-                    triton_client.is_model_ready(graphdef_name, "7"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "1"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "2"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "3"))
+                self.assertFalse(triton_client.is_model_ready(graphdef_name, "7"))
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -953,59 +1010,54 @@ def test_dynamic_version_load_unload_disabled(self):
         try:
             shutil.rmtree("models/" + graphdef_name + "/1")
             time.sleep(5)  # wait for version to unload (but it shouldn't)
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "1"))
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "2"))
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "3"))
-                self.assertFalse(
-                    triton_client.is_model_ready(graphdef_name, "7"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "1"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "2"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "3"))
+                self.assertFalse(triton_client.is_model_ready(graphdef_name, "7"))
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Run inference to make sure model still being served even
         # though version deleted from model repository
         try:
-            iu.infer_exact(self,
-                           'graphdef',
-                           tensor_shape,
-                           1,
-                           np.int32,
-                           np.int32,
-                           np.int32,
-                           swap=False,
-                           model_version=1)
+            iu.infer_exact(
+                self,
+                "graphdef",
+                tensor_shape,
+                1,
+                np.int32,
+                np.int32,
+                np.int32,
+                swap=False,
+                model_version=1,
+            )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
     def test_dynamic_model_modify(self):
-        models_base = ('savedmodel', 'plan')
+        models_base = ("savedmodel", "plan")
         models_shape = ((1, 16), (1, 16))
         models = list()
         for m in models_base:
-            models.append(
-                tu.get_model_name(m, np.float32, np.float32, np.float32))
+            models.append(tu.get_model_name(m, np.float32, np.float32, np.float32))
 
         # Make sure savedmodel and plan are in the status
         for model_name in models:
             try:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    self.assertTrue(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertTrue(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertTrue(triton_client.is_model_ready(model_name, "1"))
+                    self.assertTrue(triton_client.is_model_ready(model_name, "3"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -1013,63 +1065,67 @@ def test_dynamic_model_modify(self):
         for version in (1, 3):
             for model_name, model_shape in zip(models_base, models_shape):
                 try:
-                    iu.infer_exact(self,
-                                   model_name,
-                                   model_shape,
-                                   1,
-                                   np.float32,
-                                   np.float32,
-                                   np.float32,
-                                   swap=(version == 3),
-                                   model_version=version)
+                    iu.infer_exact(
+                        self,
+                        model_name,
+                        model_shape,
+                        1,
+                        np.float32,
+                        np.float32,
+                        np.float32,
+                        swap=(version == 3),
+                        model_version=version,
+                    )
                 except Exception as ex:
                     self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Change the model configuration to use wrong label file
         for base_name, model_name in zip(models_base, models):
-            shutil.copyfile("config.pbtxt.wrong." + base_name,
-                            "models/" + model_name + "/config.pbtxt")
+            shutil.copyfile(
+                "config.pbtxt.wrong." + base_name,
+                "models/" + model_name + "/config.pbtxt",
+            )
 
         time.sleep(5)  # wait for models to reload
         for model_name in models:
             for model_name, model_shape in zip(models_base, models_shape):
                 try:
-                    iu.infer_exact(self,
-                                   model_name,
-                                   model_shape,
-                                   1,
-                                   np.float32,
-                                   np.float32,
-                                   np.float32,
-                                   swap=(version == 3),
-                                   model_version=version,
-                                   output0_raw=False)
+                    iu.infer_exact(
+                        self,
+                        model_name,
+                        model_shape,
+                        1,
+                        np.float32,
+                        np.float32,
+                        np.float32,
+                        swap=(version == 3),
+                        model_version=version,
+                        output0_raw=False,
+                    )
                     self.assertTrue(
-                        False,
-                        "expected error for wrong label for " + model_name)
+                        False, "expected error for wrong label for " + model_name
+                    )
                 except AssertionError as ex:
-                    self.assertTrue("'label9" in str(ex) and "!=" in str(ex),
-                                    str(ex))
+                    self.assertTrue("'label9" in str(ex) and "!=" in str(ex), str(ex))
 
         # Change the model configuration to use correct label file and to have
         # the default version policy (so that only version 3) is available.
         for base_name, model_name in zip(models_base, models):
-            shutil.copyfile("config.pbtxt." + base_name,
-                            "models/" + model_name + "/config.pbtxt")
+            shutil.copyfile(
+                "config.pbtxt." + base_name, "models/" + model_name + "/config.pbtxt"
+            )
 
         time.sleep(5)  # wait for models to reload
         for model_name in models:
             try:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertTrue(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertTrue(triton_client.is_model_ready(model_name, "3"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -1077,56 +1133,58 @@ def test_dynamic_model_modify(self):
         # change in model policy makes that no longer available.
         for model_name, model_shape in zip(models_base, models_shape):
             try:
-                iu.infer_exact(self,
-                               model_name,
-                               model_shape,
-                               1,
-                               np.float32,
-                               np.float32,
-                               np.float32,
-                               swap=False,
-                               model_version=1)
+                iu.infer_exact(
+                    self,
+                    model_name,
+                    model_shape,
+                    1,
+                    np.float32,
+                    np.float32,
+                    np.float32,
+                    swap=False,
+                    model_version=1,
+                )
                 self.assertTrue(
-                    False, "expected error for unavailable model " + model_name)
+                    False, "expected error for unavailable model " + model_name
+                )
             except Exception as ex:
                 self.assertIn("Request for unknown model", ex.message())
 
         # Version 3 should continue to work...
         for model_name, model_shape in zip(models_base, models_shape):
             try:
-                iu.infer_exact(self,
-                               model_name,
-                               model_shape,
-                               1,
-                               np.float32,
-                               np.float32,
-                               np.float32,
-                               swap=True,
-                               model_version=3)
+                iu.infer_exact(
+                    self,
+                    model_name,
+                    model_shape,
+                    1,
+                    np.float32,
+                    np.float32,
+                    np.float32,
+                    swap=True,
+                    model_version=3,
+                )
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
     def test_dynamic_file_delete(self):
-        models_base = ('savedmodel', 'plan')
+        models_base = ("savedmodel", "plan")
         models_shape = ((1, 16), (1, 16))
         models = list()
         for m in models_base:
-            models.append(
-                tu.get_model_name(m, np.float32, np.float32, np.float32))
+            models.append(tu.get_model_name(m, np.float32, np.float32, np.float32))
 
         # Make sure savedmodel and plan are in the status
         for model_name in models:
             try:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    self.assertTrue(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertTrue(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertTrue(triton_client.is_model_ready(model_name, "1"))
+                    self.assertTrue(triton_client.is_model_ready(model_name, "3"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -1134,15 +1192,17 @@ def test_dynamic_file_delete(self):
         for version in (1, 3):
             for model_name, model_shape in zip(models_base, models_shape):
                 try:
-                    iu.infer_exact(self,
-                                   model_name,
-                                   model_shape,
-                                   1,
-                                   np.float32,
-                                   np.float32,
-                                   np.float32,
-                                   swap=(version == 3),
-                                   model_version=version)
+                    iu.infer_exact(
+                        self,
+                        model_name,
+                        model_shape,
+                        1,
+                        np.float32,
+                        np.float32,
+                        np.float32,
+                        swap=(version == 3),
+                        model_version=version,
+                    )
                 except Exception as ex:
                     self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -1156,81 +1216,86 @@ def test_dynamic_file_delete(self):
         time.sleep(5)  # wait for models to reload
         for model_name in models:
             try:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertTrue(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertTrue(triton_client.is_model_ready(model_name, "3"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Only version 3 (latest) should work...
         for model_name, model_shape in zip(models_base, models_shape):
             try:
-                iu.infer_exact(self,
-                               model_name,
-                               model_shape,
-                               1,
-                               np.float32,
-                               np.float32,
-                               np.float32,
-                               swap=True,
-                               model_version=3)
+                iu.infer_exact(
+                    self,
+                    model_name,
+                    model_shape,
+                    1,
+                    np.float32,
+                    np.float32,
+                    np.float32,
+                    swap=True,
+                    model_version=3,
+                )
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
             try:
-                iu.infer_exact(self,
-                               model_name,
-                               model_shape,
-                               1,
-                               np.float32,
-                               np.float32,
-                               np.float32,
-                               swap=False,
-                               model_version=1)
+                iu.infer_exact(
+                    self,
+                    model_name,
+                    model_shape,
+                    1,
+                    np.float32,
+                    np.float32,
+                    np.float32,
+                    swap=False,
+                    model_version=1,
+                )
                 self.assertTrue(
-                    False,
-                    "expected error for unavailable model " + graphdef_name)
+                    False, "expected error for unavailable model " + graphdef_name
+                )
             except Exception as ex:
                 self.assertIn("Request for unknown model", ex.message())
 
     def test_multiple_model_repository_polling(self):
         model_shape = (1, 16)
-        savedmodel_name = tu.get_model_name('savedmodel', np.float32,
-                                            np.float32, np.float32)
+        savedmodel_name = tu.get_model_name(
+            "savedmodel", np.float32, np.float32, np.float32
+        )
 
         # Models should be loaded successfully and infer
         # successfully. Initially savedmodel only has version 1.
-        self._infer_success_models([
-            'savedmodel',
-        ], (1,), model_shape)
-        self._infer_success_models(['graphdef', 'onnx'], (1, 3), model_shape)
+        self._infer_success_models(
+            [
+                "savedmodel",
+            ],
+            (1,),
+            model_shape,
+        )
+        self._infer_success_models(["graphdef", "onnx"], (1, 3), model_shape)
 
         # Add the savedmodel to the second model repository, should cause
         # it to be unloaded due to duplication
         shutil.copytree(savedmodel_name, "models_0/" + savedmodel_name)
         time.sleep(5)  # wait for models to reload
         try:
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "1"))
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "3"))
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "1"))
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "3"))
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
-        self._infer_success_models(['graphdef', 'onnx'], (1, 3), model_shape)
+        self._infer_success_models(["graphdef", "onnx"], (1, 3), model_shape)
 
         # Remove the savedmodel from the first model repository, the
         # model from the second model repository should be loaded
@@ -1238,87 +1303,96 @@ def test_multiple_model_repository_polling(self):
         # have versions 1 and 3.
         shutil.rmtree("models/" + savedmodel_name)
         time.sleep(5)  # wait for model to unload
-        self._infer_success_models(['savedmodel', 'graphdef', 'onnx'], (1, 3),
-                                   model_shape)
+        self._infer_success_models(
+            ["savedmodel", "graphdef", "onnx"], (1, 3), model_shape
+        )
 
     def test_multiple_model_repository_control(self):
         # similar to test_multiple_model_repository_polling, but the
         # model load/unload is controlled by the API
         model_shape = (1, 16)
-        savedmodel_name = tu.get_model_name('savedmodel', np.float32,
-                                            np.float32, np.float32)
-        model_bases = ['savedmodel', 'graphdef', 'onnx']
+        savedmodel_name = tu.get_model_name(
+            "savedmodel", np.float32, np.float32, np.float32
+        )
+        model_bases = ["savedmodel", "graphdef", "onnx"]
 
         # Initially models are not loaded
         for base in model_bases:
             try:
-                model_name = tu.get_model_name(base, np.float32, np.float32,
-                                               np.float32)
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                model_name = tu.get_model_name(base, np.float32, np.float32, np.float32)
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "3"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Load all models, here we use GRPC
         for base in model_bases:
             try:
-                model_name = tu.get_model_name(base, np.float32, np.float32,
-                                               np.float32)
+                model_name = tu.get_model_name(base, np.float32, np.float32, np.float32)
                 triton_client = grpcclient.InferenceServerClient(
-                    "localhost:8001", verbose=True)
+                    "localhost:8001", verbose=True
+                )
                 triton_client.load_model(model_name)
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Models should be loaded successfully and infer
         # successfully. Initially savedmodel only has version 1.
-        self._infer_success_models([
-            'savedmodel',
-        ], (1,), model_shape)
-        self._infer_success_models(['graphdef', 'onnx'], (1, 3), model_shape)
+        self._infer_success_models(
+            [
+                "savedmodel",
+            ],
+            (1,),
+            model_shape,
+        )
+        self._infer_success_models(["graphdef", "onnx"], (1, 3), model_shape)
 
         # Add the savedmodel to the second model repository. Because
         # not polling this doesn't change any model state, all models
         # are still loaded and available.
         shutil.copytree(savedmodel_name, "models_0/" + savedmodel_name)
-        self._infer_success_models([
-            'savedmodel',
-        ], (1,), model_shape)
-        self._infer_success_models(['graphdef', 'onnx'], (1, 3), model_shape)
-
-        # Reload savedmodel which will cause it to unload because it
-        # is in 2 model repositories. Use HTTP here.
-        try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+        self._infer_success_models(
+            [
+                "savedmodel",
+            ],
+            (1,),
+            model_shape,
+        )
+        self._infer_success_models(["graphdef", "onnx"], (1, 3), model_shape)
+
+        # Load savedmodel again which should fail because it is now duplicated
+        # in 2 model repositories. Use HTTP here.
+        try:
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.load_model(savedmodel_name)
         except Exception as ex:
-            self.assertIn("failed to load '{}'".format(savedmodel_name),
-                          ex.message())
+            self.assertIn("failed to load '{}'".format(savedmodel_name), ex.message())
 
         try:
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "1"))
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "3"))
+                # Unlike polling mode, the failed load on the duplicate model
+                # should NOT unload the existing versions in model control mode.
+                self.assertTrue(triton_client.is_model_ready(savedmodel_name, "1"))
+                # Version 3 did not exist in the first model repository, so
+                # it should still not be loaded.
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "3"))
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
-        self._infer_success_models(['graphdef', 'onnx'], (1, 3), model_shape)
+        self._infer_success_models(["graphdef", "onnx"], (1, 3), model_shape)
 
         # Remove the savedmodel from the first model repository and
         # explicitly load savedmodel. The savedmodel from the second
@@ -1326,20 +1400,23 @@ def test_multiple_model_repository_control(self):
         # model repository savedmodel should have versions 1 and 3.
         shutil.rmtree("models/" + savedmodel_name)
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
+            # Unload existing in-memory model from first model repository
+            triton_client.unload_model(savedmodel_name)
+            # Load model from second model repository since original was deleted
             triton_client.load_model(savedmodel_name)
         except Exception as ex:
-            self.assertIn("failed to load '{}'".format(savedmodel_name),
-                          ex.message())
+            self.assertIn("failed to load '{}'".format(savedmodel_name), ex.message())
 
-        self._infer_success_models(['savedmodel', 'graphdef', 'onnx'], (1, 3),
-                                   model_shape)
+        self._infer_success_models(
+            ["savedmodel", "graphdef", "onnx"], (1, 3), model_shape
+        )
 
     def test_model_control(self):
         model_shape = (1, 16)
-        onnx_name = tu.get_model_name('onnx', np.float32, np.float32,
-                                      np.float32)
+        onnx_name = tu.get_model_name("onnx", np.float32, np.float32, np.float32)
 
         ensemble_prefix = "simple_"
         ensemble_name = ensemble_prefix + onnx_name
@@ -1347,48 +1424,55 @@ def test_model_control(self):
         # Make sure no models are loaded
         for model_name in (onnx_name, ensemble_name):
             try:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "3"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Load non-existent model
-        for triton_client in (httpclient.InferenceServerClient("localhost:8000",
-                                                               verbose=True),
-                              grpcclient.InferenceServerClient("localhost:8001",
-                                                               verbose=True)):
+        for triton_client in (
+            httpclient.InferenceServerClient("localhost:8000", verbose=True),
+            grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+        ):
             try:
                 triton_client.load_model("unknown_model")
                 self.assertTrue(False, "expected unknown model failure")
             except Exception as ex:
                 self.assertIn(
-                    "failed to load 'unknown_model', no version is available",
-                    ex.message())
+                    "failed to load 'unknown_model', failed to poll from model repository",
+                    ex.message(),
+                )
 
         # Load ensemble model, the dependent model should be polled and loaded
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.load_model(ensemble_name)
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
-        self._infer_success_models([
-            "onnx",
-        ], (1, 3), model_shape)
-        self._infer_success_models([
-            "simple_onnx",
-        ], (1, 3),
-                                   model_shape,
-                                   swap=True)
+        self._infer_success_models(
+            [
+                "onnx",
+            ],
+            (1, 3),
+            model_shape,
+        )
+        self._infer_success_models(
+            [
+                "simple_onnx",
+            ],
+            (1, 3),
+            model_shape,
+            swap=True,
+        )
 
         # Delete model configuration for onnx, which will cause
         # the autofiller to use the latest version policy so that only
@@ -1396,51 +1480,65 @@ def test_model_control(self):
         for model_name in (onnx_name,):
             os.remove("models/" + model_name + "/config.pbtxt")
 
-        self._infer_success_models([
-            "onnx",
-        ], (1, 3), model_shape)
-        self._infer_success_models([
-            "simple_onnx",
-        ], (1, 3),
-                                   model_shape,
-                                   swap=True)
+        self._infer_success_models(
+            [
+                "onnx",
+            ],
+            (1, 3),
+            model_shape,
+        )
+        self._infer_success_models(
+            [
+                "simple_onnx",
+            ],
+            (1, 3),
+            model_shape,
+            swap=True,
+        )
 
         # Reload models, only version 3 should be available for onnx
         for model_name in (onnx_name, ensemble_name):
             try:
                 triton_client = grpcclient.InferenceServerClient(
-                    "localhost:8001", verbose=True)
+                    "localhost:8001", verbose=True
+                )
                 triton_client.load_model(model_name)
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
-        self._infer_success_models([
-            "onnx",
-        ], (3,), model_shape)
-        self._infer_success_models([
-            "simple_onnx",
-        ], (1, 3),
-                                   model_shape,
-                                   swap=True)
+        self._infer_success_models(
+            [
+                "onnx",
+            ],
+            (3,),
+            model_shape,
+        )
+        self._infer_success_models(
+            [
+                "simple_onnx",
+            ],
+            (1, 3),
+            model_shape,
+            swap=True,
+        )
 
         for model_name in (onnx_name,):
             try:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Unload non-existing model, nothing should happen
-        for triton_client in (httpclient.InferenceServerClient("localhost:8000",
-                                                               verbose=True),
-                              grpcclient.InferenceServerClient("localhost:8001",
-                                                               verbose=True)):
+        for triton_client in (
+            httpclient.InferenceServerClient("localhost:8000", verbose=True),
+            grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+        ):
             try:
                 triton_client.unload_model("unknown_model")
             except Exception as ex:
@@ -1449,24 +1547,23 @@ def test_model_control(self):
         # Unload the depending model, as side effect, the ensemble model will be
         # forced to be unloaded
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.unload_model(onnx_name)
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
         for model_name in (onnx_name, ensemble_name):
             try:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "3"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -1474,41 +1571,43 @@ def test_model_control(self):
         # model. The ensemble model should not be reloaded because it
         # was explicitly unloaded.
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.unload_model(ensemble_name)
             triton_client.load_model(onnx_name)
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
-        self._infer_success_models([
-            "onnx",
-        ], (3,), model_shape)
+        self._infer_success_models(
+            [
+                "onnx",
+            ],
+            (3,),
+            model_shape,
+        )
 
         try:
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertFalse(
-                    triton_client.is_model_ready(ensemble_name, "1"))
-                self.assertFalse(
-                    triton_client.is_model_ready(ensemble_name, "3"))
+                self.assertFalse(triton_client.is_model_ready(ensemble_name, "1"))
+                self.assertFalse(triton_client.is_model_ready(ensemble_name, "3"))
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
     def test_model_control_fail(self):
-        model_name = tu.get_model_name('onnx', np.float32, np.float32,
-                                       np.float32)
+        model_name = tu.get_model_name("onnx", np.float32, np.float32, np.float32)
 
         # Make sure no models are loaded
         try:
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
                 self.assertFalse(triton_client.is_model_ready(model_name, "1"))
@@ -1518,28 +1617,27 @@ def test_model_control_fail(self):
 
         # Request to load the model and expect fail to load
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.load_model(model_name)
             self.assertTrue(False, "expecting load failure")
         except InferenceServerException as ex:
-            self.assertIn("load failed for model '{}'".format(model_name),
-                          ex.message())
+            self.assertIn("load failed for model '{}'".format(model_name), ex.message())
 
         # Another attempt should fail as well
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.load_model(model_name)
             self.assertTrue(False, "expecting load failure")
         except InferenceServerException as ex:
-            self.assertIn("load failed for model '{}'".format(model_name),
-                          ex.message())
+            self.assertIn("load failed for model '{}'".format(model_name), ex.message())
 
     def test_model_control_ensemble(self):
         model_shape = (1, 16)
-        onnx_name = tu.get_model_name('onnx', np.float32, np.float32,
-                                      np.float32)
+        onnx_name = tu.get_model_name("onnx", np.float32, np.float32, np.float32)
 
         ensemble_prefix = "simple_"
         ensemble_name = ensemble_prefix + onnx_name
@@ -1547,83 +1645,91 @@ def test_model_control_ensemble(self):
         # Make sure no models are loaded
         for model_name in (onnx_name, ensemble_name):
             try:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "3"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Load ensemble model, the dependent model should be polled and loaded
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.load_model(ensemble_name)
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
-        self._infer_success_models([
-            "onnx",
-        ], (1, 3), model_shape)
-        self._infer_success_models([
-            "simple_onnx",
-        ], (1, 3),
-                                   model_shape,
-                                   swap=True)
+        self._infer_success_models(
+            [
+                "onnx",
+            ],
+            (1, 3),
+            model_shape,
+        )
+        self._infer_success_models(
+            [
+                "simple_onnx",
+            ],
+            (1, 3),
+            model_shape,
+            swap=True,
+        )
 
         # Unload the ensemble with unload_dependents flag. all models should be unloaded
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.unload_model(ensemble_name, unload_dependents=True)
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
         for model_name in (onnx_name, ensemble_name):
             try:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "3"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Load ensemble model, and unload it without unload_dependents flag (default).
         # The dependent model should still be available
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.load_model(ensemble_name)
             triton_client.unload_model(ensemble_name)
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
-        self._infer_success_models([
-            "onnx",
-        ], (1, 3), model_shape)
+        self._infer_success_models(
+            [
+                "onnx",
+            ],
+            (1, 3),
+            model_shape,
+        )
 
         try:
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertFalse(
-                    triton_client.is_model_ready(ensemble_name, "1"))
-                self.assertFalse(
-                    triton_client.is_model_ready(ensemble_name, "3"))
+                self.assertFalse(triton_client.is_model_ready(ensemble_name, "1"))
+                self.assertFalse(triton_client.is_model_ready(ensemble_name, "3"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "1"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "3"))
         except Exception as ex:
@@ -1631,8 +1737,7 @@ def test_model_control_ensemble(self):
 
     def test_load_same_model_different_platform(self):
         model_shape = (1, 16)
-        model_name = tu.get_model_name('simple', np.float32, np.float32,
-                                       np.float32)
+        model_name = tu.get_model_name("simple", np.float32, np.float32, np.float32)
 
         # Check whether or not to use grpc protocol
         use_grpc = "TRITONSERVER_USE_GRPC" in os.environ
@@ -1646,19 +1751,22 @@ def test_load_same_model_different_platform(self):
             self.assertTrue(triton_client.is_model_ready(model_name, "1"))
             self.assertTrue(triton_client.is_model_ready(model_name, "3"))
             if use_grpc:
-                metadata = triton_client.get_model_metadata(model_name,
-                                                            as_json=True)
+                metadata = triton_client.get_model_metadata(model_name, as_json=True)
             else:
                 metadata = triton_client.get_model_metadata(model_name)
             self.assertEqual(metadata["platform"], "tensorrt_plan")
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
-        self._infer_success_models([
-            "simple",
-        ], (
-            1,
-            3,
-        ), model_shape)
+        self._infer_success_models(
+            [
+                "simple",
+            ],
+            (
+                1,
+                3,
+            ),
+            model_shape,
+        )
 
         # Copy the same model of different platform to model repository
         shutil.rmtree("models/" + model_name)
@@ -1680,19 +1788,22 @@ def test_load_same_model_different_platform(self):
             self.assertTrue(triton_client.is_model_ready(model_name, "1"))
             self.assertTrue(triton_client.is_model_ready(model_name, "3"))
             if use_grpc:
-                metadata = triton_client.get_model_metadata(model_name,
-                                                            as_json=True)
+                metadata = triton_client.get_model_metadata(model_name, as_json=True)
             else:
                 metadata = triton_client.get_model_metadata(model_name)
             self.assertEqual(metadata["platform"], "pytorch_libtorch")
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
-        self._infer_success_models([
-            "simple",
-        ], (
-            1,
-            3,
-        ), model_shape)
+        self._infer_success_models(
+            [
+                "simple",
+            ],
+            (
+                1,
+                3,
+            ),
+            model_shape,
+        )
 
     def test_model_availability_on_reload(self):
         model_name = "identity_zero_1_int32"
@@ -1717,9 +1828,8 @@ def test_model_availability_on_reload(self):
 
         # Reload models, v1 should still be available until v2 is loaded
         # The load is requested in other thread as it is blocking API,
-        # and the v1 availibility should be tested during the reload
-        thread = threading.Thread(target=self._async_load,
-                                  args=(model_name, use_grpc))
+        # and the v1 availability should be tested during the reload
+        thread = threading.Thread(target=self._async_load, args=(model_name, use_grpc))
         thread.start()
         # wait for time < model creation delay to ensure load request is sent
         time.sleep(3)
@@ -1730,9 +1840,12 @@ def test_model_availability_on_reload(self):
             triton_client = self._get_client(use_grpc)
             self.assertTrue(triton_client.is_server_live())
             load_end = time.time()
-            self.assertTrue((load_end - load_start) < 5,
-                            "server was waiting unexpectly, waited {}".format(
-                                (load_end - load_start)))
+            self.assertTrue(
+                (load_end - load_start) < 5,
+                "server was waiting unexpectedly, waited {}".format(
+                    (load_end - load_start)
+                ),
+            )
             self.assertTrue(triton_client.is_server_ready())
             self.assertTrue(triton_client.is_model_ready(model_name, "1"))
         except Exception as ex:
@@ -1770,14 +1883,12 @@ def test_model_availability_on_reload_2(self):
         self._infer_success_identity(model_base, (1,), np.int32, model_shape)
 
         # Overwrite config.pbtxt to load v2 only
-        shutil.copyfile("config.pbtxt.v2",
-                        "models/" + model_name + "/config.pbtxt")
+        shutil.copyfile("config.pbtxt.v2", "models/" + model_name + "/config.pbtxt")
 
         # Reload models, v1 should still be available until v2 is loaded
         # The load is requested in other thread as it is blocking API,
-        # and the v1 availibility should be tested during the reload
-        thread = threading.Thread(target=self._async_load,
-                                  args=(model_name, use_grpc))
+        # and the v1 availability should be tested during the reload
+        thread = threading.Thread(target=self._async_load, args=(model_name, use_grpc))
         thread.start()
         # wait for time < model creation delay to ensure load request is sent
         time.sleep(3)
@@ -1788,9 +1899,12 @@ def test_model_availability_on_reload_2(self):
             triton_client = self._get_client(use_grpc)
             self.assertTrue(triton_client.is_server_live())
             load_end = time.time()
-            self.assertTrue((load_end - load_start) < 5,
-                            "server was waiting unexpectly, waited {}".format(
-                                (load_end - load_start)))
+            self.assertTrue(
+                (load_end - load_start) < 5,
+                "server was waiting unexpectedly, waited {}".format(
+                    (load_end - load_start)
+                ),
+            )
             self.assertTrue(triton_client.is_server_ready())
             self.assertTrue(triton_client.is_model_ready(model_name, "1"))
         except Exception as ex:
@@ -1828,13 +1942,11 @@ def test_model_availability_on_reload_3(self):
         self._infer_success_identity(model_base, (1,), np.int32, model_shape)
 
         # Overwrite config.pbtxt to load v2 only
-        shutil.copyfile("config.pbtxt.new",
-                        "models/" + model_name + "/config.pbtxt")
+        shutil.copyfile("config.pbtxt.new", "models/" + model_name + "/config.pbtxt")
 
         # Reload models, v1 will be reloaded but it should  be available
         # during the whole reload
-        thread = threading.Thread(target=self._async_load,
-                                  args=(model_name, use_grpc))
+        thread = threading.Thread(target=self._async_load, args=(model_name, use_grpc))
         thread.start()
         # wait for time < model creation delay to ensure load request is sent
         time.sleep(3)
@@ -1845,9 +1957,12 @@ def test_model_availability_on_reload_3(self):
             triton_client = self._get_client(use_grpc)
             self.assertTrue(triton_client.is_server_live())
             load_end = time.time()
-            self.assertTrue((load_end - load_start) < 5,
-                            "server was waiting unexpectly, waited {}".format(
-                                (load_end - load_start)))
+            self.assertTrue(
+                (load_end - load_start) < 5,
+                "server was waiting unexpectedly, waited {}".format(
+                    (load_end - load_start)
+                ),
+            )
             self.assertTrue(triton_client.is_server_ready())
             self.assertTrue(triton_client.is_model_ready(model_name, "1"))
         except Exception as ex:
@@ -1872,8 +1987,9 @@ def test_model_reload_fail(self):
 
         # Make sure version 1 of the model is loaded
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             self.assertTrue(triton_client.is_server_live())
             self.assertTrue(triton_client.is_server_ready())
             self.assertTrue(triton_client.is_model_ready(model_name, "1"))
@@ -1882,23 +1998,26 @@ def test_model_reload_fail(self):
         self._infer_success_identity(model_base, (1,), np.int32, model_shape)
 
         # Overwrite config.pbtxt to load v2 only on GPU, which will fail
-        shutil.copyfile("config.pbtxt.v2.gpu",
-                        "models/" + model_name + "/config.pbtxt")
+        shutil.copyfile("config.pbtxt.v2.gpu", "models/" + model_name + "/config.pbtxt")
 
         # Reload models, v1 should still be available even if v2 fails to load
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.load_model(model_name)
             self.assertTrue(False, "expecting load failure")
         except Exception as ex:
-            self.assertIn("version 2: Internal: GPU instances not supported",
-                          ex.message())
+            self.assertIn(
+                "version 2 is at UNAVAILABLE state: Internal: GPU instances not supported",
+                ex.message(),
+            )
 
         # Make sure version 1 of the model is available, and version 2 is not
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             self.assertTrue(triton_client.is_server_live())
             self.assertTrue(triton_client.is_server_ready())
             self.assertTrue(triton_client.is_model_ready(model_name, "1"))
@@ -1909,113 +2028,143 @@ def test_model_reload_fail(self):
 
     def test_multiple_model_repository_control_startup_models(self):
         model_shape = (1, 16)
-        onnx_name = tu.get_model_name('onnx', np.float32, np.float32,
-                                      np.float32)
-        plan_name = tu.get_model_name('plan', np.float32, np.float32,
-                                      np.float32)
+        onnx_name = tu.get_model_name("onnx", np.float32, np.float32, np.float32)
+        plan_name = tu.get_model_name("plan", np.float32, np.float32, np.float32)
 
         ensemble_prefix = "simple_"
         onnx_ensemble_name = ensemble_prefix + onnx_name
         plan_ensemble_name = ensemble_prefix + plan_name
 
         # Make sure unloaded models are not in the status
-        for base in ('savedmodel',):
-            model_name = tu.get_model_name(base, np.float32, np.float32,
-                                           np.float32)
+        for base in ("savedmodel",):
+            model_name = tu.get_model_name(base, np.float32, np.float32, np.float32)
             try:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "3"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
         # And loaded models work properly
-        self._infer_success_models([
-            "onnx",
-        ], (1, 3), model_shape)
-        self._infer_success_models([
-            "simple_onnx",
-        ], (1, 3),
-                                   model_shape,
-                                   swap=True)
-        self._infer_success_models([
-            "plan",
-        ], (1, 3), model_shape)
+        self._infer_success_models(
+            [
+                "onnx",
+            ],
+            (1, 3),
+            model_shape,
+        )
+        self._infer_success_models(
+            [
+                "simple_onnx",
+            ],
+            (1, 3),
+            model_shape,
+            swap=True,
+        )
+        self._infer_success_models(
+            [
+                "plan",
+            ],
+            (1, 3),
+            model_shape,
+        )
 
         # Load non-existing model
-        for triton_client in (httpclient.InferenceServerClient("localhost:8000",
-                                                               verbose=True),
-                              grpcclient.InferenceServerClient("localhost:8001",
-                                                               verbose=True)):
+        for triton_client in (
+            httpclient.InferenceServerClient("localhost:8000", verbose=True),
+            grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+        ):
             try:
                 triton_client.load_model("unknown_model")
                 self.assertTrue(False, "expected unknown model failure")
             except Exception as ex:
                 self.assertIn(
-                    "failed to load 'unknown_model', no version is available",
-                    ex.message())
+                    "failed to load 'unknown_model', failed to poll from model repository",
+                    ex.message(),
+                )
 
         # Load plan ensemble model, the dependent model is already
         # loaded via command-line
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.load_model(plan_ensemble_name)
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
-        self._infer_success_models([
-            "plan",
-        ], (1, 3), model_shape)
-        self._infer_success_models([
-            "simple_plan",
-        ], (1, 3),
-                                   model_shape,
-                                   swap=True)
+        self._infer_success_models(
+            [
+                "plan",
+            ],
+            (1, 3),
+            model_shape,
+        )
+        self._infer_success_models(
+            [
+                "simple_plan",
+            ],
+            (1, 3),
+            model_shape,
+            swap=True,
+        )
 
         # Delete model configuration, which will cause the autofiller
         # to use the latest version policy so that only version 3 will
         # be available if the models are re-loaded
         os.remove("models/" + onnx_name + "/config.pbtxt")
 
-        self._infer_success_models([
-            "plan",
-        ], (1, 3), model_shape)
-        self._infer_success_models([
-            "simple_plan",
-        ], (1, 3),
-                                   model_shape,
-                                   swap=True)
+        self._infer_success_models(
+            [
+                "plan",
+            ],
+            (1, 3),
+            model_shape,
+        )
+        self._infer_success_models(
+            [
+                "simple_plan",
+            ],
+            (1, 3),
+            model_shape,
+            swap=True,
+        )
 
         # Reload onnx, only version 3 should be available
         try:
-            triton_client = grpcclient.InferenceServerClient("localhost:8001",
-                                                             verbose=True)
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
             triton_client.load_model(onnx_name)
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
-        self._infer_success_models([
-            "onnx",
-        ], (3,), model_shape)
-        self._infer_success_models([
-            "simple_onnx",
-        ], (1, 3),
-                                   model_shape,
-                                   swap=True)
-
-        try:
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+        self._infer_success_models(
+            [
+                "onnx",
+            ],
+            (3,),
+            model_shape,
+        )
+        self._infer_success_models(
+            [
+                "simple_onnx",
+            ],
+            (1, 3),
+            model_shape,
+            swap=True,
+        )
+
+        try:
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
                 self.assertFalse(triton_client.is_model_ready(onnx_name, "1"))
@@ -2023,10 +2172,10 @@ def test_multiple_model_repository_control_startup_models(self):
             self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Unload non-existing model, nothing should happen
-        for triton_client in (httpclient.InferenceServerClient("localhost:8000",
-                                                               verbose=True),
-                              grpcclient.InferenceServerClient("localhost:8001",
-                                                               verbose=True)):
+        for triton_client in (
+            httpclient.InferenceServerClient("localhost:8000", verbose=True),
+            grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+        ):
             try:
                 triton_client.unload_model("unknown_model")
             except Exception as ex:
@@ -2035,24 +2184,23 @@ def test_multiple_model_repository_control_startup_models(self):
         # Unload the onnx, as side effect, the ensemble model
         # will be forced to be unloaded
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.unload_model(onnx_name)
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
         for model_name in [onnx_name, onnx_ensemble_name]:
             try:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "3"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -2060,36 +2208,46 @@ def test_multiple_model_repository_control_startup_models(self):
         # depending model. The ensemble model should not be reloaded
         # because it was explicitly unloaded.
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.unload_model(onnx_ensemble_name)
             triton_client.load_model(onnx_name)
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
-        self._infer_success_models([
-            "onnx",
-        ], (3,), model_shape)
-        self._infer_success_models([
-            "plan",
-        ], (1, 3), model_shape)
-        self._infer_success_models([
-            "simple_plan",
-        ], (1, 3),
-                                   model_shape,
-                                   swap=True)
-
-        try:
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+        self._infer_success_models(
+            [
+                "onnx",
+            ],
+            (3,),
+            model_shape,
+        )
+        self._infer_success_models(
+            [
+                "plan",
+            ],
+            (1, 3),
+            model_shape,
+        )
+        self._infer_success_models(
+            [
+                "simple_plan",
+            ],
+            (1, 3),
+            model_shape,
+            swap=True,
+        )
+
+        try:
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertFalse(
-                    triton_client.is_model_ready(onnx_ensemble_name, "1"))
-                self.assertFalse(
-                    triton_client.is_model_ready(onnx_ensemble_name, "3"))
+                self.assertFalse(triton_client.is_model_ready(onnx_ensemble_name, "1"))
+                self.assertFalse(triton_client.is_model_ready(onnx_ensemble_name, "3"))
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -2097,7 +2255,7 @@ def test_model_repository_index(self):
         # use model control EXPLICIT and --load-model to load a subset of models
         # in model repository
         tensor_shape = (1, 16)
-        model_bases = ['graphdef', 'savedmodel', "simple_savedmodel"]
+        model_bases = ["graphdef", "savedmodel", "simple_savedmodel"]
 
         # Sanity check on loaded models
         # 3 models should be loaded:
@@ -2106,12 +2264,13 @@ def test_model_repository_index(self):
         #     graphdef_float32_float32_float32
         for model_base in model_bases:
             try:
-                model_name = tu.get_model_name(model_base, np.float32,
-                                               np.float32, np.float32)
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                model_name = tu.get_model_name(
+                    model_base, np.float32, np.float32, np.float32
+                )
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
                     self.assertTrue(triton_client.is_model_ready(model_name))
@@ -2123,8 +2282,9 @@ def test_model_repository_index(self):
         # which appears in two repositories.
         model_bases.append("simple_graphdef")
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             index = triton_client.get_model_repository_index()
             indexed = list()
             self.assertEqual(len(index), 8)
@@ -2133,15 +2293,17 @@ def test_model_repository_index(self):
                 if i["name"] == "onnx_float32_float32_float32":
                     self.assertEqual(i["state"], "UNAVAILABLE")
                     self.assertEqual(
-                        i["reason"],
-                        "model appears in two or more repositories")
+                        i["reason"], "model appears in two or more repositories"
+                    )
             for model_base in model_bases:
-                model_name = tu.get_model_name(model_base, np.float32,
-                                               np.float32, np.float32)
+                model_name = tu.get_model_name(
+                    model_base, np.float32, np.float32, np.float32
+                )
                 self.assertTrue(model_name in indexed)
 
-            triton_client = grpcclient.InferenceServerClient("localhost:8001",
-                                                             verbose=True)
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
             index = triton_client.get_model_repository_index()
             indexed = list()
             self.assertEqual(len(index.models), 8)
@@ -2150,10 +2312,12 @@ def test_model_repository_index(self):
                 if i.name == "onnx_float32_float32_float32":
                     self.assertEqual(i.state, "UNAVAILABLE")
                     self.assertEqual(
-                        i.reason, "model appears in two or more repositories")
+                        i.reason, "model appears in two or more repositories"
+                    )
             for model_base in model_bases:
-                model_name = tu.get_model_name(model_base, np.float32,
-                                               np.float32, np.float32)
+                model_name = tu.get_model_name(
+                    model_base, np.float32, np.float32, np.float32
+                )
                 self.assertTrue(model_name in indexed)
 
         except Exception as ex:
@@ -2162,21 +2326,19 @@ def test_model_repository_index(self):
     def test_config_override(self):
         model_shape = (1, 16)
 
-        for triton_client in (httpclient.InferenceServerClient("localhost:8000",
-                                                               verbose=True),
-                              grpcclient.InferenceServerClient("localhost:8001",
-                                                               verbose=True)):
-            for base in (('onnx', 'onnxruntime'),):
-                model_name = tu.get_model_name(base[0], np.float32, np.float32,
-                                               np.float32)
+        for triton_client in (
+            httpclient.InferenceServerClient("localhost:8000", verbose=True),
+            grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+        ):
+            for base in (("onnx", "onnxruntime"),):
+                model_name = tu.get_model_name(
+                    base[0], np.float32, np.float32, np.float32
+                )
                 try:
                     self.assertTrue(triton_client.is_server_live())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "2"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "2"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "3"))
                 except Exception as ex:
                     self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -2185,18 +2347,23 @@ def test_config_override(self):
                 try:
                     triton_client.load_model(model_name)
                     self.assertTrue(
-                        False, "expected fail to load '{}'".format(model_name))
+                        False, "expected fail to load '{}'".format(model_name)
+                    )
                 except Exception as ex:
                     self.assertIn(
-                        "load failed for model '{}'".format(model_name),
-                        ex.message())
+                        "load failed for model '{}'".format(model_name), ex.message()
+                    )
 
                 # Request to load the model with provided "correct" config
                 try:
-                    triton_client.load_model(model_name,
-                                             config="""
+                    triton_client.load_model(
+                        model_name,
+                        config="""
 {{"backend":"{backend}","version_policy":{{"specific" : {{ "versions": [2] }} }} }}
-""".format(backend=base[1]))
+""".format(
+                            backend=base[1]
+                        ),
+                    )
                 except Exception as ex:
                     self.assertTrue(False, "unexpected error {}".format(ex))
                 self.assertFalse(triton_client.is_model_ready(model_name, "1"))
@@ -2204,68 +2371,61 @@ def test_config_override(self):
                 self.assertFalse(triton_client.is_model_ready(model_name, "3"))
 
                 # And loaded models work properly
-                self._infer_success_models([
-                    base[0],
-                ], (2,), model_shape)
+                self._infer_success_models(
+                    [
+                        base[0],
+                    ],
+                    (2,),
+                    model_shape,
+                )
 
                 # request without additional config will load with default
                 # config and expect to fail, and version 2 will not be unloaded.
                 try:
                     triton_client.load_model(model_name)
                     self.assertTrue(
-                        False, "expected fail to load '{}'".format(model_name))
+                        False, "expected fail to load '{}'".format(model_name)
+                    )
                 except Exception as ex:
                     self.assertIn(
-                        "load failed for model '{}'".format(model_name),
-                        ex.message())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertTrue(
-                        triton_client.is_model_ready(model_name, "2"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "3"))
+                        "load failed for model '{}'".format(model_name), ex.message()
+                    )
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertTrue(triton_client.is_model_ready(model_name, "2"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "3"))
 
                 # Unload model for the next client iteration
                 try:
                     triton_client.unload_model(model_name)
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "2"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "2"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "3"))
                 except Exception as ex:
                     self.assertTrue(False, "unexpected error {}".format(ex))
 
     def test_file_override(self):
-        import base64
-
         model_shape = (1, 16)
         override_base = "override_model"
 
-        for base in (('onnx', 'onnxruntime'),):
-            model_name = tu.get_model_name(base[0], np.float32, np.float32,
-                                           np.float32)
-            override_model_name = tu.get_model_name(override_base, np.float32,
-                                                    np.float32, np.float32)
+        for base in (("onnx", "onnxruntime"),):
+            model_name = tu.get_model_name(base[0], np.float32, np.float32, np.float32)
+            override_model_name = tu.get_model_name(
+                override_base, np.float32, np.float32, np.float32
+            )
 
             # Prepare override file
-            with open("models/{}/3/model.{}".format(model_name, base[0]),
-                      'rb') as f:
+            with open("models/{}/3/model.{}".format(model_name, base[0]), "rb") as f:
                 file_content = f.read()
 
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 try:
                     self.assertTrue(triton_client.is_server_live())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "2"))
-                    self.assertTrue(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "2"))
+                    self.assertTrue(triton_client.is_model_ready(model_name, "3"))
                 except Exception as ex:
                     self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -2275,14 +2435,17 @@ def test_file_override(self):
                 # not be used.
                 try:
                     triton_client.load_model(
-                        model_name, files={"file:1/model.onnx": file_content})
-                    self.assertTrue(
-                        False, "expected error on missing override config")
+                        model_name, files={"file:1/model.onnx": file_content}
+                    )
+                    self.assertTrue(False, "expected error on missing override config")
                 except InferenceServerException as ex:
                     # [FIXME] Improve error reporting to mention missing config
                     self.assertIn(
-                        "failed to load '{}', failed to poll from model repository"
-                        .format(model_name), ex.message())
+                        "failed to load '{}', failed to poll from model repository".format(
+                            model_name
+                        ),
+                        ex.message(),
+                    )
 
                 # Sanity check on previous loaded version is still available
                 # after the failure attempt to load model with different version
@@ -2290,18 +2453,22 @@ def test_file_override(self):
                 self.assertFalse(triton_client.is_model_ready(model_name, "2"))
                 self.assertTrue(triton_client.is_model_ready(model_name, "3"))
 
-                self._infer_success_models([
-                    base[0],
-                ], (3,), model_shape)
+                self._infer_success_models(
+                    [
+                        base[0],
+                    ],
+                    (3,),
+                    model_shape,
+                )
 
                 # Request to load the model with override file and config in
                 # a different name
                 try:
                     triton_client.load_model(
                         override_model_name,
-                        config="""{{"backend":"{backend}" }}""".format(
-                            backend=base[1]),
-                        files={"file:1/model.onnx": file_content})
+                        config="""{{"backend":"{backend}" }}""".format(backend=base[1]),
+                        files={"file:1/model.onnx": file_content},
+                    )
                 except Exception as ex:
                     self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -2310,31 +2477,35 @@ def test_file_override(self):
                 self.assertFalse(triton_client.is_model_ready(model_name, "1"))
                 self.assertFalse(triton_client.is_model_ready(model_name, "2"))
                 self.assertTrue(triton_client.is_model_ready(model_name, "3"))
-                self._infer_success_models([
-                    base[0],
-                ], (3,), model_shape)
+                self._infer_success_models(
+                    [
+                        base[0],
+                    ],
+                    (3,),
+                    model_shape,
+                )
 
                 # New override model should also be available
-                self.assertTrue(
-                    triton_client.is_model_ready(override_model_name, "1"))
-                self.assertFalse(
-                    triton_client.is_model_ready(override_model_name, "2"))
-                self.assertFalse(
-                    triton_client.is_model_ready(override_model_name, "3"))
-                self._infer_success_models([
-                    override_base,
-                ], (1,),
-                                           model_shape,
-                                           swap=True)
+                self.assertTrue(triton_client.is_model_ready(override_model_name, "1"))
+                self.assertFalse(triton_client.is_model_ready(override_model_name, "2"))
+                self.assertFalse(triton_client.is_model_ready(override_model_name, "3"))
+                self._infer_success_models(
+                    [
+                        override_base,
+                    ],
+                    (1,),
+                    model_shape,
+                    swap=True,
+                )
 
                 # Request to load the model with override file and config in
                 # original name
                 try:
                     triton_client.load_model(
                         model_name,
-                        config="""{{"backend":"{backend}" }}""".format(
-                            backend=base[1]),
-                        files={"file:1/model.onnx": file_content})
+                        config="""{{"backend":"{backend}" }}""".format(backend=base[1]),
+                        files={"file:1/model.onnx": file_content},
+                    )
                 except Exception as ex:
                     self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -2343,24 +2514,27 @@ def test_file_override(self):
                 self.assertTrue(triton_client.is_model_ready(model_name, "1"))
                 self.assertFalse(triton_client.is_model_ready(model_name, "2"))
                 self.assertFalse(triton_client.is_model_ready(model_name, "3"))
-                self._infer_success_models([
-                    base[0],
-                ], (1,),
-                                           model_shape,
-                                           swap=True)
+                self._infer_success_models(
+                    [
+                        base[0],
+                    ],
+                    (1,),
+                    model_shape,
+                    swap=True,
+                )
 
                 # The model with different name should be available
-                self.assertTrue(
-                    triton_client.is_model_ready(override_model_name, "1"))
-                self.assertFalse(
-                    triton_client.is_model_ready(override_model_name, "2"))
-                self.assertFalse(
-                    triton_client.is_model_ready(override_model_name, "3"))
-                self._infer_success_models([
-                    override_base,
-                ], (1,),
-                                           model_shape,
-                                           swap=True)
+                self.assertTrue(triton_client.is_model_ready(override_model_name, "1"))
+                self.assertFalse(triton_client.is_model_ready(override_model_name, "2"))
+                self.assertFalse(triton_client.is_model_ready(override_model_name, "3"))
+                self._infer_success_models(
+                    [
+                        override_base,
+                    ],
+                    (1,),
+                    model_shape,
+                    swap=True,
+                )
 
                 # Reset model for the next client iteration
                 try:
@@ -2373,19 +2547,99 @@ def test_file_override(self):
                 self.assertFalse(triton_client.is_model_ready(model_name, "1"))
                 self.assertFalse(triton_client.is_model_ready(model_name, "2"))
                 self.assertTrue(triton_client.is_model_ready(model_name, "3"))
-                self._infer_success_models([
-                    base[0],
-                ], (3,), model_shape)
+                self._infer_success_models(
+                    [
+                        base[0],
+                    ],
+                    (3,),
+                    model_shape,
+                )
+
+    # Test that model load API file override can't be used to create files
+    # outside of any model directory.
+    def test_file_override_security(self):
+        # When using model load API, temporary model directories are created in
+        # a randomly generated /tmp/folderXXXXXX directory for the life of the
+        # model, and cleaned up on model unload.
+        model_basepath = "/tmp/folderXXXXXX"
+        if os.path.exists(model_basepath) and os.path.isdir(model_basepath):
+            shutil.rmtree(model_basepath)
+        os.makedirs(model_basepath)
+
+        # Set file override paths that try to escape out of model directory,
+        # and test both pre-existing and non-existent files.
+        root_home_dir = "/root"
+
+        # Relative paths
+        escape_dir_rel = os.path.join("..", "..", "root")
+        escape_dir_full = os.path.join(model_basepath, escape_dir_rel)
+        self.assertEqual(os.path.abspath(escape_dir_full), root_home_dir)
+
+        new_file_rel = os.path.join(escape_dir_rel, "new_dir", "test.txt")
+        self.assertFalse(os.path.exists(os.path.join(model_basepath, new_file_rel)))
+        existing_file_rel = os.path.join(escape_dir_rel, ".bashrc")
+        self.assertTrue(os.path.exists(os.path.join(model_basepath, existing_file_rel)))
+
+        # Symlinks
+        ## No easy way to inject symlink into generated temp model dir, so for
+        ## testing sake, make a fixed symlink path in /tmp.
+        escape_dir_symlink_rel = os.path.join("..", "escape_symlink")
+        escape_dir_symlink_full = "/tmp/escape_symlink"
+        self.assertEqual(
+            os.path.abspath(os.path.join(model_basepath, escape_dir_symlink_rel)),
+            escape_dir_symlink_full,
+        )
+        if os.path.exists(escape_dir_symlink_full):
+            os.unlink(escape_dir_symlink_full)
+        os.symlink(root_home_dir, escape_dir_symlink_full)
+        self.assertTrue(os.path.abspath(escape_dir_symlink_full), root_home_dir)
+
+        symlink_new_file_rel = os.path.join(
+            escape_dir_symlink_rel, "new_dir", "test.txt"
+        )
+        self.assertFalse(
+            os.path.exists(os.path.join(model_basepath, symlink_new_file_rel))
+        )
+        symlink_existing_file_rel = os.path.join(escape_dir_symlink_rel, ".bashrc")
+        self.assertTrue(
+            os.path.exists(os.path.join(model_basepath, symlink_existing_file_rel))
+        )
+
+        # Contents to try writing to file, though it should fail to be written
+        new_contents = "This shouldn't exist"
+        new_contents_b64 = base64.b64encode(new_contents.encode())
+
+        new_files = [new_file_rel, symlink_new_file_rel]
+        existing_files = [existing_file_rel, symlink_existing_file_rel]
+        all_files = new_files + existing_files
+        for filepath in all_files:
+            # minimal config to create a new model
+            config = json.dumps({"backend": "identity"})
+            files = {f"file:{filepath}": new_contents_b64}
+            with httpclient.InferenceServerClient("localhost:8000") as client:
+                with self.assertRaisesRegex(InferenceServerException, "failed to load"):
+                    client.load_model("new_model", config=config, files=files)
+
+        for rel_path in new_files:
+            # Assert new file wasn't created
+            self.assertFalse(os.path.exists(os.path.join(model_basepath, rel_path)))
+
+        for rel_path in existing_files:
+            # Read the existing file and make sure it's contents weren't overwritten
+            existing_file = os.path.join(model_basepath, rel_path)
+            self.assertTrue(os.path.exists(existing_file))
+            with open(existing_file) as f:
+                contents = f.read()
+                self.assertNotEqual(contents, new_contents)
 
     def test_shutdown_dynamic(self):
         model_shape = (1, 1)
         input_data = np.ones(shape=(1, 1), dtype=np.float32)
 
-        inputs = [grpcclient.InferInput('INPUT0', model_shape, "FP32")]
+        inputs = [grpcclient.InferInput("INPUT0", model_shape, "FP32")]
         inputs[0].set_data_from_numpy(input_data)
 
-        triton_client = grpcclient.InferenceServerClient("localhost:8001",
-                                                         verbose=True)
+        triton_client = grpcclient.InferenceServerClient("localhost:8001", verbose=True)
         model_name = "custom_zero_1_float32"
 
         # Send two requests as only requests held in scheduler are counted
@@ -2403,26 +2657,27 @@ def callback(user_data, result, error):
         request_count = 6
         async_results = []
         for _ in range(request_count):
-            triton_client.async_infer(model_name, inputs,
-                                      partial(callback, async_results))
+            triton_client.async_infer(
+                model_name, inputs, partial(callback, async_results)
+            )
         time.sleep(1)
 
         # Send signal to shutdown the server
-        os.kill(int(os.environ['SERVER_PID']), signal.SIGINT)
+        os.kill(int(os.environ["SERVER_PID"]), signal.SIGINT)
 
         # Send more requests and should be rejected
         try:
             triton_client.infer(model_name, inputs)
-            self.assertTrue(False,
-                            "expected error for new inference during shutdown")
+            self.assertTrue(False, "expected error for new inference during shutdown")
         except InferenceServerException as ex:
             self.assertIn(
                 "Server is stopping, scheduler for model has stopped accepting new inference requests",
-                ex.message())
+                ex.message(),
+            )
 
         # Wait until the results are available in user_data
         time_out = 30
-        while ((len(async_results) < request_count) and time_out > 0):
+        while (len(async_results) < request_count) and time_out > 0:
             time_out = time_out - 1
             time.sleep(1)
 
@@ -2430,21 +2685,19 @@ def callback(user_data, result, error):
         for result in async_results:
             if type(result) == InferenceServerException:
                 raise result
-            output_data = result.as_numpy('OUTPUT0')
+            output_data = result.as_numpy("OUTPUT0")
             np.testing.assert_allclose(
-                output_data,
-                input_data,
-                err_msg='Inference result is not correct')
+                output_data, input_data, err_msg="Inference result is not correct"
+            )
 
     def test_shutdown_sequence(self):
         model_shape = (1, 1)
         input_data = np.ones(shape=(1, 1), dtype=np.int32)
 
-        inputs = [grpcclient.InferInput('INPUT', model_shape, "INT32")]
+        inputs = [grpcclient.InferInput("INPUT", model_shape, "INT32")]
         inputs[0].set_data_from_numpy(input_data)
 
-        triton_client = grpcclient.InferenceServerClient("localhost:8001",
-                                                         verbose=True)
+        triton_client = grpcclient.InferenceServerClient("localhost:8001", verbose=True)
         model_name = "custom_sequence_int32"
 
         # Send two requests as only requests held in scheduler are counted
@@ -2459,59 +2712,57 @@ def callback(user_data, result, error):
         request_count = 2
         async_results = []
         for i in range(request_count):
-            triton_client.async_infer(model_name,
-                                      inputs,
-                                      partial(callback, async_results),
-                                      sequence_id=(i + 1),
-                                      sequence_start=True)
+            triton_client.async_infer(
+                model_name,
+                inputs,
+                partial(callback, async_results),
+                sequence_id=(i + 1),
+                sequence_start=True,
+            )
         time.sleep(1)
 
         # Send signal to shutdown the server
-        os.kill(int(os.environ['SERVER_PID']), signal.SIGINT)
+        os.kill(int(os.environ["SERVER_PID"]), signal.SIGINT)
 
         # Send requests with different characteristic
-        # 1: New sequence with new seqeuence ID
-        try:
-            triton_client.infer(model_name,
-                                inputs,
-                                sequence_id=request_count,
-                                sequence_start=True)
-            self.assertTrue(False,
-                            "expected error for new inference during shutdown")
+        # 1: New sequence with new sequence ID
+        try:
+            triton_client.infer(
+                model_name, inputs, sequence_id=request_count, sequence_start=True
+            )
+            self.assertTrue(False, "expected error for new inference during shutdown")
         except InferenceServerException as ex:
             self.assertIn(
                 "Server is stopping, scheduler for model has stopped accepting new inference requests",
-                ex.message())
-        # 2: New sequence with existing seqeuence ID
-        try:
-            triton_client.infer(model_name,
-                                inputs,
-                                sequence_id=1,
-                                sequence_start=True)
-            self.assertTrue(False,
-                            "expected error for new inference during shutdown")
+                ex.message(),
+            )
+        # 2: New sequence with existing sequence ID
+        try:
+            triton_client.infer(model_name, inputs, sequence_id=1, sequence_start=True)
+            self.assertTrue(False, "expected error for new inference during shutdown")
         except InferenceServerException as ex:
             self.assertIn(
                 "Server is stopping, scheduler for model has stopped accepting new inference requests",
-                ex.message())
+                ex.message(),
+            )
         # 3: Continuing sequence
         try:
-            res = triton_client.infer(model_name,
-                                      inputs,
-                                      sequence_id=2,
-                                      sequence_end=True)
-            output_data = res.as_numpy('OUTPUT')
+            res = triton_client.infer(
+                model_name, inputs, sequence_id=2, sequence_end=True
+            )
+            output_data = res.as_numpy("OUTPUT")
             # Result are accumulated
             np.testing.assert_allclose(
                 output_data,
                 input_data + input_data,
-                err_msg='Inference result is not correct')
+                err_msg="Inference result is not correct",
+            )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Wait until the results are available in user_data
         time_out = 30
-        while ((len(async_results) < request_count) and time_out > 0):
+        while (len(async_results) < request_count) and time_out > 0:
             time_out = time_out - 1
             time.sleep(1)
 
@@ -2519,11 +2770,10 @@ def callback(user_data, result, error):
         for result in async_results:
             if type(result) == InferenceServerException:
                 raise result
-            output_data = result.as_numpy('OUTPUT')
+            output_data = result.as_numpy("OUTPUT")
             np.testing.assert_allclose(
-                output_data,
-                input_data,
-                err_msg='Inference result is not correct')
+                output_data, input_data, err_msg="Inference result is not correct"
+            )
 
         # Sleep 5 seconds for scheduler timeout to work and should
         # reduce the in-flight count
@@ -2533,11 +2783,10 @@ def test_shutdown_ensemble(self):
         model_shape = (1, 1)
         input_data = np.ones(shape=(1, 1), dtype=np.float32)
 
-        inputs = [grpcclient.InferInput('INPUT0', model_shape, "FP32")]
+        inputs = [grpcclient.InferInput("INPUT0", model_shape, "FP32")]
         inputs[0].set_data_from_numpy(input_data)
 
-        triton_client = grpcclient.InferenceServerClient("localhost:8001",
-                                                         verbose=True)
+        triton_client = grpcclient.InferenceServerClient("localhost:8001", verbose=True)
         model_name = "ensemble_zero_1_float32"
 
         # Send two requests as only requests held in scheduler are counted
@@ -2554,26 +2803,28 @@ def callback(user_data, result, error):
         request_count = 1
         async_results = []
         for _ in range(request_count):
-            triton_client.async_infer(model_name, inputs,
-                                      partial(callback, async_results))
+            triton_client.async_infer(
+                model_name, inputs, partial(callback, async_results)
+            )
         time.sleep(1)
 
         # Send signal to shutdown the server
-        os.kill(int(os.environ['SERVER_PID']), signal.SIGINT)
+        os.kill(int(os.environ["SERVER_PID"]), signal.SIGINT)
 
         # Send more requests and should be rejected
         try:
             triton_client.infer(model_name, inputs)
-            self.assertTrue(False,
-                            "expected error for new inference during shutdown")
+            self.assertTrue(False, "expected error for new inference during shutdown")
         except InferenceServerException as ex:
+            self.assertIn("in ensemble 'ensemble_zero_1_float32'", ex.message())
             self.assertIn(
-                "in ensemble 'ensemble_zero_1_float32', Server is stopping, scheduler for model has stopped accepting new inference requests",
-                ex.message())
+                "Server is stopping, scheduler for model has stopped accepting new inference requests",
+                ex.message(),
+            )
 
         # Wait until the results are available in user_data
         time_out = 10
-        while ((len(async_results) < request_count) and time_out > 0):
+        while (len(async_results) < request_count) and time_out > 0:
             time_out = time_out - 1
             time.sleep(1)
 
@@ -2581,12 +2832,428 @@ def callback(user_data, result, error):
         for result in async_results:
             if type(result) == InferenceServerException:
                 raise result
-            output_data = result.as_numpy('OUTPUT0')
+            output_data = result.as_numpy("OUTPUT0")
             np.testing.assert_allclose(
-                output_data,
-                input_data,
-                err_msg='Inference result is not correct')
+                output_data, input_data, err_msg="Inference result is not correct"
+            )
+
+    def test_load_gpu_limit(self):
+        model_name = "cuda_memory_consumer"
+        try:
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
+            triton_client.load_model(model_name + "_1")
+        except Exception as ex:
+            self.assertTrue(False, "unexpected error {}".format(ex))
+
+        # After the first load, the memory consumption should have exceeded
+        # the specified limit, load will fail
+        try:
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
+            triton_client.load_model(model_name + "_2")
+            self.assertTrue(False, "expected error for loading model")
+        except Exception as ex:
+            self.assertIn("memory limit set for GPU 0 has exceeded", ex.message())
+
+        # Load should work after explicitly unload model to free memory
+        try:
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
+            triton_client.unload_model(model_name + "_1")
+            triton_client.load_model(model_name + "_2")
+        except Exception as ex:
+            self.assertTrue(False, "unexpected error {}".format(ex))
+
+    def test_concurrent_model_load_speedup(self):
+        # Initialize client
+        try:
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
+        except Exception as ex:
+            self.assertTrue(False, "unexpected error {}".format(ex))
+        # Each model should have a loading delay of 10 seconds
+        model_pairs = [
+            ["identity_zero_1_int32_1", "identity_zero_1_int32_2"],
+            ["python_identity_fp32_1", "python_identity_fp32_2"],
+        ]
+        # Test each model pair for speed up
+        for model_pair in model_pairs:
+            # Load both models concurrently
+            threads = []
+            for model_name in model_pair:
+                threads.append(
+                    threading.Thread(
+                        target=triton_client.load_model, args=(model_name,)
+                    )
+                )
+            start_time = time.time()
+            for thread in threads:
+                thread.start()
+            for thread in threads:
+                thread.join()
+            end_time = time.time()
+            loading_time = end_time - start_time
+            # Each of the two models has a minimum loading delay of 10 seconds
+            # Speedup is observed when the concurrent loading time < 20 seconds
+            # but use a tighter bound of 15 seconds
+            self.assertLess(
+                loading_time, 15.0, "Concurrent loading speedup not observed"
+            )
+            # Concurrent loading time cannot be < 10 seconds
+            self.assertGreaterEqual(
+                loading_time, 10.0, "Invalid concurrent loading time"
+            )
+            # Make sure the models are loaded
+            self.assertTrue(triton_client.is_server_live())
+            self.assertTrue(triton_client.is_server_ready())
+            for model_name in model_pair:
+                self.assertTrue(triton_client.is_model_ready(model_name))
+
+    def test_concurrent_model_load(self):
+        # Initialize client
+        try:
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
+        except Exception as ex:
+            self.assertTrue(False, "unexpected error {}".format(ex))
+        # Load same named model concurrently
+        with concurrent.futures.ThreadPoolExecutor() as pool:
+            # First load an 10 seconds delayed identity backend model
+            thread_1 = pool.submit(triton_client.load_model, "identity_model")
+            time.sleep(2)  # wait between loads
+            # Switch the model file to python backend
+            shutil.move("models", "models_v1")
+            shutil.move("models_v2", "models")
+            # Second load should be blocked until the first completes
+            thread_2 = pool.submit(triton_client.load_model, "identity_model")
+            # Both loads should succeed
+            thread_1.result()
+            thread_2.result()
+        # Check the model is ready
+        self.assertTrue(triton_client.is_server_live())
+        self.assertTrue(triton_client.is_server_ready())
+        self.assertTrue(triton_client.is_model_ready("identity_model"))
+        # Check the finally loaded model is the second one
+        model_metadata = triton_client.get_model_metadata("identity_model")
+        self.assertEqual(model_metadata.platform, "python")
+
+    def test_concurrent_model_load_unload(self):
+        # Initialize client
+        try:
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
+        except Exception as ex:
+            self.assertTrue(False, "unexpected error {}".format(ex))
+        # Load identity_zero_1_int32 and unload it while loading
+        # The unload operation should wait until the load is completed
+        with concurrent.futures.ThreadPoolExecutor() as pool:
+            load_thread = pool.submit(triton_client.load_model, "identity_zero_1_int32")
+            time.sleep(2)  # wait between load and unload
+            unload_thread = pool.submit(
+                triton_client.unload_model, "identity_zero_1_int32"
+            )
+            load_thread.result()
+            unload_thread.result()
+        self.assertTrue(triton_client.is_server_live())
+        self.assertTrue(triton_client.is_server_ready())
+        self.assertFalse(triton_client.is_model_ready("identity_zero_1_int32"))
+        # Load ensemble_zero_1_float32 and unload its dependency while loading
+        # The unload operation should wait until the load is completed
+        with concurrent.futures.ThreadPoolExecutor() as pool:
+            load_thread = pool.submit(
+                triton_client.load_model, "ensemble_zero_1_float32"
+            )
+            time.sleep(2)  # wait between load and unload
+            unload_thread = pool.submit(
+                triton_client.unload_model, "custom_zero_1_float32"
+            )
+            load_thread.result()
+            unload_thread.result()
+        self.assertTrue(triton_client.is_server_live())
+        self.assertTrue(triton_client.is_server_ready())
+        self.assertFalse(triton_client.is_model_ready("ensemble_zero_1_float32"))
+        self.assertFalse(triton_client.is_model_ready("custom_zero_1_float32"))
+        # Load both models and unload them concurrently
+        model_names = ["identity_zero_1_int32", "ensemble_zero_1_float32"]
+        for is_load in [True, False]:
+            action_fn = (
+                triton_client.load_model if is_load else triton_client.unload_model
+            )
+            with concurrent.futures.ThreadPoolExecutor() as pool:
+                threads = []
+                for model_name in model_names:
+                    threads.append(pool.submit(action_fn, model_name))
+                for thread in concurrent.futures.as_completed(threads):
+                    thread.result()
+            for model_name in model_names:
+                self.assertEqual(is_load, triton_client.is_model_ready(model_name))
+
+    def test_concurrent_same_model_load_unload_stress(self):
+        model_name = "identity_zero_1_int32"
+        num_threads = 32
+        num_iterations = 1024
+        try:
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
+        except Exception as ex:
+            self.assertTrue(False, "unexpected error {}".format(ex))
+
+        load_fail_reasons = [
+            "unexpected miss in global map",
+            "no version is available",
+            "failed to poll from model repository",
+        ]
+        unload_fail_reasons = ["versions that are still available: 1"]
+        load_fail_messages = [
+            ("failed to load '" + model_name + "', " + reason)
+            for reason in load_fail_reasons
+        ]
+        unload_fail_messages = [
+            ("failed to unload '" + model_name + "', " + reason)
+            for reason in unload_fail_reasons
+        ]
+        global_exception_stats = {}  # { "exception message": number of occurrence }
+        load_before_unload_finish = [False]  # use list to access by reference
+
+        def _load_unload():
+            exception_stats = {}  # { "exception message": number of occurrence }
+            for i in range(num_iterations):
+                try:
+                    triton_client.load_model(model_name)
+                except InferenceServerException as ex:
+                    # Acceptable for an unload to happen after a load completes, only
+                    # before the load can verify its load state.
+                    error_message = ex.message()
+                    self.assertIn(error_message, load_fail_messages)
+                    if error_message not in exception_stats:
+                        exception_stats[error_message] = 0
+                    exception_stats[error_message] += 1
+                try:
+                    triton_client.unload_model(model_name)
+                except InferenceServerException as ex:
+                    # Acceptable for a load to happen after an unload completes, only
+                    # before the unload can verify its unload state.
+                    error_message = ex.message()
+                    self.assertIn(error_message, unload_fail_messages)
+                    if error_message not in exception_stats:
+                        exception_stats[error_message] = 0
+                    exception_stats[error_message] += 1
+                    load_before_unload_finish[0] = True
+            return exception_stats
+
+        with concurrent.futures.ThreadPoolExecutor() as pool:
+            threads = []
+            for i in range(num_threads):
+                threads.append(pool.submit(_load_unload))
+            for t in threads:
+                exception_stats = t.result()
+                for key, count in exception_stats.items():
+                    if key not in global_exception_stats:
+                        global_exception_stats[key] = 0
+                    global_exception_stats[key] += count
+
+        self.assertTrue(triton_client.is_server_live())
+        self.assertTrue(triton_client.is_server_ready())
+        self.assertTrue(
+            load_before_unload_finish[0],
+            "The test case did not replicate a load while async unloading. Consider increase concurrency.",
+        )
+
+        stats_path = "./test_concurrent_same_model_load_unload_stress.statistics.log"
+        with open(stats_path, mode="w", encoding="utf-8") as f:
+            f.write(str(global_exception_stats) + "\n")
+
+    def test_concurrent_model_instance_load_speedup(self):
+        # Initialize client
+        try:
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
+        except Exception as ex:
+            self.assertTrue(False, "unexpected error {}".format(ex))
+        models = ["identity_fp32"]
+        # Create 2 instances which each have a delay time of 10 seconds.
+        num_instances = 2
+        instance_group = [{"kind": "KIND_CPU", "count": num_instances}]
+        config = {"instance_group": instance_group}
+        for model in models:
+            # Instances should be loaded concurrently for supported backends
+            start_time = time.time()
+            try:
+                triton_client.load_model(model, config=json.dumps(config))
+            except Exception as ex:
+                self.assertTrue(False, "unexpected error {}".format(ex))
+            end_time = time.time()
+            loading_time = end_time - start_time
+            print(f"Time to load {num_instances} instances: {loading_time}")
+
+            # Each of the two models has a minimum loading delay of 10 seconds
+            # Speedup is observed when the concurrent loading time < 20 seconds
+            # but use a tighter bound of 15 seconds
+            self.assertLess(
+                loading_time, 15.0, "Concurrent loading speedup not observed"
+            )
+            # Concurrent loading time cannot be < 10 seconds
+            self.assertGreaterEqual(
+                loading_time, 10.0, "Invalid concurrent loading time"
+            )
+            # Make sure the models are loaded
+            self.assertTrue(triton_client.is_server_live())
+            self.assertTrue(triton_client.is_server_ready())
+            self.assertTrue(triton_client.is_model_ready(model))
+
+    def _call_with_timeout(self, callable, timeout_secs):
+        # Setup handler for timing out call
+        def timeout_handler(sig, frame):
+            raise TimeoutError()
+
+        signal.signal(signal.SIGALRM, timeout_handler)
+        signal.alarm(timeout_secs)
+        result = callable()
+        return result
+
+    def _call_with_expected_timeout(self, callable, timeout_secs=3):
+        # Call callable with expectation that it will timeout
+        try:
+            self._call_with_timeout(callable, timeout_secs)
+        except TimeoutError:
+            print("Inference timed out as expected.")
+            return
+        except Exception as ex:
+            self.assertTrue(False, "unexpected error {}".format(ex))
+        else:
+            self.assertTrue(False, "unexpected success, call should've timed out.")
+
+    def _get_fp32_io(self, client_type):
+        # Config
+        input_names = ["INPUT0", "INPUT1"]
+        output_names = ["OUTPUT0", "OUTPUT1"]
+        dtype, dims, shape = ("TYPE_FP32", [-1, 16], [1, 16])
+        input_config = [
+            {"name": name, "data_type": dtype, "dims": dims} for name in input_names
+        ]
+        output_config = [
+            {"name": name, "data_type": dtype, "dims": dims} for name in output_names
+        ]
+        # Inputs
+        inputs = []
+        for name in input_names:
+            inputs.append(
+                client_type.InferInput(name, shape, dtype.replace("TYPE_", ""))
+            )
+            inputs[-1].set_data_from_numpy(np.ones(shape, dtype=np.float32))
+        return input_config, output_config, inputs
+
+    def test_concurrent_model_instance_load_sanity(self):
+        cpu, gpu = "KIND_CPU", "KIND_GPU"
+        default_kinds = [cpu, gpu]
+        backend_kinds = {"plan": [gpu], "openvino": [cpu]}
+        try:
+            client_type = httpclient
+            triton_client = client_type.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
+        except Exception as ex:
+            self.assertTrue(False, "unexpected error {}".format(ex))
+
+        backends = os.environ.get("PARALLEL_BACKENDS", "").split()
+        self.assertTrue(len(backends) > 0, "PARALLEL_BACKENDS wasn't set")
+
+        num_instances = 5
+        input_config, output_config, inputs = self._get_fp32_io(client_type)
+        for backend in backends:
+            model = tu.get_model_name(backend, np.float32, np.float32, np.float32)
+            kinds = backend_kinds.get(backend, default_kinds)
+            for kind in kinds:
+                with self.subTest(backend=backend, model=model, kind=kind):
+                    # Setup model config
+                    instance_group = {"kind": kind, "count": num_instances}
+                    # Disable batching to guarantee 1 request per instance
+                    # Configure sequence batching such that each instance cannot accept new requests
+                    # while it is busy with an ongoing sequence. This way we can guarantee sending 1 request to each instance.
+                    max_batch_size = 0
+                    sequence_timeout_secs = 10
+                    sequence_batching = {
+                        "direct": {},
+                        "max_sequence_idle_microseconds": sequence_timeout_secs
+                        * 1000000,
+                    }
+                    config = {
+                        "instance_group": instance_group,
+                        "max_batch_size": max_batch_size,
+                        "sequence_batching": sequence_batching,
+                        "input": input_config,
+                        "output": output_config,
+                    }
+                    print(
+                        f"~~~ Backend: [{backend}], Model: [{model}], Config: [{config}] ~~~"
+                    )
+                    # Load the model
+                    try:
+                        triton_client.load_model(model, config=json.dumps(config))
+                    except Exception as ex:
+                        self.assertTrue(False, "unexpected error {}".format(ex))
+
+                    # Make sure the model is loaded
+                    self.assertTrue(triton_client.is_server_live())
+                    self.assertTrue(triton_client.is_model_ready(model))
+                    print(
+                        "Model Repository Index after load:",
+                        triton_client.get_model_repository_index(),
+                    )
+
+                    # Test inference on each instance
+                    for i in range(1, num_instances + 1):
+                        try:
+                            triton_client.infer(
+                                model, inputs, sequence_id=i, sequence_start=True
+                            )
+                        except Exception as ex:
+                            self.assertTrue(
+                                False, "unexpected inference error {}".format(ex)
+                            )
+
+                    # Each instance should be busy until their sequence times out, so
+                    # an additional infer call should time out. If it doesn't time out, something
+                    # is wrong and the test should fail.
+                    callable = partial(
+                        triton_client.infer,
+                        model,
+                        inputs,
+                        sequence_id=num_instances + 1,
+                        sequence_start=True,
+                    )
+                    self._call_with_expected_timeout(callable, timeout_secs=3)
+
+                    # Unload the model
+                    try:
+                        triton_client.unload_model(model)
+                    except Exception as ex:
+                        self.assertTrue(False, "unexpected error {}".format(ex))
+
+                    # Allow server to fully unload model before next test iteration
+                    num_tries = 10
+                    for i in range(num_tries):
+                        if triton_client.is_server_ready():
+                            break
+                        print(
+                            f"[Attempt {i}] Server not ready yet, sleeping and retrying. Current repository index: {triton_client.get_model_repository_index()}"
+                        )
+                        time.sleep(6)
+                    print(
+                        "Model Repository Index after unload attempts:",
+                        triton_client.get_model_repository_index(),
+                    )
+                    self.assertTrue(triton_client.is_server_ready())
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_lifecycle/test.sh b/qa/L0_lifecycle/test.sh
index 5a34798aa1..8c389d46ac 100755
--- a/qa/L0_lifecycle/test.sh
+++ b/qa/L0_lifecycle/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -48,6 +48,21 @@ SERVER=/opt/tritonserver/bin/tritonserver
 TEST_RESULT_FILE='test_results.txt'
 source ../common/util.sh
 
+function check_unit_test() {
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Failed\n***"
+        RET=1
+    else
+        check_test_results $TEST_RESULT_FILE 1
+        if [ $? -ne 0 ]; then
+            cat $CLIENT_LOG
+            echo -e "\n***\n*** Test Result Verification Failed\n***"
+            RET=1
+        fi
+    fi
+}
+
 RET=0
 rm -fr *.log
 
@@ -74,18 +89,7 @@ sleep $SLEEP_TIME
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_parse_error_noexit >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -109,18 +113,7 @@ sleep $SLEEP_TIME
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_parse_error_noexit >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -146,18 +139,7 @@ sleep $SLEEP_TIME
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_parse_error_noexit >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -183,18 +165,7 @@ sleep $SLEEP_TIME
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_parse_error_noexit >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -206,7 +177,7 @@ LOG_IDX=$((LOG_IDX+1))
 rm -rf models
 mkdir models
 SERVER_ARGS="--model-repository=`pwd`/models"
-SERVER_LOG="./inference_server_$LOG_IDX.log"
+SERVER_LOG="./stub_inference_server_$LOG_IDX.log"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
@@ -215,6 +186,7 @@ if [ "$SERVER_PID" == "0" ]; then
 fi
 SAVED_SERVER_PID=$SERVER_PID
 SERVER_ARGS="--model-repository=`pwd`/models --http-port 8003 --metrics-port 8004"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
 run_server
 sleep $SLEEP_TIME
 # check server log for the warning messages
@@ -236,7 +208,7 @@ LOG_IDX=$((LOG_IDX+1))
 rm -rf models
 mkdir models
 SERVER_ARGS="--model-repository=`pwd`/models"
-SERVER_LOG="./inference_server_$LOG_IDX.log"
+SERVER_LOG="./stub_inference_server_$LOG_IDX.log"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
@@ -245,6 +217,7 @@ if [ "$SERVER_PID" == "0" ]; then
 fi
 SAVED_SERVER_PID=$SERVER_PID
 SERVER_ARGS="--model-repository=`pwd`/models --grpc-port 8003 --metrics-port 8004"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
 run_server
 sleep $SLEEP_TIME
 # check server log for the warning messages
@@ -267,7 +240,7 @@ LOG_IDX=$((LOG_IDX+1))
 rm -rf models
 mkdir models
 SERVER_ARGS="--model-repository=`pwd`/models"
-SERVER_LOG="./inference_server_$LOG_IDX.log"
+SERVER_LOG="./stub_inference_server_$LOG_IDX.log"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
@@ -276,6 +249,7 @@ if [ "$SERVER_PID" == "0" ]; then
 fi
 SAVED_SERVER_PID=$SERVER_PID
 SERVER_ARGS="--model-repository=`pwd`/models --grpc-port 8003 --http-port 8004"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
 run_server
 sleep $SLEEP_TIME
 # check server log for the warning messages
@@ -365,7 +339,9 @@ done
 for i in onnx plan ; do
     cp -r $DATADIR/qa_model_repository/${i}_float32_float32_float32 models_0/.
 done
-rm models/graphdef_float32_float32_float32/*/*
+# Change the model files so that multiple versions will be loaded, and one of
+# the versions will fail to load and cause all other versions to be unloaded.
+rm models/graphdef_float32_float32_float32/3/*
 
 SERVER_ARGS="--model-repository=`pwd`/models --model-repository=`pwd`/models_0 \
              --exit-on-error=false --exit-timeout-secs=5"
@@ -383,18 +359,7 @@ wait_for_model_stable $SERVER_TIMEOUT
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_parse_error_modelfail >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -419,18 +384,7 @@ wait_for_model_stable $SERVER_TIMEOUT
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_parse_error_modelfail_nostrict >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -449,8 +403,10 @@ for i in onnx plan ; do
 done
 rm models/graphdef_float32_float32_float32/config.pbtxt
 
+# Autocomplete should not be turned on for this test because it asserts an error was logged
+# when in strict model configuration mode.
 SERVER_ARGS="--model-repository=`pwd`/models --model-repository=`pwd`/models_0 \
-             --exit-on-error=false --exit-timeout-secs=5"
+             --exit-on-error=false --exit-timeout-secs=5 --strict-model-config=true"
 SERVER_LOG="./inference_server_$LOG_IDX.log"
 run_server_tolive
 if [ "$SERVER_PID" == "0" ]; then
@@ -465,18 +421,7 @@ wait_for_model_stable $SERVER_TIMEOUT
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_parse_error_no_model_config >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -521,18 +466,7 @@ wait_for_model_stable $SERVER_TIMEOUT
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_init_error_modelfail >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -566,18 +500,7 @@ wait_for_model_stable $SERVER_TIMEOUT
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_parse_error_model_no_version >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -606,18 +529,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_parse_ignore_zero_prefixed_version >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -652,18 +564,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_parse_ignore_non_intergral_version >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -698,18 +599,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_dynamic_model_load_unload >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -738,18 +628,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_dynamic_model_load_unload_disabled >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -777,18 +656,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_dynamic_version_load_unload >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -817,18 +685,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_dynamic_version_load_unload_disabled >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -863,18 +720,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_dynamic_model_modify >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -902,18 +748,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_dynamic_file_delete >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -947,18 +782,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_multiple_model_repository_polling >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -994,18 +818,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_multiple_model_repository_control >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -1037,18 +850,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_model_control >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -1080,18 +882,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_model_control_fail >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -1123,18 +914,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_model_control_ensemble >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -1177,18 +957,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_multiple_model_repository_control_startup_models >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -1231,18 +1000,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_multiple_model_repository_control_startup_models >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -1252,8 +1010,8 @@ LOG_IDX=$((LOG_IDX+1))
 
 # Test loading all models on startup in EXPLICIT model control mode AND
 # an additional --load-model argument, it should fail
-rm -fr models 
-mkdir models 
+rm -fr models
+mkdir models
 for i in onnx ; do
     cp -r $DATADIR/qa_model_repository/${i}_float32_float32_float32 models/.
     sed -i "s/max_batch_size:.*/max_batch_size: 1/" models/${i}_float32_float32_float32/config.pbtxt
@@ -1280,6 +1038,34 @@ fi
 
 LOG_IDX=$((LOG_IDX+1))
 
+# Test loading a startup model that doesn't exist, it should fail
+rm -fr models && mkdir models
+INVALID_MODEL="does-not-exist"
+SERVER_ARGS="--model-repository=`pwd`/models \
+             --model-control-mode=explicit \
+             --strict-readiness=true \
+             --exit-on-error=true \
+             --load-model=${INVALID_MODEL}"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
+run_server
+if [ "$SERVER_PID" != "0" ]; then
+    echo -e "\n***\n*** Failed: $SERVER started successfully when it was expected to fail\n***"
+    echo -e "ERROR: Startup model [${INVALID_MODEL}] should have failed to load."
+    cat $SERVER_LOG
+    RET=1
+
+    kill $SERVER_PID
+    wait $SERVER_PID
+fi
+# check server log for the error messages to make sure they're printed
+if [ `grep -c "model not found in any model repository" $SERVER_LOG` == "0" ]; then
+    echo -e "\n***\n*** Server log ${SERVER_LOG} did not print model load failure for non-existent model\n***"
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+
+LOG_IDX=$((LOG_IDX+1))
+
 # LifeCycleTest.test_model_repository_index
 rm -fr models models_0 config.pbtxt.*
 mkdir models models_0
@@ -1313,18 +1099,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_model_repository_index >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -1508,18 +1283,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_model_reload_fail >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -1539,48 +1303,67 @@ for protocol in grpc http; do
     if [[ $protocol == "grpc" ]]; then
        export TRITONSERVER_USE_GRPC=1
     fi
-    rm -fr models simple_float32_float32_float32
-    mkdir models
-    # Prepare two models of different platforms, but with the same name
-    cp -r $DATADIR/qa_model_repository/plan_float32_float32_float32 models/simple_float32_float32_float32
-    sed -i "s/plan_float32_float32_float32/simple_float32_float32_float32/" models/simple_float32_float32_float32/config.pbtxt
-    cp -r $DATADIR/qa_model_repository/libtorch_float32_float32_float32 simple_float32_float32_float32
-    sed -i "s/libtorch_float32_float32_float32/simple_float32_float32_float32/" simple_float32_float32_float32/config.pbtxt
 
-    SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit \
-                 --load-model=simple_float32_float32_float32 \
-                 --exit-timeout-secs=5"
-    SERVER_LOG="./inference_server_$LOG_IDX.log"
-    run_server
-    if [ "$SERVER_PID" == "0" ]; then
-        echo -e "\n***\n*** Failed to start $SERVER\n***"
-        cat $SERVER_LOG
-        exit 1
-    fi
+    # The OS file system is more granular when determining modification time,
+    # the modification timestamp is updated when the file content is changed in
+    # place, but not updated when the file is copied or moved. With Triton, any
+    # operation that changes a file is a modification. Thus, preparing the
+    # models backward will test when a replacement model is having an earlier or
+    # equal modification timestamp than the current model, Triton must still
+    # detect the model is modified and proceed with model reload.
+    for prep_order in normal reverse; do
+        rm -fr models simple_float32_float32_float32
+        mkdir models
+        # Prepare two models of different platforms, but with the same name
+        if [[ $prep_order == "normal" ]]; then
+            # Prepare the TRT model first, then the pytorch model
+            cp -r $DATADIR/qa_model_repository/plan_float32_float32_float32 models/simple_float32_float32_float32
+            sed -i "s/plan_float32_float32_float32/simple_float32_float32_float32/" models/simple_float32_float32_float32/config.pbtxt
+            cp -r $DATADIR/qa_model_repository/libtorch_float32_float32_float32 simple_float32_float32_float32
+            sed -i "s/libtorch_float32_float32_float32/simple_float32_float32_float32/" simple_float32_float32_float32/config.pbtxt
+        else
+            # Prepare the pytorch model first, then the TRT model
+            cp -r $DATADIR/qa_model_repository/libtorch_float32_float32_float32 simple_float32_float32_float32
+            sed -i "s/libtorch_float32_float32_float32/simple_float32_float32_float32/" simple_float32_float32_float32/config.pbtxt
+            cp -r $DATADIR/qa_model_repository/plan_float32_float32_float32 models/simple_float32_float32_float32
+            sed -i "s/plan_float32_float32_float32/simple_float32_float32_float32/" models/simple_float32_float32_float32/config.pbtxt
+        fi
 
-    rm -f $CLIENT_LOG
-    set +e
-    python $LC_TEST LifeCycleTest.test_load_same_model_different_platform >>$CLIENT_LOG 2>&1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Failed\n***"
-        RET=1
-    else
-        check_test_results $TEST_RESULT_FILE 1
+        SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit \
+                    --load-model=simple_float32_float32_float32 \
+                    --exit-timeout-secs=5"
+        SERVER_LOG="./inference_server_$LOG_IDX.log"
+        run_server
+        if [ "$SERVER_PID" == "0" ]; then
+            echo -e "\n***\n*** Failed to start $SERVER\n***"
+            cat $SERVER_LOG
+            exit 1
+        fi
+
+        rm -f $CLIENT_LOG
+        set +e
+        python $LC_TEST LifeCycleTest.test_load_same_model_different_platform >>$CLIENT_LOG 2>&1
         if [ $? -ne 0 ]; then
             cat $CLIENT_LOG
-            echo -e "\n***\n*** Test Result Verification Failed\n***"
+            echo -e "\n***\n*** Test Failed\n***"
             RET=1
+        else
+            check_test_results $TEST_RESULT_FILE 1
+            if [ $? -ne 0 ]; then
+                cat $CLIENT_LOG
+                echo -e "\n***\n*** Test Result Verification Failed\n***"
+                RET=1
+            fi
         fi
-    fi
-    set -e
+        set -e
 
-    kill $SERVER_PID
-    wait $SERVER_PID
+        kill $SERVER_PID
+        wait $SERVER_PID
 
-    unset TRITONSERVER_USE_GRPC
+        LOG_IDX=$((LOG_IDX+1))
+    done
 
-    LOG_IDX=$((LOG_IDX+1))
+    unset TRITONSERVER_USE_GRPC
 done
 
 # Send HTTP request to control endpoint
@@ -1668,7 +1451,7 @@ fi
 set +e
 code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/notapi/v2`
 set -e
-if [ "$code" != "400" ]; then
+if [ "$code" != "404" ]; then
     echo -e "\n***\n*** Test Failed\n***"
     RET=1
 fi
@@ -1676,7 +1459,7 @@ fi
 set +e
 code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/notapi`
 set -e
-if [ "$code" != "400" ]; then
+if [ "$code" != "404" ]; then
     echo -e "\n***\n*** Test Failed\n***"
     RET=1
 fi
@@ -1684,7 +1467,7 @@ fi
 set +e
 code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/models/notapi/foo`
 set -e
-if [ "$code" != "400" ]; then
+if [ "$code" != "404" ]; then
     echo -e "\n***\n*** Test Failed\n***"
     RET=1
 fi
@@ -1716,18 +1499,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_config_override >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -1760,18 +1532,9 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_file_override >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
+python $LC_TEST LifeCycleTest.test_file_override_security >>$CLIENT_LOG 2>&1
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -1787,7 +1550,7 @@ mkdir models
 cp -r ../custom_models/custom_zero_1_float32 models/. && \
     mkdir -p models/custom_zero_1_float32/1 && \
     (cd models/custom_zero_1_float32 && \
-        echo "dynamic_batching {}" >> config.pbtxt 
+        echo "dynamic_batching {}" >> config.pbtxt
         echo "parameters [" >> config.pbtxt && \
         echo "{ key: \"execute_delay_ms\"; value: { string_value: \"5000\" }}" >> config.pbtxt && \
         echo "]" >> config.pbtxt)
@@ -1802,19 +1565,9 @@ if [ "$SERVER_PID" == "0" ]; then
 fi
 
 set +e
+# Server will be shutdown in test script, need to make PID available in script
 SERVER_PID=$SERVER_PID python $LC_TEST LifeCycleTest.test_shutdown_dynamic >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 # check server log
@@ -1846,19 +1599,9 @@ if [ "$SERVER_PID" == "0" ]; then
 fi
 
 set +e
+# Server will be shutdown in test script, need to make PID available in script
 SERVER_PID=$SERVER_PID python $LC_TEST LifeCycleTest.test_shutdown_sequence >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 # check server log
@@ -1886,7 +1629,7 @@ cp -r ensemble_zero_1_float32 models/. && \
 cp -r ../custom_models/custom_zero_1_float32 models/. && \
     mkdir -p models/custom_zero_1_float32/1 && \
     (cd models/custom_zero_1_float32 && \
-        echo "dynamic_batching {}" >> config.pbtxt 
+        echo "dynamic_batching {}" >> config.pbtxt
         echo "parameters [" >> config.pbtxt && \
         echo "{ key: \"execute_delay_ms\"; value: { string_value: \"5000\" }}" >> config.pbtxt && \
         echo "]" >> config.pbtxt)
@@ -1901,32 +1644,306 @@ if [ "$SERVER_PID" == "0" ]; then
 fi
 
 set +e
+# Server will be shutdown in test script, need to make PID available in script
 SERVER_PID=$SERVER_PID python $LC_TEST LifeCycleTest.test_shutdown_ensemble >>$CLIENT_LOG 2>&1
+check_unit_test
+set -e
+
+# check server log
+if [ `grep -c "Model 'ensemble_zero_1_float32' (version 1) has 1 in-flight inferences" $SERVER_LOG` == "0" ]; then
+    echo -e "\n***\n*** Expect logging for model and in-flight inference count\n***"
+    RET=1
+fi
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+LOG_IDX=$((LOG_IDX+1))
+
+# LifeCycleTest.test_load_gpu_limit
+# dependency of the Python model to be used
+pip install cuda-python
+rm -fr models config.pbtxt.*
+mkdir models
+cp -r ../python_models/cuda_memory_consumer models/cuda_memory_consumer_1 && \
+    cp -r ../python_models/cuda_memory_consumer models/cuda_memory_consumer_2
+
+# Negative testing
+SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit --model-load-gpu-limit -1:0.6"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
+run_server
+if [ "$SERVER_PID" != "0" ]; then
+    echo -e "\n***\n*** unexpected start $SERVER\n***"
+    cat $SERVER_LOG
+    RET=1
+    kill $SERVER_PID
+    wait $SERVER_PID
+elif [ `grep -c "expects device ID >= 0, got -1" $SERVER_LOG` == "0" ]; then
+    echo -e "\n***\n*** Expect error on invalid device\n***"
+    RET=1
+fi
+
+LOG_IDX=$((LOG_IDX+1))
+
+SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit --model-load-gpu-limit 0:-0.4"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
+run_server
+if [ "$SERVER_PID" != "0" ]; then
+    echo -e "\n***\n*** unexpected start $SERVER\n***"
+    cat $SERVER_LOG
+    RET=1
+    kill $SERVER_PID
+    wait $SERVER_PID
+elif [ `grep -c "expects limit fraction to be in range \[0.0, 1.0\], got -0.4" $SERVER_LOG` == "0" ]; then
+    echo -e "\n***\n*** Expect error on invalid fraction\n***"
+    RET=1
+fi
+
+LOG_IDX=$((LOG_IDX+1))
+
+# Run server to stop model loading if > 60% of GPU 0 memory is used
+SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit --model-load-gpu-limit 0:0.6"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python $LC_TEST LifeCycleTest.test_load_gpu_limit >>$CLIENT_LOG 2>&1
+check_unit_test
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+LOG_IDX=$((LOG_IDX+1))
+
+# LifeCycleTest.test_concurrent_model_load_speedup
+rm -rf models
+mkdir models
+MODEL_NAME="identity_zero_1_int32"
+cp -r ${MODEL_NAME} models && mkdir -p models/${MODEL_NAME}/1
+cp -r models/${MODEL_NAME} models/${MODEL_NAME}_1 && \
+    sed -i "s/${MODEL_NAME}/${MODEL_NAME}_1/" models/${MODEL_NAME}_1/config.pbtxt
+mv models/${MODEL_NAME} models/${MODEL_NAME}_2 && \
+    sed -i "s/${MODEL_NAME}/${MODEL_NAME}_2/" models/${MODEL_NAME}_2/config.pbtxt
+MODEL_NAME="identity_fp32"
+cp -r ../python_models/${MODEL_NAME} models && (cd models/${MODEL_NAME} && \
+    mkdir 1 && mv model.py 1 && \
+    echo "    def initialize(self, args):" >> 1/model.py && \
+    echo "        import time" >> 1/model.py && \
+    echo "        time.sleep(10)" >> 1/model.py)
+cp -r models/${MODEL_NAME} models/python_${MODEL_NAME}_1 && \
+    sed -i "s/${MODEL_NAME}/python_${MODEL_NAME}_1/" models/python_${MODEL_NAME}_1/config.pbtxt
+mv models/${MODEL_NAME} models/python_${MODEL_NAME}_2 && \
+    sed -i "s/${MODEL_NAME}/python_${MODEL_NAME}_2/" models/python_${MODEL_NAME}_2/config.pbtxt
+
+SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python $LC_TEST LifeCycleTest.test_concurrent_model_load_speedup >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+LOG_IDX=$((LOG_IDX+1))
+
+# LifeCycleTest.test_concurrent_model_load
+rm -rf models models_v1 models_v2
+mkdir models models_v2
+cp -r identity_zero_1_int32 models/identity_model && \
+    (cd models/identity_model && \
+        mkdir 1 && \
+        sed -i "s/identity_zero_1_int32/identity_model/" config.pbtxt)
+cp -r ../python_models/identity_fp32 models_v2/identity_model && \
+    (cd models_v2/identity_model && \
+        mkdir 1 && mv model.py 1 && \
+        sed -i "s/identity_fp32/identity_model/" config.pbtxt)
+
+SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python $LC_TEST LifeCycleTest.test_concurrent_model_load >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+LOG_IDX=$((LOG_IDX+1))
+
+# LifeCycleTest.test_concurrent_model_load_unload
+rm -rf models
+mkdir models
+cp -r identity_zero_1_int32 models && mkdir -p models/identity_zero_1_int32/1
+cp -r ensemble_zero_1_float32 models && mkdir -p models/ensemble_zero_1_float32/1
+cp -r ../custom_models/custom_zero_1_float32 models/. && \
+    mkdir -p models/custom_zero_1_float32/1 && \
+    (cd models/custom_zero_1_float32 && \
+        echo "parameters [" >> config.pbtxt && \
+        echo "{ key: \"creation_delay_sec\"; value: { string_value: \"10\" }}" >> config.pbtxt && \
+        echo "]" >> config.pbtxt)
+
+SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python $LC_TEST LifeCycleTest.test_concurrent_model_load_unload >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+LOG_IDX=$((LOG_IDX+1))
+
+# LifeCycleTest.test_concurrent_same_model_load_unload_stress
+rm -rf models
+mkdir models
+cp -r identity_zero_1_int32 models && \
+    (cd models/identity_zero_1_int32 && \
+        mkdir 1 && \
+        sed -i "s/string_value: \"10\"/string_value: \"0\"/" config.pbtxt)
+
+SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit --model-load-thread-count=32 --log-verbose=2"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python $LC_TEST LifeCycleTest.test_concurrent_same_model_load_unload_stress >>$CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
     cat $CLIENT_LOG
     echo -e "\n***\n*** Test Failed\n***"
     RET=1
 else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
+    cat ./test_concurrent_same_model_load_unload_stress.statistics.log
 fi
 set -e
 
-# check server log
-if [ `grep -c "Model 'ensemble_zero_1_float32' (version 1) has 1 in-flight inferences" $SERVER_LOG` == "0" ]; then
-    echo -e "\n***\n*** Expect logging for model and in-flight inference count\n***"
+kill $SERVER_PID
+wait $SERVER_PID
+
+LOG_IDX=$((LOG_IDX+1))
+
+# LifeCycleTest.test_concurrent_model_instance_load_speedup
+rm -rf models
+mkdir models
+MODEL_NAME="identity_fp32"
+cp -r ../python_models/${MODEL_NAME} models/ && (cd models/${MODEL_NAME} && \
+    mkdir 1 && mv model.py 1 && \
+    echo "    def initialize(self, args):" >> 1/model.py && \
+    echo "        import time" >> 1/model.py && \
+    echo "        time.sleep(10)" >> 1/model.py)
+rm models/${MODEL_NAME}/config.pbtxt
+
+SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python $LC_TEST LifeCycleTest.test_concurrent_model_instance_load_speedup >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
     RET=1
 fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+LOG_IDX=$((LOG_IDX+1))
+
+# LifeCycleTest.test_concurrent_model_instance_load_sanity
+rm -rf models
+mkdir models
+# Sanity check loading multiple instances in parallel for each supported backend
+PARALLEL_BACKENDS="python onnx"
+for backend in ${PARALLEL_BACKENDS} ; do
+    model="${backend}_float32_float32_float32"
+    model_dir="models/${model}"
+    if [[ $backend == "python" ]]; then
+      cp -r ../python_models/identity_fp32 ${model_dir}
+      mkdir ${model_dir}/1 && mv ${model_dir}/model.py ${model_dir}/1
+      rm ${model_dir}/config.pbtxt
+    else
+      mkdir models/${model}
+      cp -r $DATADIR/qa_model_repository/${model}/1 models/${model}/1
+    fi
+done
+
+SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit --log-verbose=2"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+PARALLEL_BACKENDS=${PARALLEL_BACKENDS} python $LC_TEST LifeCycleTest.test_concurrent_model_instance_load_sanity >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+set -e
 
 kill $SERVER_PID
 wait $SERVER_PID
 
 if [ $RET -eq 0 ]; then
   echo -e "\n***\n*** Test Passed\n***"
+else
+  echo -e "\n***\n*** Test Failed\n***"
 fi
 
 exit $RET
diff --git a/qa/L0_logging/logging_endpoint_test.py b/qa/L0_logging/logging_endpoint_test.py
new file mode 100755
index 0000000000..26f98de3da
--- /dev/null
+++ b/qa/L0_logging/logging_endpoint_test.py
@@ -0,0 +1,405 @@
+#!/usr/bin/python
+
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+
+import json
+import sys
+import unittest
+
+import test_util as tu
+import tritonclient.grpc as grpcclient
+import tritonclient.http as httpclient
+from google.protobuf import json_format
+
+
+# Similar set up as dynamic batcher tests
+class LogEndpointTest(tu.TestResultCollector):
+    def tearDown(self):
+        # Clear all log settings to initial state.
+        # Note that the tearDown function uses HTTP client so the pass/fail
+        # of the HTTP log setting test cases should be checked to make sure
+        # tearDown() is properly executed and not affecting start state of
+        # other test cases
+        clear_settings = {
+            "log_file": "",
+            "log_info": True,
+            "log_warning": True,
+            "log_error": True,
+            "log_verbose_level": 0,
+            "log_format": "default",
+        }
+        triton_client = httpclient.InferenceServerClient("localhost:8000")
+        triton_client.update_log_settings(settings=clear_settings)
+
+    def check_server_initial_state(self):
+        # Helper function to make sure the log setting is properly
+        # initialized / reset before actually running the test case.
+        # Note that this function uses HTTP client so the pass/fail of
+        # the HTTP log setting test cases should be checked to make sure
+        # the initial state is checked properly before running other test cases.
+        initial_settings = {
+            "log_file": "",
+            "log_info": True,
+            "log_warning": True,
+            "log_error": True,
+            "log_verbose_level": 0,
+            "log_format": "default",
+        }
+        triton_client = httpclient.InferenceServerClient("localhost:8000")
+        self.assertEqual(initial_settings, triton_client.get_log_settings())
+
+    def test_http_get_settings(self):
+        # Log settings will be the same as default settings since
+        # no update has been made.
+        initial_settings = {
+            "log_file": "",
+            "log_info": True,
+            "log_warning": True,
+            "log_error": True,
+            "log_verbose_level": 0,
+            "log_format": "default",
+        }
+        triton_client = httpclient.InferenceServerClient("localhost:8000")
+        self.assertEqual(
+            initial_settings,
+            triton_client.get_log_settings(),
+            "Unexpected initial log settings",
+        )
+
+    def test_grpc_get_settings(self):
+        # Log settings will be the same as default settings since
+        # no update has been made.
+        initial_settings = grpcclient.service_pb2.LogSettingsResponse()
+        json_format.Parse(
+            json.dumps(
+                {
+                    "settings": {
+                        "log_file": {"stringParam": ""},
+                        "log_info": {"boolParam": True},
+                        "log_warning": {"boolParam": True},
+                        "log_error": {"boolParam": True},
+                        "log_verbose_level": {"uint32Param": 0},
+                        "log_format": {"stringParam": "default"},
+                    }
+                }
+            ),
+            initial_settings,
+        )
+        triton_client = grpcclient.InferenceServerClient("localhost:8001")
+        self.assertEqual(
+            initial_settings,
+            triton_client.get_log_settings(),
+            "Unexpected initial log settings",
+        )
+
+    def test_http_update_settings(self):
+        # Update each possible log configuration
+        # field and check that they are reflected
+        # by the server
+        self.check_server_initial_state()
+
+        expected_log_settings_1 = {
+            "log_file": "log_file.log",
+            "log_info": True,
+            "log_warning": True,
+            "log_error": True,
+            "log_verbose_level": 0,
+            "log_format": "default",
+        }
+        expected_log_settings_2 = {
+            "log_file": "log_file.log",
+            "log_info": False,
+            "log_warning": True,
+            "log_error": True,
+            "log_verbose_level": 0,
+            "log_format": "default",
+        }
+        expected_log_settings_3 = {
+            "log_file": "log_file.log",
+            "log_info": False,
+            "log_warning": False,
+            "log_error": True,
+            "log_verbose_level": 0,
+            "log_format": "default",
+        }
+        expected_log_settings_4 = {
+            "log_file": "log_file.log",
+            "log_info": False,
+            "log_warning": False,
+            "log_error": False,
+            "log_verbose_level": 0,
+            "log_format": "default",
+        }
+        expected_log_settings_5 = {
+            "log_file": "log_file.log",
+            "log_info": False,
+            "log_warning": False,
+            "log_error": False,
+            "log_verbose_level": 1,
+            "log_format": "default",
+        }
+        expected_log_settings_6 = {
+            "log_file": "log_file.log",
+            "log_info": False,
+            "log_warning": False,
+            "log_error": False,
+            "log_verbose_level": 1,
+            "log_format": "ISO8601",
+        }
+
+        triton_client = httpclient.InferenceServerClient("localhost:8000")
+        self.assertEqual(
+            expected_log_settings_1,
+            triton_client.update_log_settings(settings=expected_log_settings_1),
+            "Unexpected updated log settings",
+        )
+        self.assertEqual(
+            expected_log_settings_2,
+            triton_client.update_log_settings(settings=expected_log_settings_2),
+            "Unexpected updated log settings",
+        )
+        self.assertEqual(
+            expected_log_settings_3,
+            triton_client.update_log_settings(settings=expected_log_settings_3),
+            "Unexpected updated log settings",
+        )
+        self.assertEqual(
+            expected_log_settings_4,
+            triton_client.update_log_settings(settings=expected_log_settings_4),
+            "Unexpected updated log settings",
+        )
+        self.assertEqual(
+            expected_log_settings_5,
+            triton_client.update_log_settings(settings=expected_log_settings_5),
+            "Unexpected updated log settings",
+        )
+        self.assertEqual(
+            expected_log_settings_6,
+            triton_client.update_log_settings(settings=expected_log_settings_6),
+            "Unexpected updated log settings",
+        )
+
+    def test_grpc_update_settings(self):
+        # Update each possible log configuration
+        # field and check that they are reflected
+        # by the server
+        self.check_server_initial_state()
+        triton_client = grpcclient.InferenceServerClient("localhost:8001")
+
+        log_settings_1 = {
+            "log_file": "log_file.log",
+            "log_info": True,
+            "log_warning": True,
+            "log_error": True,
+            "log_verbose_level": 0,
+            "log_format": "default",
+        }
+        expected_log_settings_1 = grpcclient.service_pb2.LogSettingsResponse()
+        json_format.Parse(
+            json.dumps(
+                {
+                    "settings": {
+                        "log_file": {"stringParam": "log_file.log"},
+                        "log_info": {"boolParam": True},
+                        "log_warning": {"boolParam": True},
+                        "log_error": {"boolParam": True},
+                        "log_verbose_level": {"uint32Param": 0},
+                        "log_format": {"stringParam": "default"},
+                    }
+                }
+            ),
+            expected_log_settings_1,
+        )
+
+        self.assertEqual(
+            expected_log_settings_1,
+            triton_client.update_log_settings(settings=log_settings_1),
+            "Unexpected updated log settings",
+        )
+
+        log_settings_2 = {
+            "log_file": "log_file.log",
+            "log_info": False,
+            "log_warning": True,
+            "log_error": True,
+            "log_verbose_level": 0,
+            "log_format": "default",
+        }
+        expected_log_settings_2 = grpcclient.service_pb2.LogSettingsResponse()
+        json_format.Parse(
+            json.dumps(
+                {
+                    "settings": {
+                        "log_file": {"stringParam": "log_file.log"},
+                        "log_info": {"boolParam": False},
+                        "log_warning": {"boolParam": True},
+                        "log_error": {"boolParam": True},
+                        "log_verbose_level": {"uint32Param": 0},
+                        "log_format": {"stringParam": "default"},
+                    }
+                }
+            ),
+            expected_log_settings_2,
+        )
+
+        self.assertEqual(
+            expected_log_settings_2,
+            triton_client.update_log_settings(settings=log_settings_2),
+            "Unexpected updated log settings",
+        )
+
+        log_settings_3 = {
+            "log_file": "log_file.log",
+            "log_info": False,
+            "log_warning": False,
+            "log_error": True,
+            "log_verbose_level": 0,
+            "log_format": "default",
+        }
+        expected_log_settings_3 = grpcclient.service_pb2.LogSettingsResponse()
+        json_format.Parse(
+            json.dumps(
+                {
+                    "settings": {
+                        "log_file": {"stringParam": "log_file.log"},
+                        "log_info": {"boolParam": False},
+                        "log_warning": {"boolParam": False},
+                        "log_error": {"boolParam": True},
+                        "log_verbose_level": {"uint32Param": 0},
+                        "log_format": {"stringParam": "default"},
+                    }
+                }
+            ),
+            expected_log_settings_3,
+        )
+
+        self.assertEqual(
+            expected_log_settings_3,
+            triton_client.update_log_settings(settings=log_settings_3),
+            "Unexpected updated log settings",
+        )
+
+        log_settings_4 = {
+            "log_file": "log_file.log",
+            "log_info": False,
+            "log_warning": False,
+            "log_error": False,
+            "log_verbose_level": 0,
+            "log_format": "default",
+        }
+        expected_log_settings_4 = grpcclient.service_pb2.LogSettingsResponse()
+        json_format.Parse(
+            json.dumps(
+                {
+                    "settings": {
+                        "log_file": {"stringParam": "log_file.log"},
+                        "log_info": {"boolParam": False},
+                        "log_warning": {"boolParam": False},
+                        "log_error": {"boolParam": False},
+                        "log_verbose_level": {"uint32Param": 0},
+                        "log_format": {"stringParam": "default"},
+                    }
+                }
+            ),
+            expected_log_settings_4,
+        )
+
+        self.assertEqual(
+            expected_log_settings_4,
+            triton_client.update_log_settings(settings=log_settings_4),
+            "Unexpected updated log settings",
+        )
+
+        log_settings_5 = {
+            "log_file": "log_file.log",
+            "log_info": False,
+            "log_warning": False,
+            "log_error": False,
+            "log_verbose_level": 1,
+            "log_format": "default",
+        }
+        expected_log_settings_5 = grpcclient.service_pb2.LogSettingsResponse()
+        json_format.Parse(
+            json.dumps(
+                {
+                    "settings": {
+                        "log_file": {"stringParam": "log_file.log"},
+                        "log_info": {"boolParam": False},
+                        "log_warning": {"boolParam": False},
+                        "log_error": {"boolParam": False},
+                        "log_verbose_level": {"uint32Param": 1},
+                        "log_format": {"stringParam": "default"},
+                    }
+                }
+            ),
+            expected_log_settings_5,
+        )
+
+        self.assertEqual(
+            expected_log_settings_5,
+            triton_client.update_log_settings(settings=log_settings_5),
+            "Unexpected updated log settings",
+        )
+
+        log_settings_6 = {
+            "log_file": "log_file.log",
+            "log_info": False,
+            "log_warning": False,
+            "log_error": False,
+            "log_verbose_level": 1,
+            "log_format": "ISO8601",
+        }
+        expected_log_settings_6 = grpcclient.service_pb2.LogSettingsResponse()
+        json_format.Parse(
+            json.dumps(
+                {
+                    "settings": {
+                        "log_file": {"stringParam": "log_file.log"},
+                        "log_info": {"boolParam": False},
+                        "log_warning": {"boolParam": False},
+                        "log_error": {"boolParam": False},
+                        "log_verbose_level": {"uint32Param": 1},
+                        "log_format": {"stringParam": "ISO8601"},
+                    }
+                }
+            ),
+            expected_log_settings_6,
+        )
+
+        self.assertEqual(
+            expected_log_settings_6,
+            triton_client.update_log_settings(settings=log_settings_6),
+            "Unexpected updated log settings",
+        )
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_logging/test.sh b/qa/L0_logging/test.sh
new file mode 100755
index 0000000000..d83e0b76a4
--- /dev/null
+++ b/qa/L0_logging/test.sh
@@ -0,0 +1,595 @@
+#!/bin/bash
+# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+SIMPLE_HTTP_CLIENT=../clients/simple_http_infer_client
+SIMPLE_GRPC_CLIENT=../clients/simple_grpc_infer_client
+
+CLIENT_TEST=logging_endpoint_test.py
+CLIENT_LOG="client.log"
+TEST_RESULT_FILE="test_results.txt"
+EXPECTED_NUM_TESTS="4"
+
+REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
+if [ "$#" -ge 1 ]; then
+    REPO_VERSION=$1
+fi
+if [ -z "$REPO_VERSION" ]; then
+    echo -e "Repository version must be specified"
+    echo -e "\n***\n*** Test Failed\n***"
+    exit 1
+fi
+if [ ! -z "$TEST_REPO_ARCH" ]; then
+    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
+fi
+
+export CUDA_VISIBLE_DEVICES=0
+
+DATADIR=/data/inferenceserver/${REPO_VERSION}/qa_model_repository
+MODELBASE=onnx_int32_int32_int32
+
+MODELSDIR=`pwd`/log_models
+
+SERVER=/opt/tritonserver/bin/tritonserver
+source ../common/util.sh
+
+rm -f *.log
+rm -fr $MODELSDIR && mkdir -p $MODELSDIR
+
+# set up simple repository MODELBASE
+rm -fr $MODELSDIR && mkdir -p $MODELSDIR && \
+    cp -r $DATADIR/$MODELBASE $MODELSDIR/simple && \
+    rm -r $MODELSDIR/simple/2 && rm -r $MODELSDIR/simple/3 && \
+    (cd $MODELSDIR/simple && \
+            sed -i "s/^name:.*/name: \"simple\"/" config.pbtxt)
+RET=0
+
+function verify_correct_settings () {
+  log_file_expected=$1
+  log_info_expected=$2
+  log_warn_expected=$3
+  log_error_expected=$4
+  log_verbose_expected=$5
+  log_format_expected=$6
+  code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/logging`
+
+  if [ `grep -c "\"log_file\":\"$log_file_expected"\" ./curl.out` != "1" ]; then
+    echo -e "\n***\n*** Test Failed: Incorrect Log File Setting\n***"
+    RET=1
+  fi
+  if [ `grep -c "\"log_info\":$log_info_expected" ./curl.out` != "1" ]; then
+    echo -e "\n***\n*** Test Failed: Incorrect Log Info Setting\n***"
+    RET=1
+  fi
+  if [ `grep -c "\"log_warning\":$log_warn_expected" ./curl.out` != "1" ]; then
+    echo -e "\n***\n*** Test Failed: Incorrect Log Warn Setting\n***"
+    RET=1
+  fi
+  if [ `grep -c "\"log_error\":$log_error_expected" ./curl.out` != "1" ]; then
+    echo -e "\n***\n*** Test Failed: Incorrect Log Error Setting\n***"
+    RET=1
+  fi
+  if [ `grep -c "\"log_verbose_level\":$log_verbose_expected" ./curl.out` != "1" ]; then
+    echo -e "\n***\n*** Test Failed: Incorrect Log Verbose Setting\n***"
+    RET=1
+  fi
+  if [ `grep -c "\"log_format\":\"$log_format_expected\"" ./curl.out` != "1" ]; then
+    echo -e "\n***\n*** Test Failed: Incorrect Log Format Setting\n***"
+    RET=1
+  fi
+}
+
+#Run Default Server
+SERVER_ARGS="--model-repository=$MODELSDIR"
+SERVER_LOG="./server.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+# Check Default Settings
+rm -f ./curl.out
+set +e
+
+# Check if the current settings are returned [ file | info | warn | error | verbosity |format ]
+verify_correct_settings "" "true" "true" "true" "0" "default"
+
+$SIMPLE_HTTP_CLIENT >> client_default.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+$SIMPLE_GRPC_CLIENT >> client_default.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+# Check log is streaming to console by default
+console_count=($(wc -l ./server.log))
+if [ $console_count -le 30 ]; then
+    echo -e "\n***\n*** Test Failed: Log File Error\n***"
+    RET=1
+fi
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Test Log File (Argument)
+SERVER_ARGS="--log-file=log_file.log --model-repository=$MODELSDIR"
+SERVER_LOG="./inference_server_log_file.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+rm -f ./curl.out
+set +e
+
+verify_correct_settings "log_file.log" "true" "true" "true" "0" "default"
+
+$SIMPLE_HTTP_CLIENT >> client_test_log_file.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+$SIMPLE_GRPC_CLIENT >> client_test_log_file.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+expected_log_count=19
+actual_log_count=$(grep -c ^[IWEV][0-9][0-9][0-9][0-9].* ./log_file.log)
+if [ $actual_log_count -lt $expected_log_count ]; then
+    echo $actual_log_count
+    echo $expected_log_count
+    echo -e "\n***\n*** Test Failed: Less Log Messages Than Expected $LINENO\n***"
+    RET=1
+fi
+expected_server_count=0
+actual_server_count=$(grep -c ^[IWEV][0-9][0-9][0-9][0-9].* inference_server_log_file.log)
+if [ $actual_server_count -gt $expected_server_count ]; then
+    echo $actual_server_count
+    echo $expected_server_count
+    echo -e "\n***\n*** Test Failed: More Log Messages Than Expected $LINENO\n***"
+    RET=1
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Test Log File (Dynamic)
+rm -f log_file.log
+SERVER_ARGS="--log-file=log_file.log --log-verbose=1 --model-repository=$MODELSDIR"
+SERVER_LOG="./inference_server_log_file.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+rm -f ./curl.out
+code=`curl -s -w %{http_code} -o ./curl.out -d'{"log_file":"other_log.log"}' localhost:8000/v2/logging`
+set +e
+
+verify_correct_settings "other_log.log" "true" "true" "true" "1" "default"
+
+$SIMPLE_HTTP_CLIENT >> client_test_log_file.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+$SIMPLE_GRPC_CLIENT >> client_test_log_file.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+# Check redirection worked properly (server log has tolerance of 40 due to
+# unavoidable onnx framework logging)
+expected_log_count=75
+actual_log_count=$(grep -c ^[IWEV][0-9][0-9][0-9][0-9].* ./log_file.log)
+if [ $actual_log_count -lt $expected_log_count ]; then
+    echo $actual_log_count
+    echo $expected_log_count
+    echo -e "\n***\n*** Test Failed: Less Log Messages Than Expected $LINENO\n***"
+    RET=1
+fi
+expected_other_log_count=31
+actual_other_log_count=$(grep -c ^[IWEV][0-9][0-9][0-9][0-9].* ./other_log.log)
+if [ $actual_other_log_count -lt $expected_other_log_count ]; then
+    echo $actual_other_log_count
+    echo $expected_other_log_count
+    echo -e "\n***\n*** Test Failed: Less Log Messages Than Expected $LINENO\n***"
+    RET=1
+fi
+expected_server_count=0
+actual_server_count=$(grep -c ^[IWEV][0-9][0-9][0-9][0-9].* inference_server_log_file.log)
+if [ $actual_server_count -gt $expected_server_count ]; then
+    echo $actual_server_count
+    echo $expected_server_count
+    echo -e "\n***\n*** Test Failed: More Log Messages Than Expected $LINENO\n***"
+    RET=1
+fi
+
+set -e
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Test Log Info (Argument)
+rm -f log_file.log
+SERVER_ARGS="--log-file=log_file.log --log-info=false --log-verbose=1 --model-repository=$MODELSDIR"
+SERVER_LOG="./inference_server_log_file.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+rm -f ./curl.out
+set +e
+code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/logging`
+
+verify_correct_settings "log_file.log" "false" "true" "true" "1" "default"
+
+$SIMPLE_HTTP_CLIENT >> client_test_log_info.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+$SIMPLE_GRPC_CLIENT >> client_test_log_info.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+# Test against guaranteed info message
+count=$(grep -c "Started HTTPService at" ./log_file.log)
+if [ $count -gt 0 ]; then
+    echo -e "\n***\n*** Test Failed: Info Message Not Expected $LINENO\n***"
+    RET=1
+fi
+
+set -e
+
+# Test Log Info (Dynamic)
+set +e
+rm -f ./curl.out
+code=`curl -s -w %{http_code} -o ./curl.out -d'{"log_info":true}' localhost:8000/v2/logging`
+
+verify_correct_settings "log_file.log" "true" "true" "true" "1" "default"
+
+$SIMPLE_HTTP_CLIENT >> client_test_log_info.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+$SIMPLE_GRPC_CLIENT >> client_test_log_info.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+set +e
+# Test against guaranteed info message
+count=$(grep -c "Waiting for in-flight requests to complete" ./log_file.log)
+if [ $count -ne 1 ]; then
+    echo -e "\n***\n*** Test Failed: Info Message Expected $LINENO\n***"
+    RET=1
+fi
+set -e
+
+# Test Log Warning
+SERVER_ARGS="--log-warning=false --model-repository=$MODELSDIR"
+SERVER_LOG="./inference_server_log_file.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+rm -f ./curl.out
+set +e
+code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/logging`
+
+verify_correct_settings "" "true" "false" "true" "0" "default"
+
+$SIMPLE_HTTP_CLIENT >> client_test_log_warning.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+$SIMPLE_GRPC_CLIENT >> client_test_log_warning.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Test Log Error
+SERVER_ARGS="--log-error=false --model-repository=$MODELSDIR"
+SERVER_LOG="./inference_server_log_file.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+rm -f ./curl.out
+set +e
+code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/logging`
+
+# Check if the current settings are returned [ file | info | warn | error | verbosity |format ]
+verify_correct_settings "" "true" "true" "false" "0" "default"
+
+$SIMPLE_HTTP_CLIENT >> client_test_log_error.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+$SIMPLE_GRPC_CLIENT >> client_test_log_error.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Test Log Verbose Level (Argument)
+rm -f log_file.log
+SERVER_ARGS="--log-file=log_file.log --log-verbose=1 --model-repository=$MODELSDIR"
+SERVER_LOG="./inference_server_log_file.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+rm -f ./curl.out
+set +e
+code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/logging`
+
+verify_correct_settings "log_file.log" "true" "true" "true" "1" "default"
+
+$SIMPLE_HTTP_CLIENT >> client_test_log_verbose.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+$SIMPLE_GRPC_CLIENT >> client_test_log_verbose.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+count=$(grep -c "/v2/logging" ./log_file.log)
+if [ $count -ne 2 ]; then
+    echo -e "\n***\n*** Test Failed: Verbose Message Expected $LINENO\n***"
+    RET=1
+fi
+
+code=`curl -s -w %{http_code} -o ./curl.out -d'{"log_verbose_level":0}' localhost:8000/v2/logging`
+verify_correct_settings "log_file.log" "true" "true" "true" "0" "default"
+
+code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/logging`
+count=$(grep -c "/v2/logging" ./log_file.log)
+if [ $count -gt 3 ]; then
+    echo -e "\n***\n*** Test Failed: Too Many Verbose Messages $LINENO\n***"
+    RET=1
+fi
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Test Log Format (Argument)
+rm -f log_file.log
+SERVER_ARGS="--log-file=log_file.log --log-verbose=1 --log-format=ISO8601 --model-repository=$MODELSDIR"
+SERVER_LOG="./inference_server_log_file.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+rm -f ./curl.out
+set +e
+code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/logging`
+verify_correct_settings "log_file.log" "true" "true" "true" "1" "ISO8601"
+
+$SIMPLE_HTTP_CLIENT >> client_test_log_format.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+$SIMPLE_GRPC_CLIENT >> client_test_log_format.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+line=$(head -n 1 log_file.log)
+date=$(date '+%m%d')
+final_date="I${date}"
+format_date=$(echo $line | head -n1 | awk '{print $1;}')
+if [[ $final_date == $format_date ]]; then
+    echo -e "\n***\n*** Test Failed: Unexpected Log Format $LINENO\n***"
+    RET=1
+fi
+
+set -e
+
+# Test Log Format (Dynamic)
+set +e
+rm -f ./curl.out
+code=`curl -s -w %{http_code} -o ./curl.out -d'{"log_format":"default"}' localhost:8000/v2/logging`
+verify_correct_settings "log_file.log" "true" "true" "true" "1" "default"
+
+line=$(tail -n 1 log_file.log)
+date=$(date '+%m%d')
+final_date="I${date}"
+format_date=$(echo $line | head -n1 | awk '{print $1;}')
+if [[ $final_date != $format_date ]]; then
+    echo -e "\n***\n*** Test Failed: Unexpected Log Format $LINENO\n***"
+    RET=1
+fi
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+#Test Negative Test Cases
+SERVER_ARGS="--log-warn="false" --model-repository=$MODELSDIR"
+SERVER_LOG="./server.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+
+BOOL_PARAMS=${BOOL_PARAMS:="log_info log_warning log_error"}
+for BOOL_PARAM in $BOOL_PARAMS; do
+    # Attempt to use integer instead of bool
+    code=`curl -s -w %{http_code} -o ./curl.out -d'{"'"$BOOL_PARAM"'":1}' localhost:8000/v2/logging`
+    if [ "$code" != "400" ]; then
+        echo $code
+        cat ./curl.out
+        echo -e "\n***\n*** Test Failed: Line: $LINENO\n***"
+        RET=1
+    fi
+    # Attempt to use upper-case bool
+    code=`curl -s -w %{http_code} -o ./curl.out -d'{"'"$BOOL_PARAM"'":False}' localhost:8000/v2/logging`
+    if [ "$code" != "400" ]; then
+        cat ./curl.out
+        echo -e "\n***\n*** Test Failed: Line: $LINENO\n***"
+        RET=1
+    fi
+    # Attempt to use string bool
+    code=`curl -s -w %{http_code} -o ./curl.out -d'{"'"$BOOL_PARAM"'":"false"}' localhost:8000/v2/logging`
+    if [ "$code" != "400" ]; then
+        echo $code
+        cat ./curl.out
+        echo -e "\n***\n*** Test Failed: Line: $LINENO\n***"
+        RET=1
+    fi
+    # Positive test case
+    code=`curl -s -w %{http_code} -o ./curl.out -d'{"'"$BOOL_PARAM"'":true}' localhost:8000/v2/logging`
+    if [ "$code" != "200" ]; then
+        cat ./curl.out
+        echo -e "\n***\n*** Test Failed: Line: $LINENO\n***"
+        RET=1
+    fi
+done
+
+code=`curl -s -w %{http_code} -o ./curl.out -d'{"log_verbose_level":-1}' localhost:8000/v2/logging`
+if [ "$code" != "400" ]; then
+    echo $code
+    cat ./curl.out
+    echo -e "\n***\n*** Test Failed: Line: $LINENO\n***"
+    RET=1
+fi
+code=`curl -s -w %{http_code} -o ./curl.out -d'{"log_verbose_level":"1"}' localhost:8000/v2/logging`
+if [ "$code" != "400" ]; then
+    echo $code
+    cat ./curl.out
+    echo -e "\n***\n*** Test Failed: Line: $LINENO\n***"
+    RET=1
+fi
+code=`curl -s -w %{http_code} -o ./curl.out -d'{"log_verbose_level":0}' localhost:8000/v2/logging`
+if [ "$code" != "200" ]; then
+    echo $code
+    cat ./curl.out
+    echo -e "\n***\n*** Test Failed: Line: $LINENO\n***"
+    RET=1
+fi
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Test Python client library
+SERVER_ARGS="--model-repository=$MODELSDIR"
+SERVER_LOG="./inference_server_unittest.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+
+python $CLIENT_TEST >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+
+
+if [ $RET -eq 0 ]; then
+    echo -e "\n***\n*** Test Passed\n***"
+else
+    echo -e "\n***\n*** Test FAILED\n***"
+fi
+
+
+exit $RET
diff --git a/qa/L0_long_running_stress/crashing_client.py b/qa/L0_long_running_stress/crashing_client.py
old mode 100644
new mode 100755
index 81ce2e996e..d9c727a3d3
--- a/qa/L0_long_running_stress/crashing_client.py
+++ b/qa/L0_long_running_stress/crashing_client.py
@@ -1,4 +1,6 @@
-# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,29 +27,27 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-import numpy as np
-from multiprocessing import Process, shared_memory
+import argparse
 import time
+from multiprocessing import Process, shared_memory
+
+import numpy as np
 import test_util as tu
-import argparse
 import tritonclient.grpc as grpcclient
 from tritonclient.utils import np_to_triton_dtype
 
 
-def crashing_client(model_name,
-                    dtype,
-                    tensor_shape,
-                    shm_name,
-                    triton_client,
-                    input_name="INPUT0"):
+def crashing_client(
+    model_name, dtype, tensor_shape, shm_name, triton_client, input_name="INPUT0"
+):
     in0 = np.random.random(tensor_shape).astype(dtype)
     if "libtorch" in model_name:
         input_name = "INPUT__0"
     inputs = [
-        grpcclient.InferInput(input_name, tensor_shape,
-                              np_to_triton_dtype(dtype)),
+        grpcclient.InferInput(input_name, tensor_shape, np_to_triton_dtype(dtype)),
     ]
     inputs[0].set_data_from_numpy(in0)
 
@@ -61,13 +61,15 @@ def crashing_client(model_name,
         results = triton_client.infer(model_name, inputs)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('-t',
-                        '--trial',
-                        type=str,
-                        required=True,
-                        help='Set trial for the crashing client')
+    parser.add_argument(
+        "-t",
+        "--trial",
+        type=str,
+        required=True,
+        help="Set trial for the crashing client",
+    )
     FLAGS = parser.parse_args()
     trial = FLAGS.trial
 
@@ -75,22 +77,23 @@ def crashing_client(model_name,
     model_name = tu.get_zero_model_name(trial, 1, dtype)
     tensor_shape = (1,) if "nobatch" in trial else (1, 1)
 
-    triton_client = grpcclient.InferenceServerClient(url="localhost:8001",
-                                                     verbose=True)
+    triton_client = grpcclient.InferenceServerClient(url="localhost:8001", verbose=True)
 
     shm = shared_memory.SharedMemory(create=True, size=8)
     count = np.ndarray((1,), dtype=np.int32, buffer=shm.buf)
     count[0] = 0
 
-    p = Process(target=crashing_client,
-                name="crashing_client",
-                args=(
-                    model_name,
-                    dtype,
-                    tensor_shape,
-                    shm.name,
-                    triton_client,
-                ))
+    p = Process(
+        target=crashing_client,
+        name="crashing_client",
+        args=(
+            model_name,
+            dtype,
+            tensor_shape,
+            shm.name,
+            triton_client,
+        ),
+    )
 
     p.start()
 
diff --git a/qa/L0_long_running_stress/scenarios.py b/qa/L0_long_running_stress/scenarios.py
old mode 100644
new mode 100755
index caae4fa12e..abb0004e90
--- a/qa/L0_long_running_stress/scenarios.py
+++ b/qa/L0_long_running_stress/scenarios.py
@@ -1,4 +1,6 @@
-# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,28 +28,31 @@
 
 import math
 import sys
+
 sys.path.append("../common")
 
-import numpy as np
-import time
-import test_util as tu
-import tritonclient.grpc as grpcclient
-from tritonclient.utils import np_to_triton_dtype
 import math
-from PIL import Image
 import os
 import subprocess
 import threading
+import time
+
+import numpy as np
+import test_util as tu
+import tritonclient.grpc as grpcclient
+from PIL import Image
+from tritonclient.utils import np_to_triton_dtype
+
 if sys.version_info >= (3, 0):
     import queue
 else:
     import Queue as queue
-from functools import partial
 
 import abc
 import csv
 import json
 import re
+from functools import partial
 
 DEFAULT_TIMEOUT_MS = 25000
 SEQUENCE_LENGTH_MEAN = 16
@@ -65,7 +70,6 @@ def completion_callback(user_data, result, error):
 
 
 class Scenario(metaclass=abc.ABCMeta):
-
     def __init__(self, name, trials, verbose=False, out_stream=sys.stdout):
         self.name_ = name
         self.trials_ = trials
@@ -108,13 +112,15 @@ class ModelOption:
         # 'queue_latency_range_us' specifies the range where queue latency
         # reported should be, otherwise, model concurrency will be adjusted
         # within 'concurrency_range' to influence the queue latency.
-        def __init__(self,
-                     model_name,
-                     batch_size,
-                     concurrency_range,
-                     queue_latency_range_us,
-                     input_shapes=[],
-                     input_file=None):
+        def __init__(
+            self,
+            model_name,
+            batch_size,
+            concurrency_range,
+            queue_latency_range_us,
+            input_shapes=[],
+            input_file=None,
+        ):
             self.model_name_ = model_name
             self.concurrency_range_ = list(concurrency_range)
             self.batch_size_ = batch_size
@@ -124,8 +130,11 @@ def __init__(self,
 
         def run(self, name, sequence_id_range, out_stream):
             csv_file = os.path.join(
-                "csv_dir", "{}_{}_{}.csv".format(name, self.model_name_,
-                                                 self.concurrency_range_[2]))
+                "csv_dir",
+                "{}_{}_{}.csv".format(
+                    name, self.model_name_, self.concurrency_range_[2]
+                ),
+            )
 
             arg_list = [PerfAnalyzerScenario.command_]
             # Always use GRPC streaming feature to ensure requests are handled
@@ -135,8 +144,9 @@ def run(self, name, sequence_id_range, out_stream):
             arg_list += ["-b", "{}".format(self.batch_size_)]
             arg_list += [
                 "--concurrency-range",
-                "{}:{}:1".format(self.concurrency_range_[2],
-                                 self.concurrency_range_[2])
+                "{}:{}:1".format(
+                    self.concurrency_range_[2], self.concurrency_range_[2]
+                ),
             ]
             arg_list += ["-f", csv_file]
             for name, shape in self.input_shapes_:
@@ -146,43 +156,44 @@ def run(self, name, sequence_id_range, out_stream):
             if sequence_id_range is not None:
                 arg_list += [
                     "--sequence-id-range",
-                    "{}:{}".format(sequence_id_range[0], sequence_id_range[1])
+                    "{}:{}".format(sequence_id_range[0], sequence_id_range[1]),
                 ]
 
-            completed_process = subprocess.run(arg_list,
-                                               text=True,
-                                               stdout=subprocess.PIPE,
-                                               stderr=subprocess.STDOUT)
+            completed_process = subprocess.run(
+                arg_list, text=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT
+            )
             # Write output to file before checking return code
             print(completed_process.stdout, file=out_stream)
             completed_process.check_returncode()
 
             # Read queue time and adjust concurrency
-            with open(csv_file, newline='') as csvfile:
+            with open(csv_file, newline="") as csvfile:
                 reader = csv.DictReader(csvfile)
                 for row in reader:
-                    current_queue_us = int(row['Server Queue'])
+                    current_queue_us = int(row["Server Queue"])
                     if current_queue_us < self.queue_latency_range_us_[0]:
                         self.concurrency_range_[2] = min(
-                            self.concurrency_range_[2] + 1,
-                            self.concurrency_range_[1])
+                            self.concurrency_range_[2] + 1, self.concurrency_range_[1]
+                        )
                     elif current_queue_us > self.queue_latency_range_us_[0]:
                         self.concurrency_range_[2] = max(
-                            self.concurrency_range_[2] - 1,
-                            self.concurrency_range_[0])
+                            self.concurrency_range_[2] - 1, self.concurrency_range_[0]
+                        )
                     break
-            m = re.search(r'Request count: ([0-9]+)', completed_process.stdout)
+            m = re.search(r"Request count: ([0-9]+)", completed_process.stdout)
             return int(m.group(1))
 
-    def __init__(self,
-                 name,
-                 rng,
-                 sequence_trials,
-                 identity_trials,
-                 queue_latency_range_us=(10000, 100000),
-                 sequence_id_range=None,
-                 verbose=False,
-                 out_stream=sys.stdout):
+    def __init__(
+        self,
+        name,
+        rng,
+        sequence_trials,
+        identity_trials,
+        queue_latency_range_us=(10000, 100000),
+        sequence_id_range=None,
+        verbose=False,
+        out_stream=sys.stdout,
+    ):
         super().__init__(name, [], verbose, out_stream)
         self.rng_ = rng
         self.sequence_id_range_ = sequence_id_range
@@ -193,8 +204,10 @@ def __init__(self,
 
         # Add no validation models
         self.options_.append(
-            PerfAnalyzerScenario.ModelOption("resnet_v1_50_graphdef_def", 32,
-                                             (1, 4, 1), queue_latency_range_us))
+            PerfAnalyzerScenario.ModelOption(
+                "resnet_v1_50_graphdef_def", 32, (1, 4, 1), queue_latency_range_us
+            )
+        )
         for trial in sequence_trials:
             dtype = self.get_datatype(trial)
             # Skip string sequence model for now, it is hard for PA to generate
@@ -203,8 +216,10 @@ def __init__(self,
                 continue
             model_name = tu.get_sequence_model_name(trial, dtype)
             self.options_.append(
-                PerfAnalyzerScenario.ModelOption(model_name, 1, (1, 4, 1),
-                                                 queue_latency_range_us))
+                PerfAnalyzerScenario.ModelOption(
+                    model_name, 1, (1, 4, 1), queue_latency_range_us
+                )
+            )
         for trial in identity_trials:
             dtype = np.float32
             model_name = tu.get_zero_model_name(trial, 1, dtype)
@@ -213,9 +228,10 @@ def __init__(self,
             else:
                 input_shapes = [("INPUT0", "16")]
             self.options_.append(
-                PerfAnalyzerScenario.ModelOption(model_name, 1, (1, 4, 1),
-                                                 queue_latency_range_us,
-                                                 input_shapes))
+                PerfAnalyzerScenario.ModelOption(
+                    model_name, 1, (1, 4, 1), queue_latency_range_us, input_shapes
+                )
+            )
 
         # Add output validation version of the models
         # Skip resnet as the output data has variation which makes exact
@@ -223,25 +239,31 @@ def __init__(self,
         for trial in sequence_trials:
             dtype = self.get_datatype(trial)
             model_name = tu.get_sequence_model_name(trial, dtype)
-            data_file = os.path.join("validation_data",
-                                     "{}.json".format(model_name))
+            data_file = os.path.join("validation_data", "{}.json".format(model_name))
             self.generate_sequence_data(trial, dtype, data_file)
             self.options_.append(
-                PerfAnalyzerScenario.ModelOption(model_name,
-                                                 1, (1, 4, 1),
-                                                 queue_latency_range_us,
-                                                 input_file=data_file))
+                PerfAnalyzerScenario.ModelOption(
+                    model_name,
+                    1,
+                    (1, 4, 1),
+                    queue_latency_range_us,
+                    input_file=data_file,
+                )
+            )
         for trial in identity_trials:
             dtype = np.float32
             model_name = tu.get_zero_model_name(trial, 1, dtype)
-            data_file = os.path.join("validation_data",
-                                     "{}.json".format(model_name))
+            data_file = os.path.join("validation_data", "{}.json".format(model_name))
             self.generate_identity_data(trial, dtype, data_file)
             self.options_.append(
-                PerfAnalyzerScenario.ModelOption(model_name,
-                                                 1, (1, 4, 1),
-                                                 queue_latency_range_us,
-                                                 input_file=data_file))
+                PerfAnalyzerScenario.ModelOption(
+                    model_name,
+                    1,
+                    (1, 4, 1),
+                    queue_latency_range_us,
+                    input_file=data_file,
+                )
+            )
 
     def generate_sequence_data(self, trial, dtype, data_filename):
         input0 = "INPUT" if "libtorch" not in trial else "INPUT__0"
@@ -254,8 +276,7 @@ def generate_sequence_data(self, trial, dtype, data_filename):
             elif dtype == np.dtype(object):
                 res = str(i)
             else:
-                raise Exception(
-                    "unexpected sequence data type {}".format(dtype))
+                raise Exception("unexpected sequence data type {}".format(dtype))
             input_data.append({input0: [res]})
         output0 = "OUTPUT" if "libtorch" not in trial else "OUTPUT__0"
         output_data = []
@@ -271,8 +292,7 @@ def generate_sequence_data(self, trial, dtype, data_filename):
                 elif dtype == np.dtype(object):
                     res = str(sum)
                 else:
-                    raise Exception(
-                        "unexpected sequence data type {}".format(dtype))
+                    raise Exception("unexpected sequence data type {}".format(dtype))
                 output_data.append({output0: [res]})
         else:
             for i in range(3):
@@ -284,17 +304,17 @@ def generate_sequence_data(self, trial, dtype, data_filename):
                 elif dtype == np.dtype(object):
                     res = str(res)
                 else:
-                    raise Exception(
-                        "unexpected sequence data type {}".format(dtype))
+                    raise Exception("unexpected sequence data type {}".format(dtype))
                 output_data.append(
-                    {output0: [res if dtype != np.dtype(object) else str(res)]})
+                    {output0: [res if dtype != np.dtype(object) else str(res)]}
+                )
         data = {"data": [input_data]}
         data["validation_data"] = [output_data]
 
         # Only write to a file if there isn't validation file for the model
         PerfAnalyzerScenario.generation_mutex_.acquire()
         if not os.path.exists(data_filename):
-            with open(data_filename, 'w') as f:
+            with open(data_filename, "w") as f:
                 json.dump(data, f)
         PerfAnalyzerScenario.generation_mutex_.release()
 
@@ -310,43 +330,26 @@ def generate_identity_data(self, trial, dtype, data_filename):
             elif dtype == np.dtype(object):
                 res = str(i)
             else:
-                raise Exception(
-                    "unexpected identity data type {}".format(dtype))
+                raise Exception("unexpected identity data type {}".format(dtype))
             io_data.append(res)
         data = {
-            "data": [{
-                input0: {
-                    "content": io_data,
-                    "shape": [16]
-                }
-            }],
-            "validation_data": [{
-                output0: {
-                    "content": io_data,
-                    "shape": [16]
-                }
-            }]
+            "data": [{input0: {"content": io_data, "shape": [16]}}],
+            "validation_data": [{output0: {"content": io_data, "shape": [16]}}],
         }
         # Only write to a file if there isn't validation file for the model
         PerfAnalyzerScenario.generation_mutex_.acquire()
         if not os.path.exists(data_filename):
-            with open(data_filename, 'w') as f:
+            with open(data_filename, "w") as f:
                 json.dump(data, f)
         PerfAnalyzerScenario.generation_mutex_.release()
 
     def run(self, client_metadata):
         model_option = np.random.choice(self.options_)
-        return model_option.run(self.name_, self.sequence_id_range_,
-                                self.out_stream_)
+        return model_option.run(self.name_, self.sequence_id_range_, self.out_stream_)
 
 
 class ResNetScenario(Scenario):
-
-    def __init__(self,
-                 name,
-                 batch_size=32,
-                 verbose=False,
-                 out_stream=sys.stdout):
+    def __init__(self, name, batch_size=32, verbose=False, out_stream=sys.stdout):
         super().__init__(name, [], verbose, out_stream)
         self.model_name_ = "resnet_v1_50_graphdef_def"
         self.batch_size_ = batch_size
@@ -359,7 +362,7 @@ def __init__(self,
 
     def preprocess(self, filename):
         img = Image.open(filename)
-        resized_img = img.convert('RGB').resize((224, 224), Image.BILINEAR)
+        resized_img = img.convert("RGB").resize((224, 224), Image.BILINEAR)
         np_img = np.array(resized_img).astype(np.float32)
         if np_img.ndim == 2:
             np_img = np_img[:, :, np.newaxis]
@@ -369,31 +372,35 @@ def preprocess(self, filename):
     def postprocess(self, results):
         output_array = results.as_numpy("resnet_v1_50/predictions/Softmax")
         if len(output_array) != self.batch_size_:
-            raise Exception("expected {} results, got {}".format(
-                self.batch_size_, len(output_array)))
+            raise Exception(
+                "expected {} results, got {}".format(
+                    self.batch_size_, len(output_array)
+                )
+            )
 
         for results in output_array:
             for result in results:
                 if output_array.dtype.type == np.object_:
-                    cls = "".join(chr(x) for x in result).split(':')
+                    cls = "".join(chr(x) for x in result).split(":")
                 else:
-                    cls = result.split(':')
+                    cls = result.split(":")
                 if cls[2] != "VULTURE":
                     raise Exception(
-                        "expected VULTURE as classification result, got {}".
-                        format(cls[2]))
+                        "expected VULTURE as classification result, got {}".format(
+                            cls[2]
+                        )
+                    )
 
     def run(self, client_metadata):
         triton_client = client_metadata[0]
 
-        inputs = [
-            grpcclient.InferInput("input", self.image_data_.shape, "FP32")
-        ]
+        inputs = [grpcclient.InferInput("input", self.image_data_.shape, "FP32")]
         inputs[0].set_data_from_numpy(self.image_data_)
 
         outputs = [
-            grpcclient.InferRequestedOutput("resnet_v1_50/predictions/Softmax",
-                                            class_count=1)
+            grpcclient.InferRequestedOutput(
+                "resnet_v1_50/predictions/Softmax", class_count=1
+            )
         ]
         res = triton_client.infer(self.model_name_, inputs, outputs=outputs)
         self.postprocess(res)
@@ -401,14 +408,15 @@ def run(self, client_metadata):
 
 
 class TimeoutScenario(Scenario):
-
-    def __init__(self,
-                 name,
-                 trials,
-                 input_dtype=np.float32,
-                 input_name="INPUT0",
-                 verbose=False,
-                 out_stream=sys.stdout):
+    def __init__(
+        self,
+        name,
+        trials,
+        input_dtype=np.float32,
+        input_name="INPUT0",
+        verbose=False,
+        out_stream=sys.stdout,
+    ):
         super().__init__(name, trials, verbose, out_stream)
         self.input_dtype_ = input_dtype
         self.input_name_ = input_name
@@ -421,12 +429,16 @@ def run(self, client_metadata):
         if "librotch" in trial:
             input_name = "INPUT__0"
 
-        tensor_shape = (math.trunc(1 * (1024 * 1024 * 1024) //
-                                   np.dtype(self.input_dtype_).itemsize),)
+        tensor_shape = (
+            math.trunc(
+                1 * (1024 * 1024 * 1024) // np.dtype(self.input_dtype_).itemsize
+            ),
+        )
         in0 = np.random.random(tensor_shape).astype(self.input_dtype_)
         inputs = [
-            grpcclient.InferInput(input_name, tensor_shape,
-                                  np_to_triton_dtype(self.input_dtype_)),
+            grpcclient.InferInput(
+                input_name, tensor_shape, np_to_triton_dtype(self.input_dtype_)
+            ),
         ]
         inputs[0].set_data_from_numpy(in0)
 
@@ -442,12 +454,11 @@ def run(self, client_metadata):
 
 
 class CrashingScenario(Scenario):
-
     def __init__(self, name, verbose=False, out_stream=sys.stdout):
         super().__init__(name, [], verbose, out_stream)
 
     def run(self, client_metadata):
-        # Only use "custom" model as it simulates exectuion delay which
+        # Only use "custom" model as it simulates execution delay which
         # simplifies "crashing simulation" (client exits while request is being
         # executed)
         trial = "custom"
@@ -455,8 +466,7 @@ def run(self, client_metadata):
         # Call the client as subprocess to avoid crashing stress test
         # and gather logging as string variable
         crashing_client = "crashing_client.py"
-        log = subprocess.check_output(
-            [sys.executable, crashing_client, "-t", trial])
+        log = subprocess.check_output([sys.executable, crashing_client, "-t", trial])
         result = self.parse_result(log.decode("utf-8"))
         if not result[1]:
             assert False, "crashing_client failed {}".format(self.name_)
@@ -471,22 +481,20 @@ def parse_result(self, log):
         if "request_count:" in log:
             idx_start = log.rindex("request_count:")
             idx_start = log.find(" ", idx_start)
-            idx_end = log.find('\n', idx_start)
-            request_count = int(log[idx_start + 1:idx_end])
+            idx_end = log.find("\n", idx_start)
+            request_count = int(log[idx_start + 1 : idx_end])
 
         if "live:" in log:
             idx_start = log.rindex("live:")
             idx_start = log.find(" ", idx_start)
-            idx_end = log.find('\n', idx_start)
-            is_server_live = log[idx_start + 1:idx_end]
+            idx_end = log.find("\n", idx_start)
+            is_server_live = log[idx_start + 1 : idx_end]
 
         return (request_count, is_server_live == "true")
 
 
 class SequenceScenario(Scenario):
-
     class UserData:
-
         def __init__(self):
             self._completed_requests = queue.Queue()
 
@@ -497,51 +505,63 @@ def __init__(self):
     def check_constraints(self, model_name, sequence_id):
         pass
 
-    def __init__(self,
-                 name,
-                 trials,
-                 rng,
-                 sequence_constraints,
-                 verbose=False,
-                 out_stream=sys.stdout):
+    def __init__(
+        self,
+        name,
+        trials,
+        rng,
+        sequence_constraints,
+        verbose=False,
+        out_stream=sys.stdout,
+    ):
         super().__init__(name, trials, verbose, out_stream)
         self.rng_ = rng
         self.sequence_constraints_ = sequence_constraints
 
     def get_expected_result(self, expected_result, value, trial, flag_str=None):
         # Adjust the expected_result for models that
-        # couldn't implement the full accumulator. See
+        # could not implement the full accumulator. See
         # qa/common/gen_qa_sequence_models.py for more
         # information.
-        if (("nobatch" not in trial and
-             ("custom" not in trial)) or ("graphdef" in trial) or
-            ("plan" in trial) or ("onnx" in trial)) or ("libtorch" in trial):
+        if (
+            ("nobatch" not in trial and ("custom" not in trial))
+            or ("graphdef" in trial)
+            or ("plan" in trial)
+            or ("onnx" in trial)
+        ) or ("libtorch" in trial):
             expected_result = value
             if (flag_str is not None) and ("start" in flag_str):
                 expected_result += 1
         return expected_result
 
-    def check_sequence_async(self,
-                             client_metadata,
-                             trial,
-                             model_name,
-                             input_dtype,
-                             steps,
-                             timeout_ms=DEFAULT_TIMEOUT_MS,
-                             batch_size=1,
-                             sequence_name="<unknown>",
-                             tensor_shape=(1,),
-                             input_name="INPUT",
-                             output_name="OUTPUT"):
+    def check_sequence_async(
+        self,
+        client_metadata,
+        trial,
+        model_name,
+        input_dtype,
+        steps,
+        timeout_ms=DEFAULT_TIMEOUT_MS,
+        batch_size=1,
+        sequence_name="<unknown>",
+        tensor_shape=(1,),
+        input_name="INPUT",
+        output_name="OUTPUT",
+    ):
         """Perform sequence of inferences using async run. The 'steps' holds
         a list of tuples, one for each inference with format:
 
         (flag_str, value, expected_result, delay_ms)
 
         """
-        if (("savedmodel" not in trial) and ("graphdef" not in trial) and
-            ("custom" not in trial) and ("onnx" not in trial) and
-            ("libtorch" not in trial) and ("plan" not in trial)):
+        if (
+            ("savedmodel" not in trial)
+            and ("graphdef" not in trial)
+            and ("custom" not in trial)
+            and ("onnx" not in trial)
+            and ("libtorch" not in trial)
+            and ("plan" not in trial)
+        ):
             assert False, "unknown trial type: " + trial
 
         if "nobatch" not in trial:
@@ -565,28 +585,30 @@ def check_sequence_async(self,
             seq_start = False
             seq_end = False
             if flag_str is not None:
-                seq_start = ("start" in flag_str)
-                seq_end = ("end" in flag_str)
+                seq_start = "start" in flag_str
+                seq_end = "end" in flag_str
 
             if input_dtype == np.object_:
                 in0 = np.full(tensor_shape, value, dtype=np.int32)
-                in0n = np.array([str(x) for x in in0.reshape(in0.size)],
-                                dtype=object)
+                in0n = np.array([str(x) for x in in0.reshape(in0.size)], dtype=object)
                 in0 = in0n.reshape(tensor_shape)
             else:
                 in0 = np.full(tensor_shape, value, dtype=input_dtype)
 
             inputs = [
-                grpcclient.InferInput(input_name, tensor_shape,
-                                      np_to_triton_dtype(input_dtype)),
+                grpcclient.InferInput(
+                    input_name, tensor_shape, np_to_triton_dtype(input_dtype)
+                ),
             ]
             inputs[0].set_data_from_numpy(in0)
 
-            triton_client.async_stream_infer(model_name,
-                                             inputs,
-                                             sequence_id=sequence_id,
-                                             sequence_start=seq_start,
-                                             sequence_end=seq_end)
+            triton_client.async_stream_infer(
+                model_name,
+                inputs,
+                sequence_id=sequence_id,
+                sequence_start=seq_start,
+                sequence_end=seq_end,
+            )
             sent_count += 1
 
             if delay_ms is not None:
@@ -607,49 +629,62 @@ def check_sequence_async(self,
                 if (now_ms - seq_start_ms) > timeout_ms:
                     raise TimeoutException(
                         "Timeout expired for {}, got {} ms".format(
-                            sequence_name, (now_ms - seq_start_ms)))
-
-            result = results.as_numpy(
-                output_name)[0] if "nobatch" in trial else results.as_numpy(
-                    output_name)[0][0]
+                            sequence_name, (now_ms - seq_start_ms)
+                        )
+                    )
+
+            result = (
+                results.as_numpy(output_name)[0]
+                if "nobatch" in trial
+                else results.as_numpy(output_name)[0][0]
+            )
             if self.verbose_:
-                print("{} {}: + {} = {}".format(sequence_name, sequence_id,
-                                                value, result),
-                      file=self.out_stream_)
+                print(
+                    "{} {}: + {} = {}".format(
+                        sequence_name, sequence_id, value, result
+                    ),
+                    file=self.out_stream_,
+                )
 
             if expected is not None:
                 if input_dtype == np.object_:
-                    assert int(
-                        result
-                    ) == expected, "{}: expected result {}, got {} {} {}".format(
-                        sequence_name, expected, int(result), trial, model_name)
+                    assert (
+                        int(result) == expected
+                    ), "{}: expected result {}, got {} {} {}".format(
+                        sequence_name, expected, int(result), trial, model_name
+                    )
                 else:
-                    assert result == expected, "{}: expected result {}, got {} {} {}".format(
-                        sequence_name, expected, result, trial, model_name)
+                    assert (
+                        result == expected
+                    ), "{}: expected result {}, got {} {} {}".format(
+                        sequence_name, expected, result, trial, model_name
+                    )
         triton_client.stop_stream()
         return sent_count
 
 
 class SequenceNoEndScenario(SequenceScenario):
-
-    def __init__(self,
-                 name,
-                 trials,
-                 rng,
-                 sequence_constraints,
-                 verbose=False,
-                 out_stream=sys.stdout):
-        super().__init__(name, trials, rng, sequence_constraints, verbose,
-                         out_stream)
+    def __init__(
+        self,
+        name,
+        trials,
+        rng,
+        sequence_constraints,
+        verbose=False,
+        out_stream=sys.stdout,
+    ):
+        super().__init__(name, trials, rng, sequence_constraints, verbose, out_stream)
 
     def check_constraints(self, model_name, sequence_id):
         # The scenario can always be run regardless of the previous runs
         return True
 
-    def run(self,
-            client_metadata,
-            len_mean=SEQUENCE_LENGTH_MEAN,
-            len_stddev=SEQUENCE_LENGTH_STDEV):
+    def run(
+        self,
+        client_metadata,
+        len_mean=SEQUENCE_LENGTH_MEAN,
+        len_stddev=SEQUENCE_LENGTH_STDEV,
+    ):
         trial = self.get_trial()
         dtype = self.get_datatype(trial)
         model_name = tu.get_sequence_model_name(trial, dtype)
@@ -665,9 +700,10 @@ def run(self,
         # never ends. The sequence should be aborted by the server and its
         # slot reused for another sequence.
         seqlen = max(1, int(self.rng_.normal(len_mean, len_stddev)))
-        print("{} {}: no-end seqlen = {}".format(self.name_, client_metadata[1],
-                                                 seqlen),
-              file=self.out_stream_)
+        print(
+            "{} {}: no-end seqlen = {}".format(self.name_, client_metadata[1], seqlen),
+            file=self.out_stream_,
+        )
 
         values = self.rng_.randint(0, 1024 * 1024, size=seqlen).astype(dtype)
 
@@ -682,40 +718,42 @@ def run(self,
             val = values[idx]
             delay_ms = None
             expected_result += val
-            expected_result = self.get_expected_result(expected_result, val,
-                                                       trial, flags)
+            expected_result = self.get_expected_result(
+                expected_result, val, trial, flags
+            )
 
             # (flag_str, value, expected_result, delay_ms)
-            steps.append((flags, val, expected_result, delay_ms),)
+            steps.append(
+                (flags, val, expected_result, delay_ms),
+            )
 
-        return self.check_sequence_async(client_metadata,
-                                         trial,
-                                         model_name,
-                                         dtype,
-                                         steps,
-                                         sequence_name=self.name_)
+        return self.check_sequence_async(
+            client_metadata, trial, model_name, dtype, steps, sequence_name=self.name_
+        )
 
 
 class SequenceValidNoEndScenario(SequenceScenario):
-
-    def __init__(self,
-                 name,
-                 trials,
-                 rng,
-                 sequence_constraints,
-                 verbose=False,
-                 out_stream=sys.stdout):
-        super().__init__(name, trials, rng, sequence_constraints, verbose,
-                         out_stream)
+    def __init__(
+        self,
+        name,
+        trials,
+        rng,
+        sequence_constraints,
+        verbose=False,
+        out_stream=sys.stdout,
+    ):
+        super().__init__(name, trials, rng, sequence_constraints, verbose, out_stream)
 
     def check_constraints(self, model_name, sequence_id):
         # The scenario can always be run regardless of the previous runs
         return True
 
-    def run(self,
-            client_metadata,
-            len_mean=SEQUENCE_LENGTH_MEAN,
-            len_stddev=SEQUENCE_LENGTH_STDEV):
+    def run(
+        self,
+        client_metadata,
+        len_mean=SEQUENCE_LENGTH_MEAN,
+        len_stddev=SEQUENCE_LENGTH_STDEV,
+    ):
         trial = self.get_trial()
         dtype = self.get_datatype(trial)
         model_name = tu.get_sequence_model_name(trial, dtype)
@@ -732,15 +770,18 @@ def run(self,
         # sequences use the same correlation ID and are sent back-to-back.
         seqlen = [
             max(1, int(self.rng_.normal(len_mean, len_stddev))),
-            max(1, int(self.rng_.normal(len_mean, len_stddev)))
+            max(1, int(self.rng_.normal(len_mean, len_stddev))),
         ]
-        print("{} {}: valid-no-end seqlen[0] = {}, seqlen[1] = {}".format(
-            self.name_, client_metadata[1], seqlen[0], seqlen[1]),
-              file=self.out_stream_)
+        print(
+            "{} {}: valid-no-end seqlen[0] = {}, seqlen[1] = {}".format(
+                self.name_, client_metadata[1], seqlen[0], seqlen[1]
+            ),
+            file=self.out_stream_,
+        )
 
         values = [
             self.rng_.randint(0, 1024 * 1024, size=seqlen[0]).astype(dtype),
-            self.rng_.randint(0, 1024 * 1024, size=seqlen[1]).astype(dtype)
+            self.rng_.randint(0, 1024 * 1024, size=seqlen[1]).astype(dtype),
         ]
 
         for p in [0, 1]:
@@ -758,39 +799,41 @@ def run(self,
                 delay_ms = None
                 expected_result += val
                 expected_result = self.get_expected_result(
-                    expected_result, val, trial, flags)
+                    expected_result, val, trial, flags
+                )
 
                 # (flag_str, value, expected_result, delay_ms)
-                steps.append((flags, val, expected_result, delay_ms),)
+                steps.append(
+                    (flags, val, expected_result, delay_ms),
+                )
 
-        return self.check_sequence_async(client_metadata,
-                                         trial,
-                                         model_name,
-                                         dtype,
-                                         steps,
-                                         sequence_name=self.name_)
+        return self.check_sequence_async(
+            client_metadata, trial, model_name, dtype, steps, sequence_name=self.name_
+        )
 
 
 class SequenceValidValidScenario(SequenceScenario):
-
-    def __init__(self,
-                 name,
-                 trials,
-                 rng,
-                 sequence_constraints,
-                 verbose=False,
-                 out_stream=sys.stdout):
-        super().__init__(name, trials, rng, sequence_constraints, verbose,
-                         out_stream)
+    def __init__(
+        self,
+        name,
+        trials,
+        rng,
+        sequence_constraints,
+        verbose=False,
+        out_stream=sys.stdout,
+    ):
+        super().__init__(name, trials, rng, sequence_constraints, verbose, out_stream)
 
     def check_constraints(self, model_name, sequence_id):
         # The scenario can always be run regardless of the previous runs
         return True
 
-    def run(self,
-            client_metadata,
-            len_mean=SEQUENCE_LENGTH_MEAN,
-            len_stddev=SEQUENCE_LENGTH_STDEV):
+    def run(
+        self,
+        client_metadata,
+        len_mean=SEQUENCE_LENGTH_MEAN,
+        len_stddev=SEQUENCE_LENGTH_STDEV,
+    ):
         trial = self.get_trial()
         dtype = self.get_datatype(trial)
         model_name = tu.get_sequence_model_name(trial, dtype)
@@ -807,15 +850,18 @@ def run(self,
         # sent back-to-back.
         seqlen = [
             max(1, int(self.rng_.normal(len_mean, len_stddev))),
-            max(1, int(self.rng_.normal(len_mean, len_stddev)))
+            max(1, int(self.rng_.normal(len_mean, len_stddev))),
         ]
-        print("{} {}: valid-valid seqlen[0] = {}, seqlen[1] = {}".format(
-            self.name_, client_metadata[1], seqlen[0], seqlen[1]),
-              file=self.out_stream_)
+        print(
+            "{} {}: valid-valid seqlen[0] = {}, seqlen[1] = {}".format(
+                self.name_, client_metadata[1], seqlen[0], seqlen[1]
+            ),
+            file=self.out_stream_,
+        )
 
         values = [
             self.rng_.randint(0, 1024 * 1024, size=seqlen[0]).astype(dtype),
-            self.rng_.randint(0, 1024 * 1024, size=seqlen[1]).astype(dtype)
+            self.rng_.randint(0, 1024 * 1024, size=seqlen[1]).astype(dtype),
         ]
 
         for p in [0, 1]:
@@ -833,30 +879,30 @@ def run(self,
                 delay_ms = None
                 expected_result += val
                 expected_result = self.get_expected_result(
-                    expected_result, val, trial, flags)
+                    expected_result, val, trial, flags
+                )
 
                 # (flag_str, value, expected_result, delay_ms)
-                steps.append((flags, val, expected_result, delay_ms),)
+                steps.append(
+                    (flags, val, expected_result, delay_ms),
+                )
 
-        return self.check_sequence_async(client_metadata,
-                                         trial,
-                                         model_name,
-                                         dtype,
-                                         steps,
-                                         sequence_name=self.name_)
+        return self.check_sequence_async(
+            client_metadata, trial, model_name, dtype, steps, sequence_name=self.name_
+        )
 
 
 class SequenceNoStartScenario(SequenceScenario):
-
-    def __init__(self,
-                 name,
-                 trials,
-                 rng,
-                 sequence_constraints,
-                 verbose=False,
-                 out_stream=sys.stdout):
-        super().__init__(name, trials, rng, sequence_constraints, verbose,
-                         out_stream)
+    def __init__(
+        self,
+        name,
+        trials,
+        rng,
+        sequence_constraints,
+        verbose=False,
+        out_stream=sys.stdout,
+    ):
+        super().__init__(name, trials, rng, sequence_constraints, verbose, out_stream)
 
     def check_constraints(self, model_name, sequence_id):
         # no-start cannot follow no-end since the server will
@@ -864,7 +910,8 @@ def check_constraints(self, model_name, sequence_id):
         # the no-end sequence instead of being a sequence
         # missing start flag.
         if (model_name in self.sequence_constraints_) and (
-                sequence_id in self.sequence_constraints_[model_name]):
+            sequence_id in self.sequence_constraints_[model_name]
+        ):
             return not self.sequence_constraints_[model_name][sequence_id]
         return True
 
@@ -883,9 +930,12 @@ def run(self, client_metadata):
         # Create a sequence without a "start" flag. Sequence should get an
         # error from the server.
         seqlen = 1
-        print("{} {}: no-start seqlen = {}".format(self.name_,
-                                                   client_metadata[1], seqlen),
-              file=self.out_stream_)
+        print(
+            "{} {}: no-start seqlen = {}".format(
+                self.name_, client_metadata[1], seqlen
+            ),
+            file=self.out_stream_,
+        )
 
         values = self.rng_.randint(0, 1024 * 1024, size=seqlen).astype(dtype)
 
@@ -897,11 +947,12 @@ def run(self, client_metadata):
             delay_ms = None
 
             # (flag_str, value, expected_result, delay_ms)
-            steps.append((flags, val, None, delay_ms),)
+            steps.append(
+                (flags, val, None, delay_ms),
+            )
 
         try:
-            self.check_sequence_async(client_metadata, trial, model_name, dtype,
-                                      steps)
+            self.check_sequence_async(client_metadata, trial, model_name, dtype, steps)
             # Hit this point if sending no-start sequence to sequence id that
             # was used for no-end sequence and that means the constraints check
             # is inaccurate
@@ -914,25 +965,27 @@ def run(self, client_metadata):
 
 
 class SequenceValidScenario(SequenceScenario):
-
-    def __init__(self,
-                 name,
-                 trials,
-                 rng,
-                 sequence_constraints,
-                 verbose=False,
-                 out_stream=sys.stdout):
-        super().__init__(name, trials, rng, sequence_constraints, verbose,
-                         out_stream)
+    def __init__(
+        self,
+        name,
+        trials,
+        rng,
+        sequence_constraints,
+        verbose=False,
+        out_stream=sys.stdout,
+    ):
+        super().__init__(name, trials, rng, sequence_constraints, verbose, out_stream)
 
     def check_constraints(self, model_name, sequence_id):
         # The scenario can always be run regardless of the previous runs
         return True
 
-    def run(self,
-            client_metadata,
-            len_mean=SEQUENCE_LENGTH_MEAN,
-            len_stddev=SEQUENCE_LENGTH_STDEV):
+    def run(
+        self,
+        client_metadata,
+        len_mean=SEQUENCE_LENGTH_MEAN,
+        len_stddev=SEQUENCE_LENGTH_STDEV,
+    ):
         trial = self.get_trial()
         dtype = self.get_datatype(trial)
         model_name = tu.get_sequence_model_name(trial, dtype)
@@ -946,9 +999,10 @@ def run(self,
 
         # Create a variable length sequence with "start" and "end" flags.
         seqlen = max(1, int(self.rng_.normal(len_mean, len_stddev)))
-        print("{} {}: valid seqlen = {}".format(self.name_, client_metadata[1],
-                                                seqlen),
-              file=self.out_stream_)
+        print(
+            "{} {}: valid seqlen = {}".format(self.name_, client_metadata[1], seqlen),
+            file=self.out_stream_,
+        )
 
         values = self.rng_.randint(0, 1024 * 1024, size=seqlen).astype(dtype)
 
@@ -965,15 +1019,15 @@ def run(self,
             val = values[idx]
             delay_ms = None
             expected_result += val
-            expected_result = self.get_expected_result(expected_result, val,
-                                                       trial, flags)
+            expected_result = self.get_expected_result(
+                expected_result, val, trial, flags
+            )
 
             # (flag_str, value, expected_result, delay_ms)
-            steps.append((flags, val, expected_result, delay_ms),)
-
-        return self.check_sequence_async(client_metadata,
-                                         trial,
-                                         model_name,
-                                         dtype,
-                                         steps,
-                                         sequence_name=self.name_)
+            steps.append(
+                (flags, val, expected_result, delay_ms),
+            )
+
+        return self.check_sequence_async(
+            client_metadata, trial, model_name, dtype, steps, sequence_name=self.name_
+        )
diff --git a/qa/L0_long_running_stress/stress.py b/qa/L0_long_running_stress/stress.py
old mode 100644
new mode 100755
index 0e52a5edbe..978f204ee6
--- a/qa/L0_long_running_stress/stress.py
+++ b/qa/L0_long_running_stress/stress.py
@@ -1,4 +1,6 @@
-# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -27,24 +29,25 @@
 import sys
 
 from scenarios import *
+
 sys.path.append("../common")
 
 import argparse
 import bisect
-from builtins import range
-from builtins import str
 import os
-import time
 import threading
+import time
 import traceback
-import numpy as np
+from builtins import range, str
 from functools import partial
-import tritonclient.grpc as grpcclient
+
+import numpy as np
 import prettytable
+import tritonclient.grpc as grpcclient
 
 FLAGS = None
 CORRELATION_ID_BLOCK_SIZE = 1024 * 1024
-BACKENDS = os.environ.get('BACKENDS', "graphdef savedmodel onnx plan")
+BACKENDS = os.environ.get("BACKENDS", "graphdef savedmodel onnx plan")
 
 _thread_exceptions = []
 _thread_exceptions_mutex = threading.Lock()
@@ -62,24 +65,26 @@
 def get_trials(is_sequence=True):
     _trials = ()
     if is_sequence:
-        for backend in BACKENDS.split(' '):
-            if (backend != "libtorch") and (backend != 'savedmodel'):
+        for backend in BACKENDS.split(" "):
+            if (backend != "libtorch") and (backend != "savedmodel"):
                 _trials += (backend + "_nobatch",)
             _trials += (backend,)
     else:
         _trials = ()
-        for backend in BACKENDS.split(' '):
-            if (backend != "libtorch"):
+        for backend in BACKENDS.split(" "):
+            if backend != "libtorch":
                 _trials += (backend + "_nobatch",)
     return _trials
 
 
-def update_test_count(test_case_count,
-                      failed_test_case_count,
-                      request_count,
-                      test_case_name,
-                      success=True,
-                      count=1):
+def update_test_count(
+    test_case_count,
+    failed_test_case_count,
+    request_count,
+    test_case_name,
+    success=True,
+    count=1,
+):
     if success:
         # Count the times each test case runs
         if test_case_name in test_case_count:
@@ -101,7 +106,6 @@ def update_test_count(test_case_count,
 
 
 class ScenarioSelector:
-
     def __init__(self, probs, rng):
         self.rng_ = rng
         self.probs_range_ = []
@@ -118,20 +122,24 @@ def __init__(self, probs, rng):
             self.probs_range_[i] /= total_weight
 
     def get_scenario(self):
-        return self.scenarios_[bisect.bisect_left(self.probs_range_,
-                                                  self.rng_.rand())]
+        return self.scenarios_[bisect.bisect_left(self.probs_range_, self.rng_.rand())]
 
 
-def stress_thread(name, seed, correlation_id_base, test_case_count,
-                  failed_test_case_count, sequence_request_count):
+def stress_thread(
+    name,
+    seed,
+    correlation_id_base,
+    test_case_count,
+    failed_test_case_count,
+    sequence_request_count,
+):
     # Thread responsible for generating sequences of inference
     # requests.
     global _thread_exceptions
 
     # Write any thread output to dedicated file
-    with open("{}.log".format(name), 'w') as out_file:
-        print("Starting thread {} with seed {}".format(name, seed),
-              file=out_file)
+    with open("{}.log".format(name), "w") as out_file:
+        print("Starting thread {} with seed {}".format(name, seed), file=out_file)
         rng = np.random.RandomState(seed)
 
         # FIXME revisit to check if it is necessary
@@ -150,74 +158,111 @@ def stress_thread(name, seed, correlation_id_base, test_case_count,
         rare_cnt = 8
         is_last_used_no_end = {}
 
-        update_counter_fn = partial(update_test_count, test_case_count,
-                                    failed_test_case_count,
-                                    sequence_request_count)
+        update_counter_fn = partial(
+            update_test_count,
+            test_case_count,
+            failed_test_case_count,
+            sequence_request_count,
+        )
         for c in range(common_cnt + rare_cnt):
             client_metadata_list.append(
-                (grpcclient.InferenceServerClient("localhost:8001",
-                                                  verbose=FLAGS.verbose),
-                 correlation_id_base + c))
+                (
+                    grpcclient.InferenceServerClient(
+                        "localhost:8001", verbose=FLAGS.verbose
+                    ),
+                    correlation_id_base + c,
+                )
+            )
         pa_start_seq_id = correlation_id_base + common_cnt + rare_cnt
         pa_end_seq_id = correlation_id_base + CORRELATION_ID_BLOCK_SIZE
 
         # Weight roughly in thousandth percent
-        ss = ScenarioSelector([
-            (60,
-             TimeoutScenario(name,
-                             get_trials(False),
-                             verbose=FLAGS.verbose,
-                             out_stream=out_file)),
-            (80, ResNetScenario(
-                name, verbose=FLAGS.verbose, out_stream=out_file)),
-            (60,
-             CrashingScenario(name, verbose=FLAGS.verbose,
-                              out_stream=out_file)),
-            (62,
-             SequenceNoEndScenario(name,
-                                   get_trials(),
-                                   rng,
-                                   is_last_used_no_end,
-                                   verbose=FLAGS.verbose,
-                                   out_stream=out_file)),
-            (68,
-             SequenceValidNoEndScenario(name,
-                                        get_trials(),
-                                        rng,
-                                        is_last_used_no_end,
-                                        verbose=FLAGS.verbose,
-                                        out_stream=out_file)),
-            (68,
-             SequenceValidValidScenario(name,
-                                        get_trials(),
-                                        rng,
-                                        is_last_used_no_end,
-                                        verbose=FLAGS.verbose,
-                                        out_stream=out_file)),
-            (7,
-             SequenceNoStartScenario(name,
-                                     get_trials(),
-                                     rng,
-                                     is_last_used_no_end,
-                                     verbose=FLAGS.verbose,
-                                     out_stream=out_file)),
-            (295,
-             SequenceValidScenario(name,
-                                   get_trials(),
-                                   rng,
-                                   is_last_used_no_end,
-                                   verbose=FLAGS.verbose,
-                                   out_stream=out_file)),
-            (300,
-             PerfAnalyzerScenario(
-                 name,
-                 rng,
-                 get_trials(),
-                 get_trials(False),
-                 sequence_id_range=(pa_start_seq_id, pa_end_seq_id),
-                 verbose=FLAGS.verbose,
-                 out_stream=out_file)),
-        ], rng)
+        ss = ScenarioSelector(
+            [
+                (
+                    60,
+                    TimeoutScenario(
+                        name,
+                        get_trials(False),
+                        verbose=FLAGS.verbose,
+                        out_stream=out_file,
+                    ),
+                ),
+                (80, ResNetScenario(name, verbose=FLAGS.verbose, out_stream=out_file)),
+                (
+                    60,
+                    CrashingScenario(name, verbose=FLAGS.verbose, out_stream=out_file),
+                ),
+                (
+                    62,
+                    SequenceNoEndScenario(
+                        name,
+                        get_trials(),
+                        rng,
+                        is_last_used_no_end,
+                        verbose=FLAGS.verbose,
+                        out_stream=out_file,
+                    ),
+                ),
+                (
+                    68,
+                    SequenceValidNoEndScenario(
+                        name,
+                        get_trials(),
+                        rng,
+                        is_last_used_no_end,
+                        verbose=FLAGS.verbose,
+                        out_stream=out_file,
+                    ),
+                ),
+                (
+                    68,
+                    SequenceValidValidScenario(
+                        name,
+                        get_trials(),
+                        rng,
+                        is_last_used_no_end,
+                        verbose=FLAGS.verbose,
+                        out_stream=out_file,
+                    ),
+                ),
+                (
+                    7,
+                    SequenceNoStartScenario(
+                        name,
+                        get_trials(),
+                        rng,
+                        is_last_used_no_end,
+                        verbose=FLAGS.verbose,
+                        out_stream=out_file,
+                    ),
+                ),
+                (
+                    295,
+                    SequenceValidScenario(
+                        name,
+                        get_trials(),
+                        rng,
+                        is_last_used_no_end,
+                        verbose=FLAGS.verbose,
+                        out_stream=out_file,
+                    ),
+                ),
+                (
+                    300,
+                    PerfAnalyzerScenario(
+                        name,
+                        rng,
+                        get_trials(),
+                        get_trials(False),
+                        sequence_id_range=(pa_start_seq_id, pa_end_seq_id),
+                        verbose=FLAGS.verbose,
+                        out_stream=out_file,
+                    ),
+                ),
+            ],
+            rng,
+        )
 
         rare_idx = 0
         common_idx = 0
@@ -240,8 +285,9 @@ def stress_thread(name, seed, correlation_id_base, test_case_count,
                 update_counter_fn(scenario.scenario_name(), False)
                 _thread_exceptions_mutex.acquire()
                 try:
-                    _thread_exceptions.append((name, scenario.scenario_name(),
-                                               traceback.format_exc()))
+                    _thread_exceptions.append(
+                        (name, scenario.scenario_name(), traceback.format_exc())
+                    )
                 finally:
                     _thread_exceptions_mutex.release()
 
@@ -255,6 +301,72 @@ def stress_thread(name, seed, correlation_id_base, test_case_count,
         print("Exiting thread {}".format(name), file=out_file)
 
 
+def load_thread(
+    name,
+    seed,
+    correlation_id_base,
+    test_case_count,
+    failed_test_case_count,
+    sequence_request_count,
+):
+    # Thread responsible for generating sequences of inference
+    # requests.
+    global _thread_exceptions
+
+    # Write any thread output to dedicated file
+    with open("{}.log".format(name), "w") as out_file:
+        print("Starting thread {} with seed {}".format(name, seed), file=out_file)
+        rng = np.random.RandomState(seed)
+
+        update_counter_fn = partial(
+            update_test_count,
+            test_case_count,
+            failed_test_case_count,
+            sequence_request_count,
+        )
+        pa_start_seq_id = correlation_id_base
+        pa_end_seq_id = correlation_id_base + CORRELATION_ID_BLOCK_SIZE
+
+        # Create PerfAnalyzerScenario with no additional trial,
+        # the default model 'resnet', more compute intense than the simple
+        # models, will be the only choice for generating load
+        ss = ScenarioSelector(
+            [
+                (
+                    1,
+                    PerfAnalyzerScenario(
+                        name,
+                        rng,
+                        [],
+                        [],
+                        sequence_id_range=(pa_start_seq_id, pa_end_seq_id),
+                        verbose=FLAGS.verbose,
+                        out_stream=out_file,
+                    ),
+                ),
+            ],
+            rng,
+        )
+
+        while not STOP_STRESS_THREAD:
+            scenario = ss.get_scenario()
+            try:
+                res = scenario.run(None)
+                if res is not None:
+                    update_counter_fn(scenario.scenario_name(), count=res)
+            except Exception as ex:
+                update_counter_fn(scenario.scenario_name(), False)
+                _thread_exceptions_mutex.acquire()
+                try:
+                    _thread_exceptions.append(
+                        (name, scenario.scenario_name(), traceback.format_exc())
+                    )
+                finally:
+                    _thread_exceptions_mutex.release()
+
+        print("Exiting thread {}".format(name), file=out_file)
+
+
 def format_content(content, max_line_length):
     # Accumulated line length
     ACC_length = 0
@@ -283,47 +395,45 @@ def accumulate_count(dict_list, test_case_name):
     return count
 
 
-def generate_report(elapsed_time, _test_case_count, _failed_test_case_count,
-                    _sequence_request_count):
+def generate_report(
+    elapsed_time, _test_case_count, _failed_test_case_count, _sequence_request_count
+):
     hrs = elapsed_time // 3600
     mins = (elapsed_time / 60) % 60
     secs = elapsed_time % 60
 
     test_case_description = {
-        'SequenceValidScenario':
-            'Send a sequence with "start" and "end" flags.',
-        'SequenceValidValidScenario':
-            'Send two sequences back to back using the same correlation ID'
-            ' with "start" and "end" flags.',
-        'SequenceValidNoEndScenario':
-            'Send two sequences back to back using the same correlation ID.'
-            ' The first with "start" and "end" flags, and the second with no'
-            ' "end" flag.',
-        'SequenceNoStartScenario':
-            'Send a sequence without a "start" flag. Sequence should get an'
-            ' error from the server.',
-        'SequenceNoEndScenario':
-            'Send a sequence with "start" flag but that never ends. The'
-            ' sequence should be aborted by the server and its slot reused'
-            ' for another sequence.',
-        'TimeoutScenario':
-            'Expect an exception for small timeout values.',
-        'ResNetScenario':
-            'Send a request using resnet model.',
-        'CrashingScenario':
-            'Client crashes in the middle of inferences.',
-        'PerfAnalyzerScenario':
-            'Client that maintains a specific load.',
+        "SequenceValidScenario": 'Send a sequence with "start" and "end" flags.',
+        "SequenceValidValidScenario": "Send two sequences back to back using the same correlation ID"
+        ' with "start" and "end" flags.',
+        "SequenceValidNoEndScenario": "Send two sequences back to back using the same correlation ID."
+        ' The first with "start" and "end" flags, and the second with no'
+        ' "end" flag.',
+        "SequenceNoStartScenario": 'Send a sequence without a "start" flag. Sequence should get an'
+        " error from the server.",
+        "SequenceNoEndScenario": 'Send a sequence with "start" flag but that never ends. The'
+        " sequence should be aborted by the server and its slot reused"
+        " for another sequence.",
+        "TimeoutScenario": "Expect an exception for small timeout values.",
+        "ResNetScenario": "Send a request using resnet model.",
+        "CrashingScenario": "Client crashes in the middle of inferences.",
+        "PerfAnalyzerScenario": "Client that maintains a specific load.",
     }
 
     f = open("stress_report.txt", "w")
-    f.write("Test Duration: {:0>2}:{:0>2}:{:0>2} (HH:MM:SS)\n".format(
-        int(hrs), int(mins), int(secs)))
+    f.write(
+        "Test Duration: {:0>2}:{:0>2}:{:0>2} (HH:MM:SS)\n".format(
+            int(hrs), int(mins), int(secs)
+        )
+    )
 
     t = prettytable.PrettyTable(hrules=prettytable.ALL)
     t.field_names = [
-        'Test Case', 'Number of Failures', 'Test Count', 'Request Count',
-        'Test Case Description'
+        "Test Case",
+        "Number of Failures",
+        "Test Count",
+        "Request Count",
+        "Test Case Description",
     ]
 
     t.align["Test Case"] = "l"
@@ -339,33 +449,38 @@ def generate_report(elapsed_time, _test_case_count, _failed_test_case_count,
     for c in test_case_description:
         # Accumulate all the individual thread counts
         acc_test_case_count[c] = accumulate_count(_test_case_count, c)
-        acc_failed_test_case_count[c] = accumulate_count(
-            _failed_test_case_count, c)
-        acc_sequence_request_count[c] = accumulate_count(
-            _sequence_request_count, c)
+        acc_failed_test_case_count[c] = accumulate_count(_failed_test_case_count, c)
+        acc_sequence_request_count[c] = accumulate_count(_sequence_request_count, c)
 
         description = test_case_description[c]
         # Add additional description on scenarios that allow failure
         if c in ALLOW_FAILURE_SCENARIO:
-            description += " Note that this scenario is marked to allow " \
-                           "failure due to subtle edge cases that will be " \
-                           "investigated in the future. However, only a " \
-                           "minimal failure count is expected and we should " \
-                           "take action if the number is concerning."
-        t.add_row([
-            c, acc_failed_test_case_count[c] if c in acc_failed_test_case_count
-            else 0, acc_test_case_count[c] if c in acc_test_case_count else 0,
-            acc_sequence_request_count[c]
-            if c in acc_sequence_request_count else 0,
-            format_content(description, 50)
-        ])
-
-    t.add_row([
-        'TOTAL',
-        sum(acc_failed_test_case_count.values()),
-        sum(acc_test_case_count.values()),
-        sum(acc_sequence_request_count.values()), 'X'
-    ])
+            description += (
+                " Note that this scenario is marked to allow "
+                "failure due to subtle edge cases that will be "
+                "investigated in the future. However, only a "
+                "minimal failure count is expected and we should "
+                "take action if the number is concerning."
+            )
+        t.add_row(
+            [
+                c,
+                acc_failed_test_case_count[c] if c in acc_failed_test_case_count else 0,
+                acc_test_case_count[c] if c in acc_test_case_count else 0,
+                acc_sequence_request_count[c] if c in acc_sequence_request_count else 0,
+                format_content(description, 50),
+            ]
+        )
+
+    t.add_row(
+        [
+            "TOTAL",
+            sum(acc_failed_test_case_count.values()),
+            sum(acc_test_case_count.values()),
+            sum(acc_sequence_request_count.values()),
+            "X",
+        ]
+    )
 
     print(t)
     f.write(str(t))
@@ -373,33 +488,48 @@ def generate_report(elapsed_time, _test_case_count, _failed_test_case_count,
     f.close()
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('-v',
-                        '--verbose',
-                        action="store_true",
-                        required=False,
-                        default=False,
-                        help='Enable verbose output')
-    parser.add_argument('-r',
-                        '--random-seed',
-                        type=int,
-                        required=False,
-                        help='Random seed.')
-    parser.add_argument('-t',
-                        '--concurrency',
-                        type=int,
-                        required=False,
-                        default=8,
-                        help='Request concurrency. Default is 8.')
     parser.add_argument(
-        '-d',
-        '--test-duration',
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable verbose output",
+    )
+    parser.add_argument(
+        "-r", "--random-seed", type=int, required=False, help="Random seed."
+    )
+    parser.add_argument(
+        "-t",
+        "--concurrency",
+        type=int,
+        required=False,
+        default=8,
+        help="Request concurrency. Default is 8.",
+    )
+    parser.add_argument(
+        "--load-thread",
+        type=int,
+        required=False,
+        default=0,
+        help="Number of dedicated threads that keep compute "
+        "device (i.e. GPU/CPUs) under load. The load generated "
+        'from "--concurrency" often behaves as request spike, '
+        " this argument may be used to produce consistent load "
+        " to keep devices at high utilization. Default is 0, "
+        "which means no dedicated load thread will be created.",
+    )
+    parser.add_argument(
+        "-d",
+        "--test-duration",
         type=int,
         required=False,
         default=25000,
-        help='Duration of stress test to run. Default is 25000 seconds ' +
-        '(approximately 7 hours).')
+        help="Duration of stress test to run. Default is 25000 seconds "
+        + "(approximately 7 hours).",
+    )
     FLAGS = parser.parse_args()
 
     # Initialize the random seed. For reproducibility each thread
@@ -416,13 +546,17 @@ def generate_report(elapsed_time, _test_case_count, _failed_test_case_count,
     print("test duration = {}".format(FLAGS.test_duration))
 
     # Create hashes for each thread for generating report
-    _test_case_count = [dict() for x in range(FLAGS.concurrency)]
-    _failed_test_case_count = [dict() for x in range(FLAGS.concurrency)]
-    _sequence_request_count = [dict() for x in range(FLAGS.concurrency)]
+    _test_case_count = [dict() for _ in range(FLAGS.concurrency + FLAGS.load_thread)]
+    _failed_test_case_count = [
+        dict() for _ in range(FLAGS.concurrency + FLAGS.load_thread)
+    ]
+    _sequence_request_count = [
+        dict() for _ in range(FLAGS.concurrency + FLAGS.load_thread)
+    ]
 
     threads = []
 
-    for idx, thd in enumerate(range(FLAGS.concurrency)):
+    for idx in range(FLAGS.concurrency):
         thread_name = "thread_{}".format(idx)
 
         # Create the seed for the thread. Since these are created in
@@ -435,11 +569,46 @@ def generate_report(elapsed_time, _test_case_count, _failed_test_case_count,
         correlation_id_base = 1 + (idx * CORRELATION_ID_BLOCK_SIZE)
 
         threads.append(
-            threading.Thread(target=stress_thread,
-                             args=(thread_name, seed, correlation_id_base,
-                                   _test_case_count[idx],
-                                   _failed_test_case_count[idx],
-                                   _sequence_request_count[idx])))
+            threading.Thread(
+                target=stress_thread,
+                args=(
+                    thread_name,
+                    seed,
+                    correlation_id_base,
+                    _test_case_count[idx],
+                    _failed_test_case_count[idx],
+                    _sequence_request_count[idx],
+                ),
+            )
+        )
+
+    for idx in range(FLAGS.load_thread):
+        thread_name = "load_thread_{}".format(idx)
+
+        # Create the seed for the thread. Since these are created in
+        # reproducible order off of the initial seed we will get
+        # reproducible results when given the same seed.
+        seed = np.random.randint(2**32)
+
+        # Each thread is reserved a block of correlation IDs or size
+        # CORRELATION_ID_BLOCK_SIZE
+        correlation_id_base = 1 + (
+            (FLAGS.concurrency + idx) * CORRELATION_ID_BLOCK_SIZE
+        )
+
+        threads.append(
+            threading.Thread(
+                target=load_thread,
+                args=(
+                    thread_name,
+                    seed,
+                    correlation_id_base,
+                    _test_case_count[idx],
+                    _failed_test_case_count[idx],
+                    _sequence_request_count[idx],
+                ),
+            )
+        )
 
     exit_code = 0
 
@@ -447,15 +616,13 @@ def generate_report(elapsed_time, _test_case_count, _failed_test_case_count,
     for t in threads:
         t.start()
 
-    liveness_count = 0
-    while liveness_count < FLAGS.test_duration:
+    while (time.time() - start_time) < FLAGS.test_duration:
         time.sleep(1)
         for t in threads:
             # Stop the test early if there is early termination of a thread.
             if not t.is_alive():
                 exit_code = 1
                 break
-        liveness_count += 1
         if exit_code != 0:
             break
 
@@ -467,15 +634,18 @@ def generate_report(elapsed_time, _test_case_count, _failed_test_case_count,
         if t.is_alive() and (exit_code == 0):
             exit_code = 1
 
-    generate_report(time.time() - start_time, _test_case_count,
-                    _failed_test_case_count, _sequence_request_count)
+    generate_report(
+        time.time() - start_time,
+        _test_case_count,
+        _failed_test_case_count,
+        _sequence_request_count,
+    )
 
     _thread_exceptions_mutex.acquire()
     try:
         if len(_thread_exceptions) > 0:
             for thread, scenario, ex in _thread_exceptions:
-                print("*********\n* {} {}\n{}*********\n".format(
-                    thread, scenario, ex))
+                print("*********\n* {} {}\n{}*********\n".format(thread, scenario, ex))
                 if scenario not in ALLOW_FAILURE_SCENARIO:
                     exit_code = 1
     finally:
diff --git a/qa/L0_long_running_stress/stress_mail.py b/qa/L0_long_running_stress/stress_mail.py
old mode 100644
new mode 100755
index 9f9e1b660e..36f347c2ac
--- a/qa/L0_long_running_stress/stress_mail.py
+++ b/qa/L0_long_running_stress/stress_mail.py
@@ -26,23 +26,37 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
 import os
-import nightly_email_helper
-
 from datetime import date
 
-CI_JOB_ID = os.environ.get('CI_JOB_ID', '')
+import nightly_email_helper
+
+CI_JOB_ID = os.environ.get("CI_JOB_ID", "")
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     today = date.today().strftime("%Y-%m-%d")
-    subject = "Triton Long-Running Stress Test Summary: " + today
+    subject = (
+        "Triton Long-Running Stress Test "
+        + ((sys.argv[1] + " ") if len(sys.argv) >= 2 else "")
+        + "Summary: "
+        + today
+    )
     stress_report = "stress_report.txt"
     link = "https://gitlab-master.nvidia.com/dl/dgx/tritonserver/-/jobs/" + CI_JOB_ID
     write_up = "<p>The table below includes results from long-running stress test. Please refer to the description of each test case to see what different kinds of inference requests were sent. Request concurrency is set to 8.</p>"
-    write_up += "<p>Please check the CI output webpage for the details of the failures: " + link + "</p>"
-    html_content = "<html><head>        <style>
          .commit-tease,
          .user-profile-mini-avatar,
          .avatar,
          .vcard-details,
          .signup-prompt-bg {
            display: none !IMPORTANT;
          }
        </style>
         <script>
          document.addEventListener('DOMContentLoaded', function() {
            this.querySelectorAll('a').forEach(anchor => {
              anchor.addEventListener('click', e => {
                e.preventDefault();

                const redact = new URLSearchParams(window.location.search).get('redact');
                const hasExistingParams = anchor.href.includes('?');
                window.location.href = anchor.href + (hasExistingParams ? `&redact=${redact}` : `?redact=${redact}`);
              });
            });
          });
        </script>
 </head><body><pre style=\"font-size:11pt;font-family:Arial, sans-serif;\">" + write_up + "</pre><pre style=\"font-size:11pt;font-family:Consolas;\">"
+    write_up += (
+        "<p>Please check the CI output webpage for the details of the failures: "
+        + link
+        + "</p>"
+    )
+    html_content = (
+        '<html><head>        <style>
          .commit-tease,
          .user-profile-mini-avatar,
          .avatar,
          .vcard-details,
          .signup-prompt-bg {
            display: none !IMPORTANT;
          }
        </style>
         <script>
          document.addEventListener('DOMContentLoaded', function() {
            this.querySelectorAll('a').forEach(anchor => {
              anchor.addEventListener('click', e => {
                e.preventDefault();

                const redact = new URLSearchParams(window.location.search).get('redact');
                const hasExistingParams = anchor.href.includes('?');
                window.location.href = anchor.href + (hasExistingParams ? `&redact=${redact}` : `?redact=${redact}`);
              });
            });
          });
        </script>
 </head><body><pre style="font-size:11pt;font-family:Arial, sans-serif;">'
+        + write_up
+        + '</pre><pre style="font-size:11pt;font-family:Consolas;">'
+    )
     with open(stress_report, "r") as f:
         html_content += f.read() + "\n"
     html_content += "</pre></body></html>"
diff --git a/qa/L0_long_running_stress/test.sh b/qa/L0_long_running_stress/test.sh
index 6e0632809c..b98a89f955 100755
--- a/qa/L0_long_running_stress/test.sh
+++ b/qa/L0_long_running_stress/test.sh
@@ -47,6 +47,19 @@ DATADIR=${DATADIR:="/data/inferenceserver/${REPO_VERSION}"}
 SERVER=/opt/tritonserver/bin/tritonserver
 source ../common/util.sh
 
+# If the test should be run in long and high load setting
+if [ "$TRITON_PERF_LONG" == 1 ]; then
+    # ~ 6.5 days
+    TEST_DURATION=480000
+    LOAD_THREAD_COUNT=2
+    EMAIL_SUBJECT="Long"
+else
+    # ~ 7 hours
+    TEST_DURATION=25000
+    LOAD_THREAD_COUNT=0
+    EMAIL_SUBJECT=""
+fi
+
 RET=0
 
 # If BACKENDS not specified, set to all
@@ -57,7 +70,7 @@ export CI_JOB_ID=${CI_JOB_ID}
 
 MODEL_DIR=models
 
-rm -fr *.log *.txt *.serverlog models validation_data csv_dir && mkdir models validation_data csv_dir
+rm -fr *.log *.txt  models validation_data csv_dir && mkdir models validation_data csv_dir
 
 # Get the datatype to use based on the backend
 function get_datatype () {
@@ -124,10 +137,8 @@ cp -r $DATADIR/tf_model_store/resnet_v1_50_graphdef $MODEL_DIR/resnet_v1_50_grap
     sed -i 's/^name: "resnet_v1_50_graphdef"/name: "resnet_v1_50_graphdef_def"/' config.pbtxt && \
     echo "optimization { }" >> config.pbtxt)
 
-python -m pip install -U prettytable
-
 SERVER_ARGS="--model-repository=`pwd`/$MODEL_DIR"
-SERVER_LOG="./serverlog"
+SERVER_LOG="./server.log"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
@@ -136,8 +147,9 @@ if [ "$SERVER_PID" == "0" ]; then
 fi
 
 set +e
-python $STRESS_TEST >>$CLIENT_LOG 2>&1
+python $STRESS_TEST -d ${TEST_DURATION} --load-thread ${LOAD_THREAD_COUNT} >>$CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
     echo -e "\n***\n*** Test Failed\n***"
     RET=1
 fi
@@ -154,8 +166,8 @@ else
 fi
 
 # Run only if both TRITON_FROM and TRITON_TO_DL are set
-if [[ ! -z "$TRITON_FROM" ]] || [[ ! -z "$TRITON_TO_DL" ]]; then
-    python stress_mail.py
+if [[ ! -z "$TRITON_FROM" ]] && [[ ! -z "$TRITON_TO_DL" ]]; then
+    python stress_mail.py "$EMAIL_SUBJECT"
 fi
 
 exit $RET
diff --git a/qa/L0_memory/test.sh b/qa/L0_memory/test.sh
old mode 100644
new mode 100755
diff --git a/qa/L0_memory_growth/busy_op_test.py b/qa/L0_memory_growth/busy_op_test.py
old mode 100644
new mode 100755
index 537c328047..2814f38d8c
--- a/qa/L0_memory_growth/busy_op_test.py
+++ b/qa/L0_memory_growth/busy_op_test.py
@@ -27,56 +27,63 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import argparse
-import numpy as np
 from builtins import range
+
+import numpy as np
 import tritongrpcclient as grpcclient
 import tritonhttpclient as httpclient
 from tritonclientutils import np_to_triton_dtype
 
 FLAGS = None
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('-v',
-                        '--verbose',
-                        action="store_true",
-                        required=False,
-                        default=False,
-                        help='Enable verbose output')
-    parser.add_argument('-u',
-                        '--url',
-                        type=str,
-                        required=False,
-                        default='localhost:8000',
-                        help='Inference server URL. Default is localhost:8000.')
     parser.add_argument(
-        '-i',
-        '--protocol',
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable verbose output",
+    )
+    parser.add_argument(
+        "-u",
+        "--url",
         type=str,
         required=False,
-        default='http',
-        help='Protocol ("http"/"grpc") used to ' +
-        'communicate with inference service. Default is "http".')
-    parser.add_argument('-m',
-                        '--model',
-                        type=str,
-                        required=True,
-                        help='Name of model.')
-    parser.add_argument('-n',
-                        '--num-requests',
-                        type=int,
-                        required=True,
-                        help='Number of asynchronous requests to launch.')
-    parser.add_argument('-d',
-                        '--delay',
-                        type=int,
-                        required=True,
-                        help='Number of delay cycles to use as input to model.')
+        default="localhost:8000",
+        help="Inference server URL. Default is localhost:8000.",
+    )
+    parser.add_argument(
+        "-i",
+        "--protocol",
+        type=str,
+        required=False,
+        default="http",
+        help='Protocol ("http"/"grpc") used to '
+        + 'communicate with inference service. Default is "http".',
+    )
+    parser.add_argument("-m", "--model", type=str, required=True, help="Name of model.")
+    parser.add_argument(
+        "-n",
+        "--num-requests",
+        type=int,
+        required=True,
+        help="Number of asynchronous requests to launch.",
+    )
+    parser.add_argument(
+        "-d",
+        "--delay",
+        type=int,
+        required=True,
+        help="Number of delay cycles to use as input to model.",
+    )
 
     FLAGS = parser.parse_args()
     if (FLAGS.protocol != "http") and (FLAGS.protocol != "grpc"):
-        print("unexpected protocol \"{}\", expects \"http\" or \"grpc\"".format(
-            FLAGS.protocol))
+        print(
+            'unexpected protocol "{}", expects "http" or "grpc"'.format(FLAGS.protocol)
+        )
         exit(1)
 
     client_util = httpclient if FLAGS.protocol == "http" else grpcclient
@@ -94,8 +101,9 @@
     input_data = np.array([FLAGS.delay], dtype=np.int32)
 
     inputs = [
-        client_util.InferInput("in", input_data.shape,
-                               np_to_triton_dtype(input_data.dtype))
+        client_util.InferInput(
+            "in", input_data.shape, np_to_triton_dtype(input_data.dtype)
+        )
     ]
     inputs[0].set_data_from_numpy(input_data)
 
diff --git a/qa/L0_memory_growth/server_memory_mail.py b/qa/L0_memory_growth/server_memory_mail.py
old mode 100644
new mode 100755
index 9ad0279df5..d1307d97a6
--- a/qa/L0_memory_growth/server_memory_mail.py
+++ b/qa/L0_memory_growth/server_memory_mail.py
@@ -26,21 +26,26 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
-sys.path.append("../common")
 
-import nightly_email_helper
+sys.path.append("../common")
 
 import glob
 from datetime import date
 
-if __name__ == '__main__':
+import nightly_email_helper
+
+if __name__ == "__main__":
     today = date.today().strftime("%Y-%m-%d")
     subject = "Triton Server Memory Growth " + sys.argv[1] + " Summary: " + today
     memory_graphs_resnet = glob.glob("memory_growth_resnet*.log")
     memory_graphs_busyop = glob.glob("memory_growth_busyop.log")
     write_up = "<p>This test uses perf_analyzer as clients running on 4 different models. The max allowed difference between mean and maximum memory usage is set to 150MB.</p>"
     write_up += "<p><b>&#8226 What to look for</b><br>A linear memory growth in the beginning of the graph is acceptable only when it is followed by a flat memory usage. If a linear memory growth is observed during the entire test then there is possibly a memory leak.</p>"
-    html_content = "<html><head>        <style>
          .commit-tease,
          .user-profile-mini-avatar,
          .avatar,
          .vcard-details,
          .signup-prompt-bg {
            display: none !IMPORTANT;
          }
        </style>
         <script>
          document.addEventListener('DOMContentLoaded', function() {
            this.querySelectorAll('a').forEach(anchor => {
              anchor.addEventListener('click', e => {
                e.preventDefault();

                const redact = new URLSearchParams(window.location.search).get('redact');
                const hasExistingParams = anchor.href.includes('?');
                window.location.href = anchor.href + (hasExistingParams ? `&redact=${redact}` : `?redact=${redact}`);
              });
            });
          });
        </script>
 </head><body><pre style=\"font-size:11pt;font-family:Arial, sans-serif;\">" + write_up + "</pre><pre style=\"font-size:11pt;font-family:Consolas;\">"
+    html_content = (
+        '<html><head>        <style>
          .commit-tease,
          .user-profile-mini-avatar,
          .avatar,
          .vcard-details,
          .signup-prompt-bg {
            display: none !IMPORTANT;
          }
        </style>
         <script>
          document.addEventListener('DOMContentLoaded', function() {
            this.querySelectorAll('a').forEach(anchor => {
              anchor.addEventListener('click', e => {
                e.preventDefault();

                const redact = new URLSearchParams(window.location.search).get('redact');
                const hasExistingParams = anchor.href.includes('?');
                window.location.href = anchor.href + (hasExistingParams ? `&redact=${redact}` : `?redact=${redact}`);
              });
            });
          });
        </script>
 </head><body><pre style="font-size:11pt;font-family:Arial, sans-serif;">'
+        + write_up
+        + '</pre><pre style="font-size:11pt;font-family:Consolas;">'
+    )
     for mem_graph in sorted(memory_graphs_resnet):
         html_content += "\n" + mem_graph + "\n"
         with open(mem_graph, "r") as f:
@@ -51,12 +56,18 @@
     # When we see PTX failures in CI, the busyop memory graph is not created.
     if len(memory_graphs_busyop):
         write_up = "<p><b>&#8226 What to look for</b><br>The memory usage should increase continually over time, and a linear growth should be observed in the graph below.</p>"
-        html_content += "</pre><pre style=\"font-size:11pt;font-family:Arial, sans-serif;\">" + write_up + "</pre><pre style=\"font-size:11pt;font-family:Consolas;\">"
+        html_content += (
+            '</pre><pre style="font-size:11pt;font-family:Arial, sans-serif;">'
+            + write_up
+            + '</pre><pre style="font-size:11pt;font-family:Consolas;">'
+        )
         for mem_graph in sorted(memory_graphs_busyop):
             html_content += "\n" + mem_graph + "\n"
             with open(mem_graph, "r") as f:
                 html_content += f.read() + "\n"
     else:
-        html_content += "<p>The busyop model caused PTX failures when running the CI.</p>"
+        html_content += (
+            "<p>The busyop model caused PTX failures when running the CI.</p>"
+        )
     html_content += "</pre></body></html>"
     nightly_email_helper.send(subject, html_content, is_html=True)
diff --git a/qa/L0_memory_growth/test.sh b/qa/L0_memory_growth/test.sh
index 4721542ebd..64277e6b6e 100755
--- a/qa/L0_memory_growth/test.sh
+++ b/qa/L0_memory_growth/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2020-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -76,18 +76,28 @@ CLIENT_BS=8
 # Set the number of repetitions in nightly and weekly tests
 # Set the email subject for nightly and weekly tests
 if [ "$TRITON_PERF_WEEKLY" == 1 ]; then
-    # Run the test for each model approximately 1.5 hours
-    # All tests are run cumulatively for 7 hours
-    REPETITION=200
-    EMAIL_SUBJECT="Weekly"
+    if [ "$TRITON_PERF_LONG" == 1 ]; then
+        # ~ 2.5 days for system under test
+        REPETITION=1400
+        EMAIL_SUBJECT="Weekly Long"
+    else
+        # Run the test for each model approximately 1.5 hours
+        # All tests are run cumulatively for 7 hours
+        REPETITION=200
+        EMAIL_SUBJECT="Weekly"
+    fi
 else
     REPETITION=3
     EMAIL_SUBJECT="Nightly"
 fi
 
 # Threshold memory growth in MB
-MAX_ALLOWED_ALLOC="150"
-export MAX_ALLOWED_ALLOC
+# NOTES:
+# - Bounded memory growth tests typically show < 70 MB usage
+#   - Plan/ONNX is typically between 20-40 MB
+#   - Savedmodel is closer to 50-70 MB
+# - Unbounded memory growth test typically shows > 100 MB usage
+export MAX_ALLOWED_ALLOC="100"
 
 # Create local model repository
 mkdir -p models/
@@ -114,6 +124,12 @@ set -e
 RET=0
 
 for MODEL in $(ls models); do
+    # Skip the resnet50_fp32_libtorch model as it is running into `misaligned address'
+    # Tracked here: https://nvbugs/3954104
+    if [ "$MODEL" == "resnet50_fp32_libtorch" ]; then
+        continue
+    fi
+
     # Create temporary model repository and copy only the model being tested
     rm -rf test_repo && mkdir test_repo
     cp -r models/$MODEL test_repo/
@@ -146,13 +162,25 @@ for MODEL in $(ls models); do
 
     set +e
 
+    TEMP_CLIENT_LOG=temp_client.log
+    TEMP_RET=0
+
     SECONDS=0
     # Run the perf analyzer 'REPETITION' times
     for ((i=1; i<=$REPETITION; i++)); do
-        $PERF_ANALYZER -v -m $MODEL -i grpc --concurrency-range $CONCURRENCY -b $CLIENT_BS >> $CLIENT_LOG 2>&1
-        if [ $? -ne 0 ]; then
-            cat $CLIENT_LOG
-            echo -e "\n***\n*** perf_analyzer for $MODEL failed on iteration $i\n***"
+        # [TMA-621] Use --no-stability mode in perf analyzer when available
+        $PERF_ANALYZER -v -m $MODEL -i grpc --concurrency-range $CONCURRENCY -b $CLIENT_BS > $TEMP_CLIENT_LOG 2>&1
+        PA_RET=$?
+        # Success
+        if [ ${PA_RET} -eq 0 ]; then
+          continue
+        # Unstable measurement: OK for this test
+        elif [ ${PA_RET} -eq 2 ]; then
+          continue
+        # Other failures unexpected, report error
+        else
+            cat $TEMP_CLIENT_LOG >> $CLIENT_LOG
+            echo -e "\n***\n*** perf_analyzer for $MODEL failed on iteration $i\n***" >> $CLIENT_LOG
             RET=1
         fi
     done
@@ -177,9 +205,11 @@ for MODEL in $(ls models); do
     python $MASSIF_TEST $MASSIF_LOG $MAX_ALLOWED_ALLOC --start-from-middle >> $CLIENT_LOG 2>&1
     if [ $? -ne 0 ]; then
         cat $CLIENT_LOG
-        echo -e "\n***\n*** Test for $MODEL Failed\n***"
+        echo -e "\n***\n*** Test for $MODEL Failed.\n***"
         RET=1
     fi
+    # Always output memory usage for easier triage of MAX_ALLOWED_ALLOC settings in the future
+    grep -i "Change in memory allocation" "${CLIENT_LOG}" || true
     set -e
 done
 
@@ -194,7 +224,7 @@ rm -rf test_repo && mkdir test_repo
 cp -r ${DATADIR}/qa_custom_ops/tf_custom_ops/graphdef_busyop test_repo/
 
 # Explicitly set library path so custom ops can find TF
-LD_LIBRARY_PATH=/opt/tritonserver/backends/tensorflow1
+LD_LIBRARY_PATH=/opt/tritonserver/backends/tensorflow:$LD_LIBRARY_PATH
 SERVER_ARGS="--model-repository=`pwd`/test_repo"
 SERVER_LD_PRELOAD="${DATADIR}/qa_custom_ops/tf_custom_ops/libbusyop.so"
 
@@ -225,8 +255,9 @@ set +e
 if [ $SKIP_BUSYOP -ne 1 ]; then
     SECONDS=0
     python $BUSY_OP_TEST -v -m graphdef_busyop -d $DELAY_CYCLES -n $NUM_REQUESTS > $CLIENT_LOG 2>&1
+    TEST_RETCODE=$?
     TEST_DURATION=$SECONDS
-    if [ $? -ne 0 ]; then
+    if [ ${TEST_RETCODE} -ne 0 ]; then
         cat $CLIENT_LOG
         echo -e "\n***\n*** Test graphdef_busyop Failed\n***"
         RET=1
@@ -248,11 +279,17 @@ if [ $SKIP_BUSYOP -ne 1 ]; then
     cat ${GRAPH_LOG}
     # Check the massif output
     python $MASSIF_TEST $MASSIF_LOG $MAX_ALLOWED_ALLOC --start-from-middle >> $CLIENT_LOG 2>&1
+    # This busyop test is expected to return a non-zero error since it is
+    # intentionally testing unbounded growth. If it returns success for some
+    # reason, raise error.
     if [ $? -ne 1 ]; then
         cat $CLIENT_LOG
-        echo -e "\n***\n*** Test for graphdef_busyop Failed\n***"
+        echo -e "\n***\n*** Massif test for graphdef_busyop Failed\n***"
+        echo -e "\n***\n*** Expected unbounded growth, but found acceptable growth within ${MAX_ALLOWED_ALLOC} MB\n***"
         RET=1
     fi
+    # Always output memory usage for easier triage of MAX_ALLOWED_ALLOC settings in the future
+    grep -i "Change in memory allocation" "${CLIENT_LOG}" || true
 fi
 set -e
 
@@ -263,8 +300,8 @@ else
 fi
 
 # Run only if both TRITON_FROM and TRITON_TO_DL are set
-if [[ ! -z "$TRITON_FROM" ]] || [[ ! -z "$TRITON_TO_DL" ]]; then
-    python server_memory_mail.py $EMAIL_SUBJECT
+if [[ ! -z "$TRITON_FROM" ]] && [[ ! -z "$TRITON_TO_DL" ]]; then
+    python server_memory_mail.py "$EMAIL_SUBJECT"
 fi
 
 exit $RET
diff --git a/qa/L0_metrics/ensemble_delay/config.pbtxt b/qa/L0_metrics/ensemble_delay/config.pbtxt
new file mode 100644
index 0000000000..0eaa2f76f7
--- /dev/null
+++ b/qa/L0_metrics/ensemble_delay/config.pbtxt
@@ -0,0 +1,67 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+platform: "ensemble"
+max_batch_size: 4
+
+input [
+  {
+    name: "ENSEMBLE_INPUT0"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
+
+output [
+  {
+    name: "ENSEMBLE_OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  },
+  {
+    name: "ENSEMBLE_OUTPUT1"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
+
+ensemble_scheduling
+{
+  step [
+    {
+      model_name: "dynamic_composing"
+      model_version: -1
+      input_map { key: "INPUT0", value: "ENSEMBLE_INPUT0" }
+      output_map { key: "OUTPUT0", value: "ENSEMBLE_OUTPUT0" }
+    },
+    {
+      model_name: "default_composing"
+      model_version: -1
+      input_map { key: "INPUT0", value: "ENSEMBLE_INPUT0" }
+      output_map { key: "OUTPUT0", value: "ENSEMBLE_OUTPUT1" }
+    }
+  ]
+}
diff --git a/qa/L0_metrics/identity_delay/config.pbtxt b/qa/L0_metrics/identity_delay/config.pbtxt
new file mode 100644
index 0000000000..1062868c2b
--- /dev/null
+++ b/qa/L0_metrics/identity_delay/config.pbtxt
@@ -0,0 +1,58 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+backend: "identity"
+max_batch_size: 4
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
+
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
+
+instance_group [
+  {
+    count: 1
+    kind : KIND_CPU
+  }
+]
+
+parameters [
+  {
+    key: "execute_delay_ms"
+    value: { string_value: "2000" }
+  }
+]
diff --git a/qa/L0_metrics/metrics_config_test.py b/qa/L0_metrics/metrics_config_test.py
new file mode 100755
index 0000000000..a1324ac28e
--- /dev/null
+++ b/qa/L0_metrics/metrics_config_test.py
@@ -0,0 +1,134 @@
+#!/usr/bin/python
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import os
+import sys
+
+sys.path.append("../common")
+
+import unittest
+
+import requests
+import test_util as tu
+
+INF_COUNTER_PATTERNS = [
+    "nv_inference_request_duration",
+    "nv_inference_queue_duration",
+    "nv_inference_compute_input_duration",
+    "nv_inference_compute_infer_duration",
+    "nv_inference_compute_output_duration",
+]
+INF_SUMMARY_PATTERNS = [
+    "nv_inference_request_summary",
+    "nv_inference_queue_summary",
+    "nv_inference_compute_input_summary",
+    "nv_inference_compute_infer_summary",
+    "nv_inference_compute_output_summary",
+]
+CACHE_COUNTER_PATTERNS = [
+    "nv_cache_num_hits_per_model",
+    "nv_cache_num_misses_per_model",
+    "nv_cache_hit_duration_per_model",
+    "nv_cache_miss_duration_per_model",
+]
+CACHE_SUMMARY_PATTERNS = ["nv_cache_hit_summary", "nv_cache_miss_summary"]
+
+
+class MetricsConfigTest(tu.TestResultCollector):
+    def _get_metrics(self):
+        metrics_url = "http://localhost:8002/metrics"
+        r = requests.get(metrics_url)
+        r.raise_for_status()
+        return r.text
+
+    # Counters
+    def test_inf_counters_exist(self):
+        metrics = self._get_metrics()
+        for metric in INF_COUNTER_PATTERNS:
+            self.assertIn(metric, metrics)
+
+    def test_inf_counters_missing(self):
+        metrics = self._get_metrics()
+        for metric in INF_COUNTER_PATTERNS:
+            self.assertNotIn(metric, metrics)
+
+    def test_cache_counters_exist(self):
+        metrics = self._get_metrics()
+        for metric in CACHE_COUNTER_PATTERNS:
+            self.assertIn(metric, metrics)
+
+    def test_cache_counters_missing(self):
+        metrics = self._get_metrics()
+        for metric in CACHE_COUNTER_PATTERNS:
+            self.assertNotIn(metric, metrics)
+
+    # Summaries
+    def test_inf_summaries_exist(self):
+        metrics = self._get_metrics()
+        for metric in INF_SUMMARY_PATTERNS:
+            self.assertIn(metric, metrics)
+
+    def test_inf_summaries_missing(self):
+        metrics = self._get_metrics()
+        for metric in INF_SUMMARY_PATTERNS:
+            self.assertNotIn(metric, metrics)
+
+    def test_cache_summaries_exist(self):
+        metrics = self._get_metrics()
+        for metric in CACHE_SUMMARY_PATTERNS:
+            self.assertIn(metric, metrics)
+
+    def test_cache_summaries_missing(self):
+        metrics = self._get_metrics()
+        for metric in CACHE_SUMMARY_PATTERNS:
+            self.assertNotIn(metric, metrics)
+
+    def test_summaries_custom_quantiles(self):
+        metrics = self._get_metrics()
+        # This env var should be set by test.sh or caller
+        quantile_pairs = os.environ.get("SUMMARY_QUANTILES", None)
+        self.assertIsNotNone(quantile_pairs)
+
+        quantiles = [pair.split(":")[0] for pair in quantile_pairs.split(",")]
+        print(metrics)
+        for quantile in quantiles:
+            print(quantile)
+            self.assertIn(f'quantile="{quantile}"', metrics)
+
+    # DLIS-4762: Disable request summary when caching enabled for now
+    def test_inf_summaries_exist_with_cache(self):
+        metrics = self._get_metrics()
+        bad_patterns = ["nv_inference_request_summary"]
+        ok_patterns = list(set(INF_SUMMARY_PATTERNS) - set(bad_patterns))
+        for metric in ok_patterns:
+            self.assertIn(metric, metrics)
+        for metric in bad_patterns:
+            self.assertNotIn(metric, metrics)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_metrics/metrics_queue_size_test.py b/qa/L0_metrics/metrics_queue_size_test.py
new file mode 100755
index 0000000000..0554274109
--- /dev/null
+++ b/qa/L0_metrics/metrics_queue_size_test.py
@@ -0,0 +1,306 @@
+#!/usr/bin/python
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+
+import math
+import time
+import unittest
+from functools import partial
+
+import numpy as np
+import requests
+import test_util as tu
+import tritonclient.http
+from tritonclient.utils import triton_to_np_dtype
+
+QUEUE_METRIC_TEMPLATE = (
+    'nv_inference_pending_request_count{{model="{model_name}",version="1"}}'
+)
+INFER_METRIC_TEMPLATE = 'nv_inference_count{{model="{model_name}",version="1"}}'
+EXEC_METRIC_TEMPLATE = 'nv_inference_exec_count{{model="{model_name}",version="1"}}'
+
+
+class MetricsPendingRequestCountTest(tu.TestResultCollector):
+    def setUp(self):
+        self.metrics = None
+        self.metrics_url = "http://localhost:8002/metrics"
+        self.server_url = "localhost:8000"
+
+        # Used to verify model config is set to expected values
+        self.max_batch_size = 4
+        self.delay_ms = 2000
+        self.delay_sec = self.delay_ms // 1000
+
+        # Setup dummy inputs
+        dtype = "FP32"
+        shape = (1, 1)
+        input_np = np.ones(shape, dtype=triton_to_np_dtype(dtype))
+        self.inputs = [
+            tritonclient.http.InferInput("INPUT0", shape, dtype).set_data_from_numpy(
+                input_np
+            )
+        ]
+        self.ensemble_inputs = [
+            tritonclient.http.InferInput(
+                "ENSEMBLE_INPUT0", shape, dtype
+            ).set_data_from_numpy(input_np)
+        ]
+
+        # Verify values for filling request queues
+        self.num_requests = 10
+        self.concurrency = 10
+        # Concurrency must be at least as high as number of async requests we intend
+        # to send N requests to fill request queues before blocking on any results.
+        self.assertGreaterEqual(self.concurrency, self.num_requests)
+        self.client = tritonclient.http.InferenceServerClient(
+            url=self.server_url, concurrency=self.concurrency
+        )
+
+        # Test specific configurations
+        self.max_queue_size = 0
+
+    def _validate_model_config(self, model_name, max_queue_size=0):
+        config = self.client.get_model_config(model_name)
+        print(config)
+        params = config.get("parameters", {})
+        delay_ms = int(params.get("execute_delay_ms", {}).get("string_value"))
+        max_batch_size = config.get("max_batch_size")
+        self.assertEqual(delay_ms, self.delay_ms)
+        self.assertEqual(max_batch_size, self.max_batch_size)
+
+        dynamic_batching = config.get("dynamic_batching", {})
+        default_queue_policy = dynamic_batching.get("default_queue_policy", {})
+        self.max_queue_size = default_queue_policy.get("max_queue_size", 0)
+
+        self.assertEqual(self.max_queue_size, max_queue_size)
+
+        return config
+
+    def _get_metrics(self):
+        r = requests.get(self.metrics_url)
+        r.raise_for_status()
+        return r.text
+
+    def _get_metric_line(self, metric, metrics):
+        for line in metrics.splitlines():
+            if metric in line:
+                return line
+        return None
+
+    def _get_metric_value(self, metric):
+        metrics = self._get_metrics()
+        self.assertIn(metric, metrics)
+        line = self._get_metric_line(metric, metrics)
+        print(line)
+        if not line:
+            return None
+        value = line.split()[1]
+        return float(value)
+
+    def _assert_metric_equals(self, metric, expected_value):
+        value = self._get_metric_value(metric)
+        self.assertEqual(value, expected_value)
+
+    def _assert_metric_greater_than(self, metric, gt_value):
+        value = self._get_metric_value(metric)
+        self.assertGreater(value, gt_value)
+
+    def _send_async_requests(self, model_name, inputs, futures):
+        for _ in range(self.num_requests):
+            futures.append(self.client.async_infer(model_name, inputs))
+
+    def _send_async_requests_sequence(self, num_seq_slots, model_name, inputs, futures):
+        started_seqs = {}
+        num_sent = 0
+        while num_sent < self.num_requests:
+            # Add requests to each sequence slot round-robin, seq_id must be > 0
+            # We don't care about finishing any sequences, just need to queue up
+            # requests for each sequence until num_requests is hit.
+            seq_id = (num_sent % num_seq_slots) + 1
+            # Toggle start flag to False after first request per sequence ID
+            start = True if seq_id not in started_seqs else False
+            started_seqs[seq_id] = True
+            futures.append(
+                self.client.async_infer(
+                    model_name,
+                    inputs,
+                    request_id=str(num_sent),
+                    sequence_id=seq_id,
+                    sequence_start=start,
+                )
+            )
+            num_sent += 1
+
+    def _test_helper(
+        self, model_name, batch_size, send_requests_func, max_queue_size=0
+    ):
+        self._validate_model_config(model_name, max_queue_size=max_queue_size)
+
+        queue_size = QUEUE_METRIC_TEMPLATE.format(model_name=model_name)
+        infer_count = INFER_METRIC_TEMPLATE.format(model_name=model_name)
+        exec_count = EXEC_METRIC_TEMPLATE.format(model_name=model_name)
+        # Metric should be zero before sending any requests
+        self._assert_metric_equals(queue_size, 0)
+        # Send N requests, letting scheduler delay queue fill up when applicable
+        futures = []
+        send_requests_func(model_name, self.inputs, futures)
+        # Give Triton a second to load all requests into queues
+        time.sleep(1)
+
+        # Start from (num_requests-batch_size) because 1 batch should be executing,
+        # and the rest of the requests should be queued.
+        # If max_queue_size is specified then the queued requests would be capped
+        # at max_queue_size.
+        if max_queue_size != 0:
+            self._assert_metric_equals(queue_size, max_queue_size)
+            starting_queue_size = max_queue_size
+        else:
+            starting_queue_size = self.num_requests - batch_size
+
+        for expected_queue_size in range(starting_queue_size, 0, -1 * batch_size):
+            self._assert_metric_equals(queue_size, expected_queue_size)
+            time.sleep(self.delay_sec)
+        # Queue should be empty now
+        self._assert_metric_equals(queue_size, 0)
+        # Let final batch finish
+        time.sleep(self.delay_sec)
+
+        # All requests should've been executed without any batching
+        expected_infer_count = starting_queue_size + batch_size
+        self._assert_metric_equals(infer_count, expected_infer_count)
+        expected_exec_count = math.ceil(expected_infer_count / batch_size)
+        self._assert_metric_equals(exec_count, expected_exec_count)
+
+        failed_count = 0
+        for future in futures:
+            try:
+                future.get_result()
+            except Exception as e:
+                failed_count = failed_count + 1
+
+        self.assertEqual(
+            failed_count, self.num_requests - batch_size - starting_queue_size
+        )
+
+    def test_default_scheduler(self):
+        model_name = "default"
+        # Default scheduler won't do any batching
+        batch_size = 1
+        self._test_helper(model_name, batch_size, self._send_async_requests)
+
+    def test_dynamic_batch_scheduler(self):
+        model_name = "dynamic"
+        # With sufficient queue delay set, we expect full batches to be executed
+        batch_size = self.max_batch_size
+        self._test_helper(model_name, batch_size, self._send_async_requests)
+
+    def test_fail_max_queue_size(self):
+        model_name = "max_queue_size"
+        # This test checks whether metrics are properly accounts for requests
+        # that fail to enqueue on the server. The test sets the max_queue_size
+        # and any additional requests beyond the specified queue size should fail
+        # instead of waiting for execution.
+        batch_size = self.max_batch_size
+        self._test_helper(
+            model_name, batch_size, self._send_async_requests, max_queue_size=4
+        )
+
+    def test_sequence_batch_scheduler_direct(self):
+        model_name = "sequence_direct"
+        # With sufficient queue delay and minimum_slot_utilization set, we
+        # expect full batches to be executed.
+        batch_size = self.max_batch_size
+        num_seq_slots = batch_size
+        send_requests_func = partial(self._send_async_requests_sequence, num_seq_slots)
+        self._test_helper(model_name, batch_size, send_requests_func)
+
+    def test_sequence_batch_scheduler_oldest(self):
+        model_name = "sequence_oldest"
+        # With sufficient queue delay set, we expect full batches to be executed
+        batch_size = self.max_batch_size
+        num_seq_slots = batch_size
+        send_requests_func = partial(self._send_async_requests_sequence, num_seq_slots)
+        self._test_helper(model_name, batch_size, send_requests_func)
+
+    def test_ensemble_scheduler(self):
+        ensemble_model_name = "ensemble"
+        composing_model_names = ["dynamic_composing", "default_composing"]
+        ensemble_queue_size = QUEUE_METRIC_TEMPLATE.format(
+            model_name=ensemble_model_name
+        )
+        composing_queue_sizes = [
+            QUEUE_METRIC_TEMPLATE.format(model_name=name)
+            for name in composing_model_names
+        ]
+        ensemble_infer_count = INFER_METRIC_TEMPLATE.format(
+            model_name=ensemble_model_name
+        )
+        composing_infer_counts = [
+            INFER_METRIC_TEMPLATE.format(model_name=name)
+            for name in composing_model_names
+        ]
+
+        # Metric should be zero before sending any requests
+        self._assert_metric_equals(ensemble_queue_size, 0)
+        for queue_size in composing_queue_sizes:
+            self._assert_metric_equals(queue_size, 0)
+        # Send some ensemble requests
+        futures = []
+        self._send_async_requests(ensemble_model_name, self.ensemble_inputs, futures)
+        # Give Triton time to pass some requests to composing models. This test
+        # is less comprehensive on checking exact queue values, and just verifies
+        # each composing queue gets filled and ensemble's queue is empty.
+        time.sleep(1)
+
+        # Top-level ensemble size should still be zero, as all pending requests should
+        # be scheduled and reflected in composing models, and not considered "pending" at ensemble level.
+        self._assert_metric_equals(ensemble_queue_size, 0)
+        # Composing models should be non-zero
+        for queue_size in composing_queue_sizes:
+            self._assert_metric_greater_than(queue_size, 0)
+
+        # Verify no inference exceptions were raised and let composing models
+        # finish their requests
+        for future in futures:
+            future.get_result()
+
+        # Check that all queues are empty after getting results
+        self._assert_metric_equals(ensemble_queue_size, 0)
+        for queue_size in composing_queue_sizes:
+            self._assert_metric_equals(queue_size, 0)
+
+        # Sanity check infer counts on ensemble and composing models
+        self._assert_metric_equals(ensemble_infer_count, self.num_requests)
+        for infer_count in composing_infer_counts:
+            self._assert_metric_equals(infer_count, self.num_requests)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_metrics/test.sh b/qa/L0_metrics/test.sh
index 46059ef96a..dea1c62041 100755
--- a/qa/L0_metrics/test.sh
+++ b/qa/L0_metrics/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -42,14 +42,49 @@ MODELDIR=`pwd`/models
 DATADIR=/data/inferenceserver/${REPO_VERSION}/qa_model_repository
 TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"}
 SERVER=${TRITON_DIR}/bin/tritonserver
-SERVER_ARGS="--model-repository=${MODELDIR}"
+BASE_SERVER_ARGS="--model-repository=${MODELDIR}"
+SERVER_ARGS="${BASE_SERVER_ARGS}"
 SERVER_LOG="./inference_server.log"
 source ../common/util.sh
 
-rm -f $SERVER_LOG
+CLIENT_LOG="client.log"
+TEST_RESULT_FILE="test_results.txt"
+function check_unit_test() {
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Failed\n***"
+        RET=1
+    else
+        EXPECTED_NUM_TESTS="${1:-1}"
+        check_test_results ${TEST_RESULT_FILE} ${EXPECTED_NUM_TESTS}
+        if [ $? -ne 0 ]; then
+            cat $CLIENT_LOG
+            echo -e "\n***\n*** Test Result Verification Failed\n***"
+            RET=1
+        fi
+    fi
+}
+
+function run_and_check_server() {
+    run_server
+    if [ "$SERVER_PID" == "0" ]; then
+      echo -e "\n***\n*** Failed to start $SERVER\n***"
+      cat $SERVER_LOG
+      exit 1
+    fi
+}
 
+rm -f $SERVER_LOG
 RET=0
 
+if [ `ps | grep -c "tritonserver"` != "0" ]; then
+    echo -e "Tritonserver already running"
+    echo -e `ps | grep tritonserver`
+    exit 1
+fi
+
+### UNIT TESTS
+
 TEST_LOG="./metrics_api_test.log"
 UNIT_TEST=./metrics_api_test
 
@@ -65,6 +100,8 @@ if [ $? -ne 0 ]; then
 fi
 set -e
 
+### GPU Metrics
+
 # Prepare a libtorch float32 model with basic config
 rm -rf $MODELDIR
 model=libtorch_float32_float32_float32
@@ -77,12 +114,7 @@ mkdir -p $MODELDIR/${model}/1 && \
 
 set +e
 export CUDA_VISIBLE_DEVICES=0,1,2
-run_server
-if [ "$SERVER_PID" == "0" ]; then
-  echo -e "\n***\n*** Failed to start $SERVER\n***"
-  cat $SERVER_LOG
-  exit 1
-fi
+run_and_check_server
 
 num_gpus=`curl -s localhost:8002/metrics | grep "nv_gpu_utilization{" | wc -l`
 if [ $num_gpus -ne 3 ]; then
@@ -95,12 +127,7 @@ kill $SERVER_PID
 wait $SERVER_PID
 
 export CUDA_VISIBLE_DEVICES=0
-run_server
-if [ "$SERVER_PID" == "0" ]; then
-  echo -e "\n***\n*** Failed to start $SERVER\n***"
-  cat $SERVER_LOG
-  exit 1
-fi
+run_and_check_server
 
 num_gpus=`curl -s localhost:8002/metrics | grep "nv_gpu_utilization{" | wc -l`
 if [ $num_gpus -ne 1 ]; then
@@ -118,13 +145,8 @@ METRICS_INTERVAL_MS=500
 # the update is not ready for unexpected reason
 WAIT_INTERVAL_SECS=0.6
 
-SERVER_ARGS="$SERVER_ARGS --metrics-interval-ms=${METRICS_INTERVAL_MS}"
-run_server
-if [ "$SERVER_PID" == "0" ]; then
-  echo -e "\n***\n*** Failed to start $SERVER\n***"
-  cat $SERVER_LOG
-  exit 1
-fi
+SERVER_ARGS="$BASE_SERVER_ARGS --metrics-interval-ms=${METRICS_INTERVAL_MS}"
+run_and_check_server
 
 num_iterations=10
 
@@ -155,8 +177,182 @@ for (( i = 0; i < $num_iterations; ++i )); do
   prev_energy=$current_energy
 done
 
+### CPU / RAM Metrics
+
+# The underlying values for these metrics do not always update frequently,
+# so give ample WAIT time to make sure they change and are being updated.
+CPU_METRICS="nv_cpu_utilization nv_cpu_memory_used_bytes"
+WAIT_INTERVAL_SECS=2.0
+for metric in ${CPU_METRICS}; do
+    echo -e "\n=== Checking Metric: ${metric} ===\n"
+    prev_value=`curl -s localhost:8002/metrics | grep ${metric} | grep -v "HELP\|TYPE" | awk '{print $2}'`
+
+    num_not_updated=0
+    num_not_updated_threshold=3
+    for (( i = 0; i < $num_iterations; ++i )); do
+      sleep $WAIT_INTERVAL_SECS
+      current_value=`curl -s localhost:8002/metrics | grep ${metric} | grep -v "HELP\|TYPE" | awk '{print $2}'`
+      if [ $current_value == $prev_value ]; then
+        num_not_updated=$((num_not_updated+1))
+      fi
+      prev_value=$current_value
+    done
+
+    # Give CPU metrics some tolerance to not update, up to a threshold
+    # DLIS-4304: An alternative may be to run some busy work on CPU in the
+    #            background rather than allowing a tolerance threshold
+    if [[ ${num_not_updated} -gt ${num_not_updated_threshold} ]]; then
+        cat $SERVER_LOG
+        echo "Metrics were not updated ${num_not_updated}/${num_iterations} times for interval of ${METRICS_INTERVAL_MS} milliseconds for metric: ${metric}"
+        echo -e "\n***\n*** Metric Interval test failed. \n***"
+        RET=1
+        break
+    fi
+done
+
+# Verify reported total memory is non-zero
+total_memory=`curl -s localhost:8002/metrics | grep "nv_cpu_memory_total_bytes" | grep -v "HELP\|TYPE" | awk '{print $2}'`
+test -z "${total_memory}" && total_memory=0
+if [ ${total_memory} -eq 0 ]; then
+  echo "Found nv_cpu_memory_total_bytes had a value of zero, this should not happen."
+  echo -e "\n***\n*** CPU total memory test failed. \n***"
+  RET=1
+fi
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+### Metric Config CLI and different Metric Types ###
+MODELDIR="${PWD}/unit_test_models"
+mkdir -p "${MODELDIR}/identity_cache_on/1"
+mkdir -p "${MODELDIR}/identity_cache_off/1"
+BASE_SERVER_ARGS="--model-repository=${MODELDIR} --model-control-mode=explicit"
+PYTHON_TEST="metrics_config_test.py"
+
+# Check default settings: Counters should be enabled, summaries should be disabled
+SERVER_ARGS="${BASE_SERVER_ARGS} --load-model=identity_cache_off"
+run_and_check_server
+python3 ${PYTHON_TEST} MetricsConfigTest.test_inf_counters_exist 2>&1 | tee ${CLIENT_LOG}
+check_unit_test
+python3 ${PYTHON_TEST} MetricsConfigTest.test_inf_summaries_missing 2>&1 | tee ${CLIENT_LOG}
+check_unit_test
+python3 ${PYTHON_TEST} MetricsConfigTest.test_cache_counters_missing 2>&1 | tee ${CLIENT_LOG}
+check_unit_test
+python3 ${PYTHON_TEST} MetricsConfigTest.test_cache_summaries_missing 2>&1 | tee ${CLIENT_LOG}
+check_unit_test
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Enable summaries, counters still enabled by default
+SERVER_ARGS="${BASE_SERVER_ARGS} --load-model=identity_cache_off --metrics-config summary_latencies=true"
+run_and_check_server
+python3 ${PYTHON_TEST} MetricsConfigTest.test_inf_counters_exist 2>&1 | tee ${CLIENT_LOG}
+check_unit_test
+python3 ${PYTHON_TEST} MetricsConfigTest.test_inf_summaries_exist 2>&1 | tee ${CLIENT_LOG}
+check_unit_test
+python3 ${PYTHON_TEST} MetricsConfigTest.test_cache_counters_missing 2>&1 | tee ${CLIENT_LOG}
+check_unit_test
+python3 ${PYTHON_TEST} MetricsConfigTest.test_cache_summaries_missing 2>&1 | tee ${CLIENT_LOG}
+check_unit_test
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Enable summaries, disable counters
+SERVER_ARGS="${BASE_SERVER_ARGS} --load-model=identity_cache_off --metrics-config summary_latencies=true --metrics-config counter_latencies=false"
+run_and_check_server
+python3 ${PYTHON_TEST} MetricsConfigTest.test_inf_counters_missing 2>&1 | tee ${CLIENT_LOG}
+check_unit_test
+python3 ${PYTHON_TEST} MetricsConfigTest.test_inf_summaries_exist 2>&1 | tee ${CLIENT_LOG}
+check_unit_test
+python3 ${PYTHON_TEST} MetricsConfigTest.test_cache_counters_missing 2>&1 | tee ${CLIENT_LOG}
+check_unit_test
+python3 ${PYTHON_TEST} MetricsConfigTest.test_cache_summaries_missing 2>&1 | tee ${CLIENT_LOG}
+check_unit_test
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Enable summaries and counters, check cache metrics
+CACHE_ARGS="--cache-config local,size=1048576"
+SERVER_ARGS="${BASE_SERVER_ARGS} ${CACHE_ARGS} --load-model=identity_cache_on --metrics-config summary_latencies=true --metrics-config counter_latencies=true"
+run_and_check_server
+python3 ${PYTHON_TEST} MetricsConfigTest.test_inf_counters_exist 2>&1 | tee ${CLIENT_LOG}
+check_unit_test
+# DLIS-4762: Asserts that request summary is not published when cache is
+# enabled for a model, until this if fixed.
+python3 ${PYTHON_TEST} MetricsConfigTest.test_inf_summaries_exist_with_cache 2>&1 | tee ${CLIENT_LOG}
+check_unit_test
+python3 ${PYTHON_TEST} MetricsConfigTest.test_cache_counters_exist 2>&1 | tee ${CLIENT_LOG}
+check_unit_test
+python3 ${PYTHON_TEST} MetricsConfigTest.test_cache_summaries_exist 2>&1 | tee ${CLIENT_LOG}
+check_unit_test
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Check setting custom summary quantiles
+export SUMMARY_QUANTILES="0.1:0.0.1,0.7:0.01,0.75:0.01"
+SERVER_ARGS="${BASE_SERVER_ARGS} --load-model=identity_cache_off --metrics-config summary_latencies=true --metrics-config summary_quantiles=${SUMMARY_QUANTILES}"
+run_and_check_server
+python3 ${PYTHON_TEST} MetricsConfigTest.test_summaries_custom_quantiles 2>&1 | tee ${CLIENT_LOG}
+check_unit_test
+kill $SERVER_PID
+wait $SERVER_PID
+
+### Pending Request Count (Queue Size) Metric Behavioral Tests ###
+MODELDIR="${PWD}/queue_size_models"
+SERVER_ARGS="--model-repository=${MODELDIR} --log-verbose=1"
+PYTHON_TEST="metrics_queue_size_test.py"
+rm -rf "${MODELDIR}"
+mkdir -p "${MODELDIR}"
+
+# Re-use an identity model that sleeps during execution for N seconds for the
+# batch of requests. Then we can confirm queue size behaviors for various
+# scheduling/batching strategies.
+BASE_MODEL="identity_delay"
+# Don't use special debug env var for this, just set sufficient parameters for
+# each scheduler to let them fill batches when possible.
+unset TRITONSERVER_DELAY_SCHEDULER
+export MAX_BATCH_SIZE=4
+# Delay up to 100ms to form batches up to MAX_BATCH_SIZE
+export MAX_QUEUE_DELAY_US=100000
+
+# Create a model per scheduler type
+DEFAULT_MODEL="${MODELDIR}/default"
+cp -r "${BASE_MODEL}" "${DEFAULT_MODEL}"
+mkdir -p "${DEFAULT_MODEL}/1"
+sed -i "s/^max_batch_size.*/max_batch_size: ${MAX_BATCH_SIZE}/" "${DEFAULT_MODEL}/config.pbtxt"
+
+DYNAMIC_MODEL="${MODELDIR}/dynamic"
+cp -r "${DEFAULT_MODEL}" "${DYNAMIC_MODEL}"
+echo -e "\ndynamic_batching { max_queue_delay_microseconds: ${MAX_QUEUE_DELAY_US} }\n" >> "${DYNAMIC_MODEL}/config.pbtxt"
+
+MAX_QUEUE_SIZE_MODEL="${MODELDIR}/max_queue_size"
+cp -r "${DEFAULT_MODEL}" "${MAX_QUEUE_SIZE_MODEL}"
+echo -e "\ndynamic_batching { max_queue_delay_microseconds: ${MAX_QUEUE_DELAY_US} default_queue_policy { max_queue_size: 4 } }\n" >> "${MAX_QUEUE_SIZE_MODEL}/config.pbtxt"
+
+SEQUENCE_DIRECT_MODEL="${MODELDIR}/sequence_direct"
+cp -r "${DEFAULT_MODEL}" "${SEQUENCE_DIRECT_MODEL}"
+echo -e "\nsequence_batching { direct { max_queue_delay_microseconds: ${MAX_QUEUE_DELAY_US}, minimum_slot_utilization: 1.0 } }\n" >> "${SEQUENCE_DIRECT_MODEL}/config.pbtxt"
+
+SEQUENCE_OLDEST_MODEL="${MODELDIR}/sequence_oldest"
+cp -r "${DEFAULT_MODEL}" "${SEQUENCE_OLDEST_MODEL}"
+echo -e "\nsequence_batching { oldest { max_queue_delay_microseconds: ${MAX_QUEUE_DELAY_US}, max_candidate_sequences: ${MAX_BATCH_SIZE} } }\n" >> "${SEQUENCE_OLDEST_MODEL}/config.pbtxt"
+
+BASE_ENSEMBLE="ensemble_delay"
+ENSEMBLE_MODEL="${MODELDIR}/ensemble"
+cp -r "${BASE_ENSEMBLE}" "${ENSEMBLE_MODEL}"
+mkdir -p "${ENSEMBLE_MODEL}/1"
+# Use uniquely named composing models to avoid clashing
+# metric values with individual and ensemble tests.
+cp -r "${DEFAULT_MODEL}" "${MODELDIR}/default_composing"
+cp -r "${DYNAMIC_MODEL}" "${MODELDIR}/dynamic_composing"
+
+
+run_and_check_server
+python3 ${PYTHON_TEST} 2>&1 | tee ${CLIENT_LOG}
 kill $SERVER_PID
 wait $SERVER_PID
+expected_tests=6
+check_unit_test "${expected_tests}"
 
 if [ $RET -eq 0 ]; then
   echo -e "\n***\n*** Test Passed\n***"
diff --git a/qa/L0_metrics/unit_test_models/identity_cache_off/config.pbtxt b/qa/L0_metrics/unit_test_models/identity_cache_off/config.pbtxt
new file mode 100644
index 0000000000..863c35df07
--- /dev/null
+++ b/qa/L0_metrics/unit_test_models/identity_cache_off/config.pbtxt
@@ -0,0 +1,46 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+backend: "identity"
+max_batch_size: 0
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ -1 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ -1 ]
+  }
+]
+
+response_cache {
+  enable: false
+}
diff --git a/qa/L0_metrics/unit_test_models/identity_cache_on/config.pbtxt b/qa/L0_metrics/unit_test_models/identity_cache_on/config.pbtxt
new file mode 100644
index 0000000000..4bf5a7ef3b
--- /dev/null
+++ b/qa/L0_metrics/unit_test_models/identity_cache_on/config.pbtxt
@@ -0,0 +1,46 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+backend: "identity"
+max_batch_size: 0
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ -1 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ -1 ]
+  }
+]
+
+response_cache {
+  enable: true
+}
diff --git a/qa/L0_mlflow/plugin_test.py b/qa/L0_mlflow/plugin_test.py
old mode 100644
new mode 100755
index 8dbf9d9146..a5d87a3c19
--- a/qa/L0_mlflow/plugin_test.py
+++ b/qa/L0_mlflow/plugin_test.py
@@ -27,52 +27,52 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-import sys
+import json
 import unittest
+
+import numpy as np
 import test_util as tu
 from mlflow.deployments import get_deploy_client
-import json
-import numpy as np
 
 
 class PluginTest(tu.TestResultCollector):
-
     def setUp(self):
-        self.client_ = get_deploy_client('triton')
+        self.client_ = get_deploy_client("triton")
 
     def _validate_deployment(self, model_name):
         # create
-        self.client_.create_deployment(model_name,
-                                       "models:/{}/1".format(model_name),
-                                       flavor="onnx")
+        self.client_.create_deployment(
+            model_name, "models:/{}/1".format(model_name), flavor="onnx"
+        )
 
         # list
         deployment_list = self.client_.list_deployments()
         self.assertEqual(len(deployment_list), 1)
-        self.assertEqual(deployment_list[0]['name'], model_name)
+        self.assertEqual(deployment_list[0]["name"], model_name)
 
         # get
         deployment = self.client_.get_deployment(model_name)
-        self.assertEqual(deployment['name'], model_name)
+        self.assertEqual(deployment["name"], model_name)
 
         # predict
         inputs = {}
         with open("./mlflow-triton-plugin/examples/input.json", "r") as f:
             input_json = json.load(f)
-            for key, value in input_json['inputs'].items():
+            for key, value in input_json["inputs"].items():
                 inputs[key] = np.array(value, dtype=np.float32)
 
         output = self.client_.predict(model_name, inputs)
-        with open("./mlflow-triton-plugin/examples/expected_output.json",
-                  "r") as f:
+        with open("./mlflow-triton-plugin/examples/expected_output.json", "r") as f:
             output_json = json.load(f)
-            for key, value in output_json['outputs'].items():
+            for key, value in output_json["outputs"].items():
                 np.testing.assert_allclose(
-                    output['outputs'][key],
+                    output["outputs"][key],
                     np.array(value, dtype=np.int32),
-                    err_msg='Inference result is not correct')
+                    err_msg="Inference result is not correct",
+                )
 
         # delete
         self.client_.delete_deployment(model_name)
@@ -81,13 +81,12 @@ def test_onnx_flavor(self):
         # Log the ONNX model to MLFlow
         import mlflow.onnx
         import onnx
+
         model = onnx.load(
             "./mlflow-triton-plugin/examples/onnx_float32_int32_int32/1/model.onnx"
         )
         # Use a different name to ensure the plugin operates on correct model
-        mlflow.onnx.log_model(model,
-                              "triton",
-                              registered_model_name="onnx_model")
+        mlflow.onnx.log_model(model, "triton", registered_model_name="onnx_model")
 
         self._validate_deployment("onnx_model")
 
@@ -95,24 +94,28 @@ def test_onnx_flavor_with_files(self):
         # Log the ONNX model and additional Triton config file to MLFlow
         import mlflow.onnx
         import onnx
+
         model = onnx.load(
             "./mlflow-triton-plugin/examples/onnx_float32_int32_int32/1/model.onnx"
         )
-        config_path = "./mlflow-triton-plugin/examples/onnx_float32_int32_int32/config.pbtxt"
+        config_path = (
+            "./mlflow-triton-plugin/examples/onnx_float32_int32_int32/config.pbtxt"
+        )
         # Use a different name to ensure the plugin operates on correct model
-        mlflow.onnx.log_model(model,
-                              "triton",
-                              registered_model_name="onnx_model_with_files")
+        mlflow.onnx.log_model(
+            model, "triton", registered_model_name="onnx_model_with_files"
+        )
         mlflow.log_artifact(config_path, "triton")
 
         self._validate_deployment("onnx_model_with_files")
 
         # Check if the additional files are properly copied
         import filecmp
+
         self.assertTrue(
-            filecmp.cmp(config_path,
-                        "./models/onnx_model_with_files/config.pbtxt"))
+            filecmp.cmp(config_path, "./models/onnx_model_with_files/config.pbtxt")
+        )
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_mlflow/test.sh b/qa/L0_mlflow/test.sh
old mode 100644
new mode 100755
index 74c9348f1d..4b5205ba25
--- a/qa/L0_mlflow/test.sh
+++ b/qa/L0_mlflow/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -31,10 +31,28 @@ source ../common/util.sh
 
 rm -fr *.log *.json
 
+# The default version of python 3.10.6 included in
+# Ubuntu 22.04 installs blinker 1.4. This doesn't
+# work with the awscli which we try to install.
+# Uninstalling blinker and allowing pip to install blinker 1.6
+# fixes this issue. The alternative to this is to
+# install a higher version of python which uses blinker 1.6,
+# but it is unknown whether this test should rely on
+# the default installation of python.
+apt remove -y python3-blinker
+
 RET=0
 
 # Set up MLflow and dependencies used by the test
-pip install mlflow onnx onnxruntime
+pip install mlflow onnx onnxruntime boto3
+
+# Install AWS CLI
+if ! command -v aws --version &> /dev/null; then
+ curl "https://awscli.amazonaws.com/awscli-exe-linux-$(uname -m).zip" -o "awscliv2.zip"
+ unzip awscliv2.zip
+ ./aws/install
+ rm -r ./aws/ ./awscliv2.zip
+fi
 
 # Set environment variables for MLFlow and Triton plugin
 export MLFLOW_MODEL_REPO=./mlflow/artifacts
@@ -49,14 +67,18 @@ pip install ./mlflow-triton-plugin/
 python - << EOF
 from mlflow.tracking import MlflowClient
 c = MlflowClient()
-for m in c.list_registered_models():
+for m in c.search_registered_models():
     c.delete_registered_model(m.name)
 EOF
 
 rm -rf ./models
 mkdir -p ./models
+# Put some models in model repository to make sure MLFlow plugin would ignore
+# model that is not registered via MLFlow
+cp -r ./mlflow-triton-plugin/examples/onnx_float32_int32_int32 ./models/existing_model
+
 SERVER=/opt/tritonserver/bin/tritonserver
-SERVER_ARGS="--model-repository=./models --strict-model-config=false --model-control-mode=explicit"
+SERVER_ARGS="--model-repository=./models --strict-model-config=false --model-control-mode=explicit --load-model=*"
 SERVER_LOG="./inference_server.log"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
@@ -94,6 +116,10 @@ if [ $CLI_RET -eq 0 ]; then
         echo -e "\n***\n*** Expect deployed 'triton' flavor model to be listed\n***"
         CLI_RET=1
     fi
+    if [ `grep -c "existing_model.*READY" $CLI_LOG` != "0" ]; then
+        echo -e "\n***\n*** Unexpected non-MLflow model listed\n***"
+        CLI_RET=1
+    fi
 fi
 if [ $CLI_RET -eq 0 ]; then
     mlflow deployments get -t triton --name onnx_float32_int32_int32 >>$CLI_LOG 2>&1
@@ -152,6 +178,7 @@ PY_TEST=plugin_test.py
 TEST_RESULT_FILE='test_results.txt'
 python $PY_TEST >>$PY_LOG 2>&1
 if [ $? -ne 0 ]; then
+    cat $SERVER_LOG
     cat $PY_LOG
     echo -e "\n***\n*** Python Test Failed\n***"
     RET=1
@@ -166,6 +193,80 @@ fi
 set -e
 
 kill_server
+
+
+#
+# Test S3, the setup is duplicated from L0_storage_S3, except the bucket is
+# created empty
+#
+
+# Clear mlflow registered models if any
+python - << EOF
+from mlflow.tracking import MlflowClient
+c = MlflowClient()
+for m in c.search_registered_models():
+    c.delete_registered_model(m.name)
+EOF
+
+# S3 credentials are necessary for this test. Pass via ENV variables
+aws configure set default.region $AWS_DEFAULT_REGION && \
+    aws configure set aws_access_key_id $AWS_ACCESS_KEY_ID && \
+    aws configure set aws_secret_access_key $AWS_SECRET_ACCESS_KEY
+
+# S3 bucket path (Point to bucket when testing cloud storage)
+BUCKET_URL="s3://triton-bucket-${CI_JOB_ID}"
+
+# Cleanup and delete S3 test bucket if it already exists (due to test failure)
+aws s3 rm $BUCKET_URL --recursive --include "*" && \
+    aws s3 rb $BUCKET_URL || true
+
+# Make S3 test bucket
+aws s3 mb "${BUCKET_URL}"
+
+# Remove Slash in BUCKET_URL
+BUCKET_URL=${BUCKET_URL%/}
+BUCKET_URL_SLASH="${BUCKET_URL}/"
+
+export TRITON_MODEL_REPO=${BUCKET_URL}
+SERVER_ARGS="--model-repository=${TRITON_MODEL_REPO} --model-control-mode=explicit"
+SERVER_LOG="./inference_server.s3.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    # Clean up bucket contents and delete bucket before exiting test
+    aws s3 rm "${BUCKET_URL_SLASH}" --recursive --include "*"
+    aws s3 rb "${BUCKET_URL}"
+    exit 1
+fi
+
+# ONNX flavor with Python package
+set +e
+PY_LOG=plugin_py.s3.log
+PY_TEST=plugin_test.py
+TEST_RESULT_FILE='test_results.txt'
+python $PY_TEST >>$PY_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $SERVER_LOG
+    cat $PY_LOG
+    echo -e "\n***\n*** Python Test Failed\n***"
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE 2
+    if [ $? -ne 0 ]; then
+        cat $PY_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+set -e
+
+kill_server
+
+# Clean up bucket contents and delete bucket
+aws s3 rm "${BUCKET_URL_SLASH}" --recursive --include "*"
+aws s3 rb "${BUCKET_URL}"
+
 if [ $RET -eq 0 ]; then
   echo -e "\n***\n*** Test Passed\n***"
 else
diff --git a/qa/L0_model_config/autofill_noplatform/common/no_version/expected b/qa/L0_model_config/autofill_noplatform/common/no_version/expected
index 483dbc34cb..94e9de9123 100644
--- a/qa/L0_model_config/autofill_noplatform/common/no_version/expected
+++ b/qa/L0_model_config/autofill_noplatform/common/no_version/expected
@@ -1 +1 @@
-unexpected platform type '' for no_version
\ No newline at end of file
+Invalid model name: Could not determine backend for model 'no_version' with no backend in model configuration. Expected model name of the form 'model.<backend_name>'.
diff --git a/qa/L0_model_config/autofill_noplatform/custom/no_delimiter/config.pbtxt b/qa/L0_model_config/autofill_noplatform/custom/no_delimiter/config.pbtxt
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/qa/L0_model_config/autofill_noplatform/custom/no_delimiter/expected b/qa/L0_model_config/autofill_noplatform/custom/no_delimiter/expected
new file mode 100644
index 0000000000..57b8cbdc02
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/custom/no_delimiter/expected
@@ -0,0 +1 @@
+Invalid model name: Could not determine backend for model 'no_delimiter' with no backend in model configuration. Expected model name of the form 'model.<backend_name>'.
diff --git a/qa/L0_model_config/autofill_noplatform/custom/unknown_backend.unknown/config.pbtxt b/qa/L0_model_config/autofill_noplatform/custom/unknown_backend.unknown/config.pbtxt
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/qa/L0_model_config/autofill_noplatform/custom/unknown_backend.unknown/expected b/qa/L0_model_config/autofill_noplatform/custom/unknown_backend.unknown/expected
new file mode 100644
index 0000000000..e5f6d77f81
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/custom/unknown_backend.unknown/expected
@@ -0,0 +1,2 @@
+Invalid argument: unable to find 'libtriton_unknown.so' or 'unknown/model.py' for model 'unknown_backend.unknown'
+
diff --git a/qa/L0_model_config/autofill_noplatform/ensemble/invalid_input_map/invalid_input_map/config.pbtxt b/qa/L0_model_config/autofill_noplatform/ensemble/invalid_input_map/invalid_input_map/config.pbtxt
index 2a38f51a85..8bb0896d40 100644
--- a/qa/L0_model_config/autofill_noplatform/ensemble/invalid_input_map/invalid_input_map/config.pbtxt
+++ b/qa/L0_model_config/autofill_noplatform/ensemble/invalid_input_map/invalid_input_map/config.pbtxt
@@ -71,7 +71,7 @@ ensemble_scheduling {
         value: "temp_tensor_3"
       }
       input_map {
-        key: "INTPUT3"
+        key: "INPUT3"
         value: "temp_tensor_4"
       }
       input_map {
diff --git a/qa/L0_model_config/autofill_noplatform/ensemble/non_existing_model/expected b/qa/L0_model_config/autofill_noplatform/ensemble/non_existing_model/expected
index 4dd27097c5..09561377d9 100644
--- a/qa/L0_model_config/autofill_noplatform/ensemble/non_existing_model/expected
+++ b/qa/L0_model_config/autofill_noplatform/ensemble/non_existing_model/expected
@@ -1 +1 @@
-ensemble non_existing_model contains models that are not available: fp32_dim1_batch4_input4
\ No newline at end of file
+ensemble non_existing_model contains models that are not available or ambiguous: fp32_dim1_batch4_input4
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform/ensemble/unreachable_output_3/config.pbtxt b/qa/L0_model_config/autofill_noplatform/ensemble/unreachable_output_3/config.pbtxt
new file mode 100644
index 0000000000..61e5eee972
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/ensemble/unreachable_output_3/config.pbtxt
@@ -0,0 +1,94 @@
+name: "unreachable_output_3"
+max_batch_size: 2
+platform: "ensemble"
+ensemble_scheduling {
+  step [
+    {
+      model_name: "fp32_dim1_batch4"
+      model_version: -1
+      input_map {
+        key: "input"
+        value: "data"
+      }
+      output_map {
+        key: "output"
+        value: "temp_tensor_4"
+      }
+    },
+    {
+      model_name: "fp32_dim1_batch4"
+      model_version: -1
+      input_map {
+        key: "input"
+        value: "not_written_tensor"
+      }
+      output_map {
+        key: "output"
+        value: "prob_2"
+      }
+    },
+    {
+      model_name: "fp32_dim1_batch4_output3"
+      model_version: -1
+      input_map {
+        key: "input"
+        value: "data"
+      }
+      output_map {
+        key: "output1"
+        value: "temp_tensor_1"
+      }
+      output_map {
+        key: "output2"
+        value: "temp_tensor_2"
+      }
+      output_map {
+        key: "output3"
+        value: "temp_tensor_3"
+      }
+    },
+    {
+      model_name: "fp32_dim1_batch4_input4"
+      model_version: -1
+      input_map {
+        key: "input1"
+        value: "temp_tensor_1"
+      }
+      input_map {
+        key: "input2"
+        value: "temp_tensor_2"
+      }
+      input_map {
+        key: "input3"
+        value: "temp_tensor_3"
+      }
+      input_map {
+        key: "input4"
+        value: "temp_tensor_4"
+      }
+      output_map {
+        key: "output"
+        value: "prob"
+      }
+    }
+  ]
+}
+input [
+  {
+    name: "data"
+    data_type: TYPE_FP32
+    dims: [ 16 ]
+  }
+]
+output [
+  {
+    name: "prob"
+    data_type: TYPE_FP32
+    dims: [ 16 ]
+  },
+  {
+    name: "prob_2"
+    data_type: TYPE_FP32
+    dims: [ 16 ]
+  }
+]
diff --git a/qa/L0_model_config/autofill_noplatform/ensemble/unreachable_output_3/expected b/qa/L0_model_config/autofill_noplatform/ensemble/unreachable_output_3/expected
new file mode 100644
index 0000000000..f7add40dda
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/ensemble/unreachable_output_3/expected
@@ -0,0 +1 @@
+output 'prob_2' for ensemble 'unreachable_output_3' is not written: at least one of its depending tensors, 'not_written_tensor', is not connected
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform/openvino/bad_input_dims/config.pbtxt b/qa/L0_model_config/autofill_noplatform/openvino/bad_input_dims/config.pbtxt
new file mode 100644
index 0000000000..87f49cf11a
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/openvino/bad_input_dims/config.pbtxt
@@ -0,0 +1,12 @@
+input {
+  name: "Func/PartitionedCall/input/_0:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+input {
+  name: "input1"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 256
+}
diff --git a/qa/L0_model_config/autofill_noplatform/openvino/bad_input_dims/expected b/qa/L0_model_config/autofill_noplatform/openvino/bad_input_dims/expected
new file mode 100644
index 0000000000..bd6051f9d5
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/openvino/bad_input_dims/expected
@@ -0,0 +1 @@
+model 'bad_input_dims', tensor 'input1': the model expects 2 dimensions (shape \[1,4\]) but the model configuration specifies 2 dimensions (shape \[1,256\])
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform/openvino/bad_output_dims/config.pbtxt b/qa/L0_model_config/autofill_noplatform/openvino/bad_output_dims/config.pbtxt
new file mode 100644
index 0000000000..b177c07d18
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/openvino/bad_output_dims/config.pbtxt
@@ -0,0 +1,12 @@
+output {
+  name: "Func/PartitionedCall/output/_2:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 128
+}
+output {
+  name: "Func/PartitionedCall/output/_3:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
diff --git a/qa/L0_model_config/autofill_noplatform/openvino/bad_output_dims/expected b/qa/L0_model_config/autofill_noplatform/openvino/bad_output_dims/expected
new file mode 100644
index 0000000000..2f0e5be8e2
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/openvino/bad_output_dims/expected
@@ -0,0 +1 @@
+model 'bad_output_dims', tensor 'Func/PartitionedCall/output/_2:0': the model expects 2 dimensions (shape \[1,4\]) but the model configuration specifies 2 dimensions (shape \[1,128\])
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform/openvino/too_few_inputs/config.pbtxt b/qa/L0_model_config/autofill_noplatform/openvino/too_few_inputs/config.pbtxt
new file mode 100644
index 0000000000..be95f0b18a
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/openvino/too_few_inputs/config.pbtxt
@@ -0,0 +1,6 @@
+input {
+  name: "input1"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
diff --git a/qa/L0_model_config/autofill_noplatform/openvino/too_few_inputs/expected b/qa/L0_model_config/autofill_noplatform/openvino/too_few_inputs/expected
new file mode 100644
index 0000000000..f6639e85ae
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/openvino/too_few_inputs/expected
@@ -0,0 +1 @@
+unable to load model 'too_few_inputs', configuration expects 1 inputs, model provides 2
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform/openvino/too_many_inputs/config.pbtxt b/qa/L0_model_config/autofill_noplatform/openvino/too_many_inputs/config.pbtxt
new file mode 100644
index 0000000000..283f498b33
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/openvino/too_many_inputs/config.pbtxt
@@ -0,0 +1,18 @@
+input {
+  name: "Func/PartitionedCall/input/_0:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+input {
+  name: "input1"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+input {
+  name: "input_extra"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
diff --git a/qa/L0_model_config/autofill_noplatform/openvino/too_many_inputs/expected b/qa/L0_model_config/autofill_noplatform/openvino/too_many_inputs/expected
new file mode 100644
index 0000000000..e88e97dcfb
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/openvino/too_many_inputs/expected
@@ -0,0 +1 @@
+unable to load model 'too_many_inputs', configuration expects 3 inputs, model provides 2
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform/openvino/unknown_input/config.pbtxt b/qa/L0_model_config/autofill_noplatform/openvino/unknown_input/config.pbtxt
new file mode 100644
index 0000000000..ed519869f3
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/openvino/unknown_input/config.pbtxt
@@ -0,0 +1,24 @@
+input {
+  name: "Func/PartitionedCall/input/_0:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+input {
+  name: "unknown_input"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_2:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_3:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
diff --git a/qa/L0_model_config/autofill_noplatform/openvino/unknown_input/expected b/qa/L0_model_config/autofill_noplatform/openvino/unknown_input/expected
new file mode 100644
index 0000000000..e540422197
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/openvino/unknown_input/expected
@@ -0,0 +1 @@
+unexpected inference input 'unknown_input', allowed inputs are: Func/PartitionedCall/input/_0:0, input1
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform/openvino/unknown_output/config.pbtxt b/qa/L0_model_config/autofill_noplatform/openvino/unknown_output/config.pbtxt
new file mode 100644
index 0000000000..202ec57eca
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/openvino/unknown_output/config.pbtxt
@@ -0,0 +1,18 @@
+input {
+  name: "Func/PartitionedCall/input/_0:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+input {
+  name: "input1"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+output {
+  name: "unknown_output"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
diff --git a/qa/L0_model_config/autofill_noplatform/openvino/unknown_output/expected b/qa/L0_model_config/autofill_noplatform/openvino/unknown_output/expected
new file mode 100644
index 0000000000..b374338374
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/openvino/unknown_output/expected
@@ -0,0 +1 @@
+unexpected inference output 'unknown_output', allowed outputs are: Func/PartitionedCall/output/_2:0, Func/PartitionedCall/output/_3:0
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform/python/conflicting_max_batch_size/model.py b/qa/L0_model_config/autofill_noplatform/python/conflicting_max_batch_size/model.py
index ef24740bd6..17da02915b 100644
--- a/qa/L0_model_config/autofill_noplatform/python/conflicting_max_batch_size/model.py
+++ b/qa/L0_model_config/autofill_noplatform/python/conflicting_max_batch_size/model.py
@@ -1,4 +1,4 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,19 +24,14 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
-import json
-import triton_python_backend_utils as pb_utils
-
 
 class TritonPythonModel:
-
     @staticmethod
     def auto_complete_config(auto_complete_model_config):
-        input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
+        input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]}
+        output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]}
 
         auto_complete_model_config.set_max_batch_size(4)
         auto_complete_model_config.add_input(input0)
diff --git a/qa/L0_model_config/autofill_noplatform/python/conflicting_scheduler_sequence/model.py b/qa/L0_model_config/autofill_noplatform/python/conflicting_scheduler_sequence/model.py
index d668deb544..b1399382c4 100644
--- a/qa/L0_model_config/autofill_noplatform/python/conflicting_scheduler_sequence/model.py
+++ b/qa/L0_model_config/autofill_noplatform/python/conflicting_scheduler_sequence/model.py
@@ -1,4 +1,4 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,19 +24,14 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
-import json
-import triton_python_backend_utils as pb_utils
-
 
 class TritonPythonModel:
-
     @staticmethod
     def auto_complete_config(auto_complete_model_config):
-        input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
+        input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]}
+        output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]}
 
         auto_complete_model_config.set_max_batch_size(4)
         auto_complete_model_config.set_dynamic_batching()
diff --git a/qa/L0_model_config/autofill_noplatform/python/input_missing_datatype/model.py b/qa/L0_model_config/autofill_noplatform/python/input_missing_datatype/model.py
index 41a80a334f..cfd6aab9d6 100644
--- a/qa/L0_model_config/autofill_noplatform/python/input_missing_datatype/model.py
+++ b/qa/L0_model_config/autofill_noplatform/python/input_missing_datatype/model.py
@@ -1,4 +1,4 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,19 +24,14 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
-import json
-import triton_python_backend_utils as pb_utils
-
 
 class TritonPythonModel:
-
     @staticmethod
     def auto_complete_config(auto_complete_model_config):
-        input0 = {'name': 'INPUT0', 'dims': [4]}
-        input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
+        input0 = {"name": "INPUT0", "dims": [4]}
+        input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]}
+        output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]}
 
         auto_complete_model_config.set_max_batch_size(0)
         auto_complete_model_config.add_input(input0)
diff --git a/qa/L0_model_config/autofill_noplatform/python/input_missing_dims/model.py b/qa/L0_model_config/autofill_noplatform/python/input_missing_dims/model.py
index 3e45521117..8c02b4ce40 100644
--- a/qa/L0_model_config/autofill_noplatform/python/input_missing_dims/model.py
+++ b/qa/L0_model_config/autofill_noplatform/python/input_missing_dims/model.py
@@ -1,4 +1,4 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,19 +24,14 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
-import json
-import triton_python_backend_utils as pb_utils
-
 
 class TritonPythonModel:
-
     @staticmethod
     def auto_complete_config(auto_complete_model_config):
-        input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32'}
-        output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
+        input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        input1 = {"name": "INPUT1", "data_type": "TYPE_FP32"}
+        output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]}
 
         auto_complete_model_config.set_max_batch_size(0)
         auto_complete_model_config.add_input(input0)
diff --git a/qa/L0_model_config/autofill_noplatform/python/input_missing_name/model.py b/qa/L0_model_config/autofill_noplatform/python/input_missing_name/model.py
index 93bd36ef1f..33a76b6b30 100644
--- a/qa/L0_model_config/autofill_noplatform/python/input_missing_name/model.py
+++ b/qa/L0_model_config/autofill_noplatform/python/input_missing_name/model.py
@@ -1,4 +1,4 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,19 +24,14 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
-import json
-import triton_python_backend_utils as pb_utils
-
 
 class TritonPythonModel:
-
     @staticmethod
     def auto_complete_config(auto_complete_model_config):
-        input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        input1 = {'data_type': 'TYPE_FP32', 'dims': [4]}
-        output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
+        input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        input1 = {"data_type": "TYPE_FP32", "dims": [4]}
+        output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]}
 
         auto_complete_model_config.set_max_batch_size(0)
         auto_complete_model_config.add_input(input0)
diff --git a/qa/L0_model_config/autofill_noplatform/python/input_wrong_property/expected b/qa/L0_model_config/autofill_noplatform/python/input_wrong_property/expected
index 9b34c74b2b..c91f4599ee 100644
--- a/qa/L0_model_config/autofill_noplatform/python/input_wrong_property/expected
+++ b/qa/L0_model_config/autofill_noplatform/python/input_wrong_property/expected
@@ -1 +1 @@
-input 'INPUT1' in auto-complete-config function for model 'input_wrong_property' contains property other than 'name', 'data_type' and 'dims'.
+input 'INPUT1' in auto-complete-config function for model 'input_wrong_property' contains property other than 'name', 'data_type', 'dims' and 'optional'.
diff --git a/qa/L0_model_config/autofill_noplatform/python/input_wrong_property/model.py b/qa/L0_model_config/autofill_noplatform/python/input_wrong_property/model.py
index e43008e584..f3e883db06 100644
--- a/qa/L0_model_config/autofill_noplatform/python/input_wrong_property/model.py
+++ b/qa/L0_model_config/autofill_noplatform/python/input_wrong_property/model.py
@@ -1,4 +1,4 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,19 +24,19 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
-import json
-import triton_python_backend_utils as pb_utils
-
 
 class TritonPythonModel:
-
     @staticmethod
     def auto_complete_config(auto_complete_model_config):
-        input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4], 'is_shape_tensor:' : True}
-        output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
+        input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        input1 = {
+            "name": "INPUT1",
+            "data_type": "TYPE_FP32",
+            "dims": [4],
+            "is_shape_tensor:": True,
+        }
+        output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]}
 
         auto_complete_model_config.set_max_batch_size(0)
         auto_complete_model_config.add_input(input0)
diff --git a/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_invalid_args/config.pbtxt b/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_invalid_args/config.pbtxt
new file mode 100644
index 0000000000..3100235010
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_invalid_args/config.pbtxt
@@ -0,0 +1,24 @@
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  },
+  {
+    name: "INPUT1"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  },
+  {
+    name: "OUTPUT1"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  }
+]
diff --git a/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_invalid_args/expected b/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_invalid_args/expected
new file mode 100644
index 0000000000..388c6a728d
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_invalid_args/expected
@@ -0,0 +1 @@
+model transaction property in auto-complete-config function for model 'model_transaction_policy_invalid_args' contains property other than 'decoupled'
diff --git a/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_invalid_args/model.py b/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_invalid_args/model.py
new file mode 100644
index 0000000000..4de9d7c80a
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_invalid_args/model.py
@@ -0,0 +1,47 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+class TritonPythonModel:
+    @staticmethod
+    def auto_complete_config(auto_complete_model_config):
+        input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]}
+        output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]}
+        transaction_policy = {"invalid": "argument"}
+
+        auto_complete_model_config.set_max_batch_size(4)
+        auto_complete_model_config.set_model_transaction_policy(transaction_policy)
+        auto_complete_model_config.add_input(input0)
+        auto_complete_model_config.add_input(input1)
+        auto_complete_model_config.add_output(output0)
+        auto_complete_model_config.add_output(output1)
+
+        return auto_complete_model_config
+
+    def execute(self, requests):
+        pass
diff --git a/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_mismatch/config.pbtxt b/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_mismatch/config.pbtxt
new file mode 100644
index 0000000000..f8113f307e
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_mismatch/config.pbtxt
@@ -0,0 +1,28 @@
+model_transaction_policy {
+  decoupled: false
+}
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  },
+  {
+    name: "INPUT1"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  },
+  {
+    name: "OUTPUT1"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  }
+]
diff --git a/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_mismatch/expected b/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_mismatch/expected
new file mode 100644
index 0000000000..bbdc5d2165
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_mismatch/expected
@@ -0,0 +1 @@
+trying to change decoupled property in auto-complete-config for model 'model_transaction_policy_mismatch', which is already set to 'False'
diff --git a/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_mismatch/model.py b/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_mismatch/model.py
new file mode 100644
index 0000000000..424eca60ce
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_mismatch/model.py
@@ -0,0 +1,46 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+class TritonPythonModel:
+    @staticmethod
+    def auto_complete_config(auto_complete_model_config):
+        input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]}
+        output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]}
+
+        auto_complete_model_config.set_max_batch_size(4)
+        auto_complete_model_config.set_model_transaction_policy(dict(decoupled=True))
+        auto_complete_model_config.add_input(input0)
+        auto_complete_model_config.add_input(input1)
+        auto_complete_model_config.add_output(output0)
+        auto_complete_model_config.add_output(output1)
+
+        return auto_complete_model_config
+
+    def execute(self, requests):
+        pass
diff --git a/qa/L0_model_config/autofill_noplatform/python/no_return/model.py b/qa/L0_model_config/autofill_noplatform/python/no_return/model.py
index f22d144f47..65fae1dcc2 100644
--- a/qa/L0_model_config/autofill_noplatform/python/no_return/model.py
+++ b/qa/L0_model_config/autofill_noplatform/python/no_return/model.py
@@ -1,4 +1,4 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,19 +24,14 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
-import json
-import triton_python_backend_utils as pb_utils
-
 
 class TritonPythonModel:
-
     @staticmethod
     def auto_complete_config(auto_complete_model_config):
-        input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
+        input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]}
+        output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]}
 
         auto_complete_model_config.set_max_batch_size(0)
         auto_complete_model_config.add_input(input0)
diff --git a/qa/L0_model_config/autofill_noplatform/python/output_missing_datatype/model.py b/qa/L0_model_config/autofill_noplatform/python/output_missing_datatype/model.py
index 431ef1930f..26ef3e5c7e 100644
--- a/qa/L0_model_config/autofill_noplatform/python/output_missing_datatype/model.py
+++ b/qa/L0_model_config/autofill_noplatform/python/output_missing_datatype/model.py
@@ -1,4 +1,4 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,19 +24,14 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
-import json
-import triton_python_backend_utils as pb_utils
-
 
 class TritonPythonModel:
-
     @staticmethod
     def auto_complete_config(auto_complete_model_config):
-        input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output0 = {'name': 'OUTPUT0', 'dims': [4]}
-        output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
+        input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]}
+        output0 = {"name": "OUTPUT0", "dims": [4]}
+        output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]}
 
         auto_complete_model_config.set_max_batch_size(0)
         auto_complete_model_config.add_input(input0)
diff --git a/qa/L0_model_config/autofill_noplatform/python/output_missing_dims/model.py b/qa/L0_model_config/autofill_noplatform/python/output_missing_dims/model.py
index 6e05fcbb11..6e43928239 100644
--- a/qa/L0_model_config/autofill_noplatform/python/output_missing_dims/model.py
+++ b/qa/L0_model_config/autofill_noplatform/python/output_missing_dims/model.py
@@ -1,4 +1,4 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,19 +24,14 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
-import json
-import triton_python_backend_utils as pb_utils
-
 
 class TritonPythonModel:
-
     @staticmethod
     def auto_complete_config(auto_complete_model_config):
-        input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32'}
+        input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]}
+        output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32"}
 
         auto_complete_model_config.set_max_batch_size(0)
         auto_complete_model_config.add_input(input0)
diff --git a/qa/L0_model_config/autofill_noplatform/python/output_missing_name/model.py b/qa/L0_model_config/autofill_noplatform/python/output_missing_name/model.py
index 2d1651431d..cde57b7827 100644
--- a/qa/L0_model_config/autofill_noplatform/python/output_missing_name/model.py
+++ b/qa/L0_model_config/autofill_noplatform/python/output_missing_name/model.py
@@ -1,4 +1,4 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,19 +24,14 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
-import json
-import triton_python_backend_utils as pb_utils
-
 
 class TritonPythonModel:
-
     @staticmethod
     def auto_complete_config(auto_complete_model_config):
-        input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output0 = {'data_type': 'TYPE_FP32', 'dims': [4]}
-        output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
+        input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]}
+        output0 = {"data_type": "TYPE_FP32", "dims": [4]}
+        output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]}
 
         auto_complete_model_config.set_max_batch_size(0)
         auto_complete_model_config.add_input(input0)
diff --git a/qa/L0_model_config/autofill_noplatform/python/output_wrong_property/model.py b/qa/L0_model_config/autofill_noplatform/python/output_wrong_property/model.py
index ddccf9fb4f..4dd17ea4e3 100644
--- a/qa/L0_model_config/autofill_noplatform/python/output_wrong_property/model.py
+++ b/qa/L0_model_config/autofill_noplatform/python/output_wrong_property/model.py
@@ -1,4 +1,4 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,19 +24,19 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
-import json
-import triton_python_backend_utils as pb_utils
-
 
 class TritonPythonModel:
-
     @staticmethod
     def auto_complete_config(auto_complete_model_config):
-        input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4], 'is_shape_tensor:' : True}
+        input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]}
+        output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        output1 = {
+            "name": "OUTPUT1",
+            "data_type": "TYPE_FP32",
+            "dims": [4],
+            "is_shape_tensor:": True,
+        }
 
         auto_complete_model_config.set_max_batch_size(0)
         auto_complete_model_config.add_input(input0)
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/1/model.savedmodel/saved_model.pb b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_dims/1/model.savedmodel/saved_model.pb
similarity index 100%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/1/model.savedmodel/saved_model.pb
rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_dims/1/model.savedmodel/saved_model.pb
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/config.pbtxt b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_dims/config.pbtxt
similarity index 100%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/config.pbtxt
rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_dims/config.pbtxt
diff --git a/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_dims/expected b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_dims/expected
new file mode 100644
index 0000000000..9db37f7864
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_dims/expected
@@ -0,0 +1 @@
+Internal: unable to autofill for 'bad_input_dims', model tensor configurations are contradicting each other in terms of whether batching is supported
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/1/model.savedmodel/saved_model.pb b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_type/1/model.savedmodel/saved_model.pb
similarity index 100%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/1/model.savedmodel/saved_model.pb
rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_type/1/model.savedmodel/saved_model.pb
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/config.pbtxt b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_type/config.pbtxt
similarity index 100%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/config.pbtxt
rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_type/config.pbtxt
diff --git a/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_type/expected b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_type/expected
new file mode 100644
index 0000000000..584634b2eb
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_type/expected
@@ -0,0 +1 @@
+Invalid argument: unable to load model 'bad_input_type', configuration expects datatype TYPE_FP32 for input 'INPUT1', model provides TYPE_INT32
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/1/model.savedmodel/saved_model.pb b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_dims/1/model.savedmodel/saved_model.pb
similarity index 100%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/1/model.savedmodel/saved_model.pb
rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_dims/1/model.savedmodel/saved_model.pb
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/config.pbtxt b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_dims/config.pbtxt
similarity index 100%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/config.pbtxt
rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_dims/config.pbtxt
diff --git a/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_dims/expected b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_dims/expected
new file mode 100644
index 0000000000..70a0138e77
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_dims/expected
@@ -0,0 +1 @@
+Invalid argument: model 'bad_output_dims', tensor 'OUTPUT1': the model expects 2 dimensions (shape \[-1,16\]) but the model configuration specifies 2 dimensions (an initial batch dimension because max_batch_size > 0 followed by the explicit tensor shape, making complete shape \[-1,1\])
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/1/model.savedmodel/saved_model.pb b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_type/1/model.savedmodel/saved_model.pb
similarity index 100%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/1/model.savedmodel/saved_model.pb
rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_type/1/model.savedmodel/saved_model.pb
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/config.pbtxt b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_type/config.pbtxt
similarity index 100%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/config.pbtxt
rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_type/config.pbtxt
diff --git a/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_type/expected b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_type/expected
new file mode 100644
index 0000000000..bbbe1846d1
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_type/expected
@@ -0,0 +1 @@
+Invalid argument: unable to load model 'bad_output_type', configuration expects datatype TYPE_INT16 for output 'OUTPUT0', model provides TYPE_INT8
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/1/model.savedmodel/saved_model.pb b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/too_many_inputs/1/model.savedmodel/saved_model.pb
similarity index 100%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/1/model.savedmodel/saved_model.pb
rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/too_many_inputs/1/model.savedmodel/saved_model.pb
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/config.pbtxt b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/too_many_inputs/config.pbtxt
similarity index 93%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/config.pbtxt
rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/too_many_inputs/config.pbtxt
index 6ba2274876..cee3e28b89 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/config.pbtxt
+++ b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/too_many_inputs/config.pbtxt
@@ -11,7 +11,7 @@ input [
     dims: [ 16 ]
   },
   {
-    name: "INPUT_EXTRA"
+    name: "INPUT1"
     data_type: TYPE_INT32
     dims: [ 16 ]
   }
diff --git a/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/too_many_inputs/expected b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/too_many_inputs/expected
new file mode 100644
index 0000000000..caaebb93a0
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/too_many_inputs/expected
@@ -0,0 +1 @@
+Invalid argument: unable to load model 'too_many_inputs', configuration expects 3 inputs, model provides 2
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/1/model.savedmodel/saved_model.pb b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_input/1/model.savedmodel/saved_model.pb
similarity index 100%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/1/model.savedmodel/saved_model.pb
rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_input/1/model.savedmodel/saved_model.pb
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/config.pbtxt b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_input/config.pbtxt
similarity index 100%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/config.pbtxt
rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_input/config.pbtxt
diff --git a/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_input/expected b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_input/expected
new file mode 100644
index 0000000000..3f101c14fa
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_input/expected
@@ -0,0 +1 @@
+Invalid argument: unexpected inference input 'INPUT_UNKNOWN', allowed inputs are: INPUT0, INPUT1
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/1/model.savedmodel/saved_model.pb b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_output/1/model.savedmodel/saved_model.pb
similarity index 100%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/1/model.savedmodel/saved_model.pb
rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_output/1/model.savedmodel/saved_model.pb
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/config.pbtxt b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_output/config.pbtxt
similarity index 100%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/config.pbtxt
rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_output/config.pbtxt
diff --git a/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_output/expected b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_output/expected
new file mode 100644
index 0000000000..a525ae910b
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_output/expected
@@ -0,0 +1 @@
+Invalid argument: unexpected inference output 'OUTPUT_UNKNOWN', allowed outputs are: OUTPUT0, OUTPUT1
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform/tensorrt/bad_dynamic_shapes_max/expected b/qa/L0_model_config/autofill_noplatform/tensorrt/bad_dynamic_shapes_max/expected
index 24bbb8f7d2..33630c195b 100644
--- a/qa/L0_model_config/autofill_noplatform/tensorrt/bad_dynamic_shapes_max/expected
+++ b/qa/L0_model_config/autofill_noplatform/tensorrt/bad_dynamic_shapes_max/expected
@@ -1 +1 @@
-model configuration specified invalid shape for input 'INPUT0' for model bad_dynamic_shapes_max. Error details: model expected the shape of dimension 0 to be between 4 and 32 but received 33
+model configuration specified invalid shape for input 'INPUT0' for model bad_dynamic_shapes_max. Error details: model expected the shape of dimension 1 to be between 4 and 32 but received 33
diff --git a/qa/L0_model_config/autofill_noplatform/tensorrt/bad_dynamic_shapes_min/expected b/qa/L0_model_config/autofill_noplatform/tensorrt/bad_dynamic_shapes_min/expected
index add01d771b..288d129df0 100644
--- a/qa/L0_model_config/autofill_noplatform/tensorrt/bad_dynamic_shapes_min/expected
+++ b/qa/L0_model_config/autofill_noplatform/tensorrt/bad_dynamic_shapes_min/expected
@@ -1 +1 @@
-model configuration specified invalid shape for input 'INPUT0' for model bad_dynamic_shapes_min. Error details: model expected the shape of dimension 0 to be between 4 and 32 but received 3
+model configuration specified invalid shape for input 'INPUT0' for model bad_dynamic_shapes_min. Error details: model expected the shape of dimension 1 to be between 4 and 32 but received 3
diff --git a/qa/L0_model_config/autofill_noplatform_success/custom/empty_config.identity/config.pbtxt b/qa/L0_model_config/autofill_noplatform_success/custom/empty_config.identity/config.pbtxt
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/qa/L0_model_config/autofill_noplatform_success/custom/empty_config.identity/expected b/qa/L0_model_config/autofill_noplatform_success/custom/empty_config.identity/expected
new file mode 100644
index 0000000000..be092e0b0d
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/custom/empty_config.identity/expected
@@ -0,0 +1,22 @@
+name: "empty_config.identity"
+version_policy {
+latest {
+    num_versions: 1
+}
+}
+instance_group {
+name: "empty_config.identity"
+count: 1
+gpus: 0
+kind: KIND_GPU
+}
+default_model_filename: "model.identity"
+optimization {
+input_pinned_memory {
+    enable: true
+}
+output_pinned_memory {
+    enable: true
+}
+}
+backend: "identity"
diff --git a/qa/L0_model_config/autofill_noplatform_success/custom/no_backend.identity/config.pbtxt b/qa/L0_model_config/autofill_noplatform_success/custom/no_backend.identity/config.pbtxt
new file mode 100644
index 0000000000..575da253a5
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/custom/no_backend.identity/config.pbtxt
@@ -0,0 +1,15 @@
+max_batch_size: 64
+input [
+ {
+    name: "INPUT0"
+    data_type: TYPE_INT32
+    dims: [ 1000 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_INT32
+    dims: [ 1000 ]
+  }
+]
diff --git a/qa/L0_model_config/autofill_noplatform_success/custom/no_backend.identity/expected b/qa/L0_model_config/autofill_noplatform_success/custom/no_backend.identity/expected
new file mode 100644
index 0000000000..e5edfe5f9e
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/custom/no_backend.identity/expected
@@ -0,0 +1,33 @@
+name: "no_backend.identity"
+version_policy {
+latest {
+    num_versions: 1
+}
+}
+max_batch_size: 64
+input {
+name: "INPUT0"
+data_type: TYPE_INT32
+dims: 1000
+}
+output {
+name: "OUTPUT0"
+data_type: TYPE_INT32
+dims: 1000
+}
+instance_group {
+name: "no_backend.identity"
+count: 1
+gpus: 0
+kind: KIND_GPU
+}
+default_model_filename: "model.identity"
+optimization {
+input_pinned_memory {
+    enable: true
+}
+output_pinned_memory {
+    enable: true
+}
+}
+backend: "identity"
diff --git a/qa/L0_model_config/autofill_noplatform_success/onnx/cpu_instance/config.pbtxt b/qa/L0_model_config/autofill_noplatform_success/onnx/cpu_instance/config.pbtxt
old mode 100755
new mode 100644
diff --git a/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected b/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected
index c8af844b2d..fd06613612 100644
--- a/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected
+++ b/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.onnx"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected.1 b/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected.1
index 436e7937a2..65da68ab57 100644
--- a/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected.1
+++ b/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected.1
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.onnx"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected.2 b/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected.2
index c2a4e3d863..32365f3fd4 100644
--- a/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected.2
+++ b/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected.2
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.onnx"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected.3 b/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected.3
index 9f00645e90..0307a34cae 100644
--- a/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected.3
+++ b/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected.3
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.onnx"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected b/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected
index d8e3a1222f..5a03128998 100644
--- a/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected
+++ b/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.onnx"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected.1 b/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected.1
index 74174340b5..ca1e128d12 100644
--- a/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected.1
+++ b/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected.1
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.onnx"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected.2 b/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected.2
index fc75b0e0a2..fece0349ea 100644
--- a/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected.2
+++ b/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected.2
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.onnx"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected.3 b/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected.3
index fb1f739756..107b9cfc3d 100644
--- a/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected.3
+++ b/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected.3
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.onnx"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/config.pbtxt b/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/config.pbtxt
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected b/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected
new file mode 100644
index 0000000000..f4d0fb85bb
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected
@@ -0,0 +1,45 @@
+name: "dynamic_batch"
+version_policy {
+  latest {
+    num_versions: 1
+  }
+}
+max_batch_size: 4
+input {
+  name: "Func/PartitionedCall/input/_0:0"
+  data_type: TYPE_INT32
+  dims: 4
+}
+input {
+  name: "input1"
+  data_type: TYPE_INT32
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_2:0"
+  data_type: TYPE_INT32
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_3:0"
+  data_type: TYPE_INT32
+  dims: 4
+}
+instance_group {
+  name: "dynamic_batch"
+  count: 1
+  kind: KIND_CPU
+}
+default_model_filename: "model.xml"
+dynamic_batching {
+  preferred_batch_size: 4
+}
+optimization {
+  input_pinned_memory {
+    enable: true
+  }
+  output_pinned_memory {
+    enable: true
+  }
+}
+backend: "openvino"
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected.1 b/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected.1
new file mode 100644
index 0000000000..4e420de350
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected.1
@@ -0,0 +1,45 @@
+name: "dynamic_batch"
+version_policy {
+  latest {
+    num_versions: 1
+  }
+}
+max_batch_size: 4
+input {
+  name: "input1"
+  data_type: TYPE_INT32
+  dims: 4
+}
+input {
+  name: "Func/PartitionedCall/input/_0:0"
+  data_type: TYPE_INT32
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_2:0"
+  data_type: TYPE_INT32
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_3:0"
+  data_type: TYPE_INT32
+  dims: 4
+}
+instance_group {
+  name: "dynamic_batch"
+  count: 1
+  kind: KIND_CPU
+}
+default_model_filename: "model.xml"
+dynamic_batching {
+  preferred_batch_size: 4
+}
+optimization {
+  input_pinned_memory {
+    enable: true
+  }
+  output_pinned_memory {
+    enable: true
+  }
+}
+backend: "openvino"
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected.2 b/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected.2
new file mode 100644
index 0000000000..f66217757d
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected.2
@@ -0,0 +1,45 @@
+name: "dynamic_batch"
+version_policy {
+  latest {
+    num_versions: 1
+  }
+}
+max_batch_size: 4
+input {
+  name: "Func/PartitionedCall/input/_0:0"
+  data_type: TYPE_INT32
+  dims: 4
+}
+input {
+  name: "input1"
+  data_type: TYPE_INT32
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_3:0"
+  data_type: TYPE_INT32
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_2:0"
+  data_type: TYPE_INT32
+  dims: 4
+}
+instance_group {
+  name: "dynamic_batch"
+  count: 1
+  kind: KIND_CPU
+}
+default_model_filename: "model.xml"
+dynamic_batching {
+  preferred_batch_size: 4
+}
+optimization {
+  input_pinned_memory {
+    enable: true
+  }
+  output_pinned_memory {
+    enable: true
+  }
+}
+backend: "openvino"
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected.3 b/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected.3
new file mode 100644
index 0000000000..5a08b4c736
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected.3
@@ -0,0 +1,45 @@
+name: "dynamic_batch"
+version_policy {
+  latest {
+    num_versions: 1
+  }
+}
+max_batch_size: 4
+input {
+  name: "input1"
+  data_type: TYPE_INT32
+  dims: 4
+}
+input {
+  name: "Func/PartitionedCall/input/_0:0"
+  data_type: TYPE_INT32
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_3:0"
+  data_type: TYPE_INT32
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_2:0"
+  data_type: TYPE_INT32
+  dims: 4
+}
+instance_group {
+  name: "dynamic_batch"
+  count: 1
+  kind: KIND_CPU
+}
+default_model_filename: "model.xml"
+dynamic_batching {
+  preferred_batch_size: 4
+}
+optimization {
+  input_pinned_memory {
+    enable: true
+  }
+  output_pinned_memory {
+    enable: true
+  }
+}
+backend: "openvino"
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/config.pbtxt b/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/config.pbtxt
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected b/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected
new file mode 100644
index 0000000000..4ff077e8ab
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected
@@ -0,0 +1,45 @@
+name: "empty_config"
+version_policy {
+  latest {
+    num_versions: 1
+  }
+}
+input {
+  name: "Func/PartitionedCall/input/_0:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+input {
+  name: "input1"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_2:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_3:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+instance_group {
+  name: "empty_config"
+  count: 1
+  kind: KIND_CPU
+}
+default_model_filename: "model.xml"
+optimization {
+  input_pinned_memory {
+    enable: true
+  }
+  output_pinned_memory {
+    enable: true
+  }
+}
+backend: "openvino"
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected.1 b/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected.1
new file mode 100644
index 0000000000..8c7ca01525
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected.1
@@ -0,0 +1,45 @@
+name: "empty_config"
+version_policy {
+  latest {
+    num_versions: 1
+  }
+}
+input {
+  name: "input1"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+input {
+  name: "Func/PartitionedCall/input/_0:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_2:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_3:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+instance_group {
+  name: "empty_config"
+  count: 1
+  kind: KIND_CPU
+}
+default_model_filename: "model.xml"
+optimization {
+  input_pinned_memory {
+    enable: true
+  }
+  output_pinned_memory {
+    enable: true
+  }
+}
+backend: "openvino"
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected.2 b/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected.2
new file mode 100644
index 0000000000..bd0cc02f27
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected.2
@@ -0,0 +1,45 @@
+name: "empty_config"
+version_policy {
+  latest {
+    num_versions: 1
+  }
+}
+input {
+  name: "Func/PartitionedCall/input/_0:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+input {
+  name: "input1"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_3:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_2:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+instance_group {
+  name: "empty_config"
+  count: 1
+  kind: KIND_CPU
+}
+default_model_filename: "model.xml"
+optimization {
+  input_pinned_memory {
+    enable: true
+  }
+  output_pinned_memory {
+    enable: true
+  }
+}
+backend: "openvino"
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected.3 b/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected.3
new file mode 100644
index 0000000000..745125a795
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected.3
@@ -0,0 +1,45 @@
+name: "empty_config"
+version_policy {
+  latest {
+    num_versions: 1
+  }
+}
+input {
+  name: "input1"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+input {
+  name: "Func/PartitionedCall/input/_0:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_3:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_2:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+instance_group {
+  name: "empty_config"
+  count: 1
+  kind: KIND_CPU
+}
+default_model_filename: "model.xml"
+optimization {
+  input_pinned_memory {
+    enable: true
+  }
+  output_pinned_memory {
+    enable: true
+  }
+}
+backend: "openvino"
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected b/qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected
new file mode 100644
index 0000000000..8506cd53fb
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected
@@ -0,0 +1,45 @@
+name: "no_config"
+version_policy {
+  latest {
+    num_versions: 1
+  }
+}
+input {
+  name: "Func/PartitionedCall/input/_0:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+input {
+  name: "input1"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_2:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_3:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+instance_group {
+  name: "no_config"
+  count: 1
+  kind: KIND_CPU
+}
+default_model_filename: "model.xml"
+optimization {
+  input_pinned_memory {
+    enable: true
+  }
+  output_pinned_memory {
+    enable: true
+  }
+}
+backend: "openvino"
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected.1 b/qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected.1
new file mode 100644
index 0000000000..f2637ede14
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected.1
@@ -0,0 +1,45 @@
+name: "no_config"
+version_policy {
+  latest {
+    num_versions: 1
+  }
+}
+input {
+  name: "input1"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+input {
+  name: "Func/PartitionedCall/input/_0:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_2:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_3:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+instance_group {
+  name: "no_config"
+  count: 1
+  kind: KIND_CPU
+}
+default_model_filename: "model.xml"
+optimization {
+  input_pinned_memory {
+    enable: true
+  }
+  output_pinned_memory {
+    enable: true
+  }
+}
+backend: "openvino"
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected.2 b/qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected.2
new file mode 100644
index 0000000000..3c625cada5
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected.2
@@ -0,0 +1,45 @@
+name: "no_config"
+version_policy {
+  latest {
+    num_versions: 1
+  }
+}
+input {
+  name: "Func/PartitionedCall/input/_0:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+input {
+  name: "input1"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_3:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_2:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+instance_group {
+  name: "no_config"
+  count: 1
+  kind: KIND_CPU
+}
+default_model_filename: "model.xml"
+optimization {
+  input_pinned_memory {
+    enable: true
+  }
+  output_pinned_memory {
+    enable: true
+  }
+}
+backend: "openvino"
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected.3 b/qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected.3
new file mode 100644
index 0000000000..4076982ca5
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected.3
@@ -0,0 +1,45 @@
+name: "no_config"
+version_policy {
+  latest {
+    num_versions: 1
+  }
+}
+input {
+  name: "input1"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+input {
+  name: "Func/PartitionedCall/input/_0:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_3:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+output {
+  name: "Func/PartitionedCall/output/_2:0"
+  data_type: TYPE_INT32
+  dims: 1
+  dims: 4
+}
+instance_group {
+  name: "no_config"
+  count: 1
+  kind: KIND_CPU
+}
+default_model_filename: "model.xml"
+optimization {
+  input_pinned_memory {
+    enable: true
+  }
+  output_pinned_memory {
+    enable: true
+  }
+}
+backend: "openvino"
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/partial_config/config.pbtxt b/qa/L0_model_config/autofill_noplatform_success/openvino/partial_config/config.pbtxt
new file mode 100644
index 0000000000..cfdc579dae
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/openvino/partial_config/config.pbtxt
@@ -0,0 +1,14 @@
+max_batch_size: 8
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_INT8
+    dims: [ 16 ]
+    label_filename: "output0_labels.txt"
+   },
+  {
+    name: "OUTPUT1"
+    data_type: TYPE_INT8
+    dims: [ 16 ]
+  }
+]
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/expected b/qa/L0_model_config/autofill_noplatform_success/openvino/partial_config/expected
similarity index 62%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/expected
rename to qa/L0_model_config/autofill_noplatform_success/openvino/partial_config/expected
index e08c2471c5..b95f710bd9 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/expected
+++ b/qa/L0_model_config/autofill_noplatform_success/openvino/partial_config/expected
@@ -1,38 +1,37 @@
-name: "unknown_input"
-platform: "tensorflow_savedmodel"
+name: "partial_config"
 version_policy {
   latest {
     num_versions: 1
   }
 }
-max_batch_size: 1
+max_batch_size: 8
 input {
   name: "INPUT1"
-  data_type: TYPE_INT32
+  data_type: TYPE_INT8
   dims: 16
 }
 input {
   name: "INPUT0"
-  data_type: TYPE_INT32
+  data_type: TYPE_INT8
   dims: 16
 }
 output {
-  name: "OUTPUT1"
+  name: "OUTPUT0"
   data_type: TYPE_INT8
   dims: 16
+  label_filename: "output0_labels.txt"
 }
 output {
-  name: "OUTPUT0"
+  name: "OUTPUT1"
   data_type: TYPE_INT8
   dims: 16
 }
 instance_group {
-  name: "unknown_input"
+  name: "partial_config"
   count: 1
-  gpus: 0
-  kind: KIND_GPU
+  kind: KIND_CPU
 }
-default_model_filename: "model.savedmodel"
+default_model_filename: "model.xml"
 optimization {
   input_pinned_memory {
     enable: true
@@ -41,4 +40,4 @@ optimization {
     enable: true
   }
 }
-backend: "tensorflow"
+backend: "openvino"
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/expected.1 b/qa/L0_model_config/autofill_noplatform_success/openvino/partial_config/expected.1
similarity index 62%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/expected.1
rename to qa/L0_model_config/autofill_noplatform_success/openvino/partial_config/expected.1
index c97f486287..688ac8fbf5 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/expected.1
+++ b/qa/L0_model_config/autofill_noplatform_success/openvino/partial_config/expected.1
@@ -1,38 +1,37 @@
-name: "unknown_input"
-platform: "tensorflow_savedmodel"
+name: "partial_config"
 version_policy {
   latest {
     num_versions: 1
   }
 }
-max_batch_size: 1
+max_batch_size: 8
 input {
   name: "INPUT0"
-  data_type: TYPE_INT32
+  data_type: TYPE_INT8
   dims: 16
 }
 input {
   name: "INPUT1"
-  data_type: TYPE_INT32
+  data_type: TYPE_INT8
   dims: 16
 }
 output {
-  name: "OUTPUT1"
+  name: "OUTPUT0"
   data_type: TYPE_INT8
   dims: 16
+  label_filename: "output0_labels.txt"
 }
 output {
-  name: "OUTPUT0"
+  name: "OUTPUT1"
   data_type: TYPE_INT8
   dims: 16
 }
 instance_group {
-  name: "unknown_input"
+  name: "partial_config"
   count: 1
-  gpus: 0
-  kind: KIND_GPU
+  kind: KIND_CPU
 }
-default_model_filename: "model.savedmodel"
+default_model_filename: "model.xml"
 optimization {
   input_pinned_memory {
     enable: true
@@ -41,4 +40,4 @@ optimization {
     enable: true
   }
 }
-backend: "tensorflow"
+backend: "openvino"
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/python/conflicting_scheduler_ensemble/conflicting_scheduler_ensemble/model.py b/qa/L0_model_config/autofill_noplatform_success/python/conflicting_scheduler_ensemble/conflicting_scheduler_ensemble/model.py
index 72f588f7cb..57589bacdf 100644
--- a/qa/L0_model_config/autofill_noplatform_success/python/conflicting_scheduler_ensemble/conflicting_scheduler_ensemble/model.py
+++ b/qa/L0_model_config/autofill_noplatform_success/python/conflicting_scheduler_ensemble/conflicting_scheduler_ensemble/model.py
@@ -1,4 +1,4 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,17 +24,12 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
-import json
-import triton_python_backend_utils as pb_utils
-
 
 class TritonPythonModel:
-
     @staticmethod
     def auto_complete_config(auto_complete_model_config):
-        input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
+        input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]}
 
         auto_complete_model_config.set_max_batch_size(4)
         auto_complete_model_config.set_dynamic_batching()
diff --git a/qa/L0_model_config/autofill_noplatform_success/python/conflicting_scheduler_ensemble/ensemble_first_step/model.py b/qa/L0_model_config/autofill_noplatform_success/python/conflicting_scheduler_ensemble/ensemble_first_step/model.py
index 72f588f7cb..57589bacdf 100644
--- a/qa/L0_model_config/autofill_noplatform_success/python/conflicting_scheduler_ensemble/ensemble_first_step/model.py
+++ b/qa/L0_model_config/autofill_noplatform_success/python/conflicting_scheduler_ensemble/ensemble_first_step/model.py
@@ -1,4 +1,4 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,17 +24,12 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
-import json
-import triton_python_backend_utils as pb_utils
-
 
 class TritonPythonModel:
-
     @staticmethod
     def auto_complete_config(auto_complete_model_config):
-        input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
+        input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]}
 
         auto_complete_model_config.set_max_batch_size(4)
         auto_complete_model_config.set_dynamic_batching()
diff --git a/qa/L0_model_config/autofill_noplatform_success/python/conflicting_scheduler_ensemble/ensemble_second_step/model.py b/qa/L0_model_config/autofill_noplatform_success/python/conflicting_scheduler_ensemble/ensemble_second_step/model.py
index 72f588f7cb..57589bacdf 100644
--- a/qa/L0_model_config/autofill_noplatform_success/python/conflicting_scheduler_ensemble/ensemble_second_step/model.py
+++ b/qa/L0_model_config/autofill_noplatform_success/python/conflicting_scheduler_ensemble/ensemble_second_step/model.py
@@ -1,4 +1,4 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,17 +24,12 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
-import json
-import triton_python_backend_utils as pb_utils
-
 
 class TritonPythonModel:
-
     @staticmethod
     def auto_complete_config(auto_complete_model_config):
-        input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
+        input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]}
 
         auto_complete_model_config.set_max_batch_size(4)
         auto_complete_model_config.set_dynamic_batching()
diff --git a/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected b/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected
index 577ce5cce4..f11fa57bf2 100644
--- a/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected
+++ b/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected
@@ -33,6 +33,7 @@ instance_group {
 }
 default_model_filename: "model.py"
 dynamic_batching {
+  preferred_batch_size: 4
 }
 optimization {
   input_pinned_memory {
diff --git a/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected.1 b/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected.1
index 4880649296..1e5a266319 100644
--- a/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected.1
+++ b/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected.1
@@ -33,6 +33,7 @@ instance_group {
 }
 default_model_filename: "model.py"
 dynamic_batching {
+  preferred_batch_size: 4
 }
 optimization {
   input_pinned_memory {
diff --git a/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected.2 b/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected.2
index 30bdfa2c0f..4b96c9b2a6 100644
--- a/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected.2
+++ b/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected.2
@@ -33,6 +33,7 @@ instance_group {
 }
 default_model_filename: "model.py"
 dynamic_batching {
+  preferred_batch_size: 4
 }
 optimization {
   input_pinned_memory {
diff --git a/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected.3 b/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected.3
index 214f8ef16d..f3c6508cab 100644
--- a/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected.3
+++ b/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected.3
@@ -33,6 +33,7 @@ instance_group {
 }
 default_model_filename: "model.py"
 dynamic_batching {
+  preferred_batch_size: 4
 }
 optimization {
   input_pinned_memory {
diff --git a/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/model.py b/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/model.py
index d668deb544..b1399382c4 100644
--- a/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/model.py
+++ b/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/model.py
@@ -1,4 +1,4 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,19 +24,14 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
-import json
-import triton_python_backend_utils as pb_utils
-
 
 class TritonPythonModel:
-
     @staticmethod
     def auto_complete_config(auto_complete_model_config):
-        input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
+        input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]}
+        output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]}
 
         auto_complete_model_config.set_max_batch_size(4)
         auto_complete_model_config.set_dynamic_batching()
diff --git a/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching_no_op/model.py b/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching_no_op/model.py
index d668deb544..b1399382c4 100644
--- a/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching_no_op/model.py
+++ b/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching_no_op/model.py
@@ -1,4 +1,4 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,19 +24,14 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
-import json
-import triton_python_backend_utils as pb_utils
-
 
 class TritonPythonModel:
-
     @staticmethod
     def auto_complete_config(auto_complete_model_config):
-        input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
+        input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]}
+        output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]}
 
         auto_complete_model_config.set_max_batch_size(4)
         auto_complete_model_config.set_dynamic_batching()
diff --git a/qa/L0_model_config/autofill_noplatform_success/python/incomplete_input/model.py b/qa/L0_model_config/autofill_noplatform_success/python/incomplete_input/model.py
index 48a08b10ad..75000a0ba4 100644
--- a/qa/L0_model_config/autofill_noplatform_success/python/incomplete_input/model.py
+++ b/qa/L0_model_config/autofill_noplatform_success/python/incomplete_input/model.py
@@ -1,4 +1,4 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,18 +24,13 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
-import json
-import triton_python_backend_utils as pb_utils
-
 
 class TritonPythonModel:
-
     @staticmethod
     def auto_complete_config(auto_complete_model_config):
-        input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
+        input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]}
 
         auto_complete_model_config.set_max_batch_size(0)
         auto_complete_model_config.add_input(input0)
diff --git a/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/config.pbtxt b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/config.pbtxt
new file mode 100644
index 0000000000..3100235010
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/config.pbtxt
@@ -0,0 +1,24 @@
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  },
+  {
+    name: "INPUT1"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  },
+  {
+    name: "OUTPUT1"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  }
+]
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/expected
similarity index 51%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected
rename to qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/expected
index a751f6a56a..4384a240a0 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected
+++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/expected
@@ -1,38 +1,37 @@
-name: "bad_input_dims"
-platform: "tensorflow_savedmodel"
+name: "model_transaction_policy"
 version_policy {
   latest {
     num_versions: 1
   }
 }
-max_batch_size: 1
+max_batch_size: 4
 input {
-  name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
+  name: "INPUT0"
+  data_type: TYPE_FP32
+  dims: 4
 }
 input {
-  name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
+  name: "INPUT1"
+  data_type: TYPE_FP32
+  dims: 4
 }
 output {
-  name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
+  name: "OUTPUT0"
+  data_type: TYPE_FP32
+  dims: 4
 }
 output {
-  name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
+  name: "OUTPUT1"
+  data_type: TYPE_FP32
+  dims: 4
 }
 instance_group {
-  name: "bad_input_dims"
+  name: "model_transaction_policy"
   count: 1
   gpus: 0
   kind: KIND_GPU
 }
-default_model_filename: "model.savedmodel"
+default_model_filename: "model.py"
 optimization {
   input_pinned_memory {
     enable: true
@@ -41,4 +40,7 @@ optimization {
     enable: true
   }
 }
-backend: "tensorflow"
+backend: "python"
+model_transaction_policy {
+  decoupled: true
+}
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/expected.1 b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/expected.1
similarity index 51%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/expected.1
rename to qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/expected.1
index 76e9ff1b96..0ec85aa3f2 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/expected.1
+++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/expected.1
@@ -1,38 +1,37 @@
-name: "bad_input_type"
-platform: "tensorflow_savedmodel"
+name: "model_transaction_policy"
 version_policy {
   latest {
     num_versions: 1
   }
 }
-max_batch_size: 1
+max_batch_size: 4
 input {
-  name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
+  name: "INPUT1"
+  data_type: TYPE_FP32
+  dims: 4
 }
 input {
-  name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
+  name: "INPUT0"
+  data_type: TYPE_FP32
+  dims: 4
 }
 output {
   name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
+  data_type: TYPE_FP32
+  dims: 4
 }
 output {
   name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
+  data_type: TYPE_FP32
+  dims: 4
 }
 instance_group {
-  name: "bad_input_type"
+  name: "model_transaction_policy"
   count: 1
   gpus: 0
   kind: KIND_GPU
 }
-default_model_filename: "model.savedmodel"
+default_model_filename: "model.py"
 optimization {
   input_pinned_memory {
     enable: true
@@ -41,4 +40,7 @@ optimization {
     enable: true
   }
 }
-backend: "tensorflow"
+backend: "python"
+model_transaction_policy {
+  decoupled: true
+}
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/expected.2 b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/expected.2
similarity index 51%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/expected.2
rename to qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/expected.2
index 9386bf4541..db2d305cc2 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/expected.2
+++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/expected.2
@@ -1,38 +1,37 @@
-name: "bad_input_type"
-platform: "tensorflow_savedmodel"
+name: "model_transaction_policy"
 version_policy {
   latest {
     num_versions: 1
   }
 }
-max_batch_size: 1
+max_batch_size: 4
 input {
-  name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
+  name: "INPUT0"
+  data_type: TYPE_FP32
+  dims: 4
 }
 input {
-  name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
+  name: "INPUT1"
+  data_type: TYPE_FP32
+  dims: 4
 }
 output {
   name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
+  data_type: TYPE_FP32
+  dims: 4
 }
 output {
   name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
+  data_type: TYPE_FP32
+  dims: 4
 }
 instance_group {
-  name: "bad_input_type"
+  name: "model_transaction_policy"
   count: 1
   gpus: 0
   kind: KIND_GPU
 }
-default_model_filename: "model.savedmodel"
+default_model_filename: "model.py"
 optimization {
   input_pinned_memory {
     enable: true
@@ -41,4 +40,7 @@ optimization {
     enable: true
   }
 }
-backend: "tensorflow"
+backend: "python"
+model_transaction_policy {
+  decoupled: true
+}
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/expected.3 b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/expected.3
similarity index 51%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/expected.3
rename to qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/expected.3
index 5361bbe5b2..2d88c5a970 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/expected.3
+++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/expected.3
@@ -1,38 +1,37 @@
-name: "bad_input_type"
-platform: "tensorflow_savedmodel"
+name: "model_transaction_policy"
 version_policy {
   latest {
     num_versions: 1
   }
 }
-max_batch_size: 1
+max_batch_size: 4
 input {
-  name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
+  name: "INPUT1"
+  data_type: TYPE_FP32
+  dims: 4
 }
 input {
-  name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
+  name: "INPUT0"
+  data_type: TYPE_FP32
+  dims: 4
 }
 output {
   name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
+  data_type: TYPE_FP32
+  dims: 4
 }
 output {
   name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
+  data_type: TYPE_FP32
+  dims: 4
 }
 instance_group {
-  name: "bad_input_type"
+  name: "model_transaction_policy"
   count: 1
   gpus: 0
   kind: KIND_GPU
 }
-default_model_filename: "model.savedmodel"
+default_model_filename: "model.py"
 optimization {
   input_pinned_memory {
     enable: true
@@ -41,4 +40,7 @@ optimization {
     enable: true
   }
 }
-backend: "tensorflow"
+backend: "python"
+model_transaction_policy {
+  decoupled: true
+}
diff --git a/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/model.py b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/model.py
new file mode 100644
index 0000000000..424eca60ce
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/model.py
@@ -0,0 +1,46 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+class TritonPythonModel:
+    @staticmethod
+    def auto_complete_config(auto_complete_model_config):
+        input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]}
+        output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]}
+
+        auto_complete_model_config.set_max_batch_size(4)
+        auto_complete_model_config.set_model_transaction_policy(dict(decoupled=True))
+        auto_complete_model_config.add_input(input0)
+        auto_complete_model_config.add_input(input1)
+        auto_complete_model_config.add_output(output0)
+        auto_complete_model_config.add_output(output1)
+
+        return auto_complete_model_config
+
+    def execute(self, requests):
+        pass
diff --git a/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/config.pbtxt b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/config.pbtxt
new file mode 100644
index 0000000000..3100235010
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/config.pbtxt
@@ -0,0 +1,24 @@
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  },
+  {
+    name: "INPUT1"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  },
+  {
+    name: "OUTPUT1"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  }
+]
diff --git a/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected
new file mode 100644
index 0000000000..173c66ce07
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected
@@ -0,0 +1,45 @@
+name: "model_transaction_policy_decoupled_false"
+version_policy {
+  latest {
+    num_versions: 1
+  }
+}
+max_batch_size: 4
+input {
+  name: "INPUT0"
+  data_type: TYPE_FP32
+  dims: 4
+}
+input {
+  name: "INPUT1"
+  data_type: TYPE_FP32
+  dims: 4
+}
+output {
+  name: "OUTPUT0"
+  data_type: TYPE_FP32
+  dims: 4
+}
+output {
+  name: "OUTPUT1"
+  data_type: TYPE_FP32
+  dims: 4
+}
+instance_group {
+  name: "model_transaction_policy_decoupled_false"
+  count: 1
+  gpus: 0
+  kind: KIND_GPU
+}
+default_model_filename: "model.py"
+optimization {
+  input_pinned_memory {
+    enable: true
+  }
+  output_pinned_memory {
+    enable: true
+  }
+}
+backend: "python"
+model_transaction_policy {
+}
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected.1 b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected.1
similarity index 50%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected.1
rename to qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected.1
index 6fee8a3160..bc03df083b 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected.1
+++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected.1
@@ -1,38 +1,37 @@
-name: "bad_output_dims"
-platform: "tensorflow_savedmodel"
+name: "model_transaction_policy_decoupled_false"
 version_policy {
   latest {
     num_versions: 1
   }
 }
-max_batch_size: 1
+max_batch_size: 4
 input {
-  name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
+  name: "INPUT1"
+  data_type: TYPE_FP32
+  dims: 4
 }
 input {
-  name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
+  name: "INPUT0"
+  data_type: TYPE_FP32
+  dims: 4
 }
 output {
-  name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
+  name: "OUTPUT0"
+  data_type: TYPE_FP32
+  dims: 4
 }
 output {
-  name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
+  name: "OUTPUT1"
+  data_type: TYPE_FP32
+  dims: 4
 }
 instance_group {
-  name: "bad_output_dims"
+  name: "model_transaction_policy_decoupled_false"
   count: 1
   gpus: 0
   kind: KIND_GPU
 }
-default_model_filename: "model.savedmodel"
+default_model_filename: "model.py"
 optimization {
   input_pinned_memory {
     enable: true
@@ -41,4 +40,6 @@ optimization {
     enable: true
   }
 }
-backend: "tensorflow"
+backend: "python"
+model_transaction_policy {
+}
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected.2 b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected.2
similarity index 50%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected.2
rename to qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected.2
index 01d91d8868..89ddbebf8b 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected.2
+++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected.2
@@ -1,38 +1,37 @@
-name: "bad_input_dims"
-platform: "tensorflow_savedmodel"
+name: "model_transaction_policy_decoupled_false"
 version_policy {
   latest {
     num_versions: 1
   }
 }
-max_batch_size: 1
+max_batch_size: 4
 input {
-  name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
+  name: "INPUT0"
+  data_type: TYPE_FP32
+  dims: 4
 }
 input {
-  name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
+  name: "INPUT1"
+  data_type: TYPE_FP32
+  dims: 4
 }
 output {
-  name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
+  name: "OUTPUT1"
+  data_type: TYPE_FP32
+  dims: 4
 }
 output {
-  name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
+  name: "OUTPUT0"
+  data_type: TYPE_FP32
+  dims: 4
 }
 instance_group {
-  name: "bad_input_dims"
+  name: "model_transaction_policy_decoupled_false"
   count: 1
   gpus: 0
   kind: KIND_GPU
 }
-default_model_filename: "model.savedmodel"
+default_model_filename: "model.py"
 optimization {
   input_pinned_memory {
     enable: true
@@ -41,4 +40,6 @@ optimization {
     enable: true
   }
 }
-backend: "tensorflow"
+backend: "python"
+model_transaction_policy {
+}
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected.3 b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected.3
similarity index 50%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected.3
rename to qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected.3
index 7fa56796b1..75aefdca7f 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected.3
+++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected.3
@@ -1,38 +1,37 @@
-name: "bad_output_type"
-platform: "tensorflow_savedmodel"
+name: "model_transaction_policy_decoupled_false"
 version_policy {
   latest {
     num_versions: 1
   }
 }
-max_batch_size: 1
+max_batch_size: 4
 input {
-  name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
+  name: "INPUT1"
+  data_type: TYPE_FP32
+  dims: 4
 }
 input {
-  name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
+  name: "INPUT0"
+  data_type: TYPE_FP32
+  dims: 4
 }
 output {
   name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
+  data_type: TYPE_FP32
+  dims: 4
 }
 output {
   name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
+  data_type: TYPE_FP32
+  dims: 4
 }
 instance_group {
-  name: "bad_output_type"
+  name: "model_transaction_policy_decoupled_false"
   count: 1
   gpus: 0
   kind: KIND_GPU
 }
-default_model_filename: "model.savedmodel"
+default_model_filename: "model.py"
 optimization {
   input_pinned_memory {
     enable: true
@@ -41,4 +40,6 @@ optimization {
     enable: true
   }
 }
-backend: "tensorflow"
\ No newline at end of file
+backend: "python"
+model_transaction_policy {
+}
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/model.py b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/model.py
new file mode 100644
index 0000000000..848af2a2b2
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/model.py
@@ -0,0 +1,46 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+class TritonPythonModel:
+    @staticmethod
+    def auto_complete_config(auto_complete_model_config):
+        input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]}
+        output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]}
+
+        auto_complete_model_config.set_max_batch_size(4)
+        auto_complete_model_config.set_model_transaction_policy(dict(decoupled=False))
+        auto_complete_model_config.add_input(input0)
+        auto_complete_model_config.add_input(input1)
+        auto_complete_model_config.add_output(output0)
+        auto_complete_model_config.add_output(output1)
+
+        return auto_complete_model_config
+
+    def execute(self, requests):
+        pass
diff --git a/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/config.pbtxt b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/config.pbtxt
new file mode 100644
index 0000000000..1bbf76caaf
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/config.pbtxt
@@ -0,0 +1,28 @@
+model_transaction_policy {
+  decoupled: true
+}
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  },
+  {
+    name: "INPUT1"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  },
+  {
+    name: "OUTPUT1"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  }
+]
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/expected b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/expected
similarity index 50%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/expected
rename to qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/expected
index 9cda30fccb..4c171e5acc 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/expected
+++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/expected
@@ -1,38 +1,37 @@
-name: "bad_input_type"
-platform: "tensorflow_savedmodel"
+name: "model_transaction_policy_no_op"
 version_policy {
   latest {
     num_versions: 1
   }
 }
-max_batch_size: 1
+max_batch_size: 4
 input {
-  name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
+  name: "INPUT0"
+  data_type: TYPE_FP32
+  dims: 4
 }
 input {
-  name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
+  name: "INPUT1"
+  data_type: TYPE_FP32
+  dims: 4
 }
 output {
   name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
+  data_type: TYPE_FP32
+  dims: 4
 }
 output {
   name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
+  data_type: TYPE_FP32
+  dims: 4
 }
 instance_group {
-  name: "bad_input_type"
+  name: "model_transaction_policy_no_op"
   count: 1
   gpus: 0
   kind: KIND_GPU
 }
-default_model_filename: "model.savedmodel"
+default_model_filename: "model.py"
 optimization {
   input_pinned_memory {
     enable: true
@@ -41,4 +40,7 @@ optimization {
     enable: true
   }
 }
-backend: "tensorflow"
+backend: "python"
+model_transaction_policy {
+  decoupled: true
+}
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected.1 b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/expected.1
similarity index 50%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected.1
rename to qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/expected.1
index 896a8c2c1e..cf3a56f3a9 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected.1
+++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/expected.1
@@ -1,38 +1,37 @@
-name: "bad_input_dims"
-platform: "tensorflow_savedmodel"
+name: "model_transaction_policy_no_op"
 version_policy {
   latest {
     num_versions: 1
   }
 }
-max_batch_size: 1
+max_batch_size: 4
 input {
-  name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
+  name: "INPUT1"
+  data_type: TYPE_FP32
+  dims: 4
 }
 input {
-  name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
+  name: "INPUT0"
+  data_type: TYPE_FP32
+  dims: 4
 }
 output {
-  name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
+  name: "OUTPUT0"
+  data_type: TYPE_FP32
+  dims: 4
 }
 output {
-  name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
+  name: "OUTPUT1"
+  data_type: TYPE_FP32
+  dims: 4
 }
 instance_group {
-  name: "bad_input_dims"
+  name: "model_transaction_policy_no_op"
   count: 1
   gpus: 0
   kind: KIND_GPU
 }
-default_model_filename: "model.savedmodel"
+default_model_filename: "model.py"
 optimization {
   input_pinned_memory {
     enable: true
@@ -41,4 +40,7 @@ optimization {
     enable: true
   }
 }
-backend: "tensorflow"
+backend: "python"
+model_transaction_policy {
+  decoupled: true
+}
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected.2 b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/expected.2
similarity index 50%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected.2
rename to qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/expected.2
index a53a195e36..2a7e018955 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected.2
+++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/expected.2
@@ -1,38 +1,37 @@
-name: "bad_output_dims"
-platform: "tensorflow_savedmodel"
+name: "model_transaction_policy_no_op"
 version_policy {
   latest {
     num_versions: 1
   }
 }
-max_batch_size: 1
+max_batch_size: 4
 input {
-  name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
+  name: "INPUT0"
+  data_type: TYPE_FP32
+  dims: 4
 }
 input {
-  name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
+  name: "INPUT1"
+  data_type: TYPE_FP32
+  dims: 4
 }
 output {
-  name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
+  name: "OUTPUT1"
+  data_type: TYPE_FP32
+  dims: 4
 }
 output {
-  name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
+  name: "OUTPUT0"
+  data_type: TYPE_FP32
+  dims: 4
 }
 instance_group {
-  name: "bad_output_dims"
+  name: "model_transaction_policy_no_op"
   count: 1
   gpus: 0
   kind: KIND_GPU
 }
-default_model_filename: "model.savedmodel"
+default_model_filename: "model.py"
 optimization {
   input_pinned_memory {
     enable: true
@@ -41,4 +40,7 @@ optimization {
     enable: true
   }
 }
-backend: "tensorflow"
+backend: "python"
+model_transaction_policy {
+  decoupled: true
+}
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected.3 b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/expected.3
similarity index 50%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected.3
rename to qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/expected.3
index 215306f8cd..4fbaae787b 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected.3
+++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/expected.3
@@ -1,38 +1,37 @@
-name: "bad_output_dims"
-platform: "tensorflow_savedmodel"
+name: "model_transaction_policy_no_op"
 version_policy {
   latest {
     num_versions: 1
   }
 }
-max_batch_size: 1
+max_batch_size: 4
 input {
-  name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
+  name: "INPUT1"
+  data_type: TYPE_FP32
+  dims: 4
 }
 input {
-  name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
+  name: "INPUT0"
+  data_type: TYPE_FP32
+  dims: 4
 }
 output {
-  name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
+  name: "OUTPUT1"
+  data_type: TYPE_FP32
+  dims: 4
 }
 output {
-  name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
+  name: "OUTPUT0"
+  data_type: TYPE_FP32
+  dims: 4
 }
 instance_group {
-  name: "bad_output_dims"
+  name: "model_transaction_policy_no_op"
   count: 1
   gpus: 0
   kind: KIND_GPU
 }
-default_model_filename: "model.savedmodel"
+default_model_filename: "model.py"
 optimization {
   input_pinned_memory {
     enable: true
@@ -41,4 +40,7 @@ optimization {
     enable: true
   }
 }
-backend: "tensorflow"
+backend: "python"
+model_transaction_policy {
+  decoupled: true
+}
diff --git a/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/model.py b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/model.py
new file mode 100644
index 0000000000..424eca60ce
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/model.py
@@ -0,0 +1,46 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+class TritonPythonModel:
+    @staticmethod
+    def auto_complete_config(auto_complete_model_config):
+        input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]}
+        output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]}
+
+        auto_complete_model_config.set_max_batch_size(4)
+        auto_complete_model_config.set_model_transaction_policy(dict(decoupled=True))
+        auto_complete_model_config.add_input(input0)
+        auto_complete_model_config.add_input(input1)
+        auto_complete_model_config.add_output(output0)
+        auto_complete_model_config.add_output(output1)
+
+        return auto_complete_model_config
+
+    def execute(self, requests):
+        pass
diff --git a/qa/L0_model_config/autofill_noplatform_success/python/optional_input/config.pbtxt b/qa/L0_model_config/autofill_noplatform_success/python/optional_input/config.pbtxt
new file mode 100644
index 0000000000..2d2868b90e
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/python/optional_input/config.pbtxt
@@ -0,0 +1,7 @@
+input [
+  {
+    name: "INPUT1"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  }
+]
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/expected.2 b/qa/L0_model_config/autofill_noplatform_success/python/optional_input/expected
similarity index 52%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/expected.2
rename to qa/L0_model_config/autofill_noplatform_success/python/optional_input/expected
index 9dcea4093c..8bbab5a3b0 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/expected.2
+++ b/qa/L0_model_config/autofill_noplatform_success/python/optional_input/expected
@@ -1,38 +1,37 @@
-name: "unknown_input"
-platform: "tensorflow_savedmodel"
+name: "optional_input"
 version_policy {
   latest {
     num_versions: 1
   }
 }
-max_batch_size: 1
 input {
   name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
+  data_type: TYPE_FP32
+  dims: 4
 }
 input {
   name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
+  data_type: TYPE_FP32
+  dims: 4
+  optional: true
 }
 output {
   name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
+  data_type: TYPE_FP32
+  dims: 4
 }
 output {
   name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
+  data_type: TYPE_FP32
+  dims: 4
 }
 instance_group {
-  name: "unknown_input"
+  name: "optional_input"
   count: 1
   gpus: 0
   kind: KIND_GPU
 }
-default_model_filename: "model.savedmodel"
+default_model_filename: "model.py"
 optimization {
   input_pinned_memory {
     enable: true
@@ -41,4 +40,4 @@ optimization {
     enable: true
   }
 }
-backend: "tensorflow"
+backend: "python"
diff --git a/qa/L0_model_config/autofill_noplatform_success/python/optional_input/model.py b/qa/L0_model_config/autofill_noplatform_success/python/optional_input/model.py
new file mode 100644
index 0000000000..fca8e06818
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/python/optional_input/model.py
@@ -0,0 +1,48 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+class TritonPythonModel:
+    @staticmethod
+    def auto_complete_config(auto_complete_model_config):
+        input0 = {
+            "name": "INPUT0",
+            "data_type": "TYPE_FP32",
+            "dims": [4],
+            "optional": True,
+        }
+        output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]}
+
+        auto_complete_model_config.set_max_batch_size(0)
+        auto_complete_model_config.add_input(input0)
+        auto_complete_model_config.add_output(output0)
+        auto_complete_model_config.add_output(output1)
+
+        return auto_complete_model_config
+
+    def execute(self, requests):
+        pass
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected.3 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected.3
deleted file mode 100644
index e8b91f678e..0000000000
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected.3
+++ /dev/null
@@ -1,44 +0,0 @@
-name: "bad_input_dims"
-platform: "tensorflow_savedmodel"
-version_policy {
-  latest {
-    num_versions: 1
-  }
-}
-max_batch_size: 1
-input {
-  name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
-}
-input {
-  name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
-}
-output {
-  name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
-}
-output {
-  name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
-}
-instance_group {
-  name: "bad_input_dims"
-  count: 1
-  gpus: 0
-  kind: KIND_GPU
-}
-default_model_filename: "model.savedmodel"
-optimization {
-  input_pinned_memory {
-    enable: true
-  }
-  output_pinned_memory {
-    enable: true
-  }
-}
-backend: "tensorflow"
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected
deleted file mode 100644
index 948d3a5e32..0000000000
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected
+++ /dev/null
@@ -1,44 +0,0 @@
-name: "bad_output_dims"
-platform: "tensorflow_savedmodel"
-version_policy {
-  latest {
-    num_versions: 1
-  }
-}
-max_batch_size: 1
-input {
-  name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
-}
-input {
-  name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
-}
-output {
-  name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
-}
-output {
-  name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
-}
-instance_group {
-  name: "bad_output_dims"
-  count: 1
-  gpus: 0
-  kind: KIND_GPU
-}
-default_model_filename: "model.savedmodel"
-optimization {
-  input_pinned_memory {
-    enable: true
-  }
-  output_pinned_memory {
-    enable: true
-  }
-}
-backend: "tensorflow"
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected
deleted file mode 100644
index 584768c4dc..0000000000
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected
+++ /dev/null
@@ -1,44 +0,0 @@
-name: "bad_output_type"
-platform: "tensorflow_savedmodel"
-version_policy {
-  latest {
-    num_versions: 1
-  }
-}
-max_batch_size: 1
-input {
-  name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
-}
-input {
-  name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
-}
-output {
-  name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
-}
-output {
-  name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
-}
-instance_group {
-  name: "bad_output_type"
-  count: 1
-  gpus: 0
-  kind: KIND_GPU
-}
-default_model_filename: "model.savedmodel"
-optimization {
-  input_pinned_memory {
-    enable: true
-  }
-  output_pinned_memory {
-    enable: true
-  }
-}
-backend: "tensorflow"
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected.1 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected.1
deleted file mode 100644
index eb8a279bac..0000000000
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected.1
+++ /dev/null
@@ -1,44 +0,0 @@
-name: "bad_output_type"
-platform: "tensorflow_savedmodel"
-version_policy {
-  latest {
-    num_versions: 1
-  }
-}
-max_batch_size: 1
-input {
-  name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
-}
-input {
-  name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
-}
-output {
-  name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
-}
-output {
-  name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
-}
-instance_group {
-  name: "bad_output_type"
-  count: 1
-  gpus: 0
-  kind: KIND_GPU
-}
-default_model_filename: "model.savedmodel"
-optimization {
-  input_pinned_memory {
-    enable: true
-  }
-  output_pinned_memory {
-    enable: true
-  }
-}
-backend: "tensorflow"
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected.2 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected.2
deleted file mode 100644
index d36280de72..0000000000
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected.2
+++ /dev/null
@@ -1,44 +0,0 @@
-name: "bad_output_type"
-platform: "tensorflow_savedmodel"
-version_policy {
-  latest {
-    num_versions: 1
-  }
-}
-max_batch_size: 1
-input {
-  name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
-}
-input {
-  name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
-}
-output {
-  name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
-}
-output {
-  name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
-}
-instance_group {
-  name: "bad_output_type"
-  count: 1
-  gpus: 0
-  kind: KIND_GPU
-}
-default_model_filename: "model.savedmodel"
-optimization {
-  input_pinned_memory {
-    enable: true
-  }
-  output_pinned_memory {
-    enable: true
-  }
-}
-backend: "tensorflow"
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected
index 8f795e196c..9773774b21 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.savedmodel"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected.1 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected.1
index a57171a3eb..adae59e945 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected.1
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected.1
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.savedmodel"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected.2 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected.2
index cececc0cdc..ea92806ad7 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected.2
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected.2
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.savedmodel"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected.3 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected.3
index b2987d0d14..983c1ed344 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected.3
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected.3
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.savedmodel"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/1/model.savedmodel/saved_model.pb b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/1/model.savedmodel/saved_model.pb
similarity index 100%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/1/model.savedmodel/saved_model.pb
rename to qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/1/model.savedmodel/saved_model.pb
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/config.pbtxt b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/config.pbtxt
similarity index 100%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/config.pbtxt
rename to qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/config.pbtxt
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/expected b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/expected
similarity index 91%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/expected
rename to qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/expected
index b84d053a84..f7ea4005b2 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/expected
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/expected
@@ -1,4 +1,4 @@
-name: "hint_for_no_batch"
+name: "hint_for_no_batch_1"
 platform: "tensorflow_savedmodel"
 version_policy {
   latest {
@@ -30,7 +30,7 @@ output {
   dims: 16
 }
 instance_group {
-  name: "hint_for_no_batch"
+  name: "hint_for_no_batch_1"
   count: 1
   gpus: 0
   kind: KIND_GPU
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/expected.1 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/expected.1
similarity index 91%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/expected.1
rename to qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/expected.1
index 5865093359..30455a0b7f 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/expected.1
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/expected.1
@@ -1,4 +1,4 @@
-name: "hint_for_no_batch"
+name: "hint_for_no_batch_1"
 platform: "tensorflow_savedmodel"
 version_policy {
   latest {
@@ -30,7 +30,7 @@ output {
   dims: 16
 }
 instance_group {
-  name: "hint_for_no_batch"
+  name: "hint_for_no_batch_1"
   count: 1
   gpus: 0
   kind: KIND_GPU
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/expected.2 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/expected.2
similarity index 91%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/expected.2
rename to qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/expected.2
index e5bfc5fed9..bf05e9f287 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/expected.2
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/expected.2
@@ -1,4 +1,4 @@
-name: "hint_for_no_batch"
+name: "hint_for_no_batch_1"
 platform: "tensorflow_savedmodel"
 version_policy {
   latest {
@@ -30,7 +30,7 @@ output {
   dims: 16
 }
 instance_group {
-  name: "hint_for_no_batch"
+  name: "hint_for_no_batch_1"
   count: 1
   gpus: 0
   kind: KIND_GPU
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/expected.3 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/expected.3
similarity index 91%
rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/expected.3
rename to qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/expected.3
index a98f07631f..4bd3165b18 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/expected.3
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/expected.3
@@ -1,4 +1,4 @@
-name: "hint_for_no_batch"
+name: "hint_for_no_batch_1"
 platform: "tensorflow_savedmodel"
 version_policy {
   latest {
@@ -30,7 +30,7 @@ output {
   dims: 16
 }
 instance_group {
-  name: "hint_for_no_batch"
+  name: "hint_for_no_batch_1"
   count: 1
   gpus: 0
   kind: KIND_GPU
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/1/model.savedmodel/saved_model.pb b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/1/model.savedmodel/saved_model.pb
new file mode 100644
index 0000000000000000000000000000000000000000..a76abafbf76990f47d12e0697e119113c0940adb
GIT binary patch
literal 1407
zcmb_b&2G~`5Uw5DaVKq(PC-IjPCgK1iPX-~w~}z^C9SY?;SweGmPS~1WUndVL3kUU
zg)7g1kYLw;)FGf0;mhpo`kU{YKV8Ca0AFPMEQ15Biy%M^qz{JV3A^EzaQl&4;|%zv
z!ZvH_^r1UC>YhrnqN%Mz9osMkWxPlk9tyDHCca1babqZxlzGMx<W~h+c;dG79)a#9
zS>!JmMP(VsA(5XSz!3DyfJSV^HVB}uqIJfEm=0)h#tO&a3}qA;L+3hN`1Cdo1DcRt
z{hJyH#l|rdhniGPZx?Hdg&~R~KaoTM+-&*40-WR(Fw~SL@2Ls)&>jt~7l}U_!E%sA
z@1poF8sH}l)^O~-nz~o7=a<v#nP=gB#b&1xC|fOa0Nr_!<~)~~;$t{<POzi@s6Gj0
zy>C8+GLK!a+l+RT@vqo_(*tk#2u&s^=7==2ZMwGE7QK5&t|GQOdx@e&&0tp3wf8Sy
zB?5d<#}_W|Nj}9yBw<21v_fR}-jMR~6mR(m%a*n`TSa15Bs`n{PvSzioU*H#mycP!
zNTkSdZ^2cMG}sPm<5tkRpZk{s4-8p9GrvpF6RWd|-p&Jhv&ce*UnQ_WE4Sns^crj9
zSp63HeHChijavx&4+tDVyQ<Lu`pv3wjJAX27skeIjNsXYA%xK=jgy-&zB?~dJzpm@
HSo!QHW5t`j

literal 0
HcmV?d00001

diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/config.pbtxt b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/config.pbtxt
new file mode 100644
index 0000000000..a4faf54369
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/config.pbtxt
@@ -0,0 +1,10 @@
+output [
+  {
+    name: "OUTPUT1"
+    dims: [ -1, 16 ]
+  },
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_INT8
+  }
+]
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/expected b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/expected
new file mode 100644
index 0000000000..9ae3172aec
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/expected
@@ -0,0 +1,47 @@
+name: "hint_for_no_batch_2"
+platform: "tensorflow_savedmodel"
+version_policy {
+  latest {
+    num_versions: 1
+  }
+}
+input {
+  name: "INPUT0"
+  data_type: TYPE_INT32
+  dims: -1
+  dims: 16
+}
+input {
+  name: "INPUT1"
+  data_type: TYPE_INT32
+  dims: -1
+  dims: 16
+}
+output {
+  name: "OUTPUT0"
+  data_type: TYPE_INT8
+  dims: -1
+  dims: 16
+}
+output {
+  name: "OUTPUT1"
+  data_type: TYPE_INT8
+  dims: -1
+  dims: 16
+}
+instance_group {
+  name: "hint_for_no_batch_2"
+  count: 1
+  gpus: 0
+  kind: KIND_GPU
+}
+default_model_filename: "model.savedmodel"
+optimization {
+  input_pinned_memory {
+    enable: true
+  }
+  output_pinned_memory {
+    enable: true
+  }
+}
+backend: "tensorflow"
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/expected.1 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/expected.1
new file mode 100644
index 0000000000..1d156dbbef
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/expected.1
@@ -0,0 +1,47 @@
+name: "hint_for_no_batch_2"
+platform: "tensorflow_savedmodel"
+version_policy {
+  latest {
+    num_versions: 1
+  }
+}
+input {
+  name: "INPUT0"
+  data_type: TYPE_INT32
+  dims: -1
+  dims: 16
+}
+input {
+  name: "INPUT1"
+  data_type: TYPE_INT32
+  dims: -1
+  dims: 16
+}
+output {
+  name: "OUTPUT1"
+  data_type: TYPE_INT8
+  dims: -1
+  dims: 16
+}
+output {
+  name: "OUTPUT0"
+  data_type: TYPE_INT8
+  dims: -1
+  dims: 16
+}
+instance_group {
+  name: "hint_for_no_batch_2"
+  count: 1
+  gpus: 0
+  kind: KIND_GPU
+}
+default_model_filename: "model.savedmodel"
+optimization {
+  input_pinned_memory {
+    enable: true
+  }
+  output_pinned_memory {
+    enable: true
+  }
+}
+backend: "tensorflow"
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/expected.2 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/expected.2
new file mode 100644
index 0000000000..5b393e32a8
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/expected.2
@@ -0,0 +1,47 @@
+name: "hint_for_no_batch_2"
+platform: "tensorflow_savedmodel"
+version_policy {
+  latest {
+    num_versions: 1
+  }
+}
+input {
+  name: "INPUT1"
+  data_type: TYPE_INT32
+  dims: -1
+  dims: 16
+}
+input {
+  name: "INPUT0"
+  data_type: TYPE_INT32
+  dims: -1
+  dims: 16
+}
+output {
+  name: "OUTPUT0"
+  data_type: TYPE_INT8
+  dims: -1
+  dims: 16
+}
+output {
+  name: "OUTPUT1"
+  data_type: TYPE_INT8
+  dims: -1
+  dims: 16
+}
+instance_group {
+  name: "hint_for_no_batch_2"
+  count: 1
+  gpus: 0
+  kind: KIND_GPU
+}
+default_model_filename: "model.savedmodel"
+optimization {
+  input_pinned_memory {
+    enable: true
+  }
+  output_pinned_memory {
+    enable: true
+  }
+}
+backend: "tensorflow"
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/expected.3 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/expected.3
new file mode 100644
index 0000000000..6e0f1d9dc2
--- /dev/null
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/expected.3
@@ -0,0 +1,47 @@
+name: "hint_for_no_batch_2"
+platform: "tensorflow_savedmodel"
+version_policy {
+  latest {
+    num_versions: 1
+  }
+}
+input {
+  name: "INPUT1"
+  data_type: TYPE_INT32
+  dims: -1
+  dims: 16
+}
+input {
+  name: "INPUT0"
+  data_type: TYPE_INT32
+  dims: -1
+  dims: 16
+}
+output {
+  name: "OUTPUT1"
+  data_type: TYPE_INT8
+  dims: -1
+  dims: 16
+}
+output {
+  name: "OUTPUT0"
+  data_type: TYPE_INT8
+  dims: -1
+  dims: 16
+}
+instance_group {
+  name: "hint_for_no_batch_2"
+  count: 1
+  gpus: 0
+  kind: KIND_GPU
+}
+default_model_filename: "model.savedmodel"
+optimization {
+  input_pinned_memory {
+    enable: true
+  }
+  output_pinned_memory {
+    enable: true
+  }
+}
+backend: "tensorflow"
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_input/expected b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_input/expected
index e59f7021cb..3c4d46f7f7 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_input/expected
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_input/expected
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.savedmodel"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_input/expected.1 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_input/expected.1
index 79e93c0af0..8ef0c5cbff 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_input/expected.1
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_input/expected.1
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.savedmodel"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_input/expected.2 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_input/expected.2
index ff843329ef..dbd34c653a 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_input/expected.2
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_input/expected.2
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.savedmodel"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_input/expected.3 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_input/expected.3
index fdfa373a27..3cea0e4e76 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_input/expected.3
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_input/expected.3
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.savedmodel"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_output/expected b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_output/expected
index 0c8b310ada..c92c0239cb 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_output/expected
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_output/expected
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.savedmodel"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_output/expected.1 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_output/expected.1
index aa2e396aad..d2913fff58 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_output/expected.1
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_output/expected.1
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.savedmodel"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_output/expected.2 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_output/expected.2
index a6830d9525..909464492c 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_output/expected.2
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_output/expected.2
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.savedmodel"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_output/expected.3 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_output/expected.3
index af39c6d311..486d601d25 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_output/expected.3
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/incomplete_output/expected.3
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.savedmodel"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/kind_model_config/expected b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/kind_model_config/expected
index b7e54c97c0..6e9d054488 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/kind_model_config/expected
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/kind_model_config/expected
@@ -32,6 +32,7 @@ instance_group {
   kind: KIND_MODEL
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.savedmodel"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/kind_model_config/expected.1 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/kind_model_config/expected.1
index 618b0db9e4..be19ce18f5 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/kind_model_config/expected.1
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/kind_model_config/expected.1
@@ -32,6 +32,7 @@ instance_group {
   kind: KIND_MODEL
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.savedmodel"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/kind_model_config/expected.2 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/kind_model_config/expected.2
index 59ff472d92..29a39ca60d 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/kind_model_config/expected.2
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/kind_model_config/expected.2
@@ -32,6 +32,7 @@ instance_group {
   kind: KIND_MODEL
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.savedmodel"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/kind_model_config/expected.3 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/kind_model_config/expected.3
index fd816983df..117b130441 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/kind_model_config/expected.3
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/kind_model_config/expected.3
@@ -32,6 +32,7 @@ instance_group {
   kind: KIND_MODEL
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.savedmodel"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/max_batch_size_set/expected b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/max_batch_size_set/expected
index 082d6e5023..d06baf672a 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/max_batch_size_set/expected
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/max_batch_size_set/expected
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 8
 }
 default_model_filename: "model.savedmodel"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/max_batch_size_set/expected.1 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/max_batch_size_set/expected.1
index 6fa3969c7b..ffcc7672aa 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/max_batch_size_set/expected.1
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/max_batch_size_set/expected.1
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 8
 }
 default_model_filename: "model.savedmodel"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/max_batch_size_set/expected.2 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/max_batch_size_set/expected.2
index 57af4c186f..060e76e889 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/max_batch_size_set/expected.2
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/max_batch_size_set/expected.2
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 8
 }
 default_model_filename: "model.savedmodel"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/max_batch_size_set/expected.3 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/max_batch_size_set/expected.3
index 667cb37bf8..36fec5471b 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/max_batch_size_set/expected.3
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/max_batch_size_set/expected.3
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 8
 }
 default_model_filename: "model.savedmodel"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/no_config/expected b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/no_config/expected
index 53b78ae2bb..8c97bff3bd 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/no_config/expected
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/no_config/expected
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.savedmodel"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/no_config/expected.1 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/no_config/expected.1
index a663065d5b..518d30e2c0 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/no_config/expected.1
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/no_config/expected.1
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.savedmodel"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/no_config/expected.2 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/no_config/expected.2
index 1c481c59f6..dca4192923 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/no_config/expected.2
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/no_config/expected.2
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.savedmodel"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/no_config/expected.3 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/no_config/expected.3
index 3c4e654529..5e5529b9cc 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/no_config/expected.3
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/no_config/expected.3
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 4
 }
 default_model_filename: "model.savedmodel"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/reshape_config_provided/config.pbtxt b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/reshape_config_provided/config.pbtxt
old mode 100755
new mode 100644
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/reshape_config_provided/expected b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/reshape_config_provided/expected
index ed857fdf3d..26c82649ac 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/reshape_config_provided/expected
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/reshape_config_provided/expected
@@ -9,9 +9,12 @@ max_batch_size: 8
 input {
   name: "INPUT1"
   data_type: TYPE_FP32
-  dims: 4
-  dims: 1
-  dims: 2
+  dims: 8
+  reshape {
+    shape: 4
+    shape: 1
+    shape: 2
+  }
 }
 input {
   name: "INPUT0"
@@ -23,9 +26,12 @@ input {
 output {
   name: "OUTPUT1"
   data_type: TYPE_FP32
-  dims: 4
-  dims: 1
-  dims: 2
+  dims: 8
+  reshape {
+    shape: 4
+    shape: 1
+    shape: 2
+  }
 }
 output {
   name: "OUTPUT0"
@@ -41,6 +47,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 8
 }
 default_model_filename: "model.savedmodel"
 optimization {
@@ -51,4 +58,4 @@ optimization {
     enable: true
   }
 }
-backend: "tensorflow"
+backend: "tensorflow"
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/reshape_config_provided/expected.1 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/reshape_config_provided/expected.1
index 79c3a6d39d..c7f42ef86e 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/reshape_config_provided/expected.1
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/reshape_config_provided/expected.1
@@ -16,16 +16,22 @@ input {
 input {
   name: "INPUT1"
   data_type: TYPE_FP32
-  dims: 4
-  dims: 1
-  dims: 2
+  dims: 8
+  reshape {
+    shape: 4
+    shape: 1
+    shape: 2
+  }
 }
 output {
   name: "OUTPUT1"
   data_type: TYPE_FP32
-  dims: 4
-  dims: 1
-  dims: 2
+  dims: 8
+  reshape {
+    shape: 4
+    shape: 1
+    shape: 2
+  }
 }
 output {
   name: "OUTPUT0"
@@ -41,6 +47,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 8
 }
 default_model_filename: "model.savedmodel"
 optimization {
@@ -51,4 +58,4 @@ optimization {
     enable: true
   }
 }
-backend: "tensorflow"
+backend: "tensorflow"
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/reshape_config_provided/expected.2 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/reshape_config_provided/expected.2
index d83f4a3a99..3883e90d3f 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/reshape_config_provided/expected.2
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/reshape_config_provided/expected.2
@@ -9,9 +9,12 @@ max_batch_size: 8
 input {
   name: "INPUT1"
   data_type: TYPE_FP32
-  dims: 4
-  dims: 1
-  dims: 2
+  dims: 8
+  reshape {
+    shape: 4
+    shape: 1
+    shape: 2
+  }
 }
 input {
   name: "INPUT0"
@@ -30,9 +33,12 @@ output {
 output {
   name: "OUTPUT1"
   data_type: TYPE_FP32
-  dims: 4
-  dims: 1
-  dims: 2
+  dims: 8
+  reshape {
+    shape: 4
+    shape: 1
+    shape: 2
+  }
 }
 instance_group {
   name: "reshape_config_provided"
@@ -41,6 +47,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 8
 }
 default_model_filename: "model.savedmodel"
 optimization {
@@ -51,4 +58,4 @@ optimization {
     enable: true
   }
 }
-backend: "tensorflow"
+backend: "tensorflow"
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/reshape_config_provided/expected.3 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/reshape_config_provided/expected.3
index f6aec04e2e..382fe6a752 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/reshape_config_provided/expected.3
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/reshape_config_provided/expected.3
@@ -16,9 +16,12 @@ input {
 input {
   name: "INPUT1"
   data_type: TYPE_FP32
-  dims: 4
-  dims: 1
-  dims: 2
+  dims: 8
+  reshape {
+    shape: 4
+    shape: 1
+    shape: 2
+  }
 }
 output {
   name: "OUTPUT0"
@@ -30,9 +33,12 @@ output {
 output {
   name: "OUTPUT1"
   data_type: TYPE_FP32
-  dims: 4
-  dims: 1
-  dims: 2
+  dims: 8
+  reshape {
+    shape: 4
+    shape: 1
+    shape: 2
+  }
 }
 instance_group {
   name: "reshape_config_provided"
@@ -41,6 +47,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 8
 }
 default_model_filename: "model.savedmodel"
 optimization {
@@ -51,4 +58,4 @@ optimization {
     enable: true
   }
 }
-backend: "tensorflow"
+backend: "tensorflow"
\ No newline at end of file
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/expected b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/expected
deleted file mode 100644
index 7d5fbb1e5f..0000000000
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/expected
+++ /dev/null
@@ -1,44 +0,0 @@
-name: "too_many_inputs"
-platform: "tensorflow_savedmodel"
-version_policy {
-  latest {
-    num_versions: 1
-  }
-}
-max_batch_size: 1
-input {
-  name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
-}
-input {
-  name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
-}
-output {
-  name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
-}
-output {
-  name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
-}
-instance_group {
-  name: "too_many_inputs"
-  count: 1
-  gpus: 0
-  kind: KIND_GPU
-}
-default_model_filename: "model.savedmodel"
-optimization {
-  input_pinned_memory {
-    enable: true
-  }
-  output_pinned_memory {
-    enable: true
-  }
-}
-backend: "tensorflow"
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/expected.1 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/expected.1
deleted file mode 100644
index ada33dbe50..0000000000
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/expected.1
+++ /dev/null
@@ -1,44 +0,0 @@
-name: "too_many_inputs"
-platform: "tensorflow_savedmodel"
-version_policy {
-  latest {
-    num_versions: 1
-  }
-}
-max_batch_size: 1
-input {
-  name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
-}
-input {
-  name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
-}
-output {
-  name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
-}
-output {
-  name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
-}
-instance_group {
-  name: "too_many_inputs"
-  count: 1
-  gpus: 0
-  kind: KIND_GPU
-}
-default_model_filename: "model.savedmodel"
-optimization {
-  input_pinned_memory {
-    enable: true
-  }
-  output_pinned_memory {
-    enable: true
-  }
-}
-backend: "tensorflow"
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/expected.2 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/expected.2
deleted file mode 100644
index b531a99b65..0000000000
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/expected.2
+++ /dev/null
@@ -1,44 +0,0 @@
-name: "too_many_inputs"
-platform: "tensorflow_savedmodel"
-version_policy {
-  latest {
-    num_versions: 1
-  }
-}
-max_batch_size: 1
-input {
-  name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
-}
-input {
-  name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
-}
-output {
-  name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
-}
-output {
-  name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
-}
-instance_group {
-  name: "too_many_inputs"
-  count: 1
-  gpus: 0
-  kind: KIND_GPU
-}
-default_model_filename: "model.savedmodel"
-optimization {
-  input_pinned_memory {
-    enable: true
-  }
-  output_pinned_memory {
-    enable: true
-  }
-}
-backend: "tensorflow"
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/expected.3 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/expected.3
deleted file mode 100644
index 1fa4bb0efd..0000000000
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/expected.3
+++ /dev/null
@@ -1,44 +0,0 @@
-name: "too_many_inputs"
-platform: "tensorflow_savedmodel"
-version_policy {
-  latest {
-    num_versions: 1
-  }
-}
-max_batch_size: 1
-input {
-  name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
-}
-input {
-  name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
-}
-output {
-  name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
-}
-output {
-  name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
-}
-instance_group {
-  name: "too_many_inputs"
-  count: 1
-  gpus: 0
-  kind: KIND_GPU
-}
-default_model_filename: "model.savedmodel"
-optimization {
-  input_pinned_memory {
-    enable: true
-  }
-  output_pinned_memory {
-    enable: true
-  }
-}
-backend: "tensorflow"
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/expected.3 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/expected.3
deleted file mode 100644
index f48c35c77f..0000000000
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/expected.3
+++ /dev/null
@@ -1,44 +0,0 @@
-name: "unknown_input"
-platform: "tensorflow_savedmodel"
-version_policy {
-  latest {
-    num_versions: 1
-  }
-}
-max_batch_size: 1
-input {
-  name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
-}
-input {
-  name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
-}
-output {
-  name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
-}
-output {
-  name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
-}
-instance_group {
-  name: "unknown_input"
-  count: 1
-  gpus: 0
-  kind: KIND_GPU
-}
-default_model_filename: "model.savedmodel"
-optimization {
-  input_pinned_memory {
-    enable: true
-  }
-  output_pinned_memory {
-    enable: true
-  }
-}
-backend: "tensorflow"
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/expected b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/expected
deleted file mode 100644
index c75e97e934..0000000000
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/expected
+++ /dev/null
@@ -1,44 +0,0 @@
-name: "unknown_output"
-platform: "tensorflow_savedmodel"
-version_policy {
-  latest {
-    num_versions: 1
-  }
-}
-max_batch_size: 1
-input {
-  name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
-}
-input {
-  name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
-}
-output {
-  name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
-}
-output {
-  name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
-}
-instance_group {
-  name: "unknown_output"
-  count: 1
-  gpus: 0
-  kind: KIND_GPU
-}
-default_model_filename: "model.savedmodel"
-optimization {
-  input_pinned_memory {
-    enable: true
-  }
-  output_pinned_memory {
-    enable: true
-  }
-}
-backend: "tensorflow"
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/expected.1 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/expected.1
deleted file mode 100644
index 2b3a137deb..0000000000
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/expected.1
+++ /dev/null
@@ -1,44 +0,0 @@
-name: "unknown_output"
-platform: "tensorflow_savedmodel"
-version_policy {
-  latest {
-    num_versions: 1
-  }
-}
-max_batch_size: 1
-input {
-  name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
-}
-input {
-  name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
-}
-output {
-  name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
-}
-output {
-  name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
-}
-instance_group {
-  name: "unknown_output"
-  count: 1
-  gpus: 0
-  kind: KIND_GPU
-}
-default_model_filename: "model.savedmodel"
-optimization {
-  input_pinned_memory {
-    enable: true
-  }
-  output_pinned_memory {
-    enable: true
-  }
-}
-backend: "tensorflow"
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/expected.2 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/expected.2
deleted file mode 100644
index a93809e4b4..0000000000
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/expected.2
+++ /dev/null
@@ -1,44 +0,0 @@
-name: "unknown_output"
-platform: "tensorflow_savedmodel"
-version_policy {
-  latest {
-    num_versions: 1
-  }
-}
-max_batch_size: 1
-input {
-  name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
-}
-input {
-  name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
-}
-output {
-  name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
-}
-output {
-  name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
-}
-instance_group {
-  name: "unknown_output"
-  count: 1
-  gpus: 0
-  kind: KIND_GPU
-}
-default_model_filename: "model.savedmodel"
-optimization {
-  input_pinned_memory {
-    enable: true
-  }
-  output_pinned_memory {
-    enable: true
-  }
-}
-backend: "tensorflow"
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/expected.3 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/expected.3
deleted file mode 100644
index 37d2fe7fac..0000000000
--- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/expected.3
+++ /dev/null
@@ -1,44 +0,0 @@
-name: "unknown_output"
-platform: "tensorflow_savedmodel"
-version_policy {
-  latest {
-    num_versions: 1
-  }
-}
-max_batch_size: 1
-input {
-  name: "INPUT0"
-  data_type: TYPE_INT32
-  dims: 16
-}
-input {
-  name: "INPUT1"
-  data_type: TYPE_INT32
-  dims: 16
-}
-output {
-  name: "OUTPUT0"
-  data_type: TYPE_INT8
-  dims: 16
-}
-output {
-  name: "OUTPUT1"
-  data_type: TYPE_INT8
-  dims: 16
-}
-instance_group {
-  name: "unknown_output"
-  count: 1
-  gpus: 0
-  kind: KIND_GPU
-}
-default_model_filename: "model.savedmodel"
-optimization {
-  input_pinned_memory {
-    enable: true
-  }
-  output_pinned_memory {
-    enable: true
-  }
-}
-backend: "tensorflow"
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorrt/empty_config/expected b/qa/L0_model_config/autofill_noplatform_success/tensorrt/empty_config/expected
index 715dfe5f3d..9bee47e11d 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorrt/empty_config/expected
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorrt/empty_config/expected
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 8
 }
 default_model_filename: "model.plan"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorrt/empty_config_variable/expected b/qa/L0_model_config/autofill_noplatform_success/tensorrt/empty_config_variable/expected
index eead8ed09b..7a1eb047d8 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorrt/empty_config_variable/expected
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorrt/empty_config_variable/expected
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 8
 }
 default_model_filename: "model.plan"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_input/expected b/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_input/expected
index bf249e5e1d..067af7e780 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_input/expected
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_input/expected
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 8
 }
 default_model_filename: "model.plan"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_input/expected.1 b/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_input/expected.1
index 2c3d21f1d8..5b637f178b 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_input/expected.1
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_input/expected.1
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 8
 }
 default_model_filename: "model.plan"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_input/expected.2 b/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_input/expected.2
index ee58c6d271..c75be4f278 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_input/expected.2
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_input/expected.2
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 8
 }
 default_model_filename: "model.plan"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_input/expected.3 b/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_input/expected.3
index 0355fa2b9b..8a452edecf 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_input/expected.3
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_input/expected.3
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 8
 }
 default_model_filename: "model.plan"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_output/expected b/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_output/expected
index 49165acc8f..9fcb73c707 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_output/expected
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_output/expected
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 8
 }
 default_model_filename: "model.plan"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_output/expected.1 b/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_output/expected.1
index f16cd6f302..b02a588b36 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_output/expected.1
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_output/expected.1
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 8
 }
 default_model_filename: "model.plan"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_output/expected.2 b/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_output/expected.2
index e43266399c..da9da430ca 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_output/expected.2
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_output/expected.2
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 8
 }
 default_model_filename: "model.plan"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_output/expected.3 b/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_output/expected.3
index c612aeda16..935f064940 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_output/expected.3
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorrt/incomplete_output/expected.3
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 8
 }
 default_model_filename: "model.plan"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorrt/multi_prof_max_bs/expected b/qa/L0_model_config/autofill_noplatform_success/tensorrt/multi_prof_max_bs/expected
index 89a89db923..b4e203ecae 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorrt/multi_prof_max_bs/expected
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorrt/multi_prof_max_bs/expected
@@ -48,6 +48,7 @@ instance_group {
   profile: "2"
 }
 dynamic_batching {
+  preferred_batch_size: 8
 }
 default_model_filename: "model.plan"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorrt/no_config/expected b/qa/L0_model_config/autofill_noplatform_success/tensorrt/no_config/expected
index 576e3db790..08007011c6 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorrt/no_config/expected
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorrt/no_config/expected
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 8
 }
 default_model_filename: "model.plan"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorrt/no_config_shape_tensor/expected b/qa/L0_model_config/autofill_noplatform_success/tensorrt/no_config_shape_tensor/expected
index 89a8f32cff..7682cebd45 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorrt/no_config_shape_tensor/expected
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorrt/no_config_shape_tensor/expected
@@ -37,6 +37,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 8
 }
 default_model_filename: "model.plan"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorrt/no_config_variable/expected b/qa/L0_model_config/autofill_noplatform_success/tensorrt/no_config_variable/expected
index 03d8adbb6e..3970c1b55e 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorrt/no_config_variable/expected
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorrt/no_config_variable/expected
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 8
 }
 default_model_filename: "model.plan"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorrt/no_name_platform/expected b/qa/L0_model_config/autofill_noplatform_success/tensorrt/no_name_platform/expected
index 89c8493f71..e709ca4d3c 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorrt/no_name_platform/expected
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorrt/no_name_platform/expected
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 8
 }
 default_model_filename: "model.plan"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorrt/no_name_platform_variable/expected b/qa/L0_model_config/autofill_noplatform_success/tensorrt/no_name_platform_variable/expected
index f4a2544da5..8561accddb 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorrt/no_name_platform_variable/expected
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorrt/no_name_platform_variable/expected
@@ -33,6 +33,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 8
 }
 default_model_filename: "model.plan"
 optimization {
diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorrt/reshape_config_provided/expected b/qa/L0_model_config/autofill_noplatform_success/tensorrt/reshape_config_provided/expected
index ab16f4f7e7..47a2420634 100644
--- a/qa/L0_model_config/autofill_noplatform_success/tensorrt/reshape_config_provided/expected
+++ b/qa/L0_model_config/autofill_noplatform_success/tensorrt/reshape_config_provided/expected
@@ -84,6 +84,7 @@ instance_group {
   kind: KIND_GPU
 }
 dynamic_batching {
+  preferred_batch_size: 8
 }
 default_model_filename: "model.plan"
 optimization {
diff --git a/qa/L0_model_config/cli_messages/cli_deprecation/expected b/qa/L0_model_config/cli_messages/cli_deprecation/expected
new file mode 100644
index 0000000000..3205f6a9c2
--- /dev/null
+++ b/qa/L0_model_config/cli_messages/cli_deprecation/expected
@@ -0,0 +1 @@
+Warning: '--strict-model-config' has been deprecated! Please use '--disable-auto-complete-config' instead.
\ No newline at end of file
diff --git a/qa/L0_model_config/cli_messages/cli_override/expected b/qa/L0_model_config/cli_messages/cli_override/expected
new file mode 100644
index 0000000000..51553c31ec
--- /dev/null
+++ b/qa/L0_model_config/cli_messages/cli_override/expected
@@ -0,0 +1 @@
+Warning: Overriding deprecated '--strict-model-config' from False to True in favor of '--disable-auto-complete-config'!
\ No newline at end of file
diff --git a/qa/L0_model_config/compare_status.py b/qa/L0_model_config/compare_status.py
old mode 100644
new mode 100755
index f1548e6de4..dbed05772a
--- a/qa/L0_model_config/compare_status.py
+++ b/qa/L0_model_config/compare_status.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,43 +27,46 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import argparse
+import json
 import os
 import sys
-import json
+
 import tritonclient.grpc as grpcclient
+import tritonclient.grpc.model_config_pb2 as mc
 import tritonclient.http as httpclient
+from google.protobuf import json_format, text_format
 from tritonclient.utils import *
-from google.protobuf import text_format
-from google.protobuf import json_format
-import tritonclient.grpc.model_config_pb2 as mc
 
 FLAGS = None
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('--expected_dir',
-                        type=str,
-                        required=True,
-                        help='Directory containing expected output files')
-    parser.add_argument('--model', type=str, required=True, help='Model name')
+    parser.add_argument(
+        "--expected_dir",
+        type=str,
+        required=True,
+        help="Directory containing expected output files",
+    )
+    parser.add_argument("--model", type=str, required=True, help="Model name")
     FLAGS, unparsed = parser.parse_known_args()
 
     for pair in [("localhost:8000", "http"), ("localhost:8001", "grpc")]:
         model_name = FLAGS.model
         if pair[1] == "http":
-            triton_client = httpclient.InferenceServerClient(url=pair[0],
-                                                             verbose=False)
+            triton_client = httpclient.InferenceServerClient(url=pair[0], verbose=False)
             model_config = triton_client.get_model_config(model_name)
         else:
-            triton_client = grpcclient.InferenceServerClient(url=pair[0],
-                                                             verbose=False)
+            triton_client = grpcclient.InferenceServerClient(url=pair[0], verbose=False)
             model_config = triton_client.get_model_config(model_name)
 
         nonmatch = list()
         expected_files = [
-            f for f in os.listdir(FLAGS.expected_dir)
-            if (os.path.isfile(os.path.join(FLAGS.expected_dir, f)) and
-                (f.startswith("expected")))
+            f
+            for f in os.listdir(FLAGS.expected_dir)
+            if (
+                os.path.isfile(os.path.join(FLAGS.expected_dir, f))
+                and (f.startswith("expected"))
+            )
         ]
         for efile in expected_files:
             with open(os.path.join(FLAGS.expected_dir, efile)) as f:
@@ -69,8 +74,8 @@
 
             if pair[1] == "http":
                 config_json = json.loads(
-                    json_format.MessageToJson(config,
-                                              preserving_proto_field_name=True))
+                    json_format.MessageToJson(config, preserving_proto_field_name=True)
+                )
                 if config_json == model_config:
                     sys.exit(0)
             else:
diff --git a/qa/L0_model_config/noautofill_test.py b/qa/L0_model_config/noautofill_test.py
new file mode 100755
index 0000000000..d89e306eb8
--- /dev/null
+++ b/qa/L0_model_config/noautofill_test.py
@@ -0,0 +1,62 @@
+#!/usr/bin/python
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+
+import unittest
+
+import test_util as tu
+import tritonclient.http as httpclient
+from tritonclient.utils import InferenceServerException
+
+
+class NoAutoFillTest(tu.TestResultCollector):
+    def setUp(self):
+        self._model_name = "noautofill_noconfig"
+        self._triton_client = httpclient.InferenceServerClient("localhost:8000")
+
+    def tearDown(self):
+        self._triton_client.unload_model(self._model_name)
+
+    def test_load_no_autofill_model_with_config(self):
+        config = '{"max_batch_size":"16"}'
+        self._triton_client.load_model(self._model_name, config=config)
+
+        # Check if the model config is correct
+        model_config = self._triton_client.get_model_config(self._model_name)
+        self.assertEqual(model_config["max_batch_size"], 16)
+
+    def test_load_no_autofill_model_with_no_config(self):
+        with self.assertRaises(InferenceServerException) as ex:
+            self._triton_client.load_model(self._model_name)
+        self.assertIn("model configuration is not provided", str(ex.exception))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_model_config/special_cases/noautofill_noconfig/expected b/qa/L0_model_config/special_cases/noautofill_noconfig/expected
new file mode 100644
index 0000000000..5a0abf84dc
--- /dev/null
+++ b/qa/L0_model_config/special_cases/noautofill_noconfig/expected
@@ -0,0 +1 @@
+model configuration is not provided
diff --git a/qa/L0_model_config/test.sh b/qa/L0_model_config/test.sh
index 5644018dc1..4b0a83707c 100755
--- a/qa/L0_model_config/test.sh
+++ b/qa/L0_model_config/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -209,7 +209,13 @@ for modelpath in \
         autofill_noplatform_success/python/dynamic_batching_no_op \
         autofill_noplatform_success/python/dynamic_batching \
         autofill_noplatform_success/python/incomplete_input \
+        autofill_noplatform_success/python/model_transaction_policy \
+        autofill_noplatform_success/python/model_transaction_policy_decoupled_false \
+        autofill_noplatform_success/python/model_transaction_policy_no_op \
+        autofill_noplatform_success/python/optional_input \
         autofill_noplatform/python/input_wrong_property \
+        autofill_noplatform/python/model_transaction_policy_invalid_args \
+        autofill_noplatform/python/model_transaction_policy_mismatch \
         autofill_noplatform/python/output_wrong_property ; do
     mkdir -p $modelpath/1
     mv $modelpath/model.py $modelpath/1/.
@@ -222,10 +228,33 @@ for modelpath in \
     mv $modelpath/model.py $modelpath/1/.
 done
 
+# Make version folders for custom test model repositories.
+for modelpath in \
+        autofill_noplatform/custom/no_delimiter/1 \
+        autofill_noplatform/custom/unknown_backend.unknown/1 \
+        autofill_noplatform_success/custom/empty_config.identity/1 \
+        autofill_noplatform_success/custom/no_backend.identity/1 ; do
+    mkdir -p $modelpath
+done
+
+# Make version folders as the instance group validation is deferred to
+# the beginning of model creation
+for modelpath in \
+        noautofill_platform/invalid_cpu/1 \
+        noautofill_platform/invalid_gpu/1 \
+        noautofill_platform/negative_gpu/1 ; do
+    mkdir -p $modelpath
+done
+
 # Copy other required models
 mkdir -p special_cases/invalid_platform/1
 cp -r /data/inferenceserver/${REPO_VERSION}/qa_model_repository/savedmodel_float32_float32_float32/1/model.savedmodel \
     special_cases/invalid_platform/1/
+# Note that graphdef models don't support auto-complete-config
+# and that is why we are using graphdef model in this test case.
+mkdir -p special_cases/noautofill_noconfig/1
+cp -r /data/inferenceserver/${REPO_VERSION}/qa_model_repository/graphdef_float32_float32_float32/1/model.graphdef \
+    special_cases/noautofill_noconfig/1/
 
 # Copy reshape model files into the test model repositories.
 mkdir -p autofill_noplatform_success/tensorflow_graphdef/reshape_config_provided/1
@@ -245,12 +274,86 @@ mkdir -p autofill_noplatform_success/onnx/cpu_instance/1
 cp -r /data/inferenceserver/${REPO_VERSION}/qa_identity_model_repository/onnx_zero_1_float16/1/model.onnx \
     autofill_noplatform_success/onnx/cpu_instance/1
 
+# Copy openvino models into test directories
+for modelpath in \
+        autofill_noplatform/openvino/bad_input_dims \
+        autofill_noplatform/openvino/bad_output_dims \
+        autofill_noplatform/openvino/too_few_inputs \
+        autofill_noplatform/openvino/too_many_inputs \
+        autofill_noplatform/openvino/unknown_input \
+        autofill_noplatform/openvino/unknown_output \
+        autofill_noplatform_success/openvino/empty_config \
+        autofill_noplatform_success/openvino/no_config; do
+    cp -r /opt/tritonserver/qa/openvino_models/fixed_batch/1 $modelpath
+done
+cp -r /opt/tritonserver/qa/openvino_models/dynamic_batch/1 \
+    autofill_noplatform_success/openvino/dynamic_batch
+# Copy openvino model from qa_model_repository
+cp -r /data/inferenceserver/${REPO_VERSION}/qa_model_repository/openvino_int8_int8_int8/1 \
+    autofill_noplatform_success/openvino/partial_config
+cp /data/inferenceserver/${REPO_VERSION}/qa_model_repository/openvino_int8_int8_int8/output0_labels.txt \
+    autofill_noplatform_success/openvino/partial_config
+
 rm -f $SERVER_LOG_BASE* $CLIENT_LOG
 RET=0
 
+# Run tests for logs which do not have a timestamp on them
+for TARGET in `ls cli_messages`; do
+    case $TARGET in
+        "cli_override")
+            EXTRA_ARGS="--disable-auto-complete-config --strict-model-config=false" ;;
+        "cli_deprecation")
+            EXTRA_ARGS="--strict-model-config=true" ;;
+        *)
+            EXTRA_ARGS="" ;;
+    esac
+
+    SERVER_ARGS="--model-repository=`pwd`/models  $EXTRA_ARGS"
+    SERVER_LOG=$SERVER_LOG_BASE.cli_messages_${TARGET}.log
+
+    rm -fr models && mkdir models
+    cp -r cli_messages/$TARGET models/.
+
+    EXPECTEDS=models/$TARGET/expected*
+
+    echo -e "Test on cli_messages/$TARGET" >> $CLIENT_LOG
+
+    run_server
+    if [ "$SERVER_PID" != "0" ]; then
+        echo -e "*** FAILED: unexpected success starting $SERVER" >> $CLIENT_LOG
+        RET=1
+        kill $SERVER_PID
+        wait $SERVER_PID
+    else
+        EXFOUND=0
+        for EXPECTED in `ls $EXPECTEDS`; do
+            EX=`cat $EXPECTED`
+            echo "grepping for: $EX"
+            if grep "$EX" $SERVER_LOG; then
+                echo -e "Found \"$EX\"" >> $CLIENT_LOG
+                EXFOUND=1
+                break
+            else
+                echo -e "Not found \"$EX\"" >> $CLIENT_LOG
+            fi
+        done
+        if [ "$EXFOUND" == "0" ]; then
+            echo -e "*** FAILED: cli_messages/$TARGET" >> $CLIENT_LOG
+            RET=1
+        fi
+    fi
+done
+
 # Run special test cases
 for TARGET in `ls special_cases`; do
-    SERVER_ARGS="--model-repository=`pwd`/models --strict-model-config=true"
+    case $TARGET in
+        "invalid_platform")
+            EXTRA_ARGS="--disable-auto-complete-config" ;;
+        *)
+            EXTRA_ARGS="" ;;
+    esac
+
+    SERVER_ARGS="--model-repository=`pwd`/models $EXTRA_ARGS"
     SERVER_LOG=$SERVER_LOG_BASE.special_case_${TARGET}.log
 
     rm -fr models && mkdir models
@@ -288,6 +391,34 @@ for TARGET in `ls special_cases`; do
     fi
 done
 
+# Run noautofill unittest
+SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit --log-verbose=1"
+SERVER_LOG=$SERVER_LOG_BASE.special_case_noautofill_test.log
+
+rm -fr models && mkdir models
+cp -r special_cases/noautofill_noconfig models/.
+
+echo -e "Test on special_cases/noautofill_test" >> $CLIENT_LOG
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python noautofill_test.py >> $CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Python NoAutoFill Test Failed\n***"
+    RET=1
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
 for TRIAL in $TRIALS; do
     # Run all tests that require no autofill but that add the platform to
     # the model config before running the test
@@ -339,6 +470,57 @@ for TRIAL in $TRIALS; do
     done
 done
 
+for TRIAL in $TRIALS; do
+    # Run all tests that require no autofill but that add the platform to
+    # the model config before running the test
+    for TARGET in `ls noautofill_platform`; do
+        SERVER_ARGS="--model-repository=`pwd`/models --disable-auto-complete-config"
+        SERVER_LOG=$SERVER_LOG_BASE.noautofill_platform_disableflag_${TRIAL}_${TARGET}.log
+
+        rm -fr models && mkdir models
+        cp -r noautofill_platform/$TARGET models/.
+
+        CONFIG=models/$TARGET/config.pbtxt
+        EXPECTEDS=models/$TARGET/expected*
+
+        # If there is a config.pbtxt change/add platform to it
+        if [ -f $CONFIG ]; then
+            sed -i '/platform:/d' $CONFIG
+            echo "platform: \"$TRIAL\"" >> $CONFIG
+            cat $CONFIG
+        fi
+
+        echo -e "Test platform $TRIAL on noautofill_platform/$TARGET with disable-auto-complete-config flag" >> $CLIENT_LOG
+
+        # We expect all the tests to fail with one of the expected
+        # error messages
+        run_server
+        if [ "$SERVER_PID" != "0" ]; then
+            echo -e "*** FAILED: unexpected success starting $SERVER" >> $CLIENT_LOG
+            RET=1
+            kill $SERVER_PID
+            wait $SERVER_PID
+        else
+            EXFOUND=0
+            for EXPECTED in `ls $EXPECTEDS`; do
+                EX=`cat $EXPECTED`
+                if grep ^E[0-9][0-9][0-9][0-9].*"$EX" $SERVER_LOG; then
+                    echo -e "Found \"$EX\"" >> $CLIENT_LOG
+                    EXFOUND=1
+                    break
+                else
+                    echo -e "Not found \"$EX\"" >> $CLIENT_LOG
+                fi
+            done
+
+            if [ "$EXFOUND" == "0" ]; then
+                echo -e "*** FAILED: platform $TRIAL noautofill_platform/$TARGET with disable-auto-complete-config flag" >> $CLIENT_LOG
+                RET=1
+            fi
+        fi
+    done
+done
+
 # Run all autofill tests that don't add a platform to the model config
 # before running the test
 for TARGET_DIR in `ls -d autofill_noplatform/*/*`; do
diff --git a/qa/L0_model_namespacing/python_addsub/__init__.py b/qa/L0_model_namespacing/python_addsub/__init__.py
new file mode 100755
index 0000000000..a664eafef0
--- /dev/null
+++ b/qa/L0_model_namespacing/python_addsub/__init__.py
@@ -0,0 +1,123 @@
+#!/usr/bin/env python3
+
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import json
+
+import numpy as np
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    # Use auto complete feature to ship config.pbtxt along with the Python
+    # model definition
+    @staticmethod
+    def auto_complete_config(auto_complete_model_config):
+        # Only use packaged config if config is not explicitly provided
+        config = auto_complete_model_config.as_dict()
+        if (len(config["input"]) != 0) or (len(config["output"]) != 0):
+            return auto_complete_model_config
+
+        auto_complete_model_config.add_input(
+            {
+                "name": "INPUT0",
+                "data_type": "TYPE_INT32",
+                "dims": [
+                    16,
+                ],
+            }
+        )
+        auto_complete_model_config.add_input(
+            {
+                "name": "INPUT1",
+                "data_type": "TYPE_INT32",
+                "dims": [
+                    16,
+                ],
+            }
+        )
+        auto_complete_model_config.add_output(
+            {
+                "name": "OUTPUT0",
+                "data_type": "TYPE_INT32",
+                "dims": [
+                    16,
+                ],
+            }
+        )
+        auto_complete_model_config.add_output(
+            {
+                "name": "OUTPUT1",
+                "data_type": "TYPE_INT32",
+                "dims": [
+                    16,
+                ],
+            }
+        )
+        return auto_complete_model_config
+
+    def initialize(self, args):
+        self.model_config = model_config = json.loads(args["model_config"])
+
+        output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0")
+        output1_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT1")
+
+        self.output0_dtype = pb_utils.triton_string_to_numpy(
+            output0_config["data_type"]
+        )
+        self.output1_dtype = pb_utils.triton_string_to_numpy(
+            output1_config["data_type"]
+        )
+
+    def execute(self, requests):
+        """This function is called on inference request."""
+
+        responses = []
+        for request in requests:
+            in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
+            in_1 = pb_utils.get_input_tensor_by_name(request, "INPUT1")
+            responses.append(pb_utils.InferenceResponse(self.addsub(in_0, in_1)))
+        return responses
+
+    def addsub(self, in_0, in_1):
+        if (
+            in_0.as_numpy().dtype.type is np.bytes_
+            or in_0.as_numpy().dtype == np.object_
+        ):
+            out_0, out_1 = (
+                in_0.as_numpy().astype(np.int32) + in_1.as_numpy().astype(np.int32),
+                in_0.as_numpy().astype(np.int32) - in_1.as_numpy().astype(np.int32),
+            )
+        else:
+            out_0, out_1 = (
+                in_0.as_numpy() + in_1.as_numpy(),
+                in_0.as_numpy() - in_1.as_numpy(),
+            )
+
+        out_tensor_0 = pb_utils.Tensor("OUTPUT0", out_0.astype(self.output0_dtype))
+        out_tensor_1 = pb_utils.Tensor("OUTPUT1", out_1.astype(self.output1_dtype))
+        return [out_tensor_0, out_tensor_1]
diff --git a/qa/L0_model_namespacing/python_subadd/__init__.py b/qa/L0_model_namespacing/python_subadd/__init__.py
new file mode 100755
index 0000000000..bd3ddefe9e
--- /dev/null
+++ b/qa/L0_model_namespacing/python_subadd/__init__.py
@@ -0,0 +1,123 @@
+#!/usr/bin/env python3
+
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import json
+
+import numpy as np
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    # Use auto complete feature to ship config.pbtxt along with the Python
+    # model definition
+    @staticmethod
+    def auto_complete_config(auto_complete_model_config):
+        # Only use packaged config if config is not explicitly provided
+        config = auto_complete_model_config.as_dict()
+        if (len(config["input"]) != 0) or (len(config["output"]) != 0):
+            return auto_complete_model_config
+
+        auto_complete_model_config.add_input(
+            {
+                "name": "INPUT0",
+                "data_type": "TYPE_INT32",
+                "dims": [
+                    16,
+                ],
+            }
+        )
+        auto_complete_model_config.add_input(
+            {
+                "name": "INPUT1",
+                "data_type": "TYPE_INT32",
+                "dims": [
+                    16,
+                ],
+            }
+        )
+        auto_complete_model_config.add_output(
+            {
+                "name": "OUTPUT0",
+                "data_type": "TYPE_INT32",
+                "dims": [
+                    16,
+                ],
+            }
+        )
+        auto_complete_model_config.add_output(
+            {
+                "name": "OUTPUT1",
+                "data_type": "TYPE_INT32",
+                "dims": [
+                    16,
+                ],
+            }
+        )
+        return auto_complete_model_config
+
+    def initialize(self, args):
+        self.model_config = model_config = json.loads(args["model_config"])
+
+        output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0")
+        output1_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT1")
+
+        self.output0_dtype = pb_utils.triton_string_to_numpy(
+            output0_config["data_type"]
+        )
+        self.output1_dtype = pb_utils.triton_string_to_numpy(
+            output1_config["data_type"]
+        )
+
+    def execute(self, requests):
+        """This function is called on inference request."""
+
+        responses = []
+        for request in requests:
+            in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
+            in_1 = pb_utils.get_input_tensor_by_name(request, "INPUT1")
+            responses.append(pb_utils.InferenceResponse(self.subadd(in_0, in_1)))
+        return responses
+
+    def subadd(self, in_0, in_1):
+        if (
+            in_0.as_numpy().dtype.type is np.bytes_
+            or in_0.as_numpy().dtype == np.object_
+        ):
+            out_0, out_1 = (
+                in_0.as_numpy().astype(np.int32) - in_1.as_numpy().astype(np.int32),
+                in_0.as_numpy().astype(np.int32) + in_1.as_numpy().astype(np.int32),
+            )
+        else:
+            out_0, out_1 = (
+                in_0.as_numpy() - in_1.as_numpy(),
+                in_0.as_numpy() + in_1.as_numpy(),
+            )
+
+        out_tensor_0 = pb_utils.Tensor("OUTPUT0", out_0.astype(self.output0_dtype))
+        out_tensor_1 = pb_utils.Tensor("OUTPUT1", out_1.astype(self.output1_dtype))
+        return [out_tensor_0, out_tensor_1]
diff --git a/qa/L0_model_namespacing/test.py b/qa/L0_model_namespacing/test.py
new file mode 100755
index 0000000000..f45300d4fd
--- /dev/null
+++ b/qa/L0_model_namespacing/test.py
@@ -0,0 +1,361 @@
+#!/usr/bin/env python
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import os
+import sys
+
+sys.path.append(os.path.join(os.environ["TRITON_QA_ROOT_DIR"], "common"))
+
+import shutil
+import time
+import unittest
+
+import numpy as np
+import test_util as tu
+import tritonclient.http as httpclient
+from tritonclient.utils import InferenceServerException
+
+#
+# Test utilities
+#
+
+
+# Checker to perform inference on given model, expecting model to have
+# [INPUT0, INPUT1] and produce [OUTPUT0, OUTPUT1] where:
+# OUTPUT0 = INPUT0 + INPUT1
+# OUTPUT1 = INPUT0 - INPUT1
+class AddSubChecker:
+    # Optional 'checker_client' may be provided to use a different
+    # Triton client library, currently it must be either Triton HTTP client
+    # library or Triton GRPC client library
+    def __init__(self, checker_client=None):
+        # client library selection
+        if checker_client is None:
+            import tritonclient.http as checker_client
+        if "http" in checker_client.__name__:
+            self.client_ = checker_client.InferenceServerClient("localhost:8000")
+        else:
+            self.client_ = checker_client.InferenceServerClient("localhost:8001")
+
+        # Create infer input tensors
+        self.inputs_ = []
+        self.inputs_.append(checker_client.InferInput("INPUT0", [16], "INT32"))
+        self.inputs_.append(checker_client.InferInput("INPUT1", [16], "INT32"))
+
+        # Initialize the data and expected output
+        input_data = np.arange(start=0, stop=16, dtype=np.int32)
+        self.inputs_[0].set_data_from_numpy(input_data)
+        self.inputs_[1].set_data_from_numpy(input_data)
+        self.expected_outputs_ = {
+            "add": (input_data + input_data),
+            "sub": (input_data - input_data),
+        }
+
+    def infer(self, model):
+        res = self.client_.infer(model, self.inputs_)
+        np.testing.assert_allclose(
+            res.as_numpy("OUTPUT0"), self.expected_outputs_["add"]
+        )
+        np.testing.assert_allclose(
+            res.as_numpy("OUTPUT1"), self.expected_outputs_["sub"]
+        )
+
+
+# Checker to perform inference on given model, expecting model to have
+# [INPUT0, INPUT1] and produce [OUTPUT0, OUTPUT1] where:
+# OUTPUT0 = INPUT0 - INPUT1
+# OUTPUT1 = INPUT0 + INPUT1
+class SubAddChecker(AddSubChecker):
+    def infer(self, model):
+        res = self.client_.infer(model, self.inputs_)
+        np.testing.assert_allclose(
+            res.as_numpy("OUTPUT0"), self.expected_outputs_["sub"]
+        )
+        np.testing.assert_allclose(
+            res.as_numpy("OUTPUT1"), self.expected_outputs_["add"]
+        )
+
+
+#
+# Test suites and cases
+#
+
+
+class ModelNamespacePoll(tu.TestResultCollector):
+    def setUp(self):
+        self.addsub_ = AddSubChecker()
+        self.subadd_ = SubAddChecker()
+        # For other server interaction
+        self.client_ = httpclient.InferenceServerClient("localhost:8000")
+
+    def check_health(self, expect_live=True, expect_ready=True):
+        self.assertEqual(self.client_.is_server_live(), expect_live)
+        self.assertEqual(self.client_.is_server_ready(), expect_ready)
+
+    def test_no_duplication(self):
+        # Enable model namspacing on repositories that is already valid without
+        # enabling model namespacing.
+        # All models should be visible and can be inferred individually
+        self.check_health()
+
+        # infer check
+        for model in ["simple_addsub", "composing_addsub"]:
+            self.addsub_.infer(model)
+        for model in ["simple_subadd", "composing_subadd"]:
+            self.subadd_.infer(model)
+
+    def test_duplication(self):
+        # Enable model namspacing on repositories that each repo has one
+        # ensemble and it requires an composing model ('composing_model') that
+        # exists in both repos.
+        # Expect all models are visible, the ensemble will pick up the correct
+        # model even the composing model can't be inferred individually.
+        self.check_health()
+
+        # infer check
+        for model in [
+            "simple_addsub",
+        ]:
+            self.addsub_.infer(model)
+        for model in [
+            "simple_subadd",
+        ]:
+            self.subadd_.infer(model)
+
+        # error check
+        try:
+            self.addsub_.infer("composing_model")
+            self.assertTrue(False, "expected error for inferring ambiguous named model")
+        except InferenceServerException as ex:
+            self.assertIn("ambiguity", ex.message())
+
+    def test_ensemble_duplication(self):
+        # Enable model namspacing on repositories that each repo has one
+        # ensemble with the same name. Expect the ensemble will pick up the correct
+        # model.
+        # Expect all models are visible, the ensemble will pick up the correct
+        # model even the ensemble itself can't be inferred without providing
+        # namespace.
+        self.check_health()
+
+        # infer
+        for model in [
+            "composing_addsub",
+        ]:
+            self.addsub_.infer(model)
+        for model in [
+            "composing_subadd",
+        ]:
+            self.subadd_.infer(model)
+
+        # error check
+        try:
+            self.addsub_.infer("simple_ensemble")
+            self.assertTrue(False, "expected error for inferring ambiguous named model")
+        except InferenceServerException as ex:
+            self.assertIn("ambiguity", ex.message())
+
+    def test_dynamic_resolution(self):
+        # Same model setup as 'test_duplication', will remove / add one of the
+        # composing model at runtime and expect the ensemble to be properly
+        # linked to existing composing model at different steps.
+        # 1. Remove 'composing_model' in addsub_repo, expect both ensembles use
+        #    'composing_model' in subadd_repo and act as subadd
+        # 2. Add back 'composing_model' in addsub_repo, expect the ensembles to behave the
+        #    same as before the removal.
+        self.assertTrue("NAMESPACE_TESTING_DIRCTORY" in os.environ)
+        td = os.environ["NAMESPACE_TESTING_DIRCTORY"]
+        composing_before_path = os.path.join(td, "addsub_repo", "composing_model")
+        composing_after_path = os.path.join(td, "composing_model")
+
+        self.check_health()
+        # step 1.
+        shutil.move(composing_before_path, composing_after_path)
+        time.sleep(5)
+
+        # infer
+        for model in ["simple_subadd", "simple_addsub", "composing_model"]:
+            self.subadd_.infer(model)
+
+        # step 2.
+        shutil.move(composing_after_path, composing_before_path)
+        time.sleep(5)
+
+        # infer
+        for model in [
+            "simple_addsub",
+        ]:
+            self.addsub_.infer(model)
+        for model in [
+            "simple_subadd",
+        ]:
+            self.subadd_.infer(model)
+
+        # error check
+        try:
+            self.addsub_.infer("composing_model")
+            self.assertTrue(False, "expected error for inferring ambiguous named model")
+        except InferenceServerException as ex:
+            self.assertIn("ambiguity", ex.message())
+
+
+class ModelNamespaceExplicit(tu.TestResultCollector):
+    def setUp(self):
+        self.addsub_ = AddSubChecker()
+        self.subadd_ = SubAddChecker()
+        # For other server interaction
+        self.client_ = httpclient.InferenceServerClient("localhost:8000")
+
+    def check_health(self, expect_live=True, expect_ready=True):
+        self.assertEqual(self.client_.is_server_live(), expect_live)
+        self.assertEqual(self.client_.is_server_ready(), expect_ready)
+
+    def test_no_duplication(self):
+        # Enable model namspacing on repositories that is already valid without
+        # enabling model namespacing.
+        # All models should be visible and can be inferred individually
+        self.check_health()
+        # load ensembles, cascadingly load composing model
+        for model in ["simple_addsub", "simple_subadd"]:
+            self.client_.load_model(model)
+
+        # infer
+        for model in ["simple_addsub", "composing_addsub"]:
+            self.addsub_.infer(model)
+        for model in ["simple_subadd", "composing_subadd"]:
+            self.subadd_.infer(model)
+
+    def test_duplication(self):
+        # Enable model namspacing on repositories that each repo has one
+        # ensemble and it requires an composing model ('composing_model') that
+        # exists in both repos.
+        # Expect all models are visible, the ensemble will pick up the correct
+        # model even the composing model can't be inferred individually.
+        self.check_health()
+        # load ensembles, cascadingly load composing model
+        for model in ["simple_addsub", "simple_subadd"]:
+            self.client_.load_model(model)
+
+        # infer
+        for model in [
+            "simple_addsub",
+        ]:
+            self.addsub_.infer(model)
+        for model in [
+            "simple_subadd",
+        ]:
+            self.subadd_.infer(model)
+
+        # error check
+        try:
+            self.addsub_.infer("composing_model")
+            self.assertTrue(False, "expected error for inferring ambiguous named model")
+        except InferenceServerException as ex:
+            self.assertIn("ambiguity", ex.message())
+
+    def test_ensemble_duplication(self):
+        # Enable model namspacing on repositories that each repo has one
+        # ensemble with the same name. Expect the ensemble will pick up the correct
+        # model.
+        # Expect all models are visible, the ensemble will pick up the correct
+        # model even the ensemble itself can't be inferred without providing
+        # namespace.
+        self.check_health()
+        # load ensembles, cascadingly load composing model
+        for model in ["simple_ensemble"]:
+            self.client_.load_model(model)
+
+        # infer
+        for model in [
+            "composing_addsub",
+        ]:
+            self.addsub_.infer(model)
+        for model in [
+            "composing_subadd",
+        ]:
+            self.subadd_.infer(model)
+
+        # error check
+        try:
+            self.addsub_.infer("simple_ensemble")
+            self.assertTrue(False, "expected error for inferring ambiguous named model")
+        except InferenceServerException as ex:
+            self.assertIn("ambiguity", ex.message())
+
+    def test_dynamic_resolution(self):
+        # Same model setup as 'test_duplication', will remove / add one of the
+        # composing model at runtime and expect the ensemble to be properly
+        # linked to existing composing model at different steps.
+        # 1. Remove 'composing_model' in addsub_repo, expect both ensembles use
+        #    'composing_model' in subadd_repo and act as subadd.
+        # 2. Add back 'composing_model' in addsub_repo, expect the ensembles to behave the
+        #    same as before the removal.
+        self.assertTrue("NAMESPACE_TESTING_DIRCTORY" in os.environ)
+        td = os.environ["NAMESPACE_TESTING_DIRCTORY"]
+        composing_before_path = os.path.join(td, "addsub_repo", "composing_model")
+        composing_after_path = os.path.join(td, "composing_model")
+
+        self.check_health()
+        # step 1.
+        shutil.move(composing_before_path, composing_after_path)
+        # load ensembles, cascadingly load composing model
+        for model in ["simple_addsub", "simple_subadd"]:
+            self.client_.load_model(model)
+
+        # infer
+        for model in ["simple_subadd", "simple_addsub", "composing_model"]:
+            self.subadd_.infer(model)
+
+        # step 2.
+        shutil.move(composing_after_path, composing_before_path)
+        # Explicitly load one of the ensembel, should still trigger cascading
+        # (re-)load
+        for model in [
+            "simple_addsub",
+        ]:
+            self.client_.load_model(model)
+
+        # infer
+        for model in [
+            "simple_addsub",
+        ]:
+            self.addsub_.infer(model)
+        for model in [
+            "simple_subadd",
+        ]:
+            self.subadd_.infer(model)
+
+        # error check
+        try:
+            self.addsub_.infer("composing_model")
+            self.assertTrue(False, "expected error for inferring ambiguous named model")
+        except InferenceServerException as ex:
+            self.assertIn("ambiguity", ex.message())
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_model_namespacing/test.sh b/qa/L0_model_namespacing/test.sh
new file mode 100755
index 0000000000..414bd3dde9
--- /dev/null
+++ b/qa/L0_model_namespacing/test.sh
@@ -0,0 +1,149 @@
+#!/bin/bash
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+TRITON_QA_ROOT_DIR=${TRITON_QA_ROOT_DIR:="/opt/tritonserver/qa"}
+source $TRITON_QA_ROOT_DIR/common/util.sh
+
+RET=0
+
+TEST_PY=./test.py
+# tests are run individually
+EXPECTED_NUM_TESTS="1"
+TEST_RESULT_FILE='test_results.txt'
+
+
+export CUDA_VISIBLE_DEVICES=0
+export TRITON_QA_ROOT_DIR=$TRITON_QA_ROOT_DIR
+export TRITON_QA_PYTHON_MODEL_DIR=$TRITON_QA_ROOT_DIR/L0_model_namespacing
+
+rm -fr *.log
+
+REPO_ARGS="--model-namespacing=true --model-repository=`pwd`/test_dir/addsub_repo --model-repository=`pwd`/test_dir/subadd_repo"
+POLL_ARGS="--model-control-mode=POLL --repository-poll-secs=2"
+EXPLICIT_ARGS="--model-control-mode=EXPLICIT"
+
+SERVER=/opt/tritonserver/bin/tritonserver
+
+# List all tests as each test will use different repo configuration
+TEST_LIST=${TEST_LIST:="test_duplication \
+                            test_dynamic_resolution \
+                            test_ensemble_duplication \
+                            test_no_duplication"}
+
+# Helper to make sure all ensemble have version directory
+CURR_DIR=`pwd`
+for test_name in $TEST_LIST; do
+    for model_dir in $CURR_DIR/$test_name/*/*; do
+        mkdir -p $model_dir/1
+    done
+done
+
+# Set this variable to avoid generation of '__pycache__' in the model directory,
+# which will cause unintended model reload in POLLING model as Triton sees
+# changes in the model directory
+export PYTHONDONTWRITEBYTECODE=1
+
+# Polling
+for test_name in $TEST_LIST; do
+    TEST_SUITE="ModelNamespacePoll"
+    TEST_LOG="`pwd`/test.$TEST_SUITE.$test_name.log"
+    SERVER_LOG="./server.$TEST_SUITE.$test_name.log"
+
+    rm -fr `pwd`/test_dir
+    cp -r `pwd`/$test_name `pwd`/test_dir
+    SERVER_ARGS="$REPO_ARGS $POLL_ARGS"
+    run_server
+    if [ "$SERVER_PID" == "0" ]; then
+        echo -e "\n***\n*** Failed to start $SERVER\n***"
+        cat $SERVER_LOG
+        exit 1
+    fi
+
+    set +e
+    # Pass in the test directory as the test may modify the structure
+    NAMESPACE_TESTING_DIRCTORY=`pwd`/test_dir python $TEST_PY $TEST_SUITE.$test_name >>$TEST_LOG 2>&1
+    if [ $? -ne 0 ]; then
+        RET=1
+        cat $TEST_LOG
+    else
+        check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+        if [ $? -ne 0 ]; then
+            cat $TEST_LOG
+            echo -e "\n***\n*** Test Result Verification Failed\n***"
+            RET=1
+        fi
+    fi
+    set -e
+
+    kill $SERVER_PID
+    wait $SERVER_PID
+done
+
+# Explicit
+for test_name in $TEST_LIST; do
+    TEST_SUITE="ModelNamespaceExplicit"
+    TEST_LOG="`pwd`/test.$TEST_SUITE.$test_name.log"
+    SERVER_LOG="./server.$TEST_SUITE.$test_name.log"
+
+    rm -fr `pwd`/test_dir
+    cp -r `pwd`/$test_name `pwd`/test_dir
+    SERVER_ARGS="$REPO_ARGS $EXPLICIT_ARGS"
+    run_server
+    if [ "$SERVER_PID" == "0" ]; then
+        echo -e "\n***\n*** Failed to start $SERVER\n***"
+        cat $SERVER_LOG
+        exit 1
+    fi
+
+    set +e
+    # Pass in the test directory as the test may modify the structure
+    NAMESPACE_TESTING_DIRCTORY=`pwd`/test_dir python $TEST_PY $TEST_SUITE.$test_name >>$TEST_LOG 2>&1
+    if [ $? -ne 0 ]; then
+        RET=1
+        cat $TEST_LOG
+    else
+        check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+        if [ $? -ne 0 ]; then
+            cat $TEST_LOG
+            echo -e "\n***\n*** Test Result Verification Failed\n***"
+            RET=1
+        fi
+    fi
+    set -e
+
+    kill $SERVER_PID
+    wait $SERVER_PID
+done
+
+
+if [ $RET -eq 0 ]; then
+    echo -e "\n***\n*** Test Passed\n***"
+else
+    echo -e "\n***\n*** Test FAILED\n***"
+fi
+
+exit $RET
diff --git a/qa/L0_model_namespacing/test_duplication/addsub_repo/composing_model/1/model.py b/qa/L0_model_namespacing/test_duplication/addsub_repo/composing_model/1/model.py
new file mode 100644
index 0000000000..13a611e7a3
--- /dev/null
+++ b/qa/L0_model_namespacing/test_duplication/addsub_repo/composing_model/1/model.py
@@ -0,0 +1,6 @@
+import os
+import sys
+
+# load pre-defined QA model
+sys.path.append(os.environ["TRITON_QA_PYTHON_MODEL_DIR"])
+from python_addsub import *
diff --git a/qa/L0_model_namespacing/test_duplication/addsub_repo/simple_addsub/config.pbtxt b/qa/L0_model_namespacing/test_duplication/addsub_repo/simple_addsub/config.pbtxt
new file mode 100644
index 0000000000..245e256976
--- /dev/null
+++ b/qa/L0_model_namespacing/test_duplication/addsub_repo/simple_addsub/config.pbtxt
@@ -0,0 +1,90 @@
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+platform: "ensemble"
+max_batch_size: 0
+version_policy: { all { }}
+
+
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+  }
+]
+input [
+  {
+    name: "INPUT1"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+
+  }
+]
+output [
+  {
+    name: "OUTPUT1"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+
+  }
+]
+ensemble_scheduling {
+  step [
+    {
+      model_name: "composing_model"
+      model_version: -1
+      input_map {
+        key: "INPUT0"
+        value: "INPUT0"
+      }
+      input_map {
+        key: "INPUT1"
+        value: "INPUT1"
+      }
+      output_map {
+        key: "OUTPUT0"
+        value: "OUTPUT0"
+      }
+      output_map {
+        key: "OUTPUT1"
+        value: "OUTPUT1"
+      }
+    }
+  ]
+}
diff --git a/qa/L0_model_namespacing/test_duplication/subadd_repo/composing_model/1/model.py b/qa/L0_model_namespacing/test_duplication/subadd_repo/composing_model/1/model.py
new file mode 100644
index 0000000000..664c20b58f
--- /dev/null
+++ b/qa/L0_model_namespacing/test_duplication/subadd_repo/composing_model/1/model.py
@@ -0,0 +1,6 @@
+import os
+import sys
+
+# load pre-defined QA model
+sys.path.append(os.environ["TRITON_QA_PYTHON_MODEL_DIR"])
+from python_subadd import *
diff --git a/qa/L0_model_namespacing/test_duplication/subadd_repo/simple_subadd/config.pbtxt b/qa/L0_model_namespacing/test_duplication/subadd_repo/simple_subadd/config.pbtxt
new file mode 100644
index 0000000000..85d8ec0051
--- /dev/null
+++ b/qa/L0_model_namespacing/test_duplication/subadd_repo/simple_subadd/config.pbtxt
@@ -0,0 +1,88 @@
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+platform: "ensemble"
+max_batch_size: 0
+version_policy: { all { }}
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+  }
+]
+input [
+  {
+    name: "INPUT1"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+
+  }
+]
+output [
+  {
+    name: "OUTPUT1"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+
+  }
+]
+ensemble_scheduling {
+  step [
+    {
+      model_name: "composing_model"
+      model_version: -1
+      input_map {
+        key: "INPUT0"
+        value: "INPUT0"
+      }
+      input_map {
+        key: "INPUT1"
+        value: "INPUT1"
+      }
+      output_map {
+        key: "OUTPUT0"
+        value: "OUTPUT0"
+      }
+      output_map {
+        key: "OUTPUT1"
+        value: "OUTPUT1"
+      }
+    }
+  ]
+}
diff --git a/qa/L0_model_namespacing/test_dynamic_resolution/addsub_repo/composing_model/1/model.py b/qa/L0_model_namespacing/test_dynamic_resolution/addsub_repo/composing_model/1/model.py
new file mode 100644
index 0000000000..13a611e7a3
--- /dev/null
+++ b/qa/L0_model_namespacing/test_dynamic_resolution/addsub_repo/composing_model/1/model.py
@@ -0,0 +1,6 @@
+import os
+import sys
+
+# load pre-defined QA model
+sys.path.append(os.environ["TRITON_QA_PYTHON_MODEL_DIR"])
+from python_addsub import *
diff --git a/qa/L0_model_namespacing/test_dynamic_resolution/addsub_repo/simple_addsub/config.pbtxt b/qa/L0_model_namespacing/test_dynamic_resolution/addsub_repo/simple_addsub/config.pbtxt
new file mode 100644
index 0000000000..245e256976
--- /dev/null
+++ b/qa/L0_model_namespacing/test_dynamic_resolution/addsub_repo/simple_addsub/config.pbtxt
@@ -0,0 +1,90 @@
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+platform: "ensemble"
+max_batch_size: 0
+version_policy: { all { }}
+
+
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+  }
+]
+input [
+  {
+    name: "INPUT1"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+
+  }
+]
+output [
+  {
+    name: "OUTPUT1"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+
+  }
+]
+ensemble_scheduling {
+  step [
+    {
+      model_name: "composing_model"
+      model_version: -1
+      input_map {
+        key: "INPUT0"
+        value: "INPUT0"
+      }
+      input_map {
+        key: "INPUT1"
+        value: "INPUT1"
+      }
+      output_map {
+        key: "OUTPUT0"
+        value: "OUTPUT0"
+      }
+      output_map {
+        key: "OUTPUT1"
+        value: "OUTPUT1"
+      }
+    }
+  ]
+}
diff --git a/qa/L0_model_namespacing/test_dynamic_resolution/subadd_repo/composing_model/1/model.py b/qa/L0_model_namespacing/test_dynamic_resolution/subadd_repo/composing_model/1/model.py
new file mode 100644
index 0000000000..664c20b58f
--- /dev/null
+++ b/qa/L0_model_namespacing/test_dynamic_resolution/subadd_repo/composing_model/1/model.py
@@ -0,0 +1,6 @@
+import os
+import sys
+
+# load pre-defined QA model
+sys.path.append(os.environ["TRITON_QA_PYTHON_MODEL_DIR"])
+from python_subadd import *
diff --git a/qa/L0_model_namespacing/test_dynamic_resolution/subadd_repo/simple_subadd/config.pbtxt b/qa/L0_model_namespacing/test_dynamic_resolution/subadd_repo/simple_subadd/config.pbtxt
new file mode 100644
index 0000000000..245e256976
--- /dev/null
+++ b/qa/L0_model_namespacing/test_dynamic_resolution/subadd_repo/simple_subadd/config.pbtxt
@@ -0,0 +1,90 @@
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+platform: "ensemble"
+max_batch_size: 0
+version_policy: { all { }}
+
+
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+  }
+]
+input [
+  {
+    name: "INPUT1"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+
+  }
+]
+output [
+  {
+    name: "OUTPUT1"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+
+  }
+]
+ensemble_scheduling {
+  step [
+    {
+      model_name: "composing_model"
+      model_version: -1
+      input_map {
+        key: "INPUT0"
+        value: "INPUT0"
+      }
+      input_map {
+        key: "INPUT1"
+        value: "INPUT1"
+      }
+      output_map {
+        key: "OUTPUT0"
+        value: "OUTPUT0"
+      }
+      output_map {
+        key: "OUTPUT1"
+        value: "OUTPUT1"
+      }
+    }
+  ]
+}
diff --git a/qa/L0_model_namespacing/test_ensemble_duplication/addsub_repo/composing_addsub/1/model.py b/qa/L0_model_namespacing/test_ensemble_duplication/addsub_repo/composing_addsub/1/model.py
new file mode 100644
index 0000000000..13a611e7a3
--- /dev/null
+++ b/qa/L0_model_namespacing/test_ensemble_duplication/addsub_repo/composing_addsub/1/model.py
@@ -0,0 +1,6 @@
+import os
+import sys
+
+# load pre-defined QA model
+sys.path.append(os.environ["TRITON_QA_PYTHON_MODEL_DIR"])
+from python_addsub import *
diff --git a/qa/L0_model_namespacing/test_ensemble_duplication/addsub_repo/simple_ensemble/config.pbtxt b/qa/L0_model_namespacing/test_ensemble_duplication/addsub_repo/simple_ensemble/config.pbtxt
new file mode 100644
index 0000000000..2a9f0003a3
--- /dev/null
+++ b/qa/L0_model_namespacing/test_ensemble_duplication/addsub_repo/simple_ensemble/config.pbtxt
@@ -0,0 +1,90 @@
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+platform: "ensemble"
+max_batch_size: 0
+version_policy: { all { }}
+
+
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+  }
+]
+input [
+  {
+    name: "INPUT1"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+
+  }
+]
+output [
+  {
+    name: "OUTPUT1"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+
+  }
+]
+ensemble_scheduling {
+  step [
+    {
+      model_name: "composing_addsub"
+      model_version: -1
+      input_map {
+        key: "INPUT0"
+        value: "INPUT0"
+      }
+      input_map {
+        key: "INPUT1"
+        value: "INPUT1"
+      }
+      output_map {
+        key: "OUTPUT0"
+        value: "OUTPUT0"
+      }
+      output_map {
+        key: "OUTPUT1"
+        value: "OUTPUT1"
+      }
+    }
+  ]
+}
diff --git a/qa/L0_model_namespacing/test_ensemble_duplication/subadd_repo/composing_subadd/1/model.py b/qa/L0_model_namespacing/test_ensemble_duplication/subadd_repo/composing_subadd/1/model.py
new file mode 100644
index 0000000000..664c20b58f
--- /dev/null
+++ b/qa/L0_model_namespacing/test_ensemble_duplication/subadd_repo/composing_subadd/1/model.py
@@ -0,0 +1,6 @@
+import os
+import sys
+
+# load pre-defined QA model
+sys.path.append(os.environ["TRITON_QA_PYTHON_MODEL_DIR"])
+from python_subadd import *
diff --git a/qa/L0_model_namespacing/test_ensemble_duplication/subadd_repo/simple_ensemble/config.pbtxt b/qa/L0_model_namespacing/test_ensemble_duplication/subadd_repo/simple_ensemble/config.pbtxt
new file mode 100644
index 0000000000..0ee1015f25
--- /dev/null
+++ b/qa/L0_model_namespacing/test_ensemble_duplication/subadd_repo/simple_ensemble/config.pbtxt
@@ -0,0 +1,90 @@
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+platform: "ensemble"
+max_batch_size: 0
+version_policy: { all { }}
+
+
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+  }
+]
+input [
+  {
+    name: "INPUT1"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+
+  }
+]
+output [
+  {
+    name: "OUTPUT1"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+
+  }
+]
+ensemble_scheduling {
+  step [
+    {
+      model_name: "composing_subadd"
+      model_version: -1
+      input_map {
+        key: "INPUT0"
+        value: "INPUT0"
+      }
+      input_map {
+        key: "INPUT1"
+        value: "INPUT1"
+      }
+      output_map {
+        key: "OUTPUT0"
+        value: "OUTPUT0"
+      }
+      output_map {
+        key: "OUTPUT1"
+        value: "OUTPUT1"
+      }
+    }
+  ]
+}
diff --git a/qa/L0_model_namespacing/test_no_duplication/addsub_repo/composing_addsub/1/model.py b/qa/L0_model_namespacing/test_no_duplication/addsub_repo/composing_addsub/1/model.py
new file mode 100644
index 0000000000..13a611e7a3
--- /dev/null
+++ b/qa/L0_model_namespacing/test_no_duplication/addsub_repo/composing_addsub/1/model.py
@@ -0,0 +1,6 @@
+import os
+import sys
+
+# load pre-defined QA model
+sys.path.append(os.environ["TRITON_QA_PYTHON_MODEL_DIR"])
+from python_addsub import *
diff --git a/qa/L0_model_namespacing/test_no_duplication/addsub_repo/simple_addsub/config.pbtxt b/qa/L0_model_namespacing/test_no_duplication/addsub_repo/simple_addsub/config.pbtxt
new file mode 100644
index 0000000000..2a9f0003a3
--- /dev/null
+++ b/qa/L0_model_namespacing/test_no_duplication/addsub_repo/simple_addsub/config.pbtxt
@@ -0,0 +1,90 @@
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+platform: "ensemble"
+max_batch_size: 0
+version_policy: { all { }}
+
+
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+  }
+]
+input [
+  {
+    name: "INPUT1"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+
+  }
+]
+output [
+  {
+    name: "OUTPUT1"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+
+  }
+]
+ensemble_scheduling {
+  step [
+    {
+      model_name: "composing_addsub"
+      model_version: -1
+      input_map {
+        key: "INPUT0"
+        value: "INPUT0"
+      }
+      input_map {
+        key: "INPUT1"
+        value: "INPUT1"
+      }
+      output_map {
+        key: "OUTPUT0"
+        value: "OUTPUT0"
+      }
+      output_map {
+        key: "OUTPUT1"
+        value: "OUTPUT1"
+      }
+    }
+  ]
+}
diff --git a/qa/L0_model_namespacing/test_no_duplication/subadd_repo/composing_subadd/1/model.py b/qa/L0_model_namespacing/test_no_duplication/subadd_repo/composing_subadd/1/model.py
new file mode 100644
index 0000000000..664c20b58f
--- /dev/null
+++ b/qa/L0_model_namespacing/test_no_duplication/subadd_repo/composing_subadd/1/model.py
@@ -0,0 +1,6 @@
+import os
+import sys
+
+# load pre-defined QA model
+sys.path.append(os.environ["TRITON_QA_PYTHON_MODEL_DIR"])
+from python_subadd import *
diff --git a/qa/L0_model_namespacing/test_no_duplication/subadd_repo/simple_subadd/config.pbtxt b/qa/L0_model_namespacing/test_no_duplication/subadd_repo/simple_subadd/config.pbtxt
new file mode 100644
index 0000000000..0ee1015f25
--- /dev/null
+++ b/qa/L0_model_namespacing/test_no_duplication/subadd_repo/simple_subadd/config.pbtxt
@@ -0,0 +1,90 @@
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+platform: "ensemble"
+max_batch_size: 0
+version_policy: { all { }}
+
+
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+  }
+]
+input [
+  {
+    name: "INPUT1"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+
+  }
+]
+output [
+  {
+    name: "OUTPUT1"
+    data_type: TYPE_INT32
+    dims: [ 16 ]
+
+
+  }
+]
+ensemble_scheduling {
+  step [
+    {
+      model_name: "composing_subadd"
+      model_version: -1
+      input_map {
+        key: "INPUT0"
+        value: "INPUT0"
+      }
+      input_map {
+        key: "INPUT1"
+        value: "INPUT1"
+      }
+      output_map {
+        key: "OUTPUT0"
+        value: "OUTPUT0"
+      }
+      output_map {
+        key: "OUTPUT1"
+        value: "OUTPUT1"
+      }
+    }
+  ]
+}
diff --git a/qa/L0_model_queue/model_queue_test.py b/qa/L0_model_queue/model_queue_test.py
old mode 100644
new mode 100755
index 42bbe9130e..e7be471f79
--- a/qa/L0_model_queue/model_queue_test.py
+++ b/qa/L0_model_queue/model_queue_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2020-2021, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,17 +27,19 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-from builtins import range
-import time
 import threading
+import time
 import unittest
-import numpy as np
+from builtins import range
+from ctypes import *
+
 import infer_util as iu
+import numpy as np
 import test_util as tu
 from tritonclientutils import InferenceServerException
-from ctypes import *
 
 _max_queue_delay_ms = 10000
 
@@ -44,15 +48,11 @@
 
 
 class ModelQueueTest(tu.TestResultCollector):
-
     def setUp(self):
         self.trials_ = []
         for base in ["custom", "ensemble"]:
             for is_http_trial in [True, False]:
-                self.trials_.append({
-                    "base": base,
-                    "is_http_trial": is_http_trial
-                })
+                self.trials_.append({"base": base, "is_http_trial": is_http_trial})
         global _deferred_exceptions
         _deferred_exceptions = []
 
@@ -69,33 +69,41 @@ def check_deferred_exception(self):
                 _deferred_exceptions.pop(0)
                 raise first_exception
 
-    def check_response(self,
-                       bs,
-                       dtype,
-                       shapes,
-                       priority,
-                       timeout_us,
-                       thresholds,
-                       base="custom",
-                       is_http_trial=True):
-        full_shapes = [[
-            bs,
-        ] + shape for shape in shapes]
+    def check_response(
+        self,
+        bs,
+        dtype,
+        shapes,
+        priority,
+        timeout_us,
+        thresholds,
+        base="custom",
+        is_http_trial=True,
+    ):
+        full_shapes = [
+            [
+                bs,
+            ]
+            + shape
+            for shape in shapes
+        ]
         try:
             start_ms = int(round(time.time() * 1000))
-            iu.infer_zero(self,
-                          base,
-                          bs,
-                          dtype,
-                          full_shapes,
-                          full_shapes,
-                          model_version=1,
-                          use_http_json_tensors=False,
-                          use_http=is_http_trial,
-                          use_grpc=(not is_http_trial),
-                          use_streaming=False,
-                          priority=priority,
-                          timeout_us=timeout_us)
+            iu.infer_zero(
+                self,
+                base,
+                bs,
+                dtype,
+                full_shapes,
+                full_shapes,
+                model_version=1,
+                use_http_json_tensors=False,
+                use_http=is_http_trial,
+                use_grpc=(not is_http_trial),
+                use_streaming=False,
+                priority=priority,
+                timeout_us=timeout_us,
+            )
 
             end_ms = int(round(time.time() * 1000))
 
@@ -104,13 +112,21 @@ def check_response(self,
             if lt_ms is not None:
                 self.assertTrue(
                     (end_ms - start_ms) < lt_ms,
-                    "expected less than " + str(lt_ms) +
-                    "ms response time, got " + str(end_ms - start_ms) + " ms")
+                    "expected less than "
+                    + str(lt_ms)
+                    + "ms response time, got "
+                    + str(end_ms - start_ms)
+                    + " ms",
+                )
             if gt_ms is not None:
                 self.assertTrue(
                     (end_ms - start_ms) > gt_ms,
-                    "expected greater than " + str(gt_ms) +
-                    "ms response time, got " + str(end_ms - start_ms) + " ms")
+                    "expected greater than "
+                    + str(gt_ms)
+                    + "ms response time, got "
+                    + str(end_ms - start_ms)
+                    + " ms",
+                )
         except Exception as ex:
             self.add_deferred_exception(ex)
 
@@ -124,15 +140,17 @@ def test_max_queue_size(self):
         for trial in self.trials_:
             preceding_thread = threading.Thread(
                 target=self.check_response,
-                args=(8, dtype, shapes, 0, 0, (1999, 1000)),
+                args=(8, dtype, shapes, 0, 0, (5999, 1000)),
             )
             threads = []
             for i in range(10):
                 threads.append(
-                    threading.Thread(target=self.check_response,
-                                     args=(1, dtype, shapes, 0, 0, (None,
-                                                                    None)),
-                                     kwargs=trial))
+                    threading.Thread(
+                        target=self.check_response,
+                        args=(1, dtype, shapes, 0, 0, (None, None)),
+                        kwargs=trial,
+                    )
+                )
             preceding_thread.start()
             time.sleep(0.5)
             for t in threads:
@@ -142,15 +160,27 @@ def test_max_queue_size(self):
             for t in threads:
                 t.join()
 
-            # Expect at most two exception with exceeding max queue size error
-            for i in range(2):
+            # Expect exactly two exception with exceeding max queue size error
+            expected_exceeded_count = 2
+            exceeded_count = 0
+            for i in range(expected_exceeded_count):
                 try:
                     self.check_deferred_exception()
                 except InferenceServerException as ex:
                     self.assertTrue(
                         "Exceeds maximum queue size" in ex.message(),
-                        "Expected error message \"Exceeds maximum queue size\", got: {}"
-                        .format(ex))
+                        'Expected error message "Exceeds maximum queue size", got: {}'.format(
+                            ex
+                        ),
+                    )
+                    exceeded_count = exceeded_count + 1
+            self.assertEqual(
+                exceeded_count,
+                expected_exceeded_count,
+                "expected {} requests to fail with exceeded max queue size error, got {}".format(
+                    expected_exceeded_count, exceeded_count
+                ),
+            )
             try:
                 self.check_deferred_exception()
             except InferenceServerException as ex:
@@ -169,18 +199,26 @@ def test_policy_delay(self):
             try:
                 threads = []
                 threads.append(
-                    threading.Thread(target=self.check_response,
-                                     args=(1, dtype, shapes, 0, 0, (15000,
-                                                                    10000)),
-                                     kwargs=trial))
+                    threading.Thread(
+                        target=self.check_response,
+                        args=(1, dtype, shapes, 0, 0, (15000, 10000)),
+                        kwargs=trial,
+                    )
+                )
                 threads.append(
-                    threading.Thread(target=self.check_response,
-                                     args=(2, dtype, shapes, 0, 0, (100, 0)),
-                                     kwargs=trial))
+                    threading.Thread(
+                        target=self.check_response,
+                        args=(2, dtype, shapes, 0, 0, (100, 0)),
+                        kwargs=trial,
+                    )
+                )
                 threads.append(
-                    threading.Thread(target=self.check_response,
-                                     args=(2, dtype, shapes, 0, 0, (100, 0)),
-                                     kwargs=trial))
+                    threading.Thread(
+                        target=self.check_response,
+                        args=(2, dtype, shapes, 0, 0, (100, 0)),
+                        kwargs=trial,
+                    )
+                )
                 threads[0].start()
                 time.sleep(0.2)
                 threads[1].start()
@@ -202,17 +240,26 @@ def test_policy_reject(self):
         for trial in self.trials_:
             threads = []
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=(1, dtype, shapes, 0, 0, (None, None)),
-                                 kwargs=trial))
+                threading.Thread(
+                    target=self.check_response,
+                    args=(1, dtype, shapes, 0, 0, (None, None)),
+                    kwargs=trial,
+                )
+            )
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=(2, dtype, shapes, 0, 0, (100, 0)),
-                                 kwargs=trial))
+                threading.Thread(
+                    target=self.check_response,
+                    args=(2, dtype, shapes, 0, 0, (100, 0)),
+                    kwargs=trial,
+                )
+            )
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=(2, dtype, shapes, 0, 0, (100, 0)),
-                                 kwargs=trial))
+                threading.Thread(
+                    target=self.check_response,
+                    args=(2, dtype, shapes, 0, 0, (100, 0)),
+                    kwargs=trial,
+                )
+            )
             threads[0].start()
             time.sleep(0.2)
             threads[1].start()
@@ -227,8 +274,10 @@ def test_policy_reject(self):
             except InferenceServerException as ex:
                 self.assertTrue(
                     "Request timeout expired" in ex.message(),
-                    "Expected error message \"Request timeout expired\", got: {}"
-                    .format(ex))
+                    'Expected error message "Request timeout expired", got: {}'.format(
+                        ex
+                    ),
+                )
 
             try:
                 self.check_deferred_exception()
@@ -237,7 +286,7 @@ def test_policy_reject(self):
 
     def test_timeout_override(self):
         # Send requests with batch sizes 1, 1, 3 where the first request
-        # overrides the timout to be less than 'default_timeout_microseconds',
+        # overrides the timeout to be less than 'default_timeout_microseconds',
         # and the second and third requests are sent after the overridden
         # timeout. Expect the first request is timed-out and rejected before
         # 'default_timeout_microseconds', which makes the second and third
@@ -249,18 +298,26 @@ def test_timeout_override(self):
         for trial in self.trials_:
             threads = []
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=(1, dtype, shapes, 0, 100000, (None,
-                                                                     None)),
-                                 kwargs=trial))
+                threading.Thread(
+                    target=self.check_response,
+                    args=(1, dtype, shapes, 0, 100000, (None, None)),
+                    kwargs=trial,
+                )
+            )
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=(2, dtype, shapes, 0, 0, (100, 0)),
-                                 kwargs=trial))
+                threading.Thread(
+                    target=self.check_response,
+                    args=(2, dtype, shapes, 0, 0, (100, 0)),
+                    kwargs=trial,
+                )
+            )
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=(2, dtype, shapes, 0, 0, (100, 0)),
-                                 kwargs=trial))
+                threading.Thread(
+                    target=self.check_response,
+                    args=(2, dtype, shapes, 0, 0, (100, 0)),
+                    kwargs=trial,
+                )
+            )
             threads[0].start()
             time.sleep(0.2)
             threads[1].start()
@@ -275,8 +332,10 @@ def test_timeout_override(self):
             except InferenceServerException as ex:
                 self.assertTrue(
                     "Request timeout expired" in ex.message(),
-                    "Expected error message \"Request timeout expired\", got: {}"
-                    .format(ex))
+                    'Expected error message "Request timeout expired", got: {}'.format(
+                        ex
+                    ),
+                )
 
             try:
                 self.check_deferred_exception()
@@ -288,18 +347,26 @@ def test_timeout_override(self):
             # 'default_timeout_microseconds' and before queue delay.
             threads = []
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=(1, dtype, shapes, 0, 10000000, (None,
-                                                                       None)),
-                                 kwargs=trial))
+                threading.Thread(
+                    target=self.check_response,
+                    args=(1, dtype, shapes, 0, 10000000, (None, None)),
+                    kwargs=trial,
+                )
+            )
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=(2, dtype, shapes, 0, 0, (1100, 700)),
-                                 kwargs=trial))
+                threading.Thread(
+                    target=self.check_response,
+                    args=(2, dtype, shapes, 0, 0, (1100, 700)),
+                    kwargs=trial,
+                )
+            )
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=(2, dtype, shapes, 0, 0, (1100, 700)),
-                                 kwargs=trial))
+                threading.Thread(
+                    target=self.check_response,
+                    args=(2, dtype, shapes, 0, 0, (1100, 700)),
+                    kwargs=trial,
+                )
+            )
             threads[0].start()
             time.sleep(0.2)
             threads[1].start()
@@ -314,8 +381,10 @@ def test_timeout_override(self):
             except InferenceServerException as ex:
                 self.assertTrue(
                     "Request timeout expired" in ex.message(),
-                    "Expected error message \"Request timeout expired\", got: {}"
-                    .format(ex))
+                    'Expected error message "Request timeout expired", got: {}'.format(
+                        ex
+                    ),
+                )
 
             try:
                 self.check_deferred_exception()
@@ -326,17 +395,26 @@ def test_timeout_override(self):
             # processed only after 'default_timeout_microseconds'
             threads = []
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=(1, dtype, shapes, 0, 0, (None, None)),
-                                 kwargs=trial))
+                threading.Thread(
+                    target=self.check_response,
+                    args=(1, dtype, shapes, 0, 0, (None, None)),
+                    kwargs=trial,
+                )
+            )
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=(2, dtype, shapes, 0, 0, (1100, 700)),
-                                 kwargs=trial))
+                threading.Thread(
+                    target=self.check_response,
+                    args=(2, dtype, shapes, 0, 0, (1100, 700)),
+                    kwargs=trial,
+                )
+            )
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=(2, dtype, shapes, 0, 0, (1100, 700)),
-                                 kwargs=trial))
+                threading.Thread(
+                    target=self.check_response,
+                    args=(2, dtype, shapes, 0, 0, (1100, 700)),
+                    kwargs=trial,
+                )
+            )
             threads[0].start()
             time.sleep(0.2)
             threads[1].start()
@@ -351,8 +429,10 @@ def test_timeout_override(self):
             except InferenceServerException as ex:
                 self.assertTrue(
                     "Request timeout expired" in ex.message(),
-                    "Expected error message \"Request timeout expired\", got: {}"
-                    .format(ex))
+                    'Expected error message "Request timeout expired", got: {}'.format(
+                        ex
+                    ),
+                )
 
             try:
                 self.check_deferred_exception()
@@ -369,17 +449,72 @@ def test_priority_levels(self):
         for trial in self.trials_:
             threads = []
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=(2, dtype, shapes, 0, 0, (500, 200)),
-                                 kwargs=trial))
+                threading.Thread(
+                    target=self.check_response,
+                    args=(2, dtype, shapes, 0, 0, (500, 200)),
+                    kwargs=trial,
+                )
+            )
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=(1, dtype, shapes, 0, 0, (15000, 10000)),
-                                 kwargs=trial))
+                threading.Thread(
+                    target=self.check_response,
+                    args=(1, dtype, shapes, 0, 0, (15000, 10000)),
+                    kwargs=trial,
+                )
+            )
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=(2, dtype, shapes, 1, 0, (100, 0)),
-                                 kwargs=trial))
+                threading.Thread(
+                    target=self.check_response,
+                    args=(2, dtype, shapes, 1, 0, (100, 0)),
+                    kwargs=trial,
+                )
+            )
+            threads[0].start()
+            # wait to make sure the order is correct
+            time.sleep(0.1)
+            threads[1].start()
+            time.sleep(0.2)
+            threads[2].start()
+
+            for t in threads:
+                t.join()
+
+            try:
+                self.check_deferred_exception()
+            except InferenceServerException as ex:
+                self.assertTrue(False, "unexpected error {}".format(ex))
+
+    def test_max_priority_levels(self):
+        # Send 2 requests with batch sizes 2, 1 in default priority (MAX_UINT32+1). Then send
+        # 1 request with batch size 2 in priority 1. Expect the third request is
+        # place in the front of the queue and form a preferred batch with the
+        # first request.
+        dtype = np.float32
+        shapes = ([16],)
+        MAX_UINT32_PLUS_1 = 4294967296
+        for trial in self.trials_:
+            threads = []
+            threads.append(
+                threading.Thread(
+                    target=self.check_response,
+                    args=(2, dtype, shapes, 0, 0, (500, 200)),
+                    kwargs=trial,
+                )
+            )
+            threads.append(
+                threading.Thread(
+                    target=self.check_response,
+                    args=(1, dtype, shapes, MAX_UINT32_PLUS_1, 0, (15000, 10000)),
+                    kwargs=trial,
+                )
+            )
+            threads.append(
+                threading.Thread(
+                    target=self.check_response,
+                    args=(2, dtype, shapes, 1, 0, (100, 0)),
+                    kwargs=trial,
+                )
+            )
             threads[0].start()
             # wait to make sure the order is correct
             time.sleep(0.1)
@@ -425,31 +560,47 @@ def test_priority_with_policy(self):
             # The expected ranges may not be rounded to accommodate
             # the sleep between sending requests
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=(2, dtype, shapes, 1, 0, (2000, 1000)),
-                                 kwargs=trial))
+                threading.Thread(
+                    target=self.check_response,
+                    args=(2, dtype, shapes, 1, 0, (2000, 1000)),
+                    kwargs=trial,
+                )
+            )
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=(1, dtype, shapes, 1, 1000000, (3400,
-                                                                      2400)),
-                                 kwargs=trial))
+                threading.Thread(
+                    target=self.check_response,
+                    args=(1, dtype, shapes, 1, 1000000, (3400, 2400)),
+                    kwargs=trial,
+                )
+            )
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=(2, dtype, shapes, 1, 0, (1700, 700)),
-                                 kwargs=trial))
+                threading.Thread(
+                    target=self.check_response,
+                    args=(2, dtype, shapes, 1, 0, (1700, 700)),
+                    kwargs=trial,
+                )
+            )
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=(2, dtype, shapes, 2, 2000000, (None,
-                                                                      None)),
-                                 kwargs=trial))
+                threading.Thread(
+                    target=self.check_response,
+                    args=(2, dtype, shapes, 2, 2000000, (None, None)),
+                    kwargs=trial,
+                )
+            )
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=(3, dtype, shapes, 2, 0, (2700, 1700)),
-                                 kwargs=trial))
+                threading.Thread(
+                    target=self.check_response,
+                    args=(3, dtype, shapes, 2, 0, (2700, 1700)),
+                    kwargs=trial,
+                )
+            )
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=(6, dtype, shapes, 2, 0, (15000, 10000)),
-                                 kwargs=trial))
+                threading.Thread(
+                    target=self.check_response,
+                    args=(6, dtype, shapes, 2, 0, (15000, 10000)),
+                    kwargs=trial,
+                )
+            )
             for t in threads:
                 t.start()
                 time.sleep(0.2)
@@ -463,8 +614,10 @@ def test_priority_with_policy(self):
             except InferenceServerException as ex:
                 self.assertTrue(
                     "Request timeout expired" in ex.message(),
-                    "Expected error message \"Request timeout expired\", got: {}"
-                    .format(ex))
+                    'Expected error message "Request timeout expired", got: {}'.format(
+                        ex
+                    ),
+                )
 
             try:
                 self.check_deferred_exception()
@@ -472,5 +625,5 @@ def test_priority_with_policy(self):
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_model_queue/test.sh b/qa/L0_model_queue/test.sh
old mode 100644
new mode 100755
index a995e10687..577b7b7fc2
--- a/qa/L0_model_queue/test.sh
+++ b/qa/L0_model_queue/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2020-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -57,7 +57,7 @@ RET=0
 export CUDA_VISIBLE_DEVICES=0
 
 # Prepare base model. Only test with custom backend as it is sufficient
-rm -fr *.log *.serverlog models custom_zero_1_float32
+rm -fr *.log  models custom_zero_1_float32
 cp -r ../custom_models/custom_zero_1_float32 . && \
     mkdir -p ./custom_zero_1_float32/1 && \
     mkdir -p ./ensemble_zero_1_float32/1
@@ -82,11 +82,11 @@ rm -fr models && mkdir models && \
         echo "    }" >> config.pbtxt && \
         echo "}" >> config.pbtxt && \
         echo "parameters [" >> config.pbtxt && \
-        echo "{ key: \"execute_delay_ms\"; value: { string_value: \"1000\" }}" >> config.pbtxt && \
+        echo "{ key: \"execute_delay_ms\"; value: { string_value: \"5000\" }}" >> config.pbtxt && \
         echo "]" >> config.pbtxt)
 
 TEST_CASE=test_max_queue_size
-SERVER_LOG="./$TEST_CASE.serverlog"
+SERVER_LOG="./$TEST_CASE.server.log"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
@@ -129,7 +129,7 @@ rm -fr models && mkdir models && \
         echo "}" >> config.pbtxt)
 
 TEST_CASE=test_policy_delay
-SERVER_LOG="./$TEST_CASE.serverlog"
+SERVER_LOG="./$TEST_CASE.server.log"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
@@ -171,7 +171,7 @@ rm -fr models && mkdir models && \
         echo "}" >> config.pbtxt)
 
 TEST_CASE=test_policy_reject
-SERVER_LOG="./$TEST_CASE.serverlog"
+SERVER_LOG="./$TEST_CASE.server.log"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
@@ -214,7 +214,7 @@ rm -fr models && mkdir models && \
         echo "}" >> config.pbtxt)
 
 TEST_CASE=test_timeout_override
-SERVER_LOG="./$TEST_CASE.serverlog"
+SERVER_LOG="./$TEST_CASE.server.log"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
@@ -255,7 +255,51 @@ rm -fr models && mkdir models && \
         echo "}" >> config.pbtxt)
 
 TEST_CASE=test_priority_levels
-SERVER_LOG="./$TEST_CASE.serverlog"
+SERVER_LOG="./$TEST_CASE.server.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+echo "Test: $TEST_CASE" >>$CLIENT_LOG
+
+set +e
+python $MODEL_QUEUE_TEST ModelQueueTest.$TEST_CASE >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE 1
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+MAX_UINT64=18446744073709551615
+MAX_UINT32_PLUS_1=4294967296
+
+# test_max_priority_levels
+rm -fr models && mkdir models && \
+    cp -r ensemble_zero_1_float32 models/. && \
+    cp -r custom_zero_1_float32 models/. && \
+    (cd models/custom_zero_1_float32 && \
+        echo "dynamic_batching { " >> config.pbtxt && \
+        echo "    preferred_batch_size: [ 4, 8 ]" >> config.pbtxt && \
+        echo "    max_queue_delay_microseconds: 10000000" >> config.pbtxt && \
+        echo "    priority_levels: $MAX_UINT64" >> config.pbtxt && \
+        echo "    default_priority_level: $MAX_UINT32_PLUS_1" >> config.pbtxt && \
+        echo "}" >> config.pbtxt)
+
+TEST_CASE=test_max_priority_levels
+SERVER_LOG="./$TEST_CASE.server.log"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
@@ -312,7 +356,7 @@ rm -fr models && mkdir models && \
         echo "}" >> config.pbtxt)
 
 TEST_CASE=test_priority_with_policy
-SERVER_LOG="./$TEST_CASE.serverlog"
+SERVER_LOG="./$TEST_CASE.server.log"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
diff --git a/qa/L0_model_update/instance_update_test.py b/qa/L0_model_update/instance_update_test.py
new file mode 100755
index 0000000000..a3c9ce3201
--- /dev/null
+++ b/qa/L0_model_update/instance_update_test.py
@@ -0,0 +1,649 @@
+#!/usr/bin/env python3
+
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import concurrent.futures
+import json
+import os
+import random
+import time
+import unittest
+
+import numpy as np
+import tritonclient.grpc as grpcclient
+from models.model_init_del.util import (
+    disable_batching,
+    enable_batching,
+    get_count,
+    reset_count,
+    set_delay,
+    update_instance_group,
+    update_model_file,
+    update_sequence_batching,
+)
+from tritonclient.utils import InferenceServerException
+
+
+class TestInstanceUpdate(unittest.TestCase):
+    _model_name = "model_init_del"
+
+    def setUp(self):
+        # Reset counters
+        reset_count("initialize")
+        reset_count("finalize")
+        # Reset batching
+        disable_batching()
+        # Reset delays
+        set_delay("initialize", 0)
+        set_delay("infer", 0)
+        # Reset sequence batching
+        update_sequence_batching("")
+        # Initialize client
+        self._triton = grpcclient.InferenceServerClient("localhost:8001")
+
+    def tearDown(self):
+        # Check if the test passed for this test case that is tearing down
+        r = self.defaultTestResult()
+        self._feedErrorsToResult(r, self._outcome.errors)
+        # Use `r = self._outcome.result` for the above, if Python >= 3.11
+        passed = all(self != test_case for test_case, _ in r.errors + r.failures)
+        if passed:
+            # Do nothing if passed
+            return
+        # Best effort to reset the model state for the next test case
+        self._triton.unload_model(self._model_name)
+        time.sleep(30)  # time for instances to finish unloading
+
+    def _get_inputs(self, batching=False):
+        self.assertIsInstance(batching, bool)
+        if batching:
+            shape = [random.randint(1, 2), random.randint(1, 16)]
+        else:
+            shape = [random.randint(1, 16)]
+        inputs = [grpcclient.InferInput("INPUT0", shape, "FP32")]
+        inputs[0].set_data_from_numpy(np.ones(shape, dtype=np.float32))
+        return inputs
+
+    def _infer(self, batching=False):
+        self._triton.infer(self._model_name, self._get_inputs(batching))
+
+    def _concurrent_infer(self, concurrency=4, batching=False):
+        pool = concurrent.futures.ThreadPoolExecutor()
+        stop = [False]
+
+        def repeat_infer():
+            while not stop[0]:
+                self._infer(batching)
+
+        infer_threads = [pool.submit(repeat_infer) for i in range(concurrency)]
+
+        def stop_infer():
+            stop[0] = True
+            [t.result() for t in infer_threads]
+            pool.shutdown()
+
+        return stop_infer
+
+    def _check_count(self, kind, expected_count, poll=False):
+        self.assertIsInstance(poll, bool)
+        if poll:
+            timeout = 30  # seconds
+            poll_interval = 0.1  # seconds
+            max_retry = timeout / poll_interval
+            num_retry = 0
+            while num_retry < max_retry and get_count(kind) < expected_count:
+                time.sleep(poll_interval)
+                num_retry += 1
+        self.assertEqual(get_count(kind), expected_count)
+
+    def _load_model(self, instance_count, instance_config="", batching=False):
+        # Set batching
+        enable_batching() if batching else disable_batching()
+        # Load model
+        self._update_instance_count(
+            instance_count, 0, instance_config, batching=batching
+        )
+
+    def _update_instance_count(
+        self,
+        add_count,
+        del_count,
+        instance_config="",
+        wait_for_finalize=False,
+        batching=False,
+    ):
+        self.assertIsInstance(add_count, int)
+        self.assertGreaterEqual(add_count, 0)
+        self.assertIsInstance(del_count, int)
+        self.assertGreaterEqual(del_count, 0)
+        self.assertIsInstance(instance_config, str)
+        prev_initialize_count = get_count("initialize")
+        prev_finalize_count = get_count("finalize")
+        new_initialize_count = prev_initialize_count + add_count
+        new_finalize_count = prev_finalize_count + del_count
+        if len(instance_config) == 0:
+            prev_count = prev_initialize_count - prev_finalize_count
+            new_count = prev_count + add_count - del_count
+            instance_config = "{\ncount: " + str(new_count) + "\nkind: KIND_CPU\n}"
+        update_instance_group(instance_config)
+        self._triton.load_model(self._model_name)
+        self._check_count("initialize", new_initialize_count)
+        self._check_count("finalize", new_finalize_count, wait_for_finalize)
+        self._infer(batching)
+
+    def _unload_model(self, batching=False):
+        prev_initialize_count = get_count("initialize")
+        self._triton.unload_model(self._model_name)
+        self._check_count("initialize", prev_initialize_count)
+        self._check_count("finalize", prev_initialize_count, True)
+        with self.assertRaises(InferenceServerException):
+            self._infer(batching)
+
+    # Test add -> remove -> add an instance without batching
+    def test_add_rm_add_instance_no_batching(self):
+        self._load_model(3, batching=False)
+        stop = self._concurrent_infer(batching=False)
+        self._update_instance_count(1, 0, batching=False)  # add
+        self._update_instance_count(0, 1, batching=False)  # remove
+        self._update_instance_count(1, 0, batching=False)  # add
+        stop()
+        self._unload_model(batching=False)
+
+    # Test add -> remove -> add an instance with batching
+    def test_add_rm_add_instance_with_batching(self):
+        self._load_model(4, batching=True)
+        stop = self._concurrent_infer(batching=True)
+        self._update_instance_count(1, 0, batching=True)  # add
+        self._update_instance_count(0, 1, batching=True)  # remove
+        self._update_instance_count(1, 0, batching=True)  # add
+        stop()
+        self._unload_model(batching=True)
+
+    # Test remove -> add -> remove an instance without batching
+    def test_rm_add_rm_instance_no_batching(self):
+        self._load_model(2, batching=False)
+        stop = self._concurrent_infer(batching=False)
+        self._update_instance_count(0, 1, batching=False)  # remove
+        self._update_instance_count(1, 0, batching=False)  # add
+        self._update_instance_count(0, 1, batching=False)  # remove
+        stop()
+        self._unload_model(batching=False)
+
+    # Test remove -> add -> remove an instance with batching
+    def test_rm_add_rm_instance_with_batching(self):
+        self._load_model(3, batching=True)
+        stop = self._concurrent_infer(batching=True)
+        self._update_instance_count(0, 1, batching=True)  # remove
+        self._update_instance_count(1, 0, batching=True)  # add
+        self._update_instance_count(0, 1, batching=True)  # remove
+        stop()
+        self._unload_model(batching=True)
+
+    # Test reduce instance count to zero
+    def test_rm_instance_to_zero(self):
+        self._load_model(1)
+        # Setting instance group count to 0 will be overwritten to 1, so no
+        # instances should be created or removed.
+        self._update_instance_count(0, 0, "{\ncount: 0\nkind: KIND_CPU\n}")
+        self._unload_model()
+
+    # Test add/remove multiple CPU instances at a time
+    def test_cpu_instance_update(self):
+        self._load_model(8)
+        self._update_instance_count(0, 4)  # remove 4 instances
+        self._update_instance_count(0, 3)  # remove 3 instances
+        self._update_instance_count(0, 0)  # no change
+        time.sleep(0.1)  # larger the gap for config.pbtxt timestamp to update
+        self._update_instance_count(2, 0)  # add 2 instances
+        self._update_instance_count(5, 0)  # add 5 instances
+        self._unload_model()
+
+    # Test add/remove multiple GPU instances at a time
+    def test_gpu_instance_update(self):
+        self._load_model(6, "{\ncount: 6\nkind: KIND_GPU\n}")
+        self._update_instance_count(0, 2, "{\ncount: 4\nkind: KIND_GPU\n}")
+        self._update_instance_count(3, 0, "{\ncount: 7\nkind: KIND_GPU\n}")
+        self._unload_model()
+
+    # Test add/remove multiple CPU/GPU instances at a time
+    def test_gpu_cpu_instance_update(self):
+        # Load model with 1 GPU instance and 2 CPU instance
+        self._load_model(
+            3, "{\ncount: 2\nkind: KIND_CPU\n},\n{\ncount: 1\nkind: KIND_GPU\n}"
+        )
+        # Add 2 GPU instance and remove 1 CPU instance
+        self._update_instance_count(
+            2, 1, "{\ncount: 1\nkind: KIND_CPU\n},\n{\ncount: 3\nkind: KIND_GPU\n}"
+        )
+        # Shuffle the instances
+        self._update_instance_count(
+            0, 0, "{\ncount: 3\nkind: KIND_GPU\n},\n{\ncount: 1\nkind: KIND_CPU\n}"
+        )
+        time.sleep(0.1)  # larger the gap for config.pbtxt timestamp to update
+        # Remove 1 GPU instance and add 1 CPU instance
+        self._update_instance_count(
+            1, 1, "{\ncount: 2\nkind: KIND_GPU\n},\n{\ncount: 2\nkind: KIND_CPU\n}"
+        )
+        # Unload model
+        self._unload_model()
+
+    # Test model instance name update
+    def test_instance_name_update(self):
+        # Load 3 instances with 2 different names
+        self._load_model(
+            3,
+            '{\nname: "old_1"\ncount: 1\nkind: KIND_CPU\n},\n{\nname: "old_2"\ncount: 2\nkind: KIND_GPU\n}',
+        )
+        # Change the instance names
+        self._update_instance_count(
+            0,
+            0,
+            '{\nname: "new_1"\ncount: 1\nkind: KIND_CPU\n},\n{\nname: "new_2"\ncount: 2\nkind: KIND_GPU\n}',
+        )
+        # Unload model
+        self._unload_model()
+
+    # Test instance signature grouping
+    def test_instance_signature(self):
+        # Load 2 GPU instances and 3 CPU instances
+        self._load_model(
+            5,
+            '{\nname: "GPU_group"\ncount: 2\nkind: KIND_GPU\n},\n{\nname: "CPU_group"\ncount: 3\nkind: KIND_CPU\n}',
+        )
+        # Flatten the instances representation
+        self._update_instance_count(
+            0,
+            0,
+            '{\nname: "CPU_1"\ncount: 1\nkind: KIND_CPU\n},\n{\nname: "CPU_2_3"\ncount: 2\nkind: KIND_CPU\n},\n{\nname: "GPU_1"\ncount: 1\nkind: KIND_GPU\n},\n{\nname: "GPU_2"\ncount: 1\nkind: KIND_GPU\n}',
+        )
+        time.sleep(0.1)  # larger the gap for config.pbtxt timestamp to update
+        # Consolidate different representations
+        self._update_instance_count(
+            0,
+            0,
+            '{\nname: "CPU_group"\ncount: 3\nkind: KIND_CPU\n},\n{\nname: "GPU_group"\ncount: 2\nkind: KIND_GPU\n}',
+        )
+        time.sleep(0.1)  # larger the gap for config.pbtxt timestamp to update
+        # Flatten the instances representation
+        self._update_instance_count(
+            0,
+            0,
+            '{\nname: "GPU_1"\ncount: 1\nkind: KIND_GPU\n},\n{\nname: "GPU_2"\ncount: 1\nkind: KIND_GPU\n},\n{\nname: "CPU_1"\ncount: 1\nkind: KIND_CPU\n},\n{\nname: "CPU_2"\ncount: 1\nkind: KIND_CPU\n},\n{\nname: "CPU_3"\ncount: 1\nkind: KIND_CPU\n}',
+        )
+        # Unload model
+        self._unload_model()
+
+    # Test instance update with invalid instance group config
+    def test_invalid_config(self):
+        # Load model with 8 instances
+        self._load_model(8)
+        # Set invalid config
+        update_instance_group("--- invalid config ---")
+        with self.assertRaises(InferenceServerException):
+            self._triton.load_model("model_init_del")
+        # Correct config by reducing instances to 4
+        self._update_instance_count(0, 4)
+        # Unload model
+        self._unload_model()
+
+    # Test instance update with model file changed
+    def test_model_file_update(self):
+        self._load_model(5)
+        update_model_file()
+        self._update_instance_count(
+            6, 5, "{\ncount: 6\nkind: KIND_CPU\n}", wait_for_finalize=True
+        )
+        self._unload_model()
+
+    # Test instance update with non instance config changed in config.pbtxt
+    def test_non_instance_config_update(self):
+        self._load_model(4, batching=False)
+        enable_batching()
+        self._update_instance_count(
+            2,
+            4,
+            "{\ncount: 2\nkind: KIND_CPU\n}",
+            wait_for_finalize=True,
+            batching=True,
+        )
+        self._unload_model(batching=True)
+
+    # Test passing new instance config via load API
+    def test_load_api_with_config(self):
+        # Load model with 1 instance
+        self._load_model(1)
+        # Get the model config from Triton
+        config = self._triton.get_model_config(self._model_name, as_json=True)
+        self.assertIn("config", config)
+        self.assertIsInstance(config["config"], dict)
+        config = config["config"]
+        self.assertIn("instance_group", config)
+        self.assertIsInstance(config["instance_group"], list)
+        self.assertEqual(len(config["instance_group"]), 1)
+        self.assertIn("count", config["instance_group"][0])
+        self.assertIsInstance(config["instance_group"][0]["count"], int)
+        # Add an extra instance into the model config
+        config["instance_group"][0]["count"] += 1
+        self.assertEqual(config["instance_group"][0]["count"], 2)
+        # Load the extra instance via the load API
+        self._triton.load_model(self._model_name, config=json.dumps(config))
+        self._check_count("initialize", 2)  # 2 instances in total
+        self._check_count("finalize", 0)  # no instance is removed
+        self._infer()
+        # Unload model
+        self._unload_model()
+
+    # Test instance update with an ongoing inference
+    def test_update_while_inferencing(self):
+        # Load model with 1 instance
+        self._load_model(1)
+        # Add 1 instance while inferencing
+        set_delay("infer", 10)
+        update_instance_group("{\ncount: 2\nkind: KIND_CPU\n}")
+        with concurrent.futures.ThreadPoolExecutor() as pool:
+            infer_start_time = time.time()
+            infer_thread = pool.submit(self._infer)
+            time.sleep(2)  # make sure inference has started
+            update_start_time = time.time()
+            update_thread = pool.submit(self._triton.load_model, self._model_name)
+            update_thread.result()
+            update_end_time = time.time()
+            infer_thread.result()
+            infer_end_time = time.time()
+        infer_time = infer_end_time - infer_start_time
+        update_time = update_end_time - update_start_time
+        # Adding a new instance does not depend on existing instances, so the
+        # ongoing inference should not block the update.
+        self.assertGreaterEqual(infer_time, 10.0, "Invalid infer time")
+        self.assertLess(update_time, 5.0, "Update blocked by infer")
+        self._check_count("initialize", 2)
+        self._check_count("finalize", 0)
+        self._infer()
+        # Unload model
+        self._unload_model()
+
+    # Test inference with an ongoing instance update
+    def test_infer_while_updating(self):
+        # Load model with 1 instance
+        self._load_model(1)
+        # Infer while adding 1 instance
+        set_delay("initialize", 10)
+        update_instance_group("{\ncount: 2\nkind: KIND_CPU\n}")
+        with concurrent.futures.ThreadPoolExecutor() as pool:
+            update_start_time = time.time()
+            update_thread = pool.submit(self._triton.load_model, self._model_name)
+            time.sleep(2)  # make sure update has started
+            infer_start_time = time.time()
+            infer_thread = pool.submit(self._infer)
+            infer_thread.result()
+            infer_end_time = time.time()
+            update_thread.result()
+            update_end_time = time.time()
+        update_time = update_end_time - update_start_time
+        infer_time = infer_end_time - infer_start_time
+        # Waiting on new instance creation should not block inference on
+        # existing instances.
+        self.assertGreaterEqual(update_time, 10.0, "Invalid update time")
+        self.assertLess(infer_time, 5.0, "Infer blocked by update")
+        self._check_count("initialize", 2)
+        self._check_count("finalize", 0)
+        self._infer()
+        # Unload model
+        self._unload_model()
+
+    # Test instance resource requirement increase
+    @unittest.skipUnless(
+        "execution_count" in os.environ["RATE_LIMIT_MODE"],
+        "Rate limiter precondition not met for this test",
+    )
+    def test_instance_resource_increase(self):
+        # Load model
+        self._load_model(
+            1,
+            '{\ncount: 1\nkind: KIND_CPU\nrate_limiter {\nresources [\n{\nname: "R1"\ncount: 2\n}\n]\n}\n}',
+        )
+        # Increase resource requirement
+        self._update_instance_count(
+            1,
+            1,
+            '{\ncount: 1\nkind: KIND_CPU\nrate_limiter {\nresources [\n{\nname: "R1"\ncount: 8\n}\n]\n}\n}',
+        )
+        # Check the model is not blocked from infer due to the default resource
+        # possibly not updated to the larger resource requirement.
+        infer_count = 8
+        infer_complete = [False for i in range(infer_count)]
+
+        def infer():
+            for i in range(infer_count):
+                self._infer()
+                infer_complete[i] = True
+
+        with concurrent.futures.ThreadPoolExecutor() as pool:
+            infer_thread = pool.submit(infer)
+            time.sleep(infer_count / 2)  # each infer should take < 0.5 seconds
+            self.assertNotIn(False, infer_complete, "Infer possibly stuck")
+            infer_thread.result()
+        # Unload model
+        self._unload_model()
+
+    # Test instance resource requirement increase above explicit resource
+    @unittest.skipUnless(
+        os.environ["RATE_LIMIT_MODE"] == "execution_count_with_explicit_resource",
+        "Rate limiter precondition not met for this test",
+    )
+    def test_instance_resource_increase_above_explicit(self):
+        # Load model
+        self._load_model(
+            1,
+            '{\ncount: 1\nkind: KIND_CPU\nrate_limiter {\nresources [\n{\nname: "R1"\ncount: 2\n}\n]\n}\n}',
+        )
+        # Increase resource requirement
+        with self.assertRaises(InferenceServerException):
+            self._update_instance_count(
+                0,
+                0,
+                '{\ncount: 1\nkind: KIND_CPU\nrate_limiter {\nresources [\n{\nname: "R1"\ncount: 32\n}\n]\n}\n}',
+            )
+        # Correct the resource requirement to match the explicit resource
+        self._update_instance_count(
+            1,
+            1,
+            '{\ncount: 1\nkind: KIND_CPU\nrate_limiter {\nresources [\n{\nname: "R1"\ncount: 10\n}\n]\n}\n}',
+        )
+        # Unload model
+        self._unload_model()
+
+    # Test instance resource requirement decrease
+    @unittest.skipUnless(
+        "execution_count" in os.environ["RATE_LIMIT_MODE"],
+        "Rate limiter precondition not met for this test",
+    )
+    def test_instance_resource_decrease(self):
+        # Load model
+        self._load_model(
+            1,
+            '{\ncount: 1\nkind: KIND_CPU\nrate_limiter {\nresources [\n{\nname: "R1"\ncount: 4\n}\n]\n}\n}',
+        )
+        # Decrease resource requirement
+        self._update_instance_count(
+            1,
+            1,
+            '{\ncount: 1\nkind: KIND_CPU\nrate_limiter {\nresources [\n{\nname: "R1"\ncount: 3\n}\n]\n}\n}',
+        )
+        # Unload model
+        self._unload_model()
+        # The resource count of 3 is unique across this entire test, so check
+        # the server output to make sure it is printed, which ensures the
+        # max resource is actually decreased.
+        time.sleep(1)  # make sure the log file is updated
+        log_path = os.path.join(
+            os.environ["MODEL_LOG_DIR"],
+            "instance_update_test.rate_limit_"
+            + os.environ["RATE_LIMIT_MODE"]
+            + ".server.log",
+        )
+        with open(log_path, mode="r", encoding="utf-8", errors="strict") as f:
+            if os.environ["RATE_LIMIT_MODE"] == "execution_count":
+                # Make sure the previous max resource limit of 4 is reduced to 3
+                # when no explicit limit is set.
+                self.assertIn("Resource: R1\t Count: 3", f.read())
+            else:
+                # Make sure the max resource limit is never set to 3 when
+                # explicit limit of 10 is set.
+                self.assertNotIn("Resource: R1\t Count: 3", f.read())
+
+    _direct_sequence_batching_str = (
+        "direct { }\nmax_sequence_idle_microseconds: 8000000"
+    )
+    _oldest_sequence_batching_str = (
+        "oldest { max_candidate_sequences: 4 }\nmax_sequence_idle_microseconds: 8000000"
+    )
+
+    # Test instance update for direct scheduler without any ongoing sequences
+    def test_direct_scheduler_update_no_ongoing_sequences(self):
+        self._test_scheduler_update_no_ongoing_sequences(
+            self._direct_sequence_batching_str
+        )
+
+    # Test instance update for direct scheduler with any ongoing sequences
+    def test_direct_scheduler_update_with_ongoing_sequences(self):
+        self._test_scheduler_update_with_ongoing_sequences(
+            self._direct_sequence_batching_str
+        )
+
+    # Test instance update for oldest scheduler without ongoing sequences
+    def test_oldest_scheduler_update_no_ongoing_sequences(self):
+        self._test_scheduler_update_no_ongoing_sequences(
+            self._oldest_sequence_batching_str
+        )
+
+    # Test instance update for oldest scheduler with ongoing sequences
+    def test_oldest_scheduler_update_with_ongoing_sequences(self):
+        self._test_scheduler_update_with_ongoing_sequences(
+            self._oldest_sequence_batching_str
+        )
+
+    # Helper function for testing the success of sequence instance updates
+    # without any ongoing sequences.
+    def _test_scheduler_update_no_ongoing_sequences(self, sequence_batching_str):
+        # Load model
+        update_instance_group("{\ncount: 2\nkind: KIND_CPU\n}")
+        update_sequence_batching(sequence_batching_str)
+        self._triton.load_model(self._model_name)
+        self._check_count("initialize", 2)
+        self._check_count("finalize", 0)
+        # Basic sequence inference
+        self._triton.infer(
+            self._model_name, self._get_inputs(), sequence_id=1, sequence_start=True
+        )
+        self._triton.infer(self._model_name, self._get_inputs(), sequence_id=1)
+        self._triton.infer(
+            self._model_name, self._get_inputs(), sequence_id=1, sequence_end=True
+        )
+        # Add 2 instances without in-flight sequence
+        update_instance_group("{\ncount: 4\nkind: KIND_CPU\n}")
+        self._triton.load_model(self._model_name)
+        self._check_count("initialize", 4)
+        self._check_count("finalize", 0)
+        # Basic sequence inference
+        self._triton.infer(
+            self._model_name, self._get_inputs(), sequence_id=1, sequence_start=True
+        )
+        self._triton.infer(
+            self._model_name, self._get_inputs(), sequence_id=1, sequence_end=True
+        )
+        # Remove 1 instance without in-flight sequence
+        update_instance_group("{\ncount: 3\nkind: KIND_CPU\n}")
+        self._triton.load_model(self._model_name)
+        self._check_count("initialize", 4)
+        self._check_count("finalize", 1, poll=True)
+        # Basic sequence inference
+        self._triton.infer(
+            self._model_name, self._get_inputs(), sequence_id=1, sequence_start=True
+        )
+        self._triton.infer(
+            self._model_name, self._get_inputs(), sequence_id=1, sequence_end=True
+        )
+        # Unload model
+        self._triton.unload_model(self._model_name)
+        self._check_count("initialize", 4)
+        self._check_count("finalize", 4, poll=True)
+
+    # Helper function for testing if ongoing sequences may continue to infer on
+    # the same instance after the instance processing the sequence is removed
+    # from an instance update, which the removed instance will live until the
+    # sequences end.
+    def _test_scheduler_update_with_ongoing_sequences(self, sequence_batching_str):
+        # Load model
+        update_instance_group("{\ncount: 3\nkind: KIND_CPU\n}")
+        update_sequence_batching(sequence_batching_str)
+        self._triton.load_model(self._model_name)
+        self._check_count("initialize", 3)
+        self._check_count("finalize", 0)
+        # Start sequence 1 and 2 on CPU instances
+        self._triton.infer(
+            self._model_name, self._get_inputs(), sequence_id=1, sequence_start=True
+        )
+        self._triton.infer(
+            self._model_name, self._get_inputs(), sequence_id=2, sequence_start=True
+        )
+        # Remove all 3 CPU and add 1 GPU instance with in-flight sequences. Both
+        # in-flight sequences are assigned to any 2 CPU instances, so exactly 1
+        # CPU instance can be removed immediately.
+        update_instance_group("{\ncount: 1\nkind: KIND_GPU\n}")
+        self._triton.load_model(self._model_name)
+        self._check_count("initialize", 4)  # 3 CPU + 1 GPU
+        self._check_count("finalize", 1, poll=True)  # 1 CPU
+        # Sequence 1 and 2 may continue to infer
+        self._triton.infer(self._model_name, self._get_inputs(), sequence_id=1)
+        self._triton.infer(self._model_name, self._get_inputs(), sequence_id=2)
+        self._check_count("finalize", 1)  # check 2 CPU instances not removed
+        # Start sequence 3 on GPU instance
+        self._triton.infer(
+            self._model_name, self._get_inputs(), sequence_id=3, sequence_start=True
+        )
+        self._check_count("finalize", 1)  # check 2 CPU instances not removed
+        # End sequence 1 and 2 will remove the 2 CPU instances
+        self._triton.infer(
+            self._model_name, self._get_inputs(), sequence_id=1, sequence_end=True
+        )
+        self._triton.infer(
+            self._model_name, self._get_inputs(), sequence_id=2, sequence_end=True
+        )
+        self._check_count("finalize", 3, poll=True)  # 3 CPU
+        # End sequence 3
+        self._triton.infer(
+            self._model_name, self._get_inputs(), sequence_id=3, sequence_end=True
+        )
+        # Unload model
+        self._triton.unload_model(self._model_name)
+        self._check_count("initialize", 4)  # 3 CPU + 1 GPU
+        self._check_count("finalize", 4, poll=True)  # 3 CPU + 1 GPU
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_model_update/test.sh b/qa/L0_model_update/test.sh
new file mode 100755
index 0000000000..aa9cf7fcc1
--- /dev/null
+++ b/qa/L0_model_update/test.sh
@@ -0,0 +1,111 @@
+#!/bin/bash
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
+if [ "$#" -ge 1 ]; then
+    REPO_VERSION=$1
+fi
+if [ -z "$REPO_VERSION" ]; then
+    echo -e "Repository version must be specified"
+    echo -e "\n***\n*** Test Failed\n***"
+    exit 1
+fi
+if [ ! -z "$TEST_REPO_ARCH" ]; then
+    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
+fi
+
+# This L0_model_update test should make changes to models without restarting the
+# server, unless restarting the server is the only way of accomplishing the
+# change.
+
+export CUDA_VISIBLE_DEVICES=0
+export PYTHONDONTWRITEBYTECODE="True"
+export MODEL_LOG_DIR="`pwd`"
+
+SERVER=/opt/tritonserver/bin/tritonserver
+source ../common/util.sh
+
+function setup_models() {
+    rm -rf models && mkdir models
+    # Basic model that log instance creation and destruction
+    cp -r ../python_models/model_init_del models/model_init_del && \
+        mkdir models/model_init_del/1 && \
+        mv models/model_init_del/model.py models/model_init_del/1
+}
+
+RET=0
+
+# Test model instance update with rate limiting on/off and explicit resource
+for RATE_LIMIT_MODE in "off" "execution_count" "execution_count_with_explicit_resource"; do
+
+    RATE_LIMIT_ARGS="--rate-limit=$RATE_LIMIT_MODE"
+    if [ "$RATE_LIMIT_MODE" == "execution_count_with_explicit_resource" ]; then
+        RATE_LIMIT_ARGS="--rate-limit=execution_count --rate-limit-resource=R1:10"
+    fi
+
+    export RATE_LIMIT_MODE=$RATE_LIMIT_MODE
+    TEST_LOG="instance_update_test.rate_limit_$RATE_LIMIT_MODE.log"
+    SERVER_LOG="./instance_update_test.rate_limit_$RATE_LIMIT_MODE.server.log"
+
+    setup_models
+    SERVER_ARGS="--model-repository=models --model-control-mode=explicit $RATE_LIMIT_ARGS --log-verbose=2"
+    run_server
+    if [ "$SERVER_PID" == "0" ]; then
+        echo -e "\n***\n*** Failed to start $SERVER\n***"
+        cat $SERVER_LOG
+        exit 1
+    fi
+
+    set +e
+    python instance_update_test.py > $TEST_LOG 2>&1
+    if [ $? -ne 0 ]; then
+        echo -e "\n***\n*** Failed model instance update test on rate limit mode $RATE_LIMIT_MODE\n***"
+        cat $TEST_LOG
+        RET=1
+    fi
+    set -e
+
+    kill $SERVER_PID
+    wait $SERVER_PID
+
+    set +e
+    grep "Should not print this" $SERVER_LOG
+    if [ $? -eq 0 ]; then
+        echo -e "\n***\n*** Found \"Should not print this\" on \"$SERVER_LOG\"\n***"
+        cat $SERVER_LOG
+        RET=1
+    fi
+    set -e
+
+done
+
+if [ $RET -eq 0 ]; then
+    echo -e "\n***\n*** Test Passed\n***"
+else
+    echo -e "\n***\n*** Test FAILED\n***"
+fi
+exit $RET
diff --git a/qa/L0_multi_server/test.sh b/qa/L0_multi_server/test.sh
old mode 100644
new mode 100755
diff --git a/qa/L0_nan_inf/models/nan_inf_output/1/model.py b/qa/L0_nan_inf/models/nan_inf_output/1/model.py
index de610c6d3c..17cfb04fa0 100644
--- a/qa/L0_nan_inf/models/nan_inf_output/1/model.py
+++ b/qa/L0_nan_inf/models/nan_inf_output/1/model.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,20 +25,20 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import json
+
 import numpy as np
 import triton_python_backend_utils as pb_utils
 
-class TritonPythonModel:
 
+class TritonPythonModel:
     def initialize(self, args):
-        self.model_config = model_config = json.loads(args['model_config'])
+        self.model_config = json.loads(args["model_config"])
 
     def execute(self, requests):
-        """ This function is called on inference request.
-        """
+        """This function is called on inference request."""
 
         responses = []
-        for request in requests:
+        for _ in requests:
             # Include one of each specially parsed JSON value: nan, inf, and -inf
             out_0 = np.array([np.nan, np.inf, np.NINF, 1, 2, 3], dtype=np.float32)
             out_tensor_0 = pb_utils.Tensor("OUTPUT0", out_0)
diff --git a/qa/L0_nan_inf/nan_inf_test.py b/qa/L0_nan_inf/nan_inf_test.py
old mode 100644
new mode 100755
index e68bc664be..3013b03850
--- a/qa/L0_nan_inf/nan_inf_test.py
+++ b/qa/L0_nan_inf/nan_inf_test.py
@@ -26,45 +26,55 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
-sys.path.append('../common')
+
+sys.path.append("../common")
 
 import json
-import unittest
 import traceback
+import unittest
 
-import requests
 import numpy as np
-import tritonclient.http as tritonhttpclient
+import requests
+import test_util as tu
 import tritonclient.grpc as tritongrpcclient
+import tritonclient.http as tritonhttpclient
 from tritonclient.utils import InferenceServerException
-import test_util as tu
+
 
 class NanInfTest(tu.TestResultCollector):
     expected_output = np.array([np.nan, np.inf, np.NINF, 1, 2, 3], dtype=np.float32)
     model_name = "nan_inf_output"
 
     def test_http_raw(self):
-        payload = {"inputs": [{"name": "INPUT0", "datatype": "FP32", "shape":[1], "data": [1]}]}
-        response = requests.post("http://localhost:8000/v2/models/nan_inf_output/infer",
-                                 data=json.dumps(payload))
+        payload = {
+            "inputs": [
+                {"name": "INPUT0", "datatype": "FP32", "shape": [1], "data": [1]}
+            ]
+        }
+        response = requests.post(
+            "http://localhost:8000/v2/models/nan_inf_output/infer",
+            data=json.dumps(payload),
+        )
         if not response.ok:
             self.assertTrue(False, "Response not OK: {}".format(response.text))
 
         try:
             print(response.json())
         except:
-            self.assertTrue(False, "Response was not valid JSON:\n{}".format(response.text))
+            self.assertTrue(
+                False, "Response was not valid JSON:\n{}".format(response.text)
+            )
 
     def test_http(self):
         triton_client = tritonhttpclient.InferenceServerClient("localhost:8000")
         inputs = []
-        inputs.append(tritonhttpclient.InferInput('INPUT0', [1], "FP32"))
+        inputs.append(tritonhttpclient.InferInput("INPUT0", [1], "FP32"))
         self.infer_helper(triton_client, inputs)
 
     def test_grpc(self):
         triton_client = tritongrpcclient.InferenceServerClient("localhost:8001")
         inputs = []
-        inputs.append(tritongrpcclient.InferInput('INPUT0', [1], "FP32"))
+        inputs.append(tritongrpcclient.InferInput("INPUT0", [1], "FP32"))
         self.infer_helper(triton_client, inputs)
 
     def infer_helper(self, triton_client, inputs):
@@ -72,16 +82,20 @@ def infer_helper(self, triton_client, inputs):
 
         try:
             results = triton_client.infer(model_name=self.model_name, inputs=inputs)
-            output0_data = results.as_numpy('OUTPUT0')
+            output0_data = results.as_numpy("OUTPUT0")
             # Verify output is as expected
             # Make sure nan's are equivalent when compared
-            output_correct = np.array_equal(output0_data, self.expected_output, equal_nan=True)
-            self.assertTrue(output_correct,
-                "didn't get expected output0: {}".format(output0_data))
+            output_correct = np.array_equal(
+                output0_data, self.expected_output, equal_nan=True
+            )
+            self.assertTrue(
+                output_correct, "didn't get expected output0: {}".format(output0_data)
+            )
         except InferenceServerException as ex:
             self.assertTrue(False, ex.message())
         except:
             self.assertTrue(False, traceback.format_exc())
 
-if __name__ == '__main__':
+
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_nullchar_string/nullchar_string_client.py b/qa/L0_nullchar_string/nullchar_string_client.py
old mode 100644
new mode 100755
index d90304856d..2d69b41b3d
--- a/qa/L0_nullchar_string/nullchar_string_client.py
+++ b/qa/L0_nullchar_string/nullchar_string_client.py
@@ -26,47 +26,51 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import argparse
-import numpy as np
 
+import numpy as np
 import tritongrpcclient as grpcclient
 import tritonhttpclient as httpclient
 from tritonclientutils import np_to_triton_dtype
 
 FLAGS = None
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('-v',
-                        '--verbose',
-                        action="store_true",
-                        required=False,
-                        default=False,
-                        help='Enable verbose output')
-    parser.add_argument('-m',
-                        '--model-name',
-                        type=str,
-                        required=True,
-                        help='Name of model')
-    parser.add_argument('-u',
-                        '--url',
-                        type=str,
-                        required=False,
-                        default='localhost:8000',
-                        help='Inference server URL. Default is localhost:8000.')
     parser.add_argument(
-        '-i',
-        '--protocol',
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable verbose output",
+    )
+    parser.add_argument(
+        "-m", "--model-name", type=str, required=True, help="Name of model"
+    )
+    parser.add_argument(
+        "-u",
+        "--url",
+        type=str,
+        required=False,
+        default="localhost:8000",
+        help="Inference server URL. Default is localhost:8000.",
+    )
+    parser.add_argument(
+        "-i",
+        "--protocol",
         type=str,
         required=False,
-        default='http',
-        help='Protocol ("http"/"grpc") used to ' +
-        'communicate with inference service. Default is "http".')
+        default="http",
+        help='Protocol ("http"/"grpc") used to '
+        + 'communicate with inference service. Default is "http".',
+    )
 
     FLAGS = parser.parse_args()
 
     if (FLAGS.protocol != "http") and (FLAGS.protocol != "grpc"):
-        print("unexpected protocol \"{}\", expects \"http\" or \"grpc\"".format(
-            FLAGS.protocol))
+        print(
+            'unexpected protocol "{}", expects "http" or "grpc"'.format(FLAGS.protocol)
+        )
         exit(1)
 
     client_util = httpclient if FLAGS.protocol == "http" else grpcclient
@@ -86,8 +90,9 @@
     # Send inference request to the inference server. Get results for
     # output tensor.
     inputs = [
-        client_util.InferInput("INPUT0", input0_data.shape,
-                               np_to_triton_dtype(np.object_))
+        client_util.InferInput(
+            "INPUT0", input0_data.shape, np_to_triton_dtype(np.object_)
+        )
     ]
     inputs[0].set_data_from_numpy(input0_data)
 
@@ -95,7 +100,7 @@
 
     # We expect there to be 1 result (with batch-size 1). Compare the input
     # and output tensor calculated by the model. They must be the same.
-    output0_data = results.as_numpy('OUTPUT0')
+    output0_data = results.as_numpy("OUTPUT0")
 
     print(input0_data, "?=?", output0_data)
     assert np.equal(input0_data.astype(np.bytes_), output0_data).all()
diff --git a/qa/L0_nullchar_string/test.sh b/qa/L0_nullchar_string/test.sh
old mode 100644
new mode 100755
index f1c81c9aa6..bded41dc92
--- a/qa/L0_nullchar_string/test.sh
+++ b/qa/L0_nullchar_string/test.sh
@@ -40,16 +40,22 @@ fi
 
 export CUDA_VISIBLE_DEVICES=0
 
+CLIENT_LOG="./client.log"
 DATADIR=/data/inferenceserver/${REPO_VERSION}/qa_identity_model_repository
+MODELS="graphdef_nobatch_zero_1_object savedmodel_nobatch_zero_1_object"
 NULLCHAR_CLIENT_PY=nullchar_string_client.py
-CLIENT_LOG="./client.log"
 
 SERVER=/opt/tritonserver/bin/tritonserver
-SERVER_ARGS="--model-repository=$DATADIR"
+SERVER_ARGS="--model-repository=models"
 SERVER_LOG="./inference_server.log"
 source ../common/util.sh
 
-rm -f $CLIENT_LOG $SERVER_LOG
+rm -f $CLIENT_LOG $SERVER_LOG models
+
+mkdir -p models
+for MODEL in $MODELS; do
+    cp -r $DATADIR/$MODEL models/.
+done
 
 run_server
 if [ "$SERVER_PID" == "0" ]; then
@@ -65,7 +71,7 @@ set +e
 # Ignore ONNX backend because even though ONNX supports string data type,
 # strings that contain null character in the middle is not allowed.
 # https://github.com/microsoft/onnxruntime/issues/2284
-for MODEL in graphdef_nobatch_zero_1_object savedmodel_nobatch_zero_1_object; do
+for MODEL in $MODELS; do
   python $NULLCHAR_CLIENT_PY -m $MODEL -v >>$CLIENT_LOG 2>&1
   if [ $? -ne 0 ]; then
       RET=1
diff --git a/qa/L0_onnx_optimization/test.sh b/qa/L0_onnx_optimization/test.sh
index 7190f31515..b574f5db32 100755
--- a/qa/L0_onnx_optimization/test.sh
+++ b/qa/L0_onnx_optimization/test.sh
@@ -61,8 +61,11 @@ for MODEL in \
        models/${MODEL}_test && \
     rm -fr models/${MODEL}_test/2 && \
     rm -fr models/${MODEL}_test/3 && \
+    # Set instance count > 1 to test parallel instance loading across all EPs
+    INSTANCE_COUNT=5
     (cd models/${MODEL}_test && \
-            sed -i 's/_float32_float32_float32/&_test/' config.pbtxt) && \
+            sed -i 's/_float32_float32_float32/&_test/' config.pbtxt && \
+            echo -e "\ninstance_group { count: ${INSTANCE_COUNT} }" >> config.pbtxt) && \
     # CUDA EP optimization params
     cp -r models/${MODEL}_test models/${MODEL}_cuda_config && \
     (cd models/${MODEL}_cuda_config && \
diff --git a/qa/L0_optional_input/models/ensemble_identity_2_float32/config.pbtxt b/qa/L0_optional_input/models/ensemble_identity_2_float32/config.pbtxt
old mode 100755
new mode 100644
diff --git a/qa/L0_optional_input/models/identity_2_float32/config.pbtxt b/qa/L0_optional_input/models/identity_2_float32/config.pbtxt
old mode 100755
new mode 100644
diff --git a/qa/L0_optional_input/models/optional_connecting_tensor/config.pbtxt b/qa/L0_optional_input/models/optional_connecting_tensor/config.pbtxt
new file mode 100644
index 0000000000..afc4ebc00f
--- /dev/null
+++ b/qa/L0_optional_input/models/optional_connecting_tensor/config.pbtxt
@@ -0,0 +1,98 @@
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+platform: "ensemble"
+max_batch_size: 4
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+    optional: true
+  },
+  {
+    name: "INPUT1"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+    optional: true
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  },
+  {
+    name: "OUTPUT1"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
+ensemble_scheduling {
+  step [
+    {
+      model_name: "optional_identity"
+      model_version: -1
+      input_map {
+        key: "INPUT0"
+        value: "INPUT0"
+      }
+      input_map {
+        key: "INPUT1"
+        value: "INPUT1"
+      }
+      output_map {
+        key: "OUTPUT0"
+        value: "internal_output0"
+      }
+      output_map {
+        key: "OUTPUT1"
+        value: "internal_output1"
+      }
+    },
+    {
+      model_name: "optional_identity"
+      model_version: -1
+      input_map {
+        key: "INPUT0"
+        value: "internal_output0"
+      }
+      input_map {
+        key: "INPUT1"
+        value: "internal_output1"
+      }
+      output_map {
+        key: "OUTPUT0"
+        value: "OUTPUT0"
+      }
+      output_map {
+        key: "OUTPUT1"
+        value: "OUTPUT1"
+      }
+    }
+  ]
+}
diff --git a/qa/L0_optional_input/models/optional_identity/1/model.py b/qa/L0_optional_input/models/optional_identity/1/model.py
new file mode 100644
index 0000000000..c736ecc3bd
--- /dev/null
+++ b/qa/L0_optional_input/models/optional_identity/1/model.py
@@ -0,0 +1,46 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    def execute(self, requests):
+        """
+        Identity model in Python backend.
+        """
+        responses = []
+        for request in requests:
+            for tidx in ("0", "1"):
+                input_tensor = pb_utils.get_input_tensor_by_name(
+                    request, "INPUT" + tidx
+                )
+                if input_tensor is not None:
+                    out_tensor = pb_utils.Tensor(
+                        "OUTPUT" + tidx, input_tensor.as_numpy()
+                    )
+                    responses.append(pb_utils.InferenceResponse([out_tensor]))
+        return responses
diff --git a/qa/L0_optional_input/models/optional_identity/config.pbtxt b/qa/L0_optional_input/models/optional_identity/config.pbtxt
new file mode 100644
index 0000000000..0c73fd7ca5
--- /dev/null
+++ b/qa/L0_optional_input/models/optional_identity/config.pbtxt
@@ -0,0 +1,53 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+backend: "python"
+max_batch_size: 4
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+    optional: true
+  },
+  {
+    name: "INPUT1"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+    optional: true
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  },
+  {
+    name: "OUTPUT1"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
diff --git a/qa/L0_optional_input/models/pipeline_identity_2_float32/config.pbtxt b/qa/L0_optional_input/models/pipeline_identity_2_float32/config.pbtxt
old mode 100755
new mode 100644
diff --git a/qa/L0_optional_input/optional_input_test.py b/qa/L0_optional_input/optional_input_test.py
old mode 100644
new mode 100755
index 5143718775..c1fd114d6b
--- a/qa/L0_optional_input/optional_input_test.py
+++ b/qa/L0_optional_input/optional_input_test.py
@@ -1,6 +1,6 @@
 #!/usr/bin/python
 
-# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -27,16 +27,17 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-import numpy as np
 import sys
-import time
 import threading
+import time
 import unittest
-import tritonclient.grpc as grpcclient
-from tritonclient.utils import np_to_triton_dtype
+
+import numpy as np
 import test_util as tu
+import tritonclient.grpc as grpcclient
 
 _deferred_exceptions_lock = threading.Lock()
 _deferred_exceptions = []
@@ -44,31 +45,30 @@
 
 # Similar set up as dynamic batcher tests
 class OptionalInputTest(tu.TestResultCollector):
-
     def setUp(self):
         global _deferred_exceptions
         _deferred_exceptions = []
 
         # The helper client for setup will be GRPC for simplicity.
         self.triton_client_ = grpcclient.InferenceServerClient("localhost:8001")
-        self.model_name_ = 'identity_2_float32'
+        self.model_name_ = "identity_2_float32"
         # This will not be changed even when ensemble is under test,
         # as the dynamic batching is performed within the composing model
-        self.check_status_model = 'identity_2_float32'
+        self.check_status_model = "identity_2_float32"
         self.tensor_shape_ = (1, 1)
         self.inputs_ = {
-            "INPUT0": grpcclient.InferInput('INPUT0', [1, 1], "FP32"),
-            "INPUT1": grpcclient.InferInput('INPUT1', [1, 1], "FP32")
+            "INPUT0": grpcclient.InferInput("INPUT0", [1, 1], "FP32"),
+            "INPUT1": grpcclient.InferInput("INPUT1", [1, 1], "FP32"),
         }
         self.input_data_ = {
             "INPUT0": np.ones(shape=(1, 1), dtype=np.float32),
-            "INPUT1": np.zeros(shape=(1, 1), dtype=np.float32)
+            "INPUT1": np.zeros(shape=(1, 1), dtype=np.float32),
         }
         self.inputs_["INPUT0"].set_data_from_numpy(self.input_data_["INPUT0"])
         self.inputs_["INPUT1"].set_data_from_numpy(self.input_data_["INPUT1"])
         self.outputs_ = {
-            "INPUT0": grpcclient.InferRequestedOutput('OUTPUT0'),
-            "INPUT1": grpcclient.InferRequestedOutput('OUTPUT1')
+            "INPUT0": grpcclient.InferRequestedOutput("OUTPUT0"),
+            "INPUT1": grpcclient.InferRequestedOutput("OUTPUT1"),
         }
 
     def add_deferred_exception(self, ex):
@@ -93,9 +93,9 @@ def check_response(self, thresholds, provided_inputs=("INPUT0", "INPUT1")):
                 outputs.append(self.outputs_[provided_input])
 
             triton_client = grpcclient.InferenceServerClient("localhost:8001")
-            results = triton_client.infer(model_name=self.model_name_,
-                                          inputs=inputs,
-                                          outputs=outputs)
+            results = triton_client.infer(
+                model_name=self.model_name_, inputs=inputs, outputs=outputs
+            )
 
             end_ms = int(round(time.time() * 1000))
 
@@ -106,66 +106,103 @@ def check_response(self, thresholds, provided_inputs=("INPUT0", "INPUT1")):
                 self.assertTrue(
                     np.array_equal(output_data, expected),
                     "{}, {}, expected: {}, got {}".format(
-                        self.model_name_, output_name, expected, output_data))
+                        self.model_name_, output_name, expected, output_data
+                    ),
+                )
 
             gt_ms = thresholds[0]
             lt_ms = thresholds[1]
             if lt_ms is not None:
                 self.assertTrue(
                     (end_ms - start_ms) < lt_ms,
-                    "expected less than " + str(lt_ms) +
-                    "ms response time, got " + str(end_ms - start_ms) + " ms")
+                    "expected less than "
+                    + str(lt_ms)
+                    + "ms response time, got "
+                    + str(end_ms - start_ms)
+                    + " ms",
+                )
             if gt_ms is not None:
                 self.assertTrue(
                     (end_ms - start_ms) > gt_ms,
-                    "expected greater than " + str(gt_ms) +
-                    "ms response time, got " + str(end_ms - start_ms) + " ms")
+                    "expected greater than "
+                    + str(gt_ms)
+                    + "ms response time, got "
+                    + str(end_ms - start_ms)
+                    + " ms",
+                )
         except Exception as ex:
             self.add_deferred_exception(ex)
 
     def check_status(self, model_name, batch_exec, request_cnt, infer_cnt):
-        stats = self.triton_client_.get_inference_statistics(model_name, "1")
-        self.assertEqual(len(stats.model_stats), 1, "expect 1 model stats")
-        self.assertEqual(stats.model_stats[0].name, model_name,
-                         "expect model stats for model {}".format(model_name))
+        # There is a time window between when responses are returned and statistics are updated.
+        # To prevent intermittent test failure during that window, wait up to 10 seconds for the
+        # inference statistics to be ready.
+        num_tries = 10
+        for i in range(num_tries):
+            stats = self.triton_client_.get_inference_statistics(model_name, "1")
+            self.assertEqual(len(stats.model_stats), 1, "expect 1 model stats")
+            actual_exec_cnt = stats.model_stats[0].execution_count
+            if stats.model_stats[0].execution_count > 0:
+                break
+            time.sleep(1)
+
         self.assertEqual(
-            stats.model_stats[0].version, "1",
-            "expect model stats for model {} version 1".format(model_name))
+            stats.model_stats[0].name,
+            model_name,
+            "expect model stats for model {}".format(model_name),
+        )
+        self.assertEqual(
+            stats.model_stats[0].version,
+            "1",
+            "expect model stats for model {} version 1".format(model_name),
+        )
 
         batch_stats = stats.model_stats[0].batch_stats
         self.assertEqual(
-            len(batch_stats), len(batch_exec),
+            len(batch_stats),
+            len(batch_exec),
             "expected {} different batch-sizes, got {}".format(
-                len(batch_exec), len(batch_stats)))
+                len(batch_exec), len(batch_stats)
+            ),
+        )
 
         for batch_stat in batch_stats:
             bs = batch_stat.batch_size
             bc = batch_stat.compute_infer.count
-            self.assertTrue(bs in batch_exec,
-                            "unexpected batch-size {}".format(bs))
+            self.assertTrue(bs in batch_exec, "unexpected batch-size {}".format(bs))
             # Get count from one of the stats
             self.assertEqual(
-                bc, batch_exec[bs],
-                "expected model-execution-count {} for batch size {}, got {}".
-                format(batch_exec[bs], bs, bc))
+                bc,
+                batch_exec[bs],
+                "expected model-execution-count {} for batch size {}, got {}".format(
+                    batch_exec[bs], bs, bc
+                ),
+            )
 
         actual_request_cnt = stats.model_stats[0].inference_stats.success.count
         self.assertEqual(
-            actual_request_cnt, request_cnt,
+            actual_request_cnt,
+            request_cnt,
             "expected model-request-count {}, got {}".format(
-                request_cnt, actual_request_cnt))
+                request_cnt, actual_request_cnt
+            ),
+        )
 
         actual_exec_cnt = stats.model_stats[0].execution_count
         self.assertEqual(
-            actual_request_cnt, request_cnt,
-            "expected model-exec-count {}, got {}".format(
-                request_cnt, actual_exec_cnt))
+            actual_request_cnt,
+            request_cnt,
+            "expected model-exec-count {}, got {}".format(request_cnt, actual_exec_cnt),
+        )
 
         actual_infer_cnt = stats.model_stats[0].inference_count
         self.assertEqual(
-            actual_infer_cnt, infer_cnt,
+            actual_infer_cnt,
+            infer_cnt,
             "expected model-inference-count {}, got {}".format(
-                infer_cnt, actual_infer_cnt))
+                infer_cnt, actual_infer_cnt
+            ),
+        )
 
     def test_all_inputs(self):
         # Provide all inputs, send requests that don't form preferred batch
@@ -173,11 +210,11 @@ def test_all_inputs(self):
         try:
             threads = []
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=((4000, None),)))
+                threading.Thread(target=self.check_response, args=((4000, None),))
+            )
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=((4000, None),)))
+                threading.Thread(target=self.check_response, args=((4000, None),))
+            )
             threads[0].start()
             threads[1].start()
             for t in threads:
@@ -194,13 +231,19 @@ def test_optional_same_input(self):
         try:
             threads = []
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=((4000, None),),
-                                 kwargs={'provided_inputs': ("INPUT1",)}))
+                threading.Thread(
+                    target=self.check_response,
+                    args=((4000, None),),
+                    kwargs={"provided_inputs": ("INPUT1",)},
+                )
+            )
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=((4000, None),),
-                                 kwargs={'provided_inputs': ("INPUT1",)}))
+                threading.Thread(
+                    target=self.check_response,
+                    args=((4000, None),),
+                    kwargs={"provided_inputs": ("INPUT1",)},
+                )
+            )
             threads[0].start()
             threads[1].start()
             for t in threads:
@@ -218,22 +261,34 @@ def test_optional_mix_inputs(self):
         try:
             threads = []
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=((0, 4000),),
-                                 kwargs={'provided_inputs': ("INPUT0",)}))
+                threading.Thread(
+                    target=self.check_response,
+                    args=((0, 4000),),
+                    kwargs={"provided_inputs": ("INPUT0",)},
+                )
+            )
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=((0, 4000),),
-                                 kwargs={'provided_inputs': ("INPUT1",)}))
+                threading.Thread(
+                    target=self.check_response,
+                    args=((0, 4000),),
+                    kwargs={"provided_inputs": ("INPUT1",)},
+                )
+            )
 
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=((0, 4000),),
-                                 kwargs={'provided_inputs': ("INPUT0",)}))
+                threading.Thread(
+                    target=self.check_response,
+                    args=((0, 4000),),
+                    kwargs={"provided_inputs": ("INPUT0",)},
+                )
+            )
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=((4000, None),),
-                                 kwargs={'provided_inputs': ("INPUT1",)}))
+                threading.Thread(
+                    target=self.check_response,
+                    args=((4000, None),),
+                    kwargs={"provided_inputs": ("INPUT1",)},
+                )
+            )
             for t in threads:
                 t.start()
                 time.sleep(0.5)
@@ -253,19 +308,26 @@ def test_optional_mix_inputs_2(self):
         try:
             threads = []
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=((0, 4000),),
-                                 kwargs={'provided_inputs': ("INPUT0",)}))
+                threading.Thread(
+                    target=self.check_response,
+                    args=((0, 4000),),
+                    kwargs={"provided_inputs": ("INPUT0",)},
+                )
+            )
             threads.append(
-                threading.Thread(target=self.check_response, args=((0, 4000),)))
+                threading.Thread(target=self.check_response, args=((0, 4000),))
+            )
 
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=((0, 4000),),
-                                 kwargs={'provided_inputs': ("INPUT0",)}))
+                threading.Thread(
+                    target=self.check_response,
+                    args=((0, 4000),),
+                    kwargs={"provided_inputs": ("INPUT0",)},
+                )
+            )
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=((4000, None),)))
+                threading.Thread(target=self.check_response, args=((4000, None),))
+            )
             for t in threads:
                 t.start()
                 time.sleep(0.5)
@@ -279,28 +341,28 @@ def test_optional_mix_inputs_2(self):
 
     def test_ensemble_all_inputs(self):
         # The ensemble is only a wrapper over 'identity_2_float32'
-        self.model_name_ = 'ensemble_identity_2_float32'
+        self.model_name_ = "ensemble_identity_2_float32"
         self.test_all_inputs()
         # From the ensemble's perspective, the requests are processed as it is
         self.check_status(self.model_name_, {1: 2}, 2, 2)
 
     def test_ensemble_optional_same_input(self):
         # The ensemble is only a wrapper over 'identity_2_float32'
-        self.model_name_ = 'ensemble_identity_2_float32'
+        self.model_name_ = "ensemble_identity_2_float32"
         self.test_optional_same_input()
         # From the ensemble's perspective, the requests are processed as it is
         self.check_status(self.model_name_, {1: 2}, 2, 2)
 
     def test_ensemble_optional_mix_inputs(self):
         # The ensemble is only a wrapper over 'identity_2_float32'
-        self.model_name_ = 'ensemble_identity_2_float32'
+        self.model_name_ = "ensemble_identity_2_float32"
         self.test_optional_mix_inputs()
         # From the ensemble's perspective, the requests are processed as it is
         self.check_status(self.model_name_, {1: 4}, 4, 4)
 
     def test_ensemble_optional_mix_inputs_2(self):
         # The ensemble is only a wrapper over 'identity_2_float32'
-        self.model_name_ = 'ensemble_identity_2_float32'
+        self.model_name_ = "ensemble_identity_2_float32"
         self.test_optional_mix_inputs_2()
         # From the ensemble's perspective, the requests are processed as it is
         self.check_status(self.model_name_, {1: 4}, 4, 4)
@@ -310,7 +372,7 @@ def test_ensemble_optional_pipeline(self):
         # inputs, where the ensemble step only connects a subset of inputs
         # for the second model (which is valid because the disconnected inputs
         # are marked optional). See 'config.pbtxt' for detail.
-        self.model_name_ = 'pipeline_identity_2_float32'
+        self.model_name_ = "pipeline_identity_2_float32"
 
         # Provide all inputs, send requests that don't form preferred batch
         # so all requests should be returned after the queue delay
@@ -321,28 +383,63 @@ def test_ensemble_optional_pipeline(self):
                 inputs.append(self.inputs_[provided_input])
 
             triton_client = grpcclient.InferenceServerClient("localhost:8001")
-            results = triton_client.infer(model_name=self.model_name_,
-                                          inputs=inputs)
+            results = triton_client.infer(model_name=self.model_name_, inputs=inputs)
 
             # OUTPU0 is always zero, OUTPUT1 = INPUT0
             output_data = results.as_numpy("OUTPUT0")
             expected = np.zeros(shape=(1, 1), dtype=np.float32)
             self.assertTrue(
                 np.array_equal(output_data, expected),
-                "{}, {}, expected: {}, got {}".format(self.model_name_,
-                                                      "OUTPUT0", expected,
-                                                      output_data))
+                "{}, {}, expected: {}, got {}".format(
+                    self.model_name_, "OUTPUT0", expected, output_data
+                ),
+            )
 
             expected = self.input_data_["INPUT0"]
             output_data = results.as_numpy("OUTPUT1")
             self.assertTrue(
                 np.array_equal(output_data, expected),
-                "{}, {}, expected: {}, got {}".format(self.model_name_,
-                                                      "OUTPUT1", expected,
-                                                      output_data))
+                "{}, {}, expected: {}, got {}".format(
+                    self.model_name_, "OUTPUT1", expected, output_data
+                ),
+            )
+        except Exception as ex:
+            self.assertTrue(False, "unexpected error {}".format(ex))
+
+    def test_ensemble_optional_connecting_tensor(self):
+        # The ensemble is a special case of pipelining models with optional
+        # inputs, where the request will only produce a subset of inputs
+        # for the second model while the ensemble graph connects all inputs of
+        # the second model (which is valid because the not-provided inputs
+        # are marked optional). See 'config.pbtxt' for detail.
+        self.model_name_ = "optional_connecting_tensor"
+
+        # Provide all inputs, send requests that don't form preferred batch
+        # so all requests should be returned after the queue delay
+        try:
+            provided_inputs = ("INPUT0",)
+            inputs = []
+            outputs = []
+            for provided_input in provided_inputs:
+                inputs.append(self.inputs_[provided_input])
+                outputs.append(self.outputs_[provided_input])
+
+            triton_client = grpcclient.InferenceServerClient("localhost:8001")
+            results = triton_client.infer(
+                model_name=self.model_name_, inputs=inputs, outputs=outputs
+            )
+
+            expected = self.input_data_["INPUT0"]
+            output_data = results.as_numpy("OUTPUT0")
+            self.assertTrue(
+                np.array_equal(output_data, expected),
+                "{}, {}, expected: {}, got {}".format(
+                    self.model_name_, "OUTPUT0", expected, output_data
+                ),
+            )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_optional_input/test.sh b/qa/L0_optional_input/test.sh
index 351be38d4d..8bfd113d32 100755
--- a/qa/L0_optional_input/test.sh
+++ b/qa/L0_optional_input/test.sh
@@ -41,6 +41,7 @@ rm -fr *.log
 mkdir -p ./models/identity_2_float32/1
 mkdir -p ./models/ensemble_identity_2_float32/1
 mkdir -p ./models/pipeline_identity_2_float32/1
+mkdir -p ./models/optional_connecting_tensor/1
 
 # Basic test cases
 TEST_CASES=${TEST_CASES:="test_all_inputs \
@@ -51,8 +52,9 @@ TEST_CASES=${TEST_CASES:="test_all_inputs \
                             test_ensemble_optional_same_input \
                             test_ensemble_optional_mix_inputs \
                             test_ensemble_optional_mix_inputs_2 \
-                            test_ensemble_optional_pipeline"}
-
+                            test_ensemble_optional_pipeline \
+                            test_ensemble_optional_connecting_tensor"}
+RET=0
 for i in $TEST_CASES ; do
     # Restart server for every test to clear model stats
     run_server
@@ -62,8 +64,6 @@ for i in $TEST_CASES ; do
         exit 1
     fi
 
-    RET=0
-
     echo "Test: $i" >>$TEST_LOG
 
     set +e
diff --git a/qa/L0_output_name/output_name_test.py b/qa/L0_output_name/output_name_test.py
old mode 100644
new mode 100755
index e5efdaddc6..905174640c
--- a/qa/L0_output_name/output_name_test.py
+++ b/qa/L0_output_name/output_name_test.py
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,26 +26,20 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-import argparse
-import numpy as np
-import os
-from builtins import range
-from functools import partial
-from PIL import Image
 import unittest
+
 import test_util as tu
+from tritongrpcclient import grpc_service_pb2, grpc_service_pb2_grpc
 
 import grpc
-from tritongrpcclient import grpc_service_pb2
-from tritongrpcclient import grpc_service_pb2_grpc
 
 _trials = ("graphdef", "libtorch", "onnx", "plan", "savedmodel")
 
 
 class OutputNameValidationTest(tu.TestResultCollector):
-
     def requestGenerator(self, model_name, output_name):
         request = grpc_service_pb2.ModelInferRequest()
         request.model_name = model_name
@@ -58,12 +52,11 @@ def requestGenerator(self, model_name, output_name):
 
         request.inputs.extend([input])
 
-        output = grpc_service_pb2.ModelInferRequest(
-        ).InferRequestedOutputTensor()
+        output = grpc_service_pb2.ModelInferRequest().InferRequestedOutputTensor()
         output.name = output_name
         request.outputs.extend([output])
 
-        request.raw_input_contents.extend([bytes(4 * 'a', 'utf-8')])
+        request.raw_input_contents.extend([bytes(4 * "a", "utf-8")])
 
         return request
 
@@ -78,14 +71,14 @@ def test_grpc(self):
             try:
                 response = grpc_stub.ModelInfer(request)
                 self.assertTrue(
-                    False,
-                    "unexpected success for unknown output " + model_name)
+                    False, "unexpected success for unknown output " + model_name
+                )
             except grpc.RpcError as rpc_error:
                 msg = rpc_error.details()
                 self.assertTrue(
-                    msg.startswith(
-                        "unexpected inference output 'DUMMY' for model"))
+                    msg.startswith("unexpected inference output 'DUMMY' for model")
+                )
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_output_name/test.sh b/qa/L0_output_name/test.sh
old mode 100644
new mode 100755
diff --git a/qa/L0_output_validation/lt_op_val_client.py b/qa/L0_output_validation/lt_op_val_client.py
old mode 100644
new mode 100755
index 7647497fff..77b5a16e3f
--- a/qa/L0_output_validation/lt_op_val_client.py
+++ b/qa/L0_output_validation/lt_op_val_client.py
@@ -27,43 +27,47 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-import requests
 import unittest
+
+import requests
 import test_util as tu
 
 
 class OutputValidationTest(tu.TestResultCollector):
     # for datatype mismatch
     def test_datatype(self):
-        url = 'http://localhost:8000/v2/models/libtorch_datatype_1_float32/infer'
+        url = "http://localhost:8000/v2/models/libtorch_datatype_1_float32/infer"
         body = '{"inputs":[{"name":"INPUT__0","shape":[1,1],"datatype":"FP32","data":[1.0]}],"outputs":[{"name":"OUTPUT__0"}]}'
         response = requests.post(url, data=body)
         msg = response.json()["error"]
         self.assertTrue(
             msg.startswith(
                 "configuration expects datatype TYPE_INT32 for output 'OUTPUT__0', model provides TYPE_FP32"
-            ))
+            )
+        )
 
     # for output mismatch
     def test_index(self):
-        url = 'http://localhost:8000/v2/models/libtorch_index_1_float32/infer'
+        url = "http://localhost:8000/v2/models/libtorch_index_1_float32/infer"
         body = '{"inputs":[{"name":"INPUT__0","shape":[1,1],"datatype":"FP32","data":[1.0]}],"outputs":[{"name":"OUTPUT__1"}]}'
         response = requests.post(url, data=body)
         msg = response.json()["error"]
         self.assertTrue(
             msg.startswith(
                 "The output OUTPUT__1 in the model configuration refers to an output index which doesn't exist. This model has 1 outputs"
-            ))
+            )
+        )
 
     # successful run
     def test_success(self):
-        url = 'http://localhost:8000/v2/models/libtorch_zero_1_float32/infer'
+        url = "http://localhost:8000/v2/models/libtorch_zero_1_float32/infer"
         body = '{"inputs":[{"name":"INPUT__0","shape":[1,1],"datatype":"FP32","data":[1.0]}],"outputs":[{"name":"OUTPUT__0"}]}'
         response = requests.post(url, data=body)
         self.assertEqual(response.status_code, 200)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_output_validation/test.sh b/qa/L0_output_validation/test.sh
old mode 100644
new mode 100755
diff --git a/qa/L0_parallel_copy/parallel_copy_test.py b/qa/L0_parallel_copy/parallel_copy_test.py
old mode 100644
new mode 100755
index c9b958f5ed..6748fee006
--- a/qa/L0_parallel_copy/parallel_copy_test.py
+++ b/qa/L0_parallel_copy/parallel_copy_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,35 +27,39 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-from builtins import range
+import functools
 import time
 import unittest
+from builtins import range
+
 import numpy as np
 import test_util as tu
-import functools
 import tritonclient.grpc as grpcclient
 from tritonclient.utils import InferenceServerException
 
 
 class ParallelCopyTest(tu.TestResultCollector):
-
     def setUp(self):
         self.client_ = grpcclient.InferenceServerClient("localhost:8001")
         self.dtype_ = np.float32
-        self.model_name_ = tu.get_zero_model_name('plan', 1, self.dtype_)
+        self.model_name_ = tu.get_zero_model_name("plan", 1, self.dtype_)
 
     def _batch_input_duration(self, batch_size):
         stats = self.client_.get_inference_statistics(self.model_name_, "1")
         self.assertEqual(len(stats.model_stats), 1, "expect 1 model stats")
         self.assertEqual(
-            stats.model_stats[0].name, self.model_name_,
-            "expect model stats for model {}".format(self.model_name_))
+            stats.model_stats[0].name,
+            self.model_name_,
+            "expect model stats for model {}".format(self.model_name_),
+        )
         self.assertEqual(
-            stats.model_stats[0].version, "1",
-            "expect model stats for model {} version 1".format(
-                self.model_name_))
+            stats.model_stats[0].version,
+            "1",
+            "expect model stats for model {} version 1".format(self.model_name_),
+        )
 
         batch_stats = stats.model_stats[0].batch_stats
 
@@ -69,10 +75,11 @@ def _run(self, batch_sizes):
             np.random.random([bs, 16 * 1024 * 1024]).astype(self.dtype_)
             for bs in batch_sizes
         ]
-        inputs = [[
-            grpcclient.InferInput('INPUT0', [bs, 16 * 1024 * 1024], "FP32")
-        ] for bs in batch_sizes]
-        output = [grpcclient.InferRequestedOutput('OUTPUT0')]
+        inputs = [
+            [grpcclient.InferInput("INPUT0", [bs, 16 * 1024 * 1024], "FP32")]
+            for bs in batch_sizes
+        ]
+        output = [grpcclient.InferRequestedOutput("OUTPUT0")]
 
         for idx in range(len(inputs)):
             inputs[idx][0].set_data_from_numpy(input_data[idx])
@@ -88,11 +95,12 @@ def callback(user_data, idx, result, error):
 
         before_compute_input_duration = self._batch_input_duration(batch_size)
         for idx in range(len(batch_sizes)):
-            self.client_.async_infer(model_name=self.model_name_,
-                                     inputs=inputs[idx],
-                                     callback=functools.partial(
-                                         callback, user_data, idx),
-                                     outputs=output)
+            self.client_.async_infer(
+                model_name=self.model_name_,
+                inputs=inputs[idx],
+                callback=functools.partial(callback, user_data, idx),
+                outputs=output,
+            )
 
         # Wait until the results are available in user_data
         time_out = 20
@@ -107,19 +115,24 @@ def callback(user_data, idx, result, error):
             time_out = time_out - 1
             time.sleep(1)
         done_cnt = functools.reduce(
-            lambda dc, x: dc + 1 if x is not None else dc, user_data, 0)
+            lambda dc, x: dc + 1 if x is not None else dc, user_data, 0
+        )
         self.assertEqual(
-            done_cnt, len(batch_sizes),
-            "expected {} responses, got {}".format(len(batch_sizes), done_cnt))
+            done_cnt,
+            len(batch_sizes),
+            "expected {} responses, got {}".format(len(batch_sizes), done_cnt),
+        )
         for idx in range(len(batch_sizes)):
             res = user_data[idx]
             self.assertFalse(
                 type(res) == InferenceServerException,
-                "expected response for request {}, got exception {}".format(
-                    idx, res))
-            output_data = res.as_numpy('OUTPUT0')
-            self.assertTrue(np.array_equal(output_data, input_data[idx]),
-                            "Mismatched output data for request {}".format(idx))
+                "expected response for request {}, got exception {}".format(idx, res),
+            )
+            output_data = res.as_numpy("OUTPUT0")
+            self.assertTrue(
+                np.array_equal(output_data, input_data[idx]),
+                "Mismatched output data for request {}".format(idx),
+            )
 
         after_compute_input_duration = self._batch_input_duration(batch_size)
         return after_compute_input_duration - before_compute_input_duration
@@ -134,13 +147,17 @@ def test_performance(self):
 
         # The following check is loose, local runs show that the speedup is not
         # significant (~15%), may be due to the dispatch overhead
-        # which cancels part of the improvment
+        # which cancels part of the improvement
         self.assertTrue(
             serialized_time > parallelized_time,
-            "Expected parallelized copy is faster than serialized copy")
-        print("serialized v.s. parallelized : {} v.s. {}".format(
-            serialized_time, parallelized_time))
+            "Expected parallelized copy is faster than serialized copy",
+        )
+        print(
+            "serialized v.s. parallelized : {} v.s. {}".format(
+                serialized_time, parallelized_time
+            )
+        )
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_parameters/model_repository/ensemble/config.pbtxt b/qa/L0_parameters/model_repository/ensemble/config.pbtxt
new file mode 100644
index 0000000000..383d89c9f6
--- /dev/null
+++ b/qa/L0_parameters/model_repository/ensemble/config.pbtxt
@@ -0,0 +1,68 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+platform: "ensemble"
+max_batch_size: 0
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
+
+output [
+  {
+    name: "key"
+    data_type: TYPE_STRING
+    dims: [ -1 ]
+  },
+  {
+    name: "value"
+    data_type: TYPE_STRING
+    dims: [ -1 ]
+  }
+]
+
+ensemble_scheduling
+{
+  step [
+    {
+      model_name: "identity"
+      model_version: -1
+      input_map { key: "INPUT0", value: "INPUT0" }
+      output_map { key: "OUTPUT0", value: "OUTPUT0" }
+    },
+    {
+      model_name: "parameter"
+      model_version: -1
+      input_map { key: "INPUT0", value: "OUTPUT0" }
+      output_map { key: "key", value: "key" }
+      output_map { key: "value", value: "value" }
+    }
+  ]
+}
diff --git a/qa/L0_parameters/model_repository/identity/config.pbtxt b/qa/L0_parameters/model_repository/identity/config.pbtxt
new file mode 100644
index 0000000000..8908845574
--- /dev/null
+++ b/qa/L0_parameters/model_repository/identity/config.pbtxt
@@ -0,0 +1,44 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+backend: "identity"
+max_batch_size: 0
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
+
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
diff --git a/qa/L0_parameters/model_repository/parameter/1/model.py b/qa/L0_parameters/model_repository/parameter/1/model.py
new file mode 100644
index 0000000000..c175860962
--- /dev/null
+++ b/qa/L0_parameters/model_repository/parameter/1/model.py
@@ -0,0 +1,77 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import json
+
+import numpy as np
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    @staticmethod
+    def auto_complete_config(auto_complete_model_config):
+        inputs = [{"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [1]}]
+        outputs = [
+            {"name": "key", "data_type": "TYPE_STRING", "dims": [-1]},
+            {"name": "value", "data_type": "TYPE_STRING", "dims": [-1]},
+        ]
+
+        config = auto_complete_model_config.as_dict()
+        input_names = []
+        output_names = []
+        for input in config["input"]:
+            input_names.append(input["name"])
+        for output in config["output"]:
+            output_names.append(output["name"])
+
+        for input in inputs:
+            if input["name"] not in input_names:
+                auto_complete_model_config.add_input(input)
+        for output in outputs:
+            if output["name"] not in output_names:
+                auto_complete_model_config.add_output(output)
+
+        auto_complete_model_config.set_max_batch_size(0)
+        return auto_complete_model_config
+
+    def execute(self, requests):
+        # A simple model that puts the request parameters into the outputs.
+        responses = []
+        for request in requests:
+            parameters = json.loads(request.parameters())
+            keys = []
+            values = []
+            for key, value in parameters.items():
+                keys.append(key)
+                values.append(value)
+            key_output = pb_utils.Tensor("key", np.asarray(keys, dtype=object))
+            value_output = pb_utils.Tensor("value", np.asarray(values, dtype=object))
+            inference_response = pb_utils.InferenceResponse(
+                output_tensors=[key_output, value_output]
+            )
+            responses.append(inference_response)
+
+        return responses
diff --git a/qa/L0_parameters/parameters_test.py b/qa/L0_parameters/parameters_test.py
new file mode 100755
index 0000000000..959f0fc5dc
--- /dev/null
+++ b/qa/L0_parameters/parameters_test.py
@@ -0,0 +1,223 @@
+#!/usr/bin/env python3
+
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+
+import os
+import queue
+import unittest
+from functools import partial
+from unittest import IsolatedAsyncioTestCase
+
+import numpy as np
+import tritonclient.grpc as grpcclient
+import tritonclient.grpc.aio as asyncgrpcclient
+import tritonclient.http as httpclient
+import tritonclient.http.aio as asynchttpclient
+from tritonclient.utils import InferenceServerException
+
+TEST_HEADER = os.environ.get("TEST_HEADER")
+
+
+class InferenceParametersTest(IsolatedAsyncioTestCase):
+    async def asyncSetUp(self):
+        self.http = httpclient.InferenceServerClient(url="localhost:8000")
+        self.async_http = asynchttpclient.InferenceServerClient(url="localhost:8000")
+        self.grpc = grpcclient.InferenceServerClient(url="localhost:8001")
+        self.async_grpc = asyncgrpcclient.InferenceServerClient(url="localhost:8001")
+
+        self.parameter_list = []
+        self.parameter_list.append({"key1": "value1", "key2": "value2"})
+        self.parameter_list.append({"key1": 1, "key2": 2})
+        self.parameter_list.append({"key1": True, "key2": "value2"})
+        self.parameter_list.append({"triton_": True, "key2": "value2"})
+
+        if TEST_HEADER == "1":
+            self.headers = {
+                "header_1": "value_1",
+                "header_2": "value_2",
+                "my_header_1": "my_value_1",
+                "my_header_2": "my_value_2",
+                "my_header_3": 'This is a "quoted" string with a backslash\ ',
+            }
+
+            # only these headers should be forwarded to the model.
+            self.expected_headers = {
+                "my_header_1": "my_value_1",
+                "my_header_2": "my_value_2",
+                "my_header_3": 'This is a "quoted" string with a backslash\ ',
+            }
+        else:
+            self.headers = {}
+            self.expected_headers = {}
+
+        def callback(user_data, result, error):
+            if error:
+                user_data.put(error)
+            else:
+                user_data.put(result)
+
+        self.grpc_callback = callback
+
+    def create_inputs(self, client_type):
+        inputs = []
+        inputs.append(client_type.InferInput("INPUT0", [1], "FP32"))
+
+        # Initialize the data
+        inputs[0].set_data_from_numpy(np.asarray([1], dtype=np.float32))
+        return inputs
+
+    async def send_request_and_verify(
+        self, client_type, client, is_async=False, model_name="parameter"
+    ):
+        inputs = self.create_inputs(client_type)
+        for parameters in self.parameter_list:
+            # Setup infer callable to re-use below for brevity
+            infer_callable = partial(
+                client.infer,
+                model_name=model_name,
+                inputs=inputs,
+                parameters=parameters,
+                headers=self.headers,
+            )
+
+            # The `triton_` prefix is reserved for Triton usage
+            should_error = False
+            if "triton_" in parameters.keys():
+                should_error = True
+
+            if is_async:
+                if should_error:
+                    with self.assertRaises(InferenceServerException):
+                        await infer_callable()
+                    return
+                else:
+                    result = await infer_callable()
+            else:
+                if should_error:
+                    with self.assertRaises(InferenceServerException):
+                        infer_callable()
+                    return
+                else:
+                    result = infer_callable()
+
+            self.verify_outputs(result, parameters)
+
+    def verify_outputs(self, result, parameters):
+        keys = result.as_numpy("key")
+        values = result.as_numpy("value")
+        keys = keys.astype(str).tolist()
+        expected_keys = list(parameters.keys()) + list(self.expected_headers.keys())
+        self.assertEqual(set(keys), set(expected_keys))
+
+        # We have to convert the parameter values to string
+        expected_values = []
+        for expected_value in list(parameters.values()):
+            expected_values.append(str(expected_value))
+        for value in self.expected_headers.values():
+            expected_values.append(value)
+        self.assertEqual(set(values.astype(str).tolist()), set(expected_values))
+
+    async def test_grpc_parameter(self):
+        await self.send_request_and_verify(grpcclient, self.grpc)
+
+    async def test_http_parameter(self):
+        await self.send_request_and_verify(httpclient, self.http)
+
+    async def test_async_http_parameter(self):
+        await self.send_request_and_verify(
+            asynchttpclient, self.async_http, is_async=True
+        )
+
+    async def test_async_grpc_parameter(self):
+        await self.send_request_and_verify(
+            asyncgrpcclient, self.async_grpc, is_async=True
+        )
+
+    def test_http_async_parameter(self):
+        inputs = self.create_inputs(httpclient)
+        # Skip the parameter that returns an error
+        parameter_list = self.parameter_list[:-1]
+        for parameters in parameter_list:
+            result = self.http.async_infer(
+                model_name="parameter",
+                inputs=inputs,
+                parameters=parameters,
+                headers=self.headers,
+            ).get_result()
+            self.verify_outputs(result, parameters)
+
+    def test_grpc_async_parameter(self):
+        user_data = queue.Queue()
+        inputs = self.create_inputs(grpcclient)
+        # Skip the parameter that returns an error
+        parameter_list = self.parameter_list[:-1]
+        for parameters in parameter_list:
+            self.grpc.async_infer(
+                model_name="parameter",
+                inputs=inputs,
+                parameters=parameters,
+                headers=self.headers,
+                callback=partial(self.grpc_callback, user_data),
+            )
+            result = user_data.get()
+            self.assertFalse(result is InferenceServerException)
+            self.verify_outputs(result, parameters)
+
+    def test_grpc_stream_parameter(self):
+        user_data = queue.Queue()
+        self.grpc.start_stream(
+            callback=partial(self.grpc_callback, user_data), headers=self.headers
+        )
+        inputs = self.create_inputs(grpcclient)
+        # Skip the parameter that returns an error
+        parameter_list = self.parameter_list[:-1]
+        for parameters in parameter_list:
+            # async stream infer
+            self.grpc.async_stream_infer(
+                model_name="parameter", inputs=inputs, parameters=parameters
+            )
+            result = user_data.get()
+            self.assertFalse(result is InferenceServerException)
+            self.verify_outputs(result, parameters)
+        self.grpc.stop_stream()
+
+    async def test_ensemble_parameter_forwarding(self):
+        await self.send_request_and_verify(httpclient, self.http, model_name="ensemble")
+
+    async def asyncTearDown(self):
+        self.http.close()
+        self.grpc.close()
+        await self.async_grpc.close()
+        await self.async_http.close()
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_parameters/test.sh b/qa/L0_parameters/test.sh
new file mode 100755
index 0000000000..967ead15c7
--- /dev/null
+++ b/qa/L0_parameters/test.sh
@@ -0,0 +1,95 @@
+#!/bin/bash
+# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
+if [ "$#" -ge 1 ]; then
+    REPO_VERSION=$1
+fi
+if [ -z "$REPO_VERSION" ]; then
+    echo -e "Repository version must be specified"
+    echo -e "\n***\n*** Test Failed\n***"
+    exit 1
+fi
+if [ ! -z "$TEST_REPO_ARCH" ]; then
+    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
+fi
+
+CLIENT_LOG="./client.log"
+TEST_SCRIPT_PY="parameters_test.py"
+
+SERVER=/opt/tritonserver/bin/tritonserver
+SERVER_LOG="./inference_server.log"
+source ../common/util.sh
+
+MODELDIR="model_repository"
+# Use identity model as dummy step to ensure parameters pass through each step
+mkdir -p "${MODELDIR}/identity/1"
+mkdir -p "${MODELDIR}/ensemble/1"
+
+# TODO: Add support and testing for C++ client parameters:
+# https://jirasw.nvidia.com/browse/DLIS-4673
+
+RET=0
+for i in {0..1}; do
+
+  # TEST_HEADER is a parameter used by `parameters_test.py` that controls
+  # whether the script will test for inclusion of headers in parameters or not.
+  if [ $i == 1 ]; then
+    SERVER_ARGS="--model-repository=${MODELDIR} --exit-timeout-secs=120 --grpc-header-forward-pattern my_header.* --http-header-forward-pattern my_header.*"
+  else
+    SERVER_ARGS="--model-repository=${MODELDIR} --exit-timeout-secs=120"
+  fi
+  run_server
+  if [ "$SERVER_PID" == "0" ]; then
+      echo -e "\n***\n*** Failed to start $SERVER\n***"
+      cat $SERVER_LOG
+      exit 1
+  fi
+
+  set +e
+  TEST_HEADER=$i python3 $TEST_SCRIPT_PY >$CLIENT_LOG 2>&1
+  if [ $? -ne 0 ]; then
+      cat $CLIENT_LOG
+      echo -e "\n***\n*** Test Failed\n***"
+      RET=1
+  fi
+
+  set -e
+
+  kill $SERVER_PID
+  wait $SERVER_PID
+done
+
+if [ $RET -eq 0 ]; then
+    echo -e "\n***\n*** Test Passed\n***"
+else
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test FAILED\n***"
+fi
+
+exit $RET
+
diff --git a/qa/L0_passive_instance/models/distributed_int32_int32_int32/config.pbtxt b/qa/L0_passive_instance/models/distributed_int32_int32_int32/config.pbtxt
old mode 100755
new mode 100644
diff --git a/qa/L0_passive_instance/passive_instance_test.py b/qa/L0_passive_instance/passive_instance_test.py
old mode 100644
new mode 100755
index 38a2724f6e..d7cdfffa7b
--- a/qa/L0_passive_instance/passive_instance_test.py
+++ b/qa/L0_passive_instance/passive_instance_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,24 +27,25 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
 import unittest
-import numpy as np
+
 import infer_util as iu
+import numpy as np
 import test_util as tu
-import tritonclient.http as httpclient
 
 
 class PassiveInstanceTest(tu.TestResultCollector):
-
     def test_inference(self):
         try:
-            iu.infer_exact(self, "distributed", (1, 16), 1, np.int32, np.int32,
-                           np.int32)
+            iu.infer_exact(
+                self, "distributed", (1, 16), 1, np.int32, np.int32, np.int32
+            )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_passive_instance/test.sh b/qa/L0_passive_instance/test.sh
old mode 100644
new mode 100755
diff --git a/qa/L0_perf_analyzer/perf_analyzer_profile_export_schema.json b/qa/L0_perf_analyzer/perf_analyzer_profile_export_schema.json
new file mode 100644
index 0000000000..d0feacd9b4
--- /dev/null
+++ b/qa/L0_perf_analyzer/perf_analyzer_profile_export_schema.json
@@ -0,0 +1,95 @@
+{
+    "$schema": "https://json-schema.org/draft/2020-12/schema",
+    "$id": "https://github.com/triton-inference-server/client/blob/main/src/c%2B%2B/perf_analyzer/examples/schema.json",
+    "title": "Perf Analyzer output data",
+    "description": "A json file describing the output from a Perf Analyzer run.",
+    "type": "object",
+    "required": [
+        "experiments",
+        "version"
+    ],
+    "properties": {
+        "experiments": {
+            "description": "The array of all experiments run by Perf Analyzer.",
+            "type": "array",
+            "required": [
+                "experiment",
+                "requests",
+                "window_boundaries"
+            ],
+            "minItems": 1,
+            "uniqueItems": true,
+            "items": {
+                "type": "object",
+                "properties": {
+                    "experiment": {
+                        "description": "A single experiment run by Perf Analyzer.",
+                        "type": "object",
+                        "required": [
+                            "mode",
+                            "value"
+                        ],
+                        "minItems": 1,
+                        "maxItems": 1,
+                        "properties": {
+                            "mode": {
+                                "description": "Operating mode of Perf Analyzer: For example, 'concurrency' or 'request rate'.",
+                                "type": "string"
+                            },
+                            "value": {
+                                "description": "Concurrency or request rate for the current experiment.",
+                                "type": "integer"
+                            }
+                        }
+                    },
+                    "requests": {
+                        "description": "The array of requests sent by Perf Analyzer for this experiment.",
+                        "type": "array",
+                        "items": {
+                            "$ref": "#/properties/experiments/items/properties/$defs/request"
+                        }
+                    },
+                    "$defs": {
+                        "request": {
+                            "description": "Info for a single request.",
+                            "type": "object",
+                            "required": [
+                                "timestamp",
+                                "response_timestamps"
+                            ],
+                            "properties": {
+                                "timestamp": {
+                                    "description": "Time stamp of the request.",
+                                    "type": "integer"
+                                },
+                                "sequence_id": {
+                                    "description": "The sequence_id of the request.",
+                                    "type": "integer"
+                                },
+                                "response_timestamps": {
+                                    "description": "All associated responses to this request.",
+                                    "type": "array",
+                                    "items": {
+                                        "type": "integer"
+                                    }
+                                }
+                            }
+                        }
+                    },
+                    "window_boundaries": {
+                        "description": "An array of time stamps describing window boundaries.",
+                        "type": "array",
+                        "items": {
+                            "type": "integer"
+                        },
+                        "uniqueItems": true
+                    }
+                }
+            }
+        },
+        "version": {
+            "description": "The version of Perf Analyzer that generated the report.",
+            "type": "string"
+        }
+    }
+}
\ No newline at end of file
diff --git a/qa/L0_perf_analyzer/test.sh b/qa/L0_perf_analyzer/test.sh
index 4c2b7244a2..20a659da85 100755
--- a/qa/L0_perf_analyzer/test.sh
+++ b/qa/L0_perf_analyzer/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -48,6 +48,7 @@ TESTDATADIR=`pwd`/test_data
 
 INT_JSONDATAFILE=`pwd`/../common/perf_analyzer_input_data_json/int_data.json
 INT_DIFFSHAPE_JSONDATAFILE=`pwd`/../common/perf_analyzer_input_data_json/int_data_diff_shape.json
+INT_OPTIONAL_JSONDATAFILE=`pwd`/../common/perf_analyzer_input_data_json/int_data_optional.json
 FLOAT_DIFFSHAPE_JSONDATAFILE=`pwd`/../common/perf_analyzer_input_data_json/float_data_with_shape.json
 STRING_JSONDATAFILE=`pwd`/../common/perf_analyzer_input_data_json/string_data.json
 STRING_WITHSHAPE_JSONDATAFILE=`pwd`/../common/perf_analyzer_input_data_json/string_data_with_shape.json
@@ -63,12 +64,16 @@ WRONG_OUTPUT_2_JSONDATAFILE=`pwd`/../common/perf_analyzer_input_data_json/wrong_
 SEQ_OUTPUT_JSONDATAFILE=`pwd`/../common/perf_analyzer_input_data_json/seq_output.json
 SEQ_WRONG_OUTPUT_JSONDATAFILE=`pwd`/../common/perf_analyzer_input_data_json/seq_wrong_output.json
 
+REPEAT_INT32_JSONDATAFILE=`pwd`/../common/perf_analyzer_input_data_json/repeat_int32_data.json
+
 SERVER=/opt/tritonserver/bin/tritonserver
 SERVER_ARGS="--model-repository=${DATADIR}"
 SERVER_LOG="./inference_server.log"
 
 ERROR_STRING="error | Request count: 0 | : 0 infer/sec"
 
+STABILITY_THRESHOLD="100"
+
 source ../common/util.sh
 
 rm -f $SERVER_LOG $CLIENT_LOG
@@ -112,6 +117,18 @@ cp -r ../custom_models/custom_zero_1_float32 $DATADIR && \
         echo "{ key: \"execute_delay_ms\"; value: { string_value: \"100\" }}" >> config.pbtxt && \
         echo "]" >> config.pbtxt)
 
+# Copy and customize optional inputs model
+cp -r ../python_models/optional $DATADIR && \
+  mkdir $DATADIR/optional/1 && \
+  mv $DATADIR/optional/model.py $DATADIR/optional/1 && \
+  sed -i 's/max_batch_size: 0/max_batch_size: 2/g' $DATADIR/optional/config.pbtxt
+
+# Copy decoupled model
+git clone --depth=1 https://github.com/triton-inference-server/python_backend
+mkdir -p $DATADIR/repeat_int32/1
+cp python_backend/examples/decoupled/repeat_config.pbtxt $DATADIR/repeat_int32/config.pbtxt
+cp python_backend/examples/decoupled/repeat_model.py $DATADIR/repeat_int32/1/model.py
+
 # Generating test data
 mkdir -p $TESTDATADIR
 for INPUT in INPUT0 INPUT1; do
@@ -136,7 +153,7 @@ fi
 SERVER_ERROR_STRING="The previous sequence did not end before this sequence start"
 
 set +e
-$PERF_ANALYZER -v -i $PROTOCOL -m graphdef_object_object_object -p2000 >$CLIENT_LOG 2>&1
+$PERF_ANALYZER -v -i $PROTOCOL -m graphdef_object_object_object -p2000 -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
 if [ $? -eq 0 ]; then
   cat $CLIENT_LOG
   echo -e "\n***\n*** Test Failed: Expected an error when using dynamic shapes in string inputs\n***"
@@ -148,21 +165,9 @@ if [ $(cat $CLIENT_LOG |  grep "input INPUT0 contains dynamic shape, provide sha
   RET=1
 fi
 
-$PERF_ANALYZER -v -i $PROTOCOL -m graphdef_object_object_object -p2000 --shape INPUT0 >$CLIENT_LOG 2>&1
-if [ $? -eq 0 ]; then
-  cat $CLIENT_LOG
-  echo -e "\n***\n*** Test Failed: Expected an error when using dynamic shapes with incorrect arguments\n***"
-  RET=1
-fi
-if [ $(cat $CLIENT_LOG |  grep "failed to parse input shape. There must be a colon after input name." | wc -l) -eq 0 ]; then
-  cat $CLIENT_LOG
-  echo -e "\n***\n*** Test Failed: \n***"
-  RET=1
-fi
-
 # Testing with ensemble and sequential model variants
 $PERF_ANALYZER -v -i grpc -m  simple_savedmodel_sequence_object -p 2000 -t5 --streaming \
---input-data=$SEQ_JSONDATAFILE  --input-data=$SEQ_JSONDATAFILE >$CLIENT_LOG 2>&1
+--input-data=$SEQ_JSONDATAFILE  --input-data=$SEQ_JSONDATAFILE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
     cat $CLIENT_LOG
     echo -e "\n***\n*** Test Failed\n***"
@@ -180,7 +185,7 @@ if [ $(cat $CLIENT_LOG |  grep "${ERROR_STRING}" | wc -l) -ne 0 ]; then
 fi
 
 $PERF_ANALYZER -v -i grpc -m  simple_savedmodel_sequence_object -p 1000 --request-rate-range 100:200:50 --streaming \
---input-data=$SEQ_JSONDATAFILE >$CLIENT_LOG 2>&1
+--input-data=$SEQ_JSONDATAFILE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
     cat $CLIENT_LOG
     echo -e "\n***\n*** Test Failed\n***"
@@ -205,7 +210,7 @@ for PROTOCOL in grpc http; do
     for SHARED_MEMORY_TYPE in none system cuda; do
         set +e
         $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_int32_int32_int32 -t 1 -p2000 -b 1 \
-    --shared-memory=$SHARED_MEMORY_TYPE >$CLIENT_LOG 2>&1
+    --shared-memory=$SHARED_MEMORY_TYPE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
         if [ $? -ne 0 ]; then
             cat $CLIENT_LOG
             echo -e "\n***\n*** Test Failed\n***"
@@ -218,7 +223,7 @@ for PROTOCOL in grpc http; do
         fi
 
         $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_int32_int32_int32 -t 1 -p2000 -b 1 -a \
-    --shared-memory=$SHARED_MEMORY_TYPE>$CLIENT_LOG 2>&1
+    --shared-memory=$SHARED_MEMORY_TYPE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
         if [ $? -ne 0 ]; then
             cat $CLIENT_LOG
             echo -e "\n***\n*** Test Failed\n***"
@@ -238,7 +243,7 @@ for PROTOCOL in grpc http; do
     for SHARED_MEMORY_TYPE in none system cuda; do
         set +e
         $PERF_ANALYZER -v -i $PROTOCOL -m inception_v1_graphdef -t 1 -p2000 -b 1 \
-    --shared-memory=$SHARED_MEMORY_TYPE >$CLIENT_LOG 2>&1
+    --shared-memory=$SHARED_MEMORY_TYPE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
         if [ $? -ne 0 ]; then
             cat $CLIENT_LOG
             echo -e "\n***\n*** Test Failed\n***"
@@ -251,7 +256,7 @@ for PROTOCOL in grpc http; do
         fi
 
         $PERF_ANALYZER -v -i $PROTOCOL -m inception_v1_graphdef -t 1 -p2000 -b 1 -a \
-    --shared-memory=$SHARED_MEMORY_TYPE>$CLIENT_LOG 2>&1
+    --shared-memory=$SHARED_MEMORY_TYPE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
         if [ $? -ne 0 ]; then
             cat $CLIENT_LOG
             echo -e "\n***\n*** Test Failed\n***"
@@ -269,7 +274,7 @@ for PROTOCOL in grpc http; do
     for SHARED_MEMORY_TYPE in none system cuda; do
         set +e
         $PERF_ANALYZER -v -i $PROTOCOL -m inception_v1_graphdef -t 2 -p2000 -b 64 \
-    --shared-memory=$SHARED_MEMORY_TYPE >$CLIENT_LOG 2>&1
+    --shared-memory=$SHARED_MEMORY_TYPE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
         if [ $? -ne 0 ]; then
             cat $CLIENT_LOG
             echo -e "\n***\n*** Test Failed\n***"
@@ -282,7 +287,7 @@ for PROTOCOL in grpc http; do
         fi
 
         $PERF_ANALYZER -v -i $PROTOCOL -m inception_v1_graphdef -t 2 -p2000 -b 64 \
-    --shared-memory=$SHARED_MEMORY_TYPE -a >$CLIENT_LOG 2>&1
+    --shared-memory=$SHARED_MEMORY_TYPE -a -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
         if [ $? -ne 0 ]; then
             cat $CLIENT_LOG
             echo -e "\n***\n*** Test Failed\n***"
@@ -300,7 +305,7 @@ for PROTOCOL in grpc http; do
     for MODEL in graphdef_nobatch_int32_int32_int32 graphdef_int32_int32_int32; do
         # Valid batch size
         set +e
-        $PERF_ANALYZER -v -i $PROTOCOL -m $MODEL -t 1 -p2000 -b 1 >$CLIENT_LOG 2>&1
+        $PERF_ANALYZER -v -i $PROTOCOL -m $MODEL -t 1 -p2000 -b 1 -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
         if [ $? -ne 0 ]; then
             cat $CLIENT_LOG
             echo -e "\n***\n*** Test Failed\n***"
@@ -311,7 +316,7 @@ for PROTOCOL in grpc http; do
         # Invalid batch sizes
         for STATIC_BATCH in 0 10; do
             set +e
-            $PERF_ANALYZER -v -i $PROTOCOL -m $MODEL -t 1 -p2000 -b $STATIC_BATCH >$CLIENT_LOG 2>&1
+            $PERF_ANALYZER -v -i $PROTOCOL -m $MODEL -t 1 -p2000 -b $STATIC_BATCH -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
             if [ $? -eq 0 ]; then
                 cat $CLIENT_LOG
                 echo -e "\n***\n*** Test Failed\n***"
@@ -323,7 +328,7 @@ for PROTOCOL in grpc http; do
 
     # Testing with the new arguments
     set +e
-    $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_int32_int32_int32 >$CLIENT_LOG 2>&1
+    $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_int32_int32_int32 -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
     if [ $? -ne 0 ]; then
         cat $CLIENT_LOG
         echo -e "\n***\n*** Test Failed\n***"
@@ -335,7 +340,7 @@ for PROTOCOL in grpc http; do
         RET=1
     fi
 
-    $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_int32_int32_int32 --concurrency-range 1:5:2 >$CLIENT_LOG 2>&1
+    $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_int32_int32_int32 --concurrency-range 1:5:2 -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
     if [ $? -ne 0 ]; then
         cat $CLIENT_LOG
         echo -e "\n***\n*** Test Failed\n***"
@@ -348,7 +353,7 @@ for PROTOCOL in grpc http; do
     fi
 
     $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_int32_int32_int32 --concurrency-range 1:5:2 \
-    --input-data=${INT_JSONDATAFILE} >$CLIENT_LOG 2>&1
+    --input-data=${INT_JSONDATAFILE} -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
     if [ $? -ne 0 ]; then
         cat $CLIENT_LOG
         echo -e "\n***\n*** Test Failed\n***"
@@ -361,7 +366,7 @@ for PROTOCOL in grpc http; do
     fi
 
     $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_int32_int32_int32 --request-rate-range 1000:2000:500 \
-    -p1000 -b 1 -a>$CLIENT_LOG 2>&1
+    -p1000 -b 1 -a -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
     if [ $? -ne 0 ]; then
         cat $CLIENT_LOG
         echo -e "\n***\n*** Test Failed\n***"
@@ -374,7 +379,7 @@ for PROTOCOL in grpc http; do
     fi
 
     $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_int32_int32_int32 --request-rate-range 1000:2000:500 \
-    --input-data=${INT_JSONDATAFILE} -p1000 -b 1 -a>$CLIENT_LOG 2>&1
+    --input-data=${INT_JSONDATAFILE} -p1000 -b 1 -a -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
     if [ $? -ne 0 ]; then
         cat $CLIENT_LOG
         echo -e "\n***\n*** Test Failed\n***"
@@ -388,7 +393,7 @@ for PROTOCOL in grpc http; do
 
     # Binary search for request rate mode
     $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_int32_int32_int32 --request-rate-range 1000:2000:100 -p1000 -b 1 \
-    -a --binary-search --request-distribution "poisson" -l 10 >$CLIENT_LOG 2>&1
+    -a --binary-search --request-distribution "poisson" -l 10 -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
     if [ $? -ne 0 ]; then
         cat $CLIENT_LOG
         echo -e "\n***\n*** Test Failed\n***"
@@ -400,11 +405,11 @@ for PROTOCOL in grpc http; do
         RET=1
     fi
     set -e
-    
+
     # Binary search for concurrency range mode and make sure it doesn't hang
     $PERF_ANALYZER -v -a --request-distribution "poisson" --shared-memory none \
     --percentile 99 --binary-search --concurrency-range 1:8:2 -l 5 \
-    -m graphdef_int32_int32_int32 -b 1 >$CLIENT_LOG 2>&1 &
+    -m graphdef_int32_int32_int32 -b 1 -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 &
     PA_PID=$!
     if [ "$PA_PID" == "0" ]; then
         echo -e "\n***\n*** Failed to start $PERF_ANALYZER\n***"
@@ -435,7 +440,7 @@ for PROTOCOL in grpc http; do
     for SHARED_MEMORY_TYPE in none system cuda; do
         set +e
         $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_object_object_object --string-data=1 -p2000 \
-    --shared-memory=$SHARED_MEMORY_TYPE>$CLIENT_LOG 2>&1
+    --shared-memory=$SHARED_MEMORY_TYPE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
         if [ $? -ne 0 ]; then
             cat $CLIENT_LOG
             echo -e "\n***\n*** Test Failed\n***"
@@ -453,7 +458,7 @@ for PROTOCOL in grpc http; do
     for SHARED_MEMORY_TYPE in none system cuda; do
         set +e
         $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_object_object_object --input-data=$TESTDATADIR -p2000 \
-    --shared-memory=$SHARED_MEMORY_TYPE>$CLIENT_LOG 2>&1
+    --shared-memory=$SHARED_MEMORY_TYPE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
         if [ $? -ne 0 ]; then
             cat $CLIENT_LOG
             echo -e "\n***\n*** Test Failed\n***"
@@ -470,7 +475,7 @@ for PROTOCOL in grpc http; do
     for SHARED_MEMORY_TYPE in none system cuda; do
         set +e
         $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_object_object_object --input-data=$STRING_JSONDATAFILE \
-    --input-data=$STRING_JSONDATAFILE -p2000 --shared-memory=$SHARED_MEMORY_TYPE>$CLIENT_LOG 2>&1
+    --input-data=$STRING_JSONDATAFILE -p2000 --shared-memory=$SHARED_MEMORY_TYPE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
         if [ $? -ne 0 ]; then
             cat $CLIENT_LOG
             echo -e "\n***\n*** Test Failed\n***"
@@ -488,7 +493,7 @@ for PROTOCOL in grpc http; do
     for SHARED_MEMORY_TYPE in none system cuda; do
         set +e
         $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_object_int32_int32 --input-data=$TESTDATADIR \
-    --shape INPUT0:2,8 --shape INPUT1:2,8 -p2000 --shared-memory=$SHARED_MEMORY_TYPE \
+    --shape INPUT0:2,8 --shape INPUT1:2,8 -p2000 --shared-memory=$SHARED_MEMORY_TYPE -s ${STABILITY_THRESHOLD} \
     >$CLIENT_LOG 2>&1
         if [ $? -ne 0 ]; then
             cat $CLIENT_LOG
@@ -506,7 +511,7 @@ for PROTOCOL in grpc http; do
     for SHARED_MEMORY_TYPE in none system cuda; do
         set +e
         $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_object_int32_int32 --input-data=$STRING_WITHSHAPE_JSONDATAFILE \
-    --shape INPUT0:2,8 --shape INPUT1:2,8 -p2000 --shared-memory=$SHARED_MEMORY_TYPE \
+    --shape INPUT0:2,8 --shape INPUT1:2,8 -p2000 --shared-memory=$SHARED_MEMORY_TYPE -s ${STABILITY_THRESHOLD} \
     >$CLIENT_LOG 2>&1
         if [ $? -ne 0 ]; then
             cat $CLIENT_LOG
@@ -523,7 +528,7 @@ for PROTOCOL in grpc http; do
 
     set +e
     $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_int32_int32_float32 --shape INPUT0:2,8,2 \
-    --shape INPUT1:2,8,2 -p2000 >$CLIENT_LOG 2>&1
+    --shape INPUT1:2,8,2 -p2000 -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
     if [ $? -ne 0 ]; then
         cat $CLIENT_LOG
         echo -e "\n***\n*** Test Failed\n***"
@@ -540,13 +545,13 @@ for PROTOCOL in grpc http; do
     for SHARED_MEMORY_TYPE in none system cuda; do
         set +e
         $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_int32_int32_float32 --shape INPUT0:2,8,2 --shape INPUT1:2,8,2 -p2000 -b 4 \
-    --shared-memory=$SHARED_MEMORY_TYPE --input-data=$INT_DIFFSHAPE_JSONDATAFILE >$CLIENT_LOG 2>&1
+    --shared-memory=$SHARED_MEMORY_TYPE --input-data=$INT_DIFFSHAPE_JSONDATAFILE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
         if [ $? -eq 0 ]; then
             cat $CLIENT_LOG
             echo -e "\n***\n*** Test Failed\n***"
             RET=1
         fi
-        if [ $(cat $CLIENT_LOG | grep "can not batch tensors with different shapes together" | wc -l) -eq 0 ]; then
+        if [ $(cat $CLIENT_LOG | grep -P "The supplied shape .+ is incompatible with the model's input shape" | wc -l) -eq 0 ]; then
             cat $CLIENT_LOG
             echo -e "\n***\n*** Test Failed\n***"
             RET=1
@@ -558,7 +563,7 @@ for PROTOCOL in grpc http; do
     for SHARED_MEMORY_TYPE in none system; do
         set +e
         $PERF_ANALYZER -v -i $PROTOCOL -m plan_zero_1_float32 --input-data=$SHAPETENSORADTAFILE \
-    --shape DUMMY_INPUT0:4,4 -p2000 --shared-memory=$SHARED_MEMORY_TYPE -b 8 \
+    --shape DUMMY_INPUT0:4,4 -p2000 --shared-memory=$SHARED_MEMORY_TYPE -b 8 -s ${STABILITY_THRESHOLD} \
     >$CLIENT_LOG 2>&1
         if [ $? -ne 0 ]; then
             cat $CLIENT_LOG
@@ -575,7 +580,7 @@ for PROTOCOL in grpc http; do
 
     set +e
     $PERF_ANALYZER -v -i $PROTOCOL -m  simple_savedmodel_sequence_object -p 2000 -t5 --sync \
-    --input-data=$SEQ_JSONDATAFILE >$CLIENT_LOG 2>&1
+    --input-data=$SEQ_JSONDATAFILE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
     if [ $? -ne 0 ]; then
         cat $CLIENT_LOG
         echo -e "\n***\n*** Test Failed\n***"
@@ -588,7 +593,7 @@ for PROTOCOL in grpc http; do
     fi
 
     $PERF_ANALYZER -v -i $PROTOCOL -m  simple_savedmodel_sequence_object -p 2000 -t5 --sync \
-    --input-data=$SEQ_JSONDATAFILE  >$CLIENT_LOG 2>&1
+    --input-data=$SEQ_JSONDATAFILE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
     if [ $? -ne 0 ]; then
         cat $CLIENT_LOG
         echo -e "\n***\n*** Test Failed\n***"
@@ -601,7 +606,7 @@ for PROTOCOL in grpc http; do
     fi
 
     $PERF_ANALYZER -v -i $PROTOCOL -m  simple_savedmodel_sequence_object -p 1000 --request-rate-range 100:200:50 --sync \
-    --input-data=$SEQ_JSONDATAFILE >$CLIENT_LOG 2>&1
+    --input-data=$SEQ_JSONDATAFILE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
     if [ $? -ne 0 ]; then
         cat $CLIENT_LOG
         echo -e "\n***\n*** Test Failed\n***"
@@ -621,13 +626,13 @@ for PROTOCOL in grpc http; do
         set +e
         # FIXME: Enable HTTP when the server is able to correctly return the complex error messages.
         $PERF_ANALYZER -v -i grpc -m graphdef_sequence_float32 --shape INPUT:2 --input-data=$FLOAT_DIFFSHAPE_JSONDATAFILE \
-    --input-data=$FLOAT_DIFFSHAPE_JSONDATAFILE -p2000 --shared-memory=$SHARED_MEMORY_TYPE >$CLIENT_LOG 2>&1
+    --input-data=$FLOAT_DIFFSHAPE_JSONDATAFILE -p2000 --shared-memory=$SHARED_MEMORY_TYPE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
         if [ $? -eq 0 ]; then
             cat $CLIENT_LOG
             echo -e "\n***\n*** Test Failed\n***"
             RET=1
         fi
-        if [ $(cat $CLIENT_LOG |  grep "Inputs to operation Select of type Select must have the same size and shape." | wc -l) -eq 0 ]; then
+        if [ $(cat $CLIENT_LOG |  grep -P "The supplied shape .+ is incompatible with the model's input shape" | wc -l) -eq 0 ]; then
             cat $CLIENT_LOG
             echo -e "\n***\n*** Test Failed\n***"
             RET=1
@@ -635,11 +640,32 @@ for PROTOCOL in grpc http; do
         set -e
     done
 
+    # Testing that trace logging works
+    set +e
+    TRACE_FILE="trace.json"
+    rm ${TRACE_FILE}*
+    $PERF_ANALYZER -v -i $PROTOCOL -m simple_savedmodel_sequence_object -p 2000 -t5 --sync --trace-file $TRACE_FILE \
+    --trace-level TIMESTAMPS --trace-rate 1000 --trace-count 100 --log-frequency 10 \
+    --input-data=$SEQ_JSONDATAFILE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Failed\n***"
+        RET=1
+    fi
+    if ! compgen -G "$TRACE_FILE*" > /dev/null; then
+        echo -e "\n***\n*** Test Failed. $TRACE_FILE failed to generate.\n***"
+        RET=1
+    elif [ $(cat ${TRACE_FILE}* |  grep "REQUEST_START" | wc -l) -eq 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Failed. Did not find `REQUEST_START` in $TRACE_FILE \n***"
+        RET=1
+    fi
+    set -e
 done
 
 # Test with output validation
 set +e
-$PERF_ANALYZER -v -m graphdef_int32_int32_int32 --input-data=${NON_ALIGNED_OUTPUT_JSONDATAFILE} >$CLIENT_LOG 2>&1
+$PERF_ANALYZER -v -m graphdef_int32_int32_int32 --input-data=${NON_ALIGNED_OUTPUT_JSONDATAFILE} -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
 if [ $? -eq 0 ]; then
     cat $CLIENT_LOG
     echo -e "\n***\n*** Test Failed\n***"
@@ -651,19 +677,19 @@ if [ $(cat $CLIENT_LOG |  grep "The 'validation_data' field doesn't align with '
     RET=1
 fi
 
-$PERF_ANALYZER -v -m graphdef_int32_int32_int32 --input-data=${WRONG_OUTPUT_JSONDATAFILE} >$CLIENT_LOG 2>&1
+$PERF_ANALYZER -v -m graphdef_int32_int32_int32 --input-data=${WRONG_OUTPUT_JSONDATAFILE} -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
 if [ $? -eq 0 ]; then
     cat $CLIENT_LOG
     echo -e "\n***\n*** Test Failed\n***"
     RET=1
 fi
-if [ $(cat $CLIENT_LOG |  grep "Output size doesn't match expected size" | wc -l) -eq 0 ]; then
+if [ $(cat $CLIENT_LOG |  grep "mismatch in the data provided" | wc -l) -eq 0 ]; then
     cat $CLIENT_LOG
     echo -e "\n***\n*** Test Failed\n***"
     RET=1
 fi
 
-$PERF_ANALYZER -v -m graphdef_int32_int32_int32 --input-data=${WRONG_OUTPUT_2_JSONDATAFILE} >$CLIENT_LOG 2>&1
+$PERF_ANALYZER -v -m graphdef_int32_int32_int32 --input-data=${WRONG_OUTPUT_2_JSONDATAFILE} -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
 if [ $? -eq 0 ]; then
     cat $CLIENT_LOG
     echo -e "\n***\n*** Test Failed\n***"
@@ -676,7 +702,7 @@ if [ $(cat $CLIENT_LOG |  grep "Output doesn't match expected output" | wc -l) -
 fi
 
 
-$PERF_ANALYZER -v -m graphdef_int32_int32_int32 --input-data=${OUTPUT_JSONDATAFILE} >$CLIENT_LOG 2>&1
+$PERF_ANALYZER -v -m graphdef_int32_int32_int32 --input-data=${OUTPUT_JSONDATAFILE} -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
     cat $CLIENT_LOG
     echo -e "\n***\n*** Test Failed\n***"
@@ -689,7 +715,7 @@ if [ $(cat $CLIENT_LOG |  grep "${ERROR_STRING}" | wc -l) -ne 0 ]; then
 fi
 
 $PERF_ANALYZER -v -m simple_savedmodel_sequence_object -i grpc --streaming \
---input-data=${SEQ_WRONG_OUTPUT_JSONDATAFILE} >$CLIENT_LOG 2>&1
+--input-data=${SEQ_WRONG_OUTPUT_JSONDATAFILE} -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
 if [ $? -eq 0 ]; then
     cat $CLIENT_LOG
     echo -e "\n***\n*** Test Failed\n***"
@@ -702,7 +728,7 @@ if [ $(cat $CLIENT_LOG |  grep "Output doesn't match expected output" | wc -l) -
 fi
 
 $PERF_ANALYZER -v -m simple_savedmodel_sequence_object -i grpc --streaming \
---input-data=${SEQ_OUTPUT_JSONDATAFILE} >$CLIENT_LOG 2>&1
+--input-data=${SEQ_OUTPUT_JSONDATAFILE} -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
     cat $CLIENT_LOG
     echo -e "\n***\n*** Test Failed\n***"
@@ -722,7 +748,7 @@ for i in {1..9}; do
 done
 set +e
 $PERF_ANALYZER -v -m  simple_savedmodel_sequence_object -p 10000 --concurrency-range 1500:2000:250 -i grpc --streaming \
-${INPUT_DATA_OPTION} >$CLIENT_LOG 2>&1
+${INPUT_DATA_OPTION} -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
    cat $CLIENT_LOG
    echo -e "\n***\n*** Test Failed\n***"
@@ -740,7 +766,7 @@ set +e
 
 # Send incorrect shape and make sure that perf_analyzer doesn't hang
 $PERF_ANALYZER -v -m graphdef_object_int32_int32 --measurement-mode "count_windows" \
-    --shape INPUT0:1,8,100 --shape INPUT1:2,8 --string-data=1 >$CLIENT_LOG 2>&1
+    --shape INPUT0:1,8,100 --shape INPUT1:2,8 --string-data=1 -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
 if [ $? -eq 0 ]; then
    cat $CLIENT_LOG
    echo -e "\n***\n*** Test Failed\n***"
@@ -753,7 +779,7 @@ if [ $(cat $CLIENT_LOG |  grep "unexpected shape for input 'INPUT0' for model" |
 fi
 
 $PERF_ANALYZER -v -m graphdef_object_int32_int32 --measurement-mode "count_windows" \
-    --shape INPUT0:2,8 --shape INPUT1:2,8 --string-data=1 >$CLIENT_LOG 2>&1
+    --shape INPUT0:2,8 --shape INPUT1:2,8 --string-data=1 -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
    cat $CLIENT_LOG
    echo -e "\n***\n*** Test Failed\n***"
@@ -766,6 +792,117 @@ if [ $(cat $CLIENT_LOG |  grep "${ERROR_STRING}" | wc -l) -ne 0 ]; then
 fi
 set -e
 
+# Test with optional inputs missing but still valid
+set +e
+$PERF_ANALYZER -v -m optional --measurement-mode "count_windows" \
+    --input-data=${INT_OPTIONAL_JSONDATAFILE} -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+   cat $CLIENT_LOG
+   echo -e "\n***\n*** Test Failed\n***"
+   RET=1
+fi
+set -e
+
+# Test with optional inputs missing and invalid
+set +e
+OPTIONAL_INPUT_ERROR_STRING="For batch sizes larger than 1, the same set of
+inputs must be specified for each batch. You cannot use different set of
+optional inputs for each individual batch."
+$PERF_ANALYZER -v -m optional -b 2 --measurement-mode "count_windows" \
+    --input-data=${INT_OPTIONAL_JSONDATAFILE} -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
+if [ $? -eq 0 ]; then
+   cat $CLIENT_LOG
+   echo -e "\n***\n*** Test Failed\n***"
+   RET=1
+fi
+if [ $(cat $CLIENT_LOG |  grep "${OPTIONAL_INPUT_ERROR_STRING}" | wc -l) -eq 0 ]; then
+   cat $CLIENT_LOG
+   echo -e "\n***\n*** Test Failed\n***"
+   RET=1
+fi
+set -e
+
+
+# Test Custom request rate option
+CUSTOM_SCHEDULE_FILE=$TESTDATADIR/custom.schedule
+echo '30000' >> $CUSTOM_SCHEDULE_FILE
+echo '10000' >> $CUSTOM_SCHEDULE_FILE
+echo '40000' >> $CUSTOM_SCHEDULE_FILE
+echo '20000' >> $CUSTOM_SCHEDULE_FILE
+echo '25000' >> $CUSTOM_SCHEDULE_FILE
+
+set +e
+$PERF_ANALYZER -v -i grpc -m graphdef_int32_int32_int32 --request-intervals $CUSTOM_SCHEDULE_FILE >$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+if [ $(cat $CLIENT_LOG | grep "${ERROR_STRING}" | wc -l) -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+if [ $(cat $CLIENT_LOG |  grep "Request Rate: 40" | wc -l) -eq 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed: \n***"
+    RET=1
+fi
+set -e
+
+# Test --serial-sequences mode
+set +e
+$PERF_ANALYZER -v -i $PROTOCOL -m  simple_savedmodel_sequence_object -p 1000 --request-rate-range 100:200:50 --serial-sequences \
+    --input-data=$SEQ_JSONDATAFILE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+if [ $(cat $CLIENT_LOG |  grep "${ERROR_STRING}" | wc -l) -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+
+$PERF_ANALYZER -v -i $PROTOCOL -m  simple_savedmodel_sequence_object -p 1000 --request-intervals $CUSTOM_SCHEDULE_FILE --serial-sequences \
+    --input-data=$SEQ_JSONDATAFILE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+if [ $(cat $CLIENT_LOG |  grep "${ERROR_STRING}" | wc -l) -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+set -e
+
+## Test decoupled model support
+$PERF_ANALYZER -v -m repeat_int32 --input-data=$REPEAT_INT32_JSONDATAFILE \
+    --profile-export-file profile_export.json -i grpc --async --streaming -s \
+    ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+python3 -c "import json ; \
+    requests = json.load(open('profile_export.json'))['experiments'][0]['requests'] ; \
+    assert any(len(r['response_timestamps']) > 1 for r in requests)"
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+check-jsonschema --schemafile perf_analyzer_profile_export_schema.json profile_export.json
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+
 ## Test perf_analyzer with MPI / multiple models
 
 is_synchronized() {
@@ -851,10 +988,10 @@ set -e
 
 ## Test perf_analyzer without MPI library (`libmpi.so`) available
 
-rm -rf /opt/hpcx
+rm -rf /opt/hpcx/ompi/lib/libmpi*
 
 set +e
-$PERF_ANALYZER -v -m graphdef_int32_int32_int32
+$PERF_ANALYZER -v -m graphdef_int32_int32_int32 -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
    cat $CLIENT_LOG
    echo -e "\n***\n*** Test Failed\n***"
@@ -907,6 +1044,7 @@ $PERF_ANALYZER -v -i grpc -m graphdef_int32_int32_int32 \
   --ssl-grpc-root-certifications-file=ca.crt \
   --ssl-grpc-private-key-file=client.key \
   --ssl-grpc-certificate-chain-file=client.crt \
+  -s ${STABILITY_THRESHOLD} \
   > ${CLIENT_LOG}.grpc_success 2>&1
 if [ $? -ne 0 ]; then
     cat ${CLIENT_LOG}.grpc_success
@@ -919,6 +1057,7 @@ $PERF_ANALYZER -v -i grpc -m graphdef_int32_int32_int32 \
     --ssl-grpc-root-certifications-file=ca.crt \
     --ssl-grpc-private-key-file=client.key \
     --ssl-grpc-certificate-chain-file=client2.crt \
+    -s ${STABILITY_THRESHOLD} \
     > ${CLIENT_LOG}.grpc_failure 2>&1
 if [ $? -eq 0 ]; then
     cat ${CLIENT_LOG}.grpc_failure
@@ -962,6 +1101,7 @@ $PERF_ANALYZER -v -u https://localhost:443 -i http -m graphdef_int32_int32_int32
     --ssl-https-client-certificate-type PEM \
     --ssl-https-private-key-file client.key \
     --ssl-https-private-key-type PEM \
+    -s ${STABILITY_THRESHOLD} \
     > ${CLIENT_LOG}.https_success 2>&1
 if [ $? -ne 0 ]; then
     cat ${CLIENT_LOG}.https_success
@@ -971,7 +1111,8 @@ fi
 # Test that HTTP protocol with SSL works correctly without certificates
 $PERF_ANALYZER -v -u https://localhost:443 -i http -m graphdef_int32_int32_int32 \
     --ssl-https-verify-peer 0 \
-    --ssl-https-verify-host 0
+    --ssl-https-verify-host 0 \
+    -s ${STABILITY_THRESHOLD} \
     > ${CLIENT_LOG}.https_success 2>&1
 if [ $? -ne 0 ]; then
     cat ${CLIENT_LOG}.https_success
@@ -987,6 +1128,7 @@ $PERF_ANALYZER -v -u https://localhost:443 -i http -m graphdef_int32_int32_int32
     --ssl-https-client-certificate-type PEM \
     --ssl-https-private-key-file client2.key \
     --ssl-https-private-key-type PEM \
+    -s ${STABILITY_THRESHOLD} \
     > ${CLIENT_LOG}.https_failure 2>&1
 if [ $? -eq 0 ]; then
     cat ${CLIENT_LOG}.https_failure
diff --git a/qa/L0_perf_analyzer_capi/test.sh b/qa/L0_perf_analyzer_capi/test.sh
index f447fe5d3d..f9fa3c078e 100755
--- a/qa/L0_perf_analyzer_capi/test.sh
+++ b/qa/L0_perf_analyzer_capi/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -55,7 +55,8 @@ SEQ_JSONDATAFILE=`pwd`/../common/perf_analyzer_input_data_json/seq_data.json
 SHAPETENSORADTAFILE=`pwd`/../common/perf_analyzer_input_data_json/shape_tensor_data.json
 
 ERROR_STRING="error | Request count: 0 | : 0 infer/sec"
-NON_SUPPORTED_ERROR_STRING="supported by C API"
+
+STABILITY_THRESHOLD="15"
 
 source ../common/util.sh
 
@@ -81,6 +82,11 @@ cp -r /data/inferenceserver/${REPO_VERSION}/qa_ensemble_model_repository/qa_sequ
 # Copying variable sequence model
 cp -r /data/inferenceserver/${REPO_VERSION}/qa_variable_sequence_model_repository/graphdef_sequence_float32 $DATADIR
 
+# Copying bls model with undefined variable
+mkdir -p $DATADIR/bls_undefined/1 && \
+    cp ../python_models/bls_undefined/model.py $DATADIR/bls_undefined/1/. && \
+    cp ../python_models/bls_undefined/config.pbtxt $DATADIR/bls_undefined/.
+
 # Generating test data
 mkdir -p $TESTDATADIR
 for INPUT in INPUT0 INPUT1; do
@@ -106,7 +112,7 @@ set -e
 $PERF_ANALYZER -v -m graphdef_int32_int32_int32 \
 --service-kind=triton_c_api \
 --model-repository=$DATADIR --triton-server-directory=$SERVER_LIBRARY_PATH \
->$CLIENT_LOG 2>&1
+-s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
     cat $CLIENT_LOG
     echo -e "\n***\n*** Test Failed\n***"
@@ -120,7 +126,8 @@ fi
 
 $PERF_ANALYZER -v -m graphdef_int32_int32_int32 -t 1 -p2000 -b 1 \
 --service-kind=triton_c_api --model-repository=$DATADIR \
---triton-server-directory=$SERVER_LIBRARY_PATH >$CLIENT_LOG 2>&1
+--triton-server-directory=$SERVER_LIBRARY_PATH -s ${STABILITY_THRESHOLD} \
+>$CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
     cat $CLIENT_LOG
     echo -e "\n***\n*** Test Failed\n***"
@@ -135,7 +142,8 @@ fi
 #Testing with string input
 $PERF_ANALYZER -v -m graphdef_object_object_object --string-data=1 -p2000 \
 --service-kind=triton_c_api --model-repository=$DATADIR \
---triton-server-directory=$SERVER_LIBRARY_PATH >$CLIENT_LOG 2>&1
+--triton-server-directory=$SERVER_LIBRARY_PATH -s ${STABILITY_THRESHOLD} \
+>$CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
     cat $CLIENT_LOG
     echo -e "\n***\n*** Test Failed\n***"
@@ -151,7 +159,8 @@ fi
 $PERF_ANALYZER -v -m graphdef_object_int32_int32 --input-data=$TESTDATADIR \
 --shape INPUT0:2,8 --shape INPUT1:2,8 \
 --service-kind=triton_c_api --model-repository=$DATADIR \
---triton-server-directory=$SERVER_LIBRARY_PATH >$CLIENT_LOG 2>&1
+--triton-server-directory=$SERVER_LIBRARY_PATH -s ${STABILITY_THRESHOLD} \
+>$CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
     cat $CLIENT_LOG
     echo -e "\n***\n*** Test Failed\n***"
@@ -162,7 +171,7 @@ $PERF_ANALYZER -v -m graphdef_object_int32_int32 \
 --input-data=$STRING_WITHSHAPE_JSONDATAFILE \
 --shape INPUT0:2,8 --shape INPUT1:2,8 -p2000 \
 --service-kind=triton_c_api --model-repository=$DATADIR \
---triton-server-directory=$SERVER_LIBRARY_PATH >$CLIENT_LOG 2>&1
+--triton-server-directory=$SERVER_LIBRARY_PATH -s ${STABILITY_THRESHOLD} \
 >$CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
     cat $CLIENT_LOG
@@ -178,7 +187,8 @@ fi
 $PERF_ANALYZER -v -m graphdef_int32_int32_float32 --shape INPUT0:2,8,2 \
 --shape INPUT1:2,8,2 -p2000 \
 --service-kind=triton_c_api --model-repository=$DATADIR \
---triton-server-directory=$SERVER_LIBRARY_PATH >$CLIENT_LOG 2>&1
+--triton-server-directory=$SERVER_LIBRARY_PATH -s ${STABILITY_THRESHOLD} \
+>$CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
     cat $CLIENT_LOG
     echo -e "\n***\n*** Test Failed\n***"
@@ -194,7 +204,8 @@ fi
 $PERF_ANALYZER -v -m plan_zero_1_float32 --input-data=$SHAPETENSORADTAFILE \
 --shape DUMMY_INPUT0:4,4 -p2000 -b 8 \
 --service-kind=triton_c_api --model-repository=$DATADIR \
---triton-server-directory=$SERVER_LIBRARY_PATH >$CLIENT_LOG 2>&1
+--triton-server-directory=$SERVER_LIBRARY_PATH -s ${STABILITY_THRESHOLD} \
+>$CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
     cat $CLIENT_LOG
     echo -e "\n***\n*** Test Failed\n***"
@@ -206,73 +217,94 @@ if [ $(cat $CLIENT_LOG | grep ": 0 infer/sec\|: 0 usec" | wc -l) -ne 0 ]; then
     RET=1
 fi
 
-# TODO: Re-enable after sequence model support if fixed for CAPI
-# $PERF_ANALYZER -v -m  simple_savedmodel_sequence_object -p 2000 -t5 --sync \
-# --input-data=$SEQ_JSONDATAFILE \
-# --service-kind=triton_c_api --model-repository=$DATADIR \
-# --triton-server-directory=$SERVER_LIBRARY_PATH >$CLIENT_LOG 2>&1
-# if [ $? -ne 0 ]; then
-#     cat $CLIENT_LOG
-#     echo -e "\n***\n*** Test Failed\n***"
-#     RET=1
-# fi
-# if [ $(cat $CLIENT_LOG |  grep "${ERROR_STRING}" | wc -l) -ne 0 ]; then
-#     cat $CLIENT_LOG
-#     echo -e "\n***\n*** Test Failed\n***"
-#     RET=1
-# fi
-#
-# TODO: Re-enable after variable model support if fixed for CAPI
-# $PERF_ANALYZER -v -m graphdef_sequence_float32 --shape INPUT:2 \
-# --input-data=$FLOAT_DIFFSHAPE_JSONDATAFILE \
-# --input-data=$FLOAT_DIFFSHAPE_JSONDATAFILE -p2000 \
-# --service-kind=triton_c_api --model-repository=$DATADIR \
-# --triton-server-directory=$SERVER_LIBRARY_PATH >$CLIENT_LOG 2>&1
-# if [ $? -eq 0 ]; then
-#     cat $CLIENT_LOG
-#     echo -e "\n***\n*** Test Failed\n***"
-#     RET=1
-# fi
-# if [ $(cat $CLIENT_LOG |  grep "Inputs to operation Select of type Select must have the same size and shape." | wc -l) -eq 0 ]; then
-#     cat $CLIENT_LOG
-#     echo -e "\n***\n*** Test Failed\n***"
-#     RET=1
-# fi
-
-#Testing that async does NOT work
+$PERF_ANALYZER -v -m  simple_savedmodel_sequence_object -p 2000 -t5 --sync \
+--input-data=$SEQ_JSONDATAFILE \
+--service-kind=triton_c_api --model-repository=$DATADIR \
+--triton-server-directory=$SERVER_LIBRARY_PATH >$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+if [ $(cat $CLIENT_LOG |  grep "${ERROR_STRING}" | wc -l) -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+
+set +e
+$PERF_ANALYZER -v -m graphdef_sequence_float32 --shape INPUT:2 \
+--input-data=$FLOAT_DIFFSHAPE_JSONDATAFILE \
+--input-data=$FLOAT_DIFFSHAPE_JSONDATAFILE -p2000 \
+--service-kind=triton_c_api --model-repository=$DATADIR \
+--triton-server-directory=$SERVER_LIBRARY_PATH --sync >$CLIENT_LOG 2>&1
+if [ $? -eq 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+if [ $(cat $CLIENT_LOG |  grep -P "The supplied shape .+ is incompatible with the model's input shape" | wc -l) -eq 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+set -e
+
+# Negative test for the async mode.
 set +e
 $PERF_ANALYZER -v -m graphdef_int32_int32_int32 -t 1 -p2000 -b 1 -a \
 --service-kind=triton_c_api --model-repository=$DATADIR \
---triton-server-directory=$SERVER_LIBRARY_PATH >$CLIENT_LOG 2>&1
-if [ $(cat $CLIENT_LOG | grep "${NON_SUPPORTED_ERROR_STRING}" | wc -l) -ne 1 ]; then
+--triton-server-directory=$SERVER_LIBRARY_PATH -s ${STABILITY_THRESHOLD} \
+>$CLIENT_LOG 2>&1
+if [ $(cat $CLIENT_LOG | grep "not supported by triton_c_api service" | wc -l) -ne 1 ]; then
     cat $CLIENT_LOG
     echo -e "\n***\n*** Test Failed\n***"
     RET=1
 fi
 set -e
 
-#Testing that shared memory does NOT work
 for SHARED_MEMORY_TYPE in system cuda; do
-    set +e
     $PERF_ANALYZER -v -m graphdef_int32_int32_int32 -t 1 -p2000 -b 1 \
     --shared-memory=$SHARED_MEMORY_TYPE \
     --service-kind=triton_c_api --model-repository=$DATADIR \
     --triton-server-directory=$SERVER_LIBRARY_PATH >$CLIENT_LOG 2>&1
-    if [ $(cat $CLIENT_LOG | grep "${NON_SUPPORTED_ERROR_STRING}" | wc -l) -ne 1 ]; then
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Failed\n***"
+        RET=1
+    fi
+    if [ $(cat $CLIENT_LOG |  grep "${ERROR_STRING}" | wc -l) -ne 0 ]; then
         cat $CLIENT_LOG
         echo -e "\n***\n*** Test Failed\n***"
         RET=1
     fi
-    set -e
 done
 
 
-# Testing --request-rate-range does NOT work
-set +e
 $PERF_ANALYZER -v -m graphdef_int32_int32_int32 --request-rate-range 1000:2000:500 -p1000 -b 1 \
 --service-kind=triton_c_api --model-repository=$DATADIR \
---triton-server-directory=$SERVER_LIBRARY_PATH >$CLIENT_LOG 2>&1
-if [ $(cat $CLIENT_LOG | grep "${NON_SUPPORTED_ERROR_STRING}" | wc -l) -ne 1 ]; then
+--triton-server-directory=$SERVER_LIBRARY_PATH -s ${STABILITY_THRESHOLD} \
+>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+if [ $(cat $CLIENT_LOG |  grep "${ERROR_STRING}" | wc -l) -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+set -e
+
+set +e
+# Testing erroneous configuration
+# This model is expected to fail
+$PERF_ANALYZER -v -m bls_undefined --shape INPUT0:1048576 -t 64\
+--service-kind=triton_c_api \
+--model-repository=$DATADIR --triton-server-directory=$SERVER_LIBRARY_PATH \
+-s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1
+if [ $? -ne 99 ]; then
     cat $CLIENT_LOG
     echo -e "\n***\n*** Test Failed\n***"
     RET=1
diff --git a/qa/L0_perf_analyzer_doc_links/mkdocs.yml b/qa/L0_perf_analyzer_doc_links/mkdocs.yml
new file mode 100644
index 0000000000..41a4bfe485
--- /dev/null
+++ b/qa/L0_perf_analyzer_doc_links/mkdocs.yml
@@ -0,0 +1,36 @@
+# Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+site_name: CI Test
+use_directory_urls: False
+docs_dir: "./docs"
+plugins:
+        - htmlproofer
+        - search
+
+markdown_extensions:
+    - toc:
+        permalink: True
diff --git a/qa/L0_perf_analyzer_doc_links/test.sh b/qa/L0_perf_analyzer_doc_links/test.sh
new file mode 100755
index 0000000000..c0c195cd18
--- /dev/null
+++ b/qa/L0_perf_analyzer_doc_links/test.sh
@@ -0,0 +1,73 @@
+#!/bin/bash
+# Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+LOG="`pwd`/doc_links.log"
+CONFIG="`pwd`/mkdocs.yml"
+RET=0
+
+# Download necessary packages
+python3 -m pip install mkdocs
+python3 -m pip install mkdocs-htmlproofer-plugin==0.10.3
+
+#Download perf_analyzer docs
+TRITON_CLIENT_REPO_TAG="${TRITON_CLIENT_REPO_TAG:=main}"
+git clone -b ${TRITON_CLIENT_REPO_TAG} https://github.com/triton-inference-server/client.git
+cp `pwd`/client/src/c++/perf_analyzer/README.md .
+cp -rf `pwd`/client/src/c++/perf_analyzer/docs .
+
+# Need to remove all links that start with -- or -. Mkdocs converts all -- to - for anchor links.
+# This breaks all links to cli commands throughout the docs. This will iterate over all
+# files in the docs directory and remove -- and - at the start of options, which allows the
+# tool to check links for correctness.
+for file in `pwd`/docs/*.md
+do
+  echo $file
+  sed -i 's/`-*/`/g' $file
+  sed -i 's/#-*/#/g' $file
+done
+
+exec mkdocs serve -f $CONFIG > $LOG &
+PID=$!
+sleep 20
+
+until [[ (-z `pgrep mkdocs`) ]]; do
+    kill -2 $PID
+    sleep 2
+done
+
+if [[ ! -z `grep "invalid url" $LOG` ]]; then
+    cat $LOG
+    RET=1
+fi
+
+
+if [ $RET -eq 0 ]; then
+    echo -e "\n***\n*** Test PASSED\n***"
+else
+    echo -e "\n***\n*** Test FAILED\n***"
+fi
+exit $RET
diff --git a/qa/L0_perf_analyzer_ground_truth/test.sh b/qa/L0_perf_analyzer_ground_truth/test.sh
new file mode 100755
index 0000000000..d5d78e63f4
--- /dev/null
+++ b/qa/L0_perf_analyzer_ground_truth/test.sh
@@ -0,0 +1,175 @@
+#!/bin/bash
+# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
+if [ "$#" -ge 1 ]; then
+    REPO_VERSION=$1
+fi
+if [ -z "${REPO_VERSION}" ]; then
+    echo -e "Repository version must be specified"
+    echo -e "\n***\n*** Test Failed\n***"
+    exit 1
+fi
+if [ ! -z "$TEST_REPO_ARCH" ]; then
+    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
+fi
+
+source ../common/util.sh
+
+# Setup client/perf_analyzer
+CLIENT_LOG="./perf_analyzer.log"
+PERF_ANALYZER=../clients/perf_analyzer
+
+function check_perf_analyzer_error {
+    ERROR_STRING="error | Request count: 0 | : 0 infer/sec"
+    CLIENT_RET="$1"
+    if [ ${CLIENT_RET} -ne 0 ]; then
+        cat ${CLIENT_LOG}
+        echo -e "\n***\n*** Test Failed\n***"
+        RET=1
+    fi
+    if [ $(cat ${CLIENT_LOG} |  grep "${ERROR_STRING}" | wc -l) -ne 0 ]; then
+        cat ${CLIENT_LOG}
+        echo -e "\n***\n*** Test Failed\n***"
+        RET=1
+    fi
+}
+
+# Checks that the model infer/sec performance is equal to an expected value
+# +/- some tolerance.
+# $1: csv result file from PA run
+# $2: expected infer/sec value
+# $3: tolerance for expected value equality
+function check_performance {
+    # get the boundary values based on the tolerance percentage
+    MIN=$(python3 -c "print(${2} * (1 - ${3}))")
+    MAX=$(python3 -c "print(${2} * (1 + ${3}))")
+
+    # delete all but the 2nd line in the resulting file
+    # then get the 2nd column value which is the infer/sec measurement
+    report_val=$(sed '2!d' $1 | awk -F ',' {'print $2'})
+
+    # check if within tolerance
+    ret=$(python3 -c "print(${report_val} >= ${MIN} and ${report_val} <= ${MAX})")
+    if [ "$ret" = "False" ]; then
+        echo -e "\n***\n*** Test Failed\n***"
+        RET=1
+    fi
+}
+
+# Iterate over the grpc results to ensure gRPC times are greater than 0
+# $1: client log file
+# example line: Avg gRPC time: 42648 usec (marshal 6 usec + response wait 42640 usec + unmarshal 2 usec)
+function check_grpc_time {
+    grep "gRPC" $1 | awk '{print $4}' | while read -r line; do
+        if [ $line -eq 0 ]; then
+            RET=1
+        fi
+    done
+}
+
+# Create input_data.json to communicate the requested model delay
+# $1: desired model delay
+function create_input_data {
+    echo "{\"data\":[{\"INPUT0\" : [${1}]}]}" > input_data.json
+}
+
+# Setup server
+export CUDA_VISIBLE_DEVICES=0
+SERVER=/opt/tritonserver/bin/tritonserver
+SERVER_ARGS="--model-repository=`pwd`/models"
+SERVER_LOG="./inference_server.log"
+
+rm -f $SERVER_LOG $CLIENT_LOG
+MODEL_DIR="./models"
+rm -fr ${MODEL_DIR} && mkdir ${MODEL_DIR}
+MODELS="ground_truth"
+
+for model in ${MODELS}; do
+    # Add version directory to each model if non-existent
+    mkdir -p "${MODEL_DIR}/${model}/1"
+    cp ../python_models/${model}/model.py     ./models/${model}/1/model.py
+    cp ../python_models/${model}/config.pbtxt ./models/${model}/config.pbtxt
+done
+
+# Run server
+run_server
+if [ "${SERVER_PID}" == "0" ]; then
+    echo -e "\n***\n*** Failed to start ${SERVER}\n***"
+    cat ${SERVER_LOG}
+    exit 1
+fi
+
+# Run perf_analyzer
+set +e
+RET=0
+PROTOCOLS="http grpc"
+OUTPUT_FILE="results"
+MODEL_DELAYS=(0.05 0.5)
+TOLERANCE="0.05"
+
+for model_delay in ${MODEL_DELAYS[@]}; do
+    create_input_data ${model_delay}
+    EXPECTED_RESULT=$(python3 -c "print(1 / ${model_delay})")
+    for protocol in ${PROTOCOLS}; do
+        for model in ${MODELS}; do
+        echo "================================================================"
+        echo "[PERMUTATION] Protocol=${protocol} Model=${model}"
+        echo "================================================================"
+
+            ${PERF_ANALYZER} -v -i ${protocol} --concurrency-range 2 --input-data input_data.json -m ${model} -f ${OUTPUT_FILE} | tee ${CLIENT_LOG} 2>&1
+            check_perf_analyzer_error $?
+
+            check_performance ${OUTPUT_FILE} ${EXPECTED_RESULT} ${TOLERANCE}
+
+            if [ "${protocol}" == "grpc" ]; then
+                check_grpc_time ${CLIENT_LOG}
+            fi
+        done;
+    done;
+done;
+
+
+set -e
+
+# Cleanup
+kill $SERVER_PID
+wait $SERVER_PID
+
+if [ $RET -eq 0 ]; then
+  echo -e "\n***\n*** Test Passed\n***"
+else
+  echo "=== START SERVER LOG ==="
+  cat ${SERVER_LOG}
+  echo "=== END SERVER LOG ==="
+  echo "=== START CLIENT LOG ==="
+  cat ${CLIENT_LOG}
+  echo "=== END CLIENT LOG ==="
+  echo -e "\n***\n*** Test FAILED\n***"
+fi
+
+exit ${RET}
diff --git a/qa/L0_perf_analyzer_report/test.sh b/qa/L0_perf_analyzer_report/test.sh
index b820bd019e..7a04905842 100755
--- a/qa/L0_perf_analyzer_report/test.sh
+++ b/qa/L0_perf_analyzer_report/test.sh
@@ -125,14 +125,14 @@ done
 sed -i "s/${COMPOSING_MODEL}/${COMPOSING_MODEL_CACHE_ENABLED}/g" "${MODEL_DIR}/${ENSEMBLE_MODEL_CACHE_ENABLED}/config.pbtxt"
 sed -i "s/${COMPOSING_MODEL}/${COMPOSING_MODEL_CACHE_DISABLED}/g" "${MODEL_DIR}/${ENSEMBLE_MODEL_CACHE_DISABLED}/config.pbtxt"
 
-## Append cache config to each model config 
-echo "response_cache { enable: True }" >> "${MODEL_DIR}/${ENSEMBLE_MODEL_CACHE_ENABLED}/config.pbtxt"
-echo "response_cache { enable: False }" >> "${MODEL_DIR}/${ENSEMBLE_MODEL_CACHE_DISABLED}/config.pbtxt"
-echo "response_cache { enable: True }" >> "${MODEL_DIR}/${COMPOSING_MODEL_CACHE_ENABLED}/config.pbtxt"
-echo "response_cache { enable: False }" >> "${MODEL_DIR}/${COMPOSING_MODEL_CACHE_DISABLED}/config.pbtxt"
+## Append cache config to each model config
+echo -e "response_cache { enable: True }" >> "${MODEL_DIR}/${ENSEMBLE_MODEL_CACHE_ENABLED}/config.pbtxt"
+echo -e "response_cache { enable: False }" >> "${MODEL_DIR}/${ENSEMBLE_MODEL_CACHE_DISABLED}/config.pbtxt"
+echo -e "response_cache { enable: True }" >> "${MODEL_DIR}/${COMPOSING_MODEL_CACHE_ENABLED}/config.pbtxt"
+echo -e "response_cache { enable: False }" >> "${MODEL_DIR}/${COMPOSING_MODEL_CACHE_DISABLED}/config.pbtxt"
 # Force CPU memory for composing models since cache doesn't currently support GPU memory
-echo "instance_group [{ kind: KIND_CPU \n count: 1 }]" >> "${MODEL_DIR}/${COMPOSING_MODEL_CACHE_ENABLED}/config.pbtxt"
-echo "instance_group [{ kind: KIND_CPU \n count: 1 }]" >> "${MODEL_DIR}/${COMPOSING_MODEL_CACHE_DISABLED}/config.pbtxt"
+echo -e "instance_group [{ kind: KIND_CPU, count: 1 }]" >> "${MODEL_DIR}/${COMPOSING_MODEL_CACHE_ENABLED}/config.pbtxt"
+echo -e "instance_group [{ kind: KIND_CPU, count: 1 }]" >> "${MODEL_DIR}/${COMPOSING_MODEL_CACHE_DISABLED}/config.pbtxt"
 
 # Run server
 run_server
@@ -146,13 +146,14 @@ fi
 set +e
 RET=0
 PROTOCOLS="http grpc"
+STABILITY_THRESHOLD="15"
 for protocol in ${PROTOCOLS}; do
     for model in ${MODELS}; do
 	echo "================================================================"
 	echo "[PERMUTATION] Protocol=${protocol} Model=${model}"
 	echo "================================================================"
 
-        ${PERF_ANALYZER} -v -i ${protocol} -m ${model} | tee ${CLIENT_LOG} 2>&1
+        ${PERF_ANALYZER} -v -i ${protocol} -m ${model} -s ${STABILITY_THRESHOLD} | tee ${CLIENT_LOG} 2>&1
         check_perf_analyzer_error $?
 
 	# Check response cache outputs
diff --git a/qa/L0_perf_deeprecommender/run_test.sh b/qa/L0_perf_deeprecommender/run_test.sh
index ca5fa8e27c..2fb74eadfc 100755
--- a/qa/L0_perf_deeprecommender/run_test.sh
+++ b/qa/L0_perf_deeprecommender/run_test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2019-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,12 +28,13 @@
 STATIC_BATCH_SIZES=${STATIC_BATCH_SIZES:=1}
 DYNAMIC_BATCH_SIZES=${DYNAMIC_BATCH_SIZES:=1}
 INSTANCE_COUNTS=${INSTANCE_COUNTS:=1}
+TF_VERSION=${TF_VERSION:=2}
 
 PERF_CLIENT=../clients/perf_client
 REPORTER=../common/reporter.py
 
 SERVER=/opt/tritonserver/bin/tritonserver
-SERVER_ARGS="--model-repository=`pwd`/models"
+SERVER_ARGS="--model-repository=`pwd`/models --backend-config=tensorflow,version=${TF_VERSION}"
 source ../common/util.sh
 
 # Select the single GPU that will be available to the inference
@@ -69,7 +70,8 @@ for STATIC_BATCH in $STATIC_BATCH_SIZES; do
                         echo "dynamic_batching { preferred_batch_size: [ ${DYNAMIC_BATCH} ] }" >> config.pbtxt)
             fi
 
-            SERVER_LOG="${NAME}.serverlog"
+            echo "Time before starting server: $(date)"
+            SERVER_LOG="${NAME}.server.log"
             run_server
             if (( $SERVER_PID == 0 )); then
                 echo -e "\n***\n*** Failed to start $SERVER\n***"
@@ -78,6 +80,7 @@ for STATIC_BATCH in $STATIC_BATCH_SIZES; do
             fi
 
             set +e
+            echo "Time before perf analyzer trials: $(date)"
 
             # Run the model once to warm up. Some frameworks do
             # optimization on the first requests.  Must warmup similar
@@ -85,14 +88,22 @@ for STATIC_BATCH in $STATIC_BATCH_SIZES; do
             $PERF_CLIENT -v -i ${PERF_CLIENT_PROTOCOL} -m $MODEL_NAME -p5000 \
                          -b${STATIC_BATCH} --concurrency-range ${CONCURRENCY}
 
+            set -o pipefail
+            PA_MAX_TRIALS=${PA_MAX_TRIALS:-"50"}
             $PERF_CLIENT -v -i ${PERF_CLIENT_PROTOCOL} -m $MODEL_NAME -p5000 \
                          -b${STATIC_BATCH} --concurrency-range ${CONCURRENCY} \
+                         --max-trials "${PA_MAX_TRIALS}" \
                          -f ${NAME}.csv 2>&1 | tee ${NAME}.log
             if (( $? != 0 )); then
+                echo -e "\n***\n*** FAILED Perf Analyzer measurement\n***"
                 RET=1
             fi
+            echo "Time after perf analyzer trials: $(date)"
+            set +o pipefail
+
             curl localhost:8002/metrics -o ${NAME}.metrics >> ${NAME}.log 2>&1
             if (( $? != 0 )); then
+                echo -e "\n***\n*** FAILED to get metrics\n***"
                 RET=1
             fi
 
diff --git a/qa/L0_perf_deeprecommender/test.sh b/qa/L0_perf_deeprecommender/test.sh
index 2c528794af..3048e46cf5 100755
--- a/qa/L0_perf_deeprecommender/test.sh
+++ b/qa/L0_perf_deeprecommender/test.sh
@@ -43,7 +43,7 @@ TRTEXEC=/usr/src/tensorrt/bin/trtexec
 MODEL="deeprecommender"
 PROTOCOLS="grpc http"
 
-rm -f *.log *.serverlog *.csv *.metrics *.tjson *.json
+rm -f *.log  *.csv *.metrics *.tjson *.json
 
 #
 # Test minimum latency
@@ -58,6 +58,7 @@ rm -fr tensorrt_models && mkdir tensorrt_models
     (cd tensorrt_models/deeprecommender_plan && \
         sed -i "s/^name:.*/name: \"deeprecommender_plan\"/" config.pbtxt && \
         sed -i "s/tensorflow_graphdef/tensorrt_plan/" config.pbtxt && \
+        sed -i "s/max_batch_size:.*/max_batch_size: ${STATIC_BATCH}/" config.pbtxt && \
         sed -i "s/\[17736\]/\[17736,1,1\]/" config.pbtxt)
 
 $TRTEXEC --uff=$REPODIR/perf_model_store/deeprecommender_graphdef/deeprecommender_graphdef.uff \
@@ -117,6 +118,7 @@ rm -fr tensorrt_models && mkdir tensorrt_models
     (cd tensorrt_models/deeprecommender_plan && \
         sed -i "s/^name:.*/name: \"deeprecommender_plan\"/" config.pbtxt && \
         sed -i "s/tensorflow_graphdef/tensorrt_plan/" config.pbtxt && \
+        sed -i "s/max_batch_size:.*/max_batch_size: ${STATIC_BATCH}/" config.pbtxt && \
         sed -i "s/\[17736\]/\[17736,1,1\]/" config.pbtxt)
 
 $TRTEXEC --uff=$REPODIR/perf_model_store/deeprecommender_graphdef/deeprecommender_graphdef.uff \
diff --git a/qa/L0_perf_kaldi/create_data.sh b/qa/L0_perf_kaldi/create_data.sh
old mode 100644
new mode 100755
index 68b32a4099..849b56d906
--- a/qa/L0_perf_kaldi/create_data.sh
+++ b/qa/L0_perf_kaldi/create_data.sh
@@ -25,7 +25,7 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-# Needs to be run in asr_kaldi main directory and must be copied to 
+# Needs to be run in asr_kaldi main directory and must be copied to
 # draco for benchmark test
 TRITON_VERSION="20.05"
 
diff --git a/qa/L0_perf_kaldi/test.sh b/qa/L0_perf_kaldi/test.sh
old mode 100644
new mode 100755
diff --git a/qa/L0_perf_nomodel/run_test.sh b/qa/L0_perf_nomodel/run_test.sh
index 8e79f82550..b1e2702ecb 100755
--- a/qa/L0_perf_nomodel/run_test.sh
+++ b/qa/L0_perf_nomodel/run_test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -48,15 +48,14 @@ ARCH=${ARCH:="x86_64"}
 SERVER=${TRITON_DIR}/bin/tritonserver
 BACKEND_DIR=${TRITON_DIR}/backends
 MODEL_REPO="${PWD}/models"
-SERVER_ARGS="--model-repository=${MODEL_REPO} --backend-directory=${BACKEND_DIR}"
+PERF_CLIENT=../clients/perf_client
+TF_VERSION=${TF_VERSION:=2}
+SERVER_ARGS="--model-repository=${MODEL_REPO} --backend-directory=${BACKEND_DIR} --backend-config=tensorflow,version=${TF_VERSION}"
 source ../common/util.sh
 
 # DATADIR is already set in environment variable for aarch64
-if [ "$ARCH" == "aarch64" ]; then
-    PERF_CLIENT=${TRITON_DIR}/clients/bin/perf_client
-else
-    PERF_CLIENT=../clients/perf_client
-    DATADIR=/data/inferenceserver/${REPO_VERSION}
+if [ "$ARCH" != "aarch64" ]; then
+    DATADIR="/data/inferenceserver/${REPO_VERSION}"
 fi
 
 # Select the single GPU that will be available to the inference server
@@ -75,12 +74,16 @@ if [[ $BACKENDS == *"python"* ]]; then
         sed -i "s/^name:.*/name: \"python_zero_1_float32\"/" config.pbtxt)
 fi
 
+if [[ $BACKENDS == *"custom"* ]]; then
+    mkdir -p "custom_models/custom_zero_1_float32/1"
+fi
+
 PERF_CLIENT_PERCENTILE_ARGS="" &&
     (( ${PERF_CLIENT_PERCENTILE} != 0 )) &&
     PERF_CLIENT_PERCENTILE_ARGS="--percentile=${PERF_CLIENT_PERCENTILE}"
-PERF_CLIENT_EXTRA_ARGS="$PERF_CLIENT_PERCENTILE_ARGS --shared-memory \"${SHARED_MEMORY}\""
+PERF_CLIENT_EXTRA_ARGS="$PERF_CLIENT_PERCENTILE_ARGS --shared-memory ${SHARED_MEMORY}"
 
-# Overload use of PERF_CLIENT_PROTOCOL for convenience with existing test and 
+# Overload use of PERF_CLIENT_PROTOCOL for convenience with existing test and
 # reporting structure, though "triton_c_api" is not strictly a "protocol".
 if [[ "${PERF_CLIENT_PROTOCOL}" == "triton_c_api" ]]; then
     # Server will be run in-process with C API
@@ -166,9 +169,10 @@ for BACKEND in $BACKENDS; do
                 echo "dynamic_batching { preferred_batch_size: [ ${DYNAMIC_BATCH} ] }" >> config.pbtxt)
     fi
 
+    echo "Time before starting server: $(date)"
     # Only start separate server if not using C API, since C API runs server in-process
     if [[ "${PERF_CLIENT_PROTOCOL}" != "triton_c_api" ]]; then
-        SERVER_LOG="${RESULTDIR}/${NAME}.serverlog"
+        SERVER_LOG="${RESULTDIR}/${NAME}.server.log"
         run_server
         if [ $SERVER_PID == 0 ]; then
             echo -e "\n***\n*** Failed to start $SERVER\n***"
@@ -177,19 +181,26 @@ for BACKEND in $BACKENDS; do
         fi
     fi
 
+    echo "Time before perf analyzer trials: $(date)"
     set +e
+    set -o pipefail
+    PA_MAX_TRIALS=${PA_MAX_TRIALS:-"50"}
     $PERF_CLIENT -v \
                  -p${PERF_CLIENT_STABILIZE_WINDOW} \
                  -s${PERF_CLIENT_STABILIZE_THRESHOLD} \
                  ${PERF_CLIENT_EXTRA_ARGS} \
                  -m ${MODEL_NAME} \
                  -b${STATIC_BATCH} -t${CONCURRENCY} \
+                 --max-trials "${PA_MAX_TRIALS}" \
                  --shape ${INPUT_NAME}:${SHAPE} \
                  ${SERVICE_ARGS} \
                  -f ${RESULTDIR}/${NAME}.csv 2>&1 | tee ${RESULTDIR}/${NAME}.log
     if [ $? -ne 0 ]; then
+        echo -e "\n***\n*** FAILED Perf Analyzer measurement\n***"
         RET=1
     fi
+    echo "Time after perf analyzer trials: $(date)"
+    set +o pipefail
     set -e
 
     echo -e "[{\"s_benchmark_kind\":\"benchmark_perf\"," >> ${RESULTDIR}/${NAME}.tjson
diff --git a/qa/L0_perf_nomodel/test.sh b/qa/L0_perf_nomodel/test.sh
index 7f1051106a..6ff68303ed 100755
--- a/qa/L0_perf_nomodel/test.sh
+++ b/qa/L0_perf_nomodel/test.sh
@@ -38,7 +38,7 @@ if [ ! -z "$TEST_REPO_ARCH" ]; then
     REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
 fi
 
-rm -f *.log *.serverlog *.csv *.tjson *.json
+rm -f *.log  *.csv *.tjson *.json
 
 # Descriptive name for the current results
 UNDERTEST_NAME=${NVIDIA_TRITON_SERVER_VERSION}
@@ -55,12 +55,12 @@ PERF_CLIENT_SLOWDOWN_THRESHOLD=5.0
 
 # Length of window, in milliseconds, to use when stabilizing latency
 # and infer/sec results.
-PERF_CLIENT_STABILIZE_WINDOW=5000
+PERF_CLIENT_STABILIZE_WINDOW=10000
 
 # Threshold, as a percentage, to use when stabilizing latency and
 # infer/sec results. Values must vary by less than this percent over 3
 # measurement windows to be considered value.
-PERF_CLIENT_STABILIZE_THRESHOLD=5.0
+PERF_CLIENT_STABILIZE_THRESHOLD=15.0
 
 RUNTEST=./run_test.sh
 
diff --git a/qa/L0_perf_pyclients/simple_perf_client.py b/qa/L0_perf_pyclients/simple_perf_client.py
old mode 100644
new mode 100755
index 00e1ea5427..fd02f94887
--- a/qa/L0_perf_pyclients/simple_perf_client.py
+++ b/qa/L0_perf_pyclients/simple_perf_client.py
@@ -26,13 +26,13 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import argparse
-import numpy as np
+import sys
 import time
 
+import numpy as np
 import tritonclient.grpc as grpcclient
 import tritonclient.http as httpclient
-from tritonclient.utils import triton_to_np_dtype
-from tritonclient.utils import InferenceServerException
+from tritonclient.utils import InferenceServerException, triton_to_np_dtype
 
 FLAGS = None
 
@@ -43,47 +43,59 @@ def parse_model_grpc(model_metadata, model_config):
     by this client.
     """
     if len(model_metadata.inputs) != 1:
-        raise Exception("expecting 1 input, got {}".format(
-            len(model_metadata.inputs)))
+        raise Exception("expecting 1 input, got {}".format(len(model_metadata.inputs)))
     if len(model_metadata.outputs) != 1:
-        raise Exception("expecting 1 output, got {}".format(
-            len(model_metadata.outputs)))
+        raise Exception(
+            "expecting 1 output, got {}".format(len(model_metadata.outputs))
+        )
 
     if len(model_config.input) != 1:
         raise Exception(
             "expecting 1 input in model configuration, got {}".format(
-                len(model_config.input)))
+                len(model_config.input)
+            )
+        )
 
     input_metadata = model_metadata.inputs[0]
     output_metadata = model_metadata.outputs[0]
 
-    batch_dim = (model_config.max_batch_size > 0)
+    batch_dim = model_config.max_batch_size > 0
     expected_dims = 1 + (1 if batch_dim else 0)
 
     if len(input_metadata.shape) != expected_dims:
         raise Exception(
-            "expecting input to have {} dimensions, model '{}' input has {}".
-            format(expected_dims, model_metadata.name,
-                   len(input_metadata.shape)))
+            "expecting input to have {} dimensions, model '{}' input has {}".format(
+                expected_dims, model_metadata.name, len(input_metadata.shape)
+            )
+        )
 
     if len(output_metadata.shape) != expected_dims:
         raise Exception(
-            "expecting output to have {} dimensions, model '{}' output has {}".
-            format(expected_dims, model_metadata.name,
-                   len(output_metadata.shape)))
+            "expecting output to have {} dimensions, model '{}' output has {}".format(
+                expected_dims, model_metadata.name, len(output_metadata.shape)
+            )
+        )
 
     if input_metadata.shape[-1] != -1:
         raise Exception(
-            "expecting input to have variable shape [-1], model '{}' input has {}"
-            .format(model_metadata.name, input_metadata.shape))
+            "expecting input to have variable shape [-1], model '{}' input has {}".format(
+                model_metadata.name, input_metadata.shape
+            )
+        )
 
     if output_metadata.shape[-1] != -1:
         raise Exception(
-            "expecting output to have variable shape [-1], model '{}' output has {}"
-            .format(model_metadata.name, output_metadata.shape))
+            "expecting output to have variable shape [-1], model '{}' output has {}".format(
+                model_metadata.name, output_metadata.shape
+            )
+        )
 
-    return (model_config.max_batch_size, input_metadata.name,
-            output_metadata.name, input_metadata.datatype)
+    return (
+        model_config.max_batch_size,
+        input_metadata.name,
+        output_metadata.name,
+        input_metadata.datatype,
+    )
 
 
 def parse_model_http(model_metadata, model_config):
@@ -91,151 +103,176 @@ def parse_model_http(model_metadata, model_config):
     Check the configuration of a model to make sure it is supported
     by this client.
     """
-    if len(model_metadata['inputs']) != 1:
-        raise Exception("expecting 1 input, got {}".format(
-            len(model_metadata['inputs'])))
-    if len(model_metadata['outputs']) != 1:
-        raise Exception("expecting 1 output, got {}".format(
-            len(model_metadata['outputs'])))
-
-    if len(model_config['input']) != 1:
+    if len(model_metadata["inputs"]) != 1:
+        raise Exception(
+            "expecting 1 input, got {}".format(len(model_metadata["inputs"]))
+        )
+    if len(model_metadata["outputs"]) != 1:
+        raise Exception(
+            "expecting 1 output, got {}".format(len(model_metadata["outputs"]))
+        )
+
+    if len(model_config["input"]) != 1:
         raise Exception(
             "expecting 1 input in model configuration, got {}".format(
-                len(model_config['input'])))
+                len(model_config["input"])
+            )
+        )
 
-    input_metadata = model_metadata['inputs'][0]
-    output_metadata = model_metadata['outputs'][0]
+    input_metadata = model_metadata["inputs"][0]
+    output_metadata = model_metadata["outputs"][0]
 
     max_batch_size = 0
-    if 'max_batch_size' in model_config:
-        max_batch_size = model_config['max_batch_size']
+    if "max_batch_size" in model_config:
+        max_batch_size = model_config["max_batch_size"]
 
-    batch_dim = (max_batch_size > 0)
+    batch_dim = max_batch_size > 0
     expected_dims = 1 + (1 if batch_dim else 0)
 
-    if len(input_metadata['shape']) != expected_dims:
+    if len(input_metadata["shape"]) != expected_dims:
         raise Exception(
-            "expecting input to have {} dimensions, model '{}' input has {}".
-            format(expected_dims, model_metadata.name,
-                   len(input_metadata['shape'])))
+            "expecting input to have {} dimensions, model '{}' input has {}".format(
+                expected_dims, model_metadata.name, len(input_metadata["shape"])
+            )
+        )
 
-    if len(output_metadata['shape']) != expected_dims:
+    if len(output_metadata["shape"]) != expected_dims:
         raise Exception(
-            "expecting output to have {} dimensions, model '{}' output has {}".
-            format(expected_dims, model_metadata.name,
-                   len(output_metadata['shape'])))
+            "expecting output to have {} dimensions, model '{}' output has {}".format(
+                expected_dims, model_metadata.name, len(output_metadata["shape"])
+            )
+        )
 
-    if input_metadata['shape'][-1] != -1:
+    if input_metadata["shape"][-1] != -1:
         raise Exception(
-            "expecting input to have variable shape [-1], model '{}' input has {}"
-            .format(model_metadata.name, input_metadata['shape']))
+            "expecting input to have variable shape [-1], model '{}' input has {}".format(
+                model_metadata.name, input_metadata["shape"]
+            )
+        )
 
-    if output_metadata['shape'][-1] != -1:
+    if output_metadata["shape"][-1] != -1:
         raise Exception(
-            "expecting output to have variable shape [-1], model '{}' output has {}"
-            .format(model_metadata.name, output_metadata['shape']))
+            "expecting output to have variable shape [-1], model '{}' output has {}".format(
+                model_metadata.name, output_metadata["shape"]
+            )
+        )
 
-    return (max_batch_size, input_metadata['name'], output_metadata['name'],
-            input_metadata['datatype'])
+    return (
+        max_batch_size,
+        input_metadata["name"],
+        output_metadata["name"],
+        input_metadata["datatype"],
+    )
 
 
 def requestGenerator(input_name, input_data, output_name, dtype, protocol):
-
     # Set the input data
     inputs = []
     if protocol.lower() == "grpc":
-        inputs.append(grpcclient.InferInput(input_name, input_data.shape,
-                                            dtype))
+        inputs.append(grpcclient.InferInput(input_name, input_data.shape, dtype))
         inputs[0].set_data_from_numpy(input_data)
     else:
-        inputs.append(httpclient.InferInput(input_name, input_data.shape,
-                                            dtype))
+        inputs.append(httpclient.InferInput(input_name, input_data.shape, dtype))
         inputs[0].set_data_from_numpy(input_data, binary_data=True)
 
     outputs = []
     if protocol.lower() == "grpc":
         outputs.append(grpcclient.InferRequestedOutput(output_name))
     else:
-        outputs.append(
-            httpclient.InferRequestedOutput(output_name, binary_data=True))
+        outputs.append(httpclient.InferRequestedOutput(output_name, binary_data=True))
 
     return inputs, outputs
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('-v',
-                        '--verbose',
-                        action="store_true",
-                        required=False,
-                        default=False,
-                        help='Enable verbose output')
-    parser.add_argument('-m',
-                        '--model-name',
-                        type=str,
-                        required=True,
-                        help='Name of model')
     parser.add_argument(
-        '-x',
-        '--model-version',
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable verbose output",
+    )
+    parser.add_argument(
+        "-m", "--model-name", type=str, required=True, help="Name of model"
+    )
+    parser.add_argument(
+        "-x",
+        "--model-version",
         type=str,
         required=False,
         default="",
-        help='Version of model. Default is to use latest version.')
-    parser.add_argument('-b',
-                        '--batch-size',
-                        type=int,
-                        required=False,
-                        default=1,
-                        help='Batch size. Default is 1.')
-    parser.add_argument('-s',
-                        '--shape',
-                        type=int,
-                        required=False,
-                        default=1,
-                        help='The shape of the tensor. Default is 1.')
-    parser.add_argument('-u',
-                        '--url',
-                        type=str,
-                        required=False,
-                        default='localhost:8000',
-                        help='Inference server URL. Default is localhost:8000.')
-    parser.add_argument('-i',
-                        '--protocol',
-                        type=str,
-                        required=False,
-                        default='HTTP',
-                        help='Protocol (HTTP/gRPC) used to communicate with ' +
-                        'the inference service. Default is HTTP.')
-    parser.add_argument('-c',
-                        '--iteration_count',
-                        type=int,
-                        required=False,
-                        default=1000,
-                        help='The number of iterations. Default is 1000.')
+        help="Version of model. Default is to use latest version.",
+    )
     parser.add_argument(
-        '-w',
-        '--warmup_count',
+        "-b",
+        "--batch-size",
+        type=int,
+        required=False,
+        default=1,
+        help="Batch size. Default is 1.",
+    )
+    parser.add_argument(
+        "-s",
+        "--shape",
+        type=int,
+        required=False,
+        default=1,
+        help="The shape of the tensor. Default is 1.",
+    )
+    parser.add_argument(
+        "-u",
+        "--url",
+        type=str,
+        required=False,
+        default="localhost:8000",
+        help="Inference server URL. Default is localhost:8000.",
+    )
+    parser.add_argument(
+        "-i",
+        "--protocol",
+        type=str,
+        required=False,
+        default="HTTP",
+        help="Protocol (HTTP/gRPC) used to communicate with "
+        + "the inference service. Default is HTTP.",
+    )
+    parser.add_argument(
+        "-c",
+        "--iteration_count",
+        type=int,
+        required=False,
+        default=1000,
+        help="The number of iterations. Default is 1000.",
+    )
+    parser.add_argument(
+        "-w",
+        "--warmup_count",
         type=int,
         required=False,
         default=500,
-        help='The number of warm-up iterations. Default is 500.')
+        help="The number of warm-up iterations. Default is 500.",
+    )
     parser.add_argument(
-        '--csv',
+        "--csv",
         type=str,
         required=False,
         default=None,
-        help='The name of the file to store the results in CSV format')
+        help="The name of the file to store the results in CSV format",
+    )
     FLAGS = parser.parse_args()
 
     try:
         if FLAGS.protocol.lower() == "grpc":
             # Create gRPC client for communicating with the server
             triton_client = grpcclient.InferenceServerClient(
-                url=FLAGS.url, verbose=FLAGS.verbose)
+                url=FLAGS.url, verbose=FLAGS.verbose
+            )
         else:
             triton_client = httpclient.InferenceServerClient(
-                url=FLAGS.url, verbose=FLAGS.verbose, concurrency=1)
+                url=FLAGS.url, verbose=FLAGS.verbose, concurrency=1
+            )
     except Exception as e:
         print("client creation failed: " + str(e))
         sys.exit(1)
@@ -244,7 +281,8 @@ def requestGenerator(input_name, input_data, output_name, dtype, protocol):
     # properties of the model that we need for preprocessing
     try:
         model_metadata = triton_client.get_model_metadata(
-            model_name=FLAGS.model_name, model_version=FLAGS.model_version)
+            model_name=FLAGS.model_name, model_version=FLAGS.model_version
+        )
     except InferenceServerException as e:
         print("failed to retrieve the metadata: " + str(e))
         sys.exit(1)
@@ -253,36 +291,41 @@ def requestGenerator(input_name, input_data, output_name, dtype, protocol):
     # properties of the model that we need for preprocessing
     try:
         model_metadata = triton_client.get_model_metadata(
-            model_name=FLAGS.model_name, model_version=FLAGS.model_version)
+            model_name=FLAGS.model_name, model_version=FLAGS.model_version
+        )
     except InferenceServerException as e:
         print("failed to retrieve the metadata: " + str(e))
         sys.exit(1)
 
     try:
         model_config = triton_client.get_model_config(
-            model_name=FLAGS.model_name, model_version=FLAGS.model_version)
+            model_name=FLAGS.model_name, model_version=FLAGS.model_version
+        )
     except InferenceServerException as e:
         print("failed to retrieve the config: " + str(e))
         sys.exit(1)
 
     if FLAGS.protocol.lower() == "grpc":
         max_batch_size, input_name, output_name, dtype = parse_model_grpc(
-            model_metadata, model_config.config)
+            model_metadata, model_config.config
+        )
     else:
         max_batch_size, input_name, output_name, dtype = parse_model_http(
-            model_metadata, model_config)
+            model_metadata, model_config
+        )
 
-    input_data = np.zeros([FLAGS.batch_size, FLAGS.shape],
-                          dtype=triton_to_np_dtype(dtype))
+    input_data = np.zeros(
+        [FLAGS.batch_size, FLAGS.shape], dtype=triton_to_np_dtype(dtype)
+    )
 
     # --------------------------- Warm-Up --------------------------------------------------------
     for i in range(FLAGS.warmup_count):
-        inputs, outputs = requestGenerator(input_name, input_data, output_name,
-                                           dtype, FLAGS.protocol.lower())
-        triton_client.infer(FLAGS.model_name,
-                            inputs,
-                            model_version=FLAGS.model_version,
-                            outputs=outputs)
+        inputs, outputs = requestGenerator(
+            input_name, input_data, output_name, dtype, FLAGS.protocol.lower()
+        )
+        triton_client.infer(
+            FLAGS.model_name, inputs, model_version=FLAGS.model_version, outputs=outputs
+        )
 
     latencies = []
 
@@ -292,12 +335,12 @@ def requestGenerator(input_name, input_data, output_name, dtype, protocol):
 
     for i in range(FLAGS.iteration_count):
         t0 = time.time()
-        inputs, outputs = requestGenerator(input_name, input_data, output_name,
-                                           dtype, FLAGS.protocol.lower())
-        triton_client.infer(FLAGS.model_name,
-                            inputs,
-                            model_version=FLAGS.model_version,
-                            outputs=outputs)
+        inputs, outputs = requestGenerator(
+            input_name, input_data, output_name, dtype, FLAGS.protocol.lower()
+        )
+        triton_client.infer(
+            FLAGS.model_name, inputs, model_version=FLAGS.model_version, outputs=outputs
+        )
         latencies.append(time.time() - t0)
 
     end_time = time.time()
@@ -320,12 +363,17 @@ def requestGenerator(input_name, input_data, output_name, dtype, protocol):
 
     # --------------------------- Write CSV --------------------------------------------------------
     if FLAGS.csv != None:
-        file = open(FLAGS.csv, 'w')
+        file = open(FLAGS.csv, "w")
         file.write(
             "Concurrency,Inferences/Second,p50 latency,p90 latency,p95 latency,p99 latency\n"
         )
-        file.write("1,{},{},{},{},{}".format(throughput, p50_latency * 1000,
-                                             p90_latency * 1000,
-                                             p95_latency * 1000,
-                                             p99_latency * 1000))
+        file.write(
+            "1,{},{},{},{},{}".format(
+                throughput,
+                p50_latency * 1000,
+                p90_latency * 1000,
+                p95_latency * 1000,
+                p99_latency * 1000,
+            )
+        )
         file.close()
diff --git a/qa/L0_perf_pyclients/test.sh b/qa/L0_perf_pyclients/test.sh
index 57350a512c..9b7e405977 100755
--- a/qa/L0_perf_pyclients/test.sh
+++ b/qa/L0_perf_pyclients/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -43,8 +43,10 @@ REPORTER=../common/reporter.py
 CLIENT_LOG="./simple_perf_client.log"
 SIMPLE_PERF_CLIENT=simple_perf_client.py
 
+TF_VERSION=${TF_VERSION:=2}
+
 SERVER=/opt/tritonserver/bin/tritonserver
-SERVER_ARGS="--model-repository=`pwd`/custom_models"
+SERVER_ARGS="--model-repository=`pwd`/custom_models --backend-config=tensorflow,version=${TF_VERSION}"
 source ../common/util.sh
 
 # Select the single GPU that will be available to the inference
diff --git a/qa/L0_perf_resnet/run_test.sh b/qa/L0_perf_resnet/run_test.sh
index 953aab71d3..579d00c0e5 100755
--- a/qa/L0_perf_resnet/run_test.sh
+++ b/qa/L0_perf_resnet/run_test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,6 +28,7 @@
 STATIC_BATCH=${STATIC_BATCH:=1}
 INSTANCE_CNT=${INSTANCE_CNT:=1}
 BACKEND_CONFIG=${BACKEND_CONFIG:=""}
+TF_VERSION=${TF_VERSION:=2}
 
 REPORTER=../common/reporter.py
 
@@ -35,7 +36,7 @@ TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"}
 SERVER=${TRITON_DIR}/bin/tritonserver
 BACKEND_DIR=${TRITON_DIR}/backends
 MODEL_REPO="${PWD}/models"
-SERVER_ARGS="--model-repository=${MODEL_REPO} --backend-directory=${BACKEND_DIR} ${BACKEND_CONFIG}"
+SERVER_ARGS="--model-repository=${MODEL_REPO} --backend-directory=${BACKEND_DIR} ${BACKEND_CONFIG} --backend-config=tensorflow,version=${TF_VERSION}"
 source ../common/util.sh
 
 # Select the single GPU that will be available to the inference
@@ -53,20 +54,16 @@ rm -fr models && mkdir -p models && \
             sed -i "s/^max_batch_size:.*/max_batch_size: ${MAX_BATCH}/" config.pbtxt && \
             echo "instance_group [ { count: ${INSTANCE_CNT} }]")
 
-# Onnx and onnx-trt models are very slow on Jetson.
 MEASUREMENT_WINDOW=5000
+PERF_CLIENT=../clients/perf_client
+# Onnx and onnx-trt models are very slow on Jetson.
 if [ "$ARCH" == "aarch64" ]; then
-    PERF_CLIENT=${TRITON_DIR}/clients/bin/perf_client
     if [ "$MODEL_FRAMEWORK" == "onnx" ] || [ "$MODEL_FRAMEWORK" == "onnx_trt" ]; then
         MEASUREMENT_WINDOW=20000
     fi
-else
-    PERF_CLIENT=../clients/perf_client
 fi
 
-set +e
-
-# Overload use of PERF_CLIENT_PROTOCOL for convenience with existing test and 
+# Overload use of PERF_CLIENT_PROTOCOL for convenience with existing test and
 # reporting structure, though "triton_c_api" is not strictly a "protocol".
 if [[ "${PERF_CLIENT_PROTOCOL}" == "triton_c_api" ]]; then
     # Server will be run in-process with C API
@@ -76,7 +73,7 @@ if [[ "${PERF_CLIENT_PROTOCOL}" == "triton_c_api" ]]; then
 else
     SERVICE_ARGS="-i ${PERF_CLIENT_PROTOCOL}"
 
-    SERVER_LOG="${NAME}.serverlog"
+    SERVER_LOG="${NAME}.server.log"
     run_server
     if (( $SERVER_PID == 0 )); then
         echo -e "\n***\n*** Failed to start $SERVER\n***"
@@ -88,19 +85,27 @@ else
     # Must warmup similar to actual run so that all instances are ready
     # Note: Running extra PA for warmup doesn't make sense for C API since it
     # uses in-process tritonserver which will exit along with this PA process.
+    set +e
     $PERF_CLIENT -v -m $MODEL_NAME -p${MEASUREMENT_WINDOW} \
                     -b${STATIC_BATCH} --concurrency-range ${CONCURRENCY} \
                     ${SERVICE_ARGS}
+    set -e
 fi
 
+set +e
+set -o pipefail
+PA_MAX_TRIALS=${PA_MAX_TRIALS:-"50"}
 # Measure perf client results and write them to a file for reporting
 $PERF_CLIENT -v -m $MODEL_NAME -p${MEASUREMENT_WINDOW} \
                 -b${STATIC_BATCH} --concurrency-range ${CONCURRENCY} \
+                --max-trials "${PA_MAX_TRIALS}" \
                 ${SERVICE_ARGS} \
                 -f ${NAME}.csv 2>&1 | tee ${NAME}.log
 if (( $? != 0 )); then
+    echo -e "\n***\n*** FAILED Perf Analyzer measurement\n***"
     RET=1
 fi
+set +o pipefail
 set -e
 
 echo -e "[{\"s_benchmark_kind\":\"benchmark_perf\"," >> ${NAME}.tjson
diff --git a/qa/L0_perf_resnet/test.sh b/qa/L0_perf_resnet/test.sh
index afdc4911d2..93b946ec35 100755
--- a/qa/L0_perf_resnet/test.sh
+++ b/qa/L0_perf_resnet/test.sh
@@ -38,7 +38,7 @@ if [ ! -z "$TEST_REPO_ARCH" ]; then
     REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
 fi
 
-rm -f *.log *.serverlog *.csv *.tjson *.json
+rm -f *.log  *.csv *.tjson *.json
 
 PROTOCOLS="grpc http triton_c_api"
 
@@ -110,7 +110,8 @@ done
 rm -fr tensorrt_models && mkdir tensorrt_models
 cp -r $REPODIR/caffe_models/trt_model_store/resnet50_plan tensorrt_models/${TRT_MODEL_NAME} && \
     (cd tensorrt_models/${TRT_MODEL_NAME} && \
-            sed -i "s/^name:.*/name: \"${TRT_MODEL_NAME}\"/" config.pbtxt) && \
+            sed -i "s/^name:.*/name: \"${TRT_MODEL_NAME}\"/" config.pbtxt && \
+            sed -i "s/max_batch_size:.*/max_batch_size: ${STATIC_BATCH}/" config.pbtxt) && \
     mkdir -p tensorrt_models/${TRT_MODEL_NAME}/1
 $CAFFE2PLAN -h -b ${STATIC_BATCH} \
             -n prob -o tensorrt_models/${TRT_MODEL_NAME}/1/model.plan \
@@ -167,7 +168,8 @@ CONCURRENCY=4
 rm -fr tensorrt_models && mkdir tensorrt_models
 cp -r $REPODIR/caffe_models/trt_model_store/resnet50_plan tensorrt_models/${TRT_MODEL_NAME} && \
     (cd tensorrt_models/${TRT_MODEL_NAME} && \
-            sed -i "s/^name:.*/name: \"${TRT_MODEL_NAME}\"/" config.pbtxt) && \
+            sed -i "s/^name:.*/name: \"${TRT_MODEL_NAME}\"/" config.pbtxt && \
+            sed -i "s/max_batch_size:.*/max_batch_size: ${STATIC_BATCH}/" config.pbtxt) && \
     mkdir -p tensorrt_models/${TRT_MODEL_NAME}/1
 $CAFFE2PLAN -h -b ${STATIC_BATCH} \
             -n prob -o tensorrt_models/${TRT_MODEL_NAME}/1/model.plan \
diff --git a/qa/L0_perf_tfs/test.sh b/qa/L0_perf_tfs/test.sh
deleted file mode 100755
index 9d44d241c1..0000000000
--- a/qa/L0_perf_tfs/test.sh
+++ /dev/null
@@ -1,153 +0,0 @@
-#!/bin/bash
-# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved.
-#
-# Redistribution and use in source and binary forms, with or without
-# modification, are permitted provided that the following conditions
-# are met:
-#  * Redistributions of source code must retain the above copyright
-#    notice, this list of conditions and the following disclaimer.
-#  * Redistributions in binary form must reproduce the above copyright
-#    notice, this list of conditions and the following disclaimer in the
-#    documentation and/or other materials provided with the distribution.
-#  * Neither the name of NVIDIA CORPORATION nor the names of its
-#    contributors may be used to endorse or promote products derived
-#    from this software without specific prior written permission.
-#
-# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
-# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
-# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
-# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
-# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
-if [ "$#" -ge 1 ]; then
-    REPO_VERSION=$1
-fi
-if [ -z "$REPO_VERSION" ]; then
-    echo -e "Repository version must be specified"
-    echo -e "\n***\n*** Test Failed\n***"
-    exit 1
-fi
-if [ ! -z "$TEST_REPO_ARCH" ]; then
-    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
-fi
-
-apt update
- # needed by perf_analyzer
-apt install -y libb64-dev
-# needed by reporter
-apt install -y python3 python3-pip python3-dev
-rm -f /usr/bin/python && ln -s /usr/bin/python3 /usr/bin/python
-pip3 install --upgrade requests
-
-REPODIR=/data/inferenceserver/${REPO_VERSION}
-rm -f *.log *.csv *.tjson *.json
-rm -rf model_store
-
-RET=0
-
-# Create model_store
-MODEL_NAME="resnet50v1.5_fp16_savedmodel"
-mkdir model_store
-mkdir -p model_store/${MODEL_NAME}
-cp -r ${REPODIR}/perf_model_store/${MODEL_NAME}/1/model.savedmodel model_store/${MODEL_NAME}/1
-
-# Run server
-tensorflow_model_server --port=8500 --model_name=${MODEL_NAME} --model_base_path=$PWD/model_store/${MODEL_NAME} > server.log 2>&1 &
-SERVER_PID=$!
-# Wait for the server to start
-sleep 10
-if [ "$SERVER_PID" == "0" ]; then
-    echo -e "\n***\n*** Failed to start server\n***"
-    cat server.log
-    exit 1
-fi
-
-PERF_ANALYZER=/perf_bin/perf_analyzer
-REPORTER=../common/reporter.py
-
-# To get the minimum latency
-STATIC_BATCH=1
-NAME=${MODEL_NAME}_sbatch${STATIC_BATCH}
-
-# Run client
-# To warmup the model
-$PERF_ANALYZER -m ${MODEL_NAME} --service-kind tfserving -i grpc -b 1 -p 5000
-# Collect data
-$PERF_ANALYZER -m ${MODEL_NAME} --service-kind tfserving -i grpc -b ${STATIC_BATCH} -p 5000 -f ${NAME}.csv >> ${NAME}.log 2>&1
-if (( $? != 0 )); then
-    RET=1
-fi
-
-echo -e "[{\"s_benchmark_kind\":\"benchmark_perf\"," >> ${NAME}.tjson
-echo -e "\"s_benchmark_name\":\"resnet50\"," >> ${NAME}.tjson
-echo -e "\"s_server\":\"tfserving\"," >> ${NAME}.tjson
-echo -e "\"s_protocol\":\"grpc\"," >> ${NAME}.tjson
-echo -e "\"s_framework\":\"savedmodel\"," >> ${NAME}.tjson
-echo -e "\"s_model\":\"${MODEL_NAME}\"," >> ${NAME}.tjson
-echo -e "\"l_concurrency\":1," >> ${NAME}.tjson
-echo -e "\"l_batch_size\":1," >> ${NAME}.tjson
-echo -e "\"l_instance_count\":1}]" >> ${NAME}.tjson
-
-if [ -f $REPORTER ]; then
-    set +e
-
-    URL_FLAG=
-    if [ ! -z ${BENCHMARK_REPORTER_URL} ]; then
-        URL_FLAG="-u ${BENCHMARK_REPORTER_URL}"
-    fi
-
-    $REPORTER -v -o ${NAME}.json --csv ${NAME}.csv ${URL_FLAG} ${NAME}.tjson
-    if (( $? != 0 )); then
-        RET=1
-    fi
-
-    set -e
-fi
-
-# Large static batch size case.
-STATIC_BATCH=128
-NAME=${MODEL_NAME}_sbatch${STATIC_BATCH}
-$PERF_ANALYZER -m ${MODEL_NAME} --service-kind tfserving -i grpc -b ${STATIC_BATCH} -p 5000 -f ${NAME}.csv >> ${NAME}.log 2>&1
-if (( $? != 0 )); then
-    RET=1
-fi
-
-echo -e "[{\"s_benchmark_kind\":\"benchmark_perf\"," >> ${NAME}.tjson
-echo -e "\"s_benchmark_name\":\"resnet50\"," >> ${NAME}.tjson
-echo -e "\"s_server\":\"tfserving\"," >> ${NAME}.tjson
-echo -e "\"s_protocol\":\"grpc\"," >> ${NAME}.tjson
-echo -e "\"s_framework\":\"savedmodel\"," >> ${NAME}.tjson
-echo -e "\"s_model\":\"${MODEL_NAME}\"," >> ${NAME}.tjson
-echo -e "\"l_concurrency\":1," >> ${NAME}.tjson
-echo -e "\"l_batch_size\":128," >> ${NAME}.tjson
-echo -e "\"l_instance_count\":1}]" >> ${NAME}.tjson
-
-if [ -f $REPORTER ]; then
-    set +e
-
-    URL_FLAG=
-    if [ ! -z ${BENCHMARK_REPORTER_URL} ]; then
-        URL_FLAG="-u ${BENCHMARK_REPORTER_URL}"
-    fi
-
-    $REPORTER -v -o ${NAME}.json --csv ${NAME}.csv ${URL_FLAG} ${NAME}.tjson
-    if (( $? != 0 )); then
-        RET=1
-    fi
-
-    set -e
-fi
-
-if (( $RET == 0 )); then
-    echo -e "\n***\n*** Test Passed\n***"
-else
-    echo -e "\n***\n*** Test FAILED\n***"
-fi
-
-exit $RET
diff --git a/qa/L0_perf_ts/test.sh b/qa/L0_perf_ts/test.sh
deleted file mode 100755
index f308a43c1e..0000000000
--- a/qa/L0_perf_ts/test.sh
+++ /dev/null
@@ -1,124 +0,0 @@
-#!/bin/bash
-# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved.
-#
-# Redistribution and use in source and binary forms, with or without
-# modification, are permitted provided that the following conditions
-# are met:
-#  * Redistributions of source code must retain the above copyright
-#    notice, this list of conditions and the following disclaimer.
-#  * Redistributions in binary form must reproduce the above copyright
-#    notice, this list of conditions and the following disclaimer in the
-#    documentation and/or other materials provided with the distribution.
-#  * Neither the name of NVIDIA CORPORATION nor the names of its
-#    contributors may be used to endorse or promote products derived
-#    from this software without specific prior written permission.
-#
-# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
-# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
-# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
-# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
-# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
-
-if [ "$#" -ge 1 ]; then
-    REPO_VERSION=$1
-fi
-if [ -z "$REPO_VERSION" ]; then
-    echo -e "Repository version must be specified"
-    echo -e "\n***\n*** Test Failed\n***"
-    exit 1
-fi
-if [ ! -z "$TEST_REPO_ARCH" ]; then
-    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
-fi
-
-# TODO: DLIS-3777 following key update is required only while base image 
-#       is not updated accordingly
-apt-key del 7fa2af80
-apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/$(uname -m)/3bf863cc.pub
-
-apt update
-apt install -y libb64-dev curl
-apt install -y python3 python3-pip python3-dev
-pip3 install --upgrade requests
-
-REPODIR=/data/inferenceserver/${REPO_VERSION}
-PERF_ANALYZER=/perf_bin/perf_analyzer
-REPORTER=../common/reporter.py
-
-rm -f *.log *.csv *.tjson *.json log4j.properties
-rm -rf model_store
-rm -rf serve
-
-RET=0
-
-# Create model archive. Using default handler for image classification
-MODEL_NAME="resnet50_fp32_libtorch"
-mkdir model_store
-torch-model-archiver --model-name resnet50 --version 1.0 --serialized-file ${REPODIR}/perf_model_store/${MODEL_NAME}/1/model.pt \
---export-path model_store --handler image_classifier -f
-# Suppressing the logging for better performance
-echo "log4j.rootLogger = OFF" >> log4j.properties
-# Run server
-torchserve --start --ncs --model-store=model_store --models model_store/resnet50.mar --log-config log4j.properties
-
-sleep 5
-
-# Get the input image to be used for generating requests
-STATIC_BATCH=1
-curl -O https://raw.githubusercontent.com/pytorch/serve/master/docs/images/kitten_small.jpg
-echo "{\"data\":[{\"TORCHSERVE_INPUT\" : [\"kitten_small.jpg\"]}]}" >> data.json
-NAME=${MODEL_NAME}_sbatch${STATIC_BATCH}
-PERF_ANALYZER_ARGS="-m resnet50 --service-kind torchserve -i http -u localhost:8080 -b ${STATIC_BATCH} -p 5000 --input-data data.json"
-
-# Run client
-# To warmup the model
-$PERF_ANALYZER ${PERF_ANALYZER_ARGS}
-# Collect data
-$PERF_ANALYZER ${PERF_ANALYZER_ARGS} -f ${NAME}.csv >> ${NAME}.log 2>&1
-if (( $? != 0 )); then
-    RET=1
-fi
-
-torchserve --stop
-
-echo -e "[{\"s_benchmark_kind\":\"benchmark_perf\"," >> ${NAME}.tjson
-echo -e "\"s_benchmark_name\":\"preprocess+resnet50\"," >> ${NAME}.tjson
-echo -e "\"s_server\":\"torchserve\"," >> ${NAME}.tjson
-echo -e "\"s_protocol\":\"http\"," >> ${NAME}.tjson
-echo -e "\"s_framework\":\"libtorch\"," >> ${NAME}.tjson
-echo -e "\"s_model\":\"${MODEL_NAME}\"," >> ${NAME}.tjson
-echo -e "\"l_concurrency\":1," >> ${NAME}.tjson
-echo -e "\"l_batch_size\":1," >> ${NAME}.tjson
-echo -e "\"l_instance_count\":1}]" >> ${NAME}.tjson
-
-
-if [ -f $REPORTER ]; then
-    set +e
-
-    URL_FLAG=
-    if [ ! -z ${BENCHMARK_REPORTER_URL} ]; then
-        URL_FLAG="-u ${BENCHMARK_REPORTER_URL}"
-    fi
-
-    python $REPORTER -v -o ${NAME}.json --csv ${NAME}.csv ${URL_FLAG} ${NAME}.tjson
-    if (( $? != 0 )); then
-        RET=1
-    fi
-
-    set -e
-fi
-
-if (( $RET == 0 )); then
-    echo -e "\n***\n*** Test Passed\n***"
-else
-    echo -e "\n***\n*** Test FAILED\n***"
-fi
-
-exit $RET
diff --git a/qa/L0_perf_vllm/test.sh b/qa/L0_perf_vllm/test.sh
new file mode 100755
index 0000000000..498f6f8e14
--- /dev/null
+++ b/qa/L0_perf_vllm/test.sh
@@ -0,0 +1,146 @@
+#!/bin/bash
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+source ../common/util.sh
+
+REPORTER=../common/reporter.py
+TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"}
+SERVER=${TRITON_DIR}/bin/tritonserver
+BACKEND_DIR=${TRITON_DIR}/backends
+MODEL_REPO="${PWD}/models"
+NAME="vllm_benchmarking_test"
+MODEL_NAME="gpt2_vllm"
+INPUT_DATA="./input_data.json"
+SERVER_LOG="${NAME}_server.log"
+SERVER_ARGS="--model-repository=${MODEL_REPO} --backend-directory=${BACKEND_DIR} --log-verbose=1"
+
+export CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:=0}
+EXPORT_FILE=profile-export-vllm-model.json
+
+pip3 install tritonclient nvidia-ml-py3
+rm -rf $MODEL_REPO $EXPORT_FILE *.tjson *.json *.csv
+
+mkdir -p $MODEL_REPO/$MODEL_NAME/1
+echo '{
+    "model":"gpt2",
+    "disable_log_requests": "true",
+    "gpu_memory_utilization": 0.5
+}' >$MODEL_REPO/$MODEL_NAME/1/model.json
+
+echo 'backend: "vllm"
+instance_group [
+  {
+    count: 1
+    kind: KIND_MODEL
+  }
+]' >$MODEL_REPO/$MODEL_NAME/config.pbtxt
+
+echo '{
+    "data": [
+        {
+            "text_input": [
+                "hi hi hi hi hi hi hi hi hi hi"
+            ],
+            "stream": [
+                true
+            ],
+            "sampling_parameters": [
+                "{\"max_tokens\": 1024, \"ignore_eos\": true}"
+            ]
+        }
+    ]
+}' >$INPUT_DATA
+
+RET=0
+ARCH="amd64"
+STATIC_BATCH=1
+INSTANCE_CNT=1
+CONCURRENCY=100
+MODEL_FRAMEWORK="vllm"
+PERF_CLIENT_PROTOCOL="grpc"
+PERF_CLIENT=perf_analyzer
+
+# Set stability-percentage 999 to bypass the stability check in PA.
+# LLM generates a sequence of tokens that is unlikely to be within a reasonable bound to determine valid measurement in terms of latency.
+# Using "count_windows" measurement mode, which automatically extends the window for collecting responses.
+PERF_CLIENT_ARGS="-v -m $MODEL_NAME --concurrency-range=${CONCURRENCY} --measurement-mode=count_windows --measurement-request-count=10 \
+                  --input-data=$INPUT_DATA --profile-export-file=$EXPORT_FILE -i $PERF_CLIENT_PROTOCOL --async --streaming --stability-percentage=999"
+
+run_server
+if (($SERVER_PID == 0)); then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+$PERF_CLIENT $PERF_CLIENT_ARGS -f ${NAME}.csv 2>&1 | tee ${NAME}_perf_analyzer.log
+set +o pipefail
+set -e
+
+if [[ -n "${SERVER_PID}" ]]; then
+    kill $SERVER_PID
+    wait $SERVER_PID
+fi
+
+echo -e "[{\"s_benchmark_kind\":\"benchmark_perf\"," >>${NAME}.tjson
+echo -e "\"s_benchmark_repo_branch\":\"${BENCHMARK_REPO_BRANCH}\"," >>${NAME}.tjson
+echo -e "\"s_benchmark_name\":\"${NAME}\"," >>${NAME}.tjson
+echo -e "\"s_server\":\"triton\"," >>${NAME}.tjson
+echo -e "\"s_protocol\":\"${PERF_CLIENT_PROTOCOL}\"," >>${NAME}.tjson
+echo -e "\"s_framework\":\"${MODEL_FRAMEWORK}\"," >>${NAME}.tjson
+echo -e "\"s_model\":\"${MODEL_NAME}\"," >>${NAME}.tjson
+echo -e "\"l_concurrency\":\"${CONCURRENCY}\"," >>${NAME}.tjson
+echo -e "\"l_batch_size\":${STATIC_BATCH}," >>${NAME}.tjson
+echo -e "\"l_instance_count\":${INSTANCE_CNT}," >>${NAME}.tjson
+echo -e "\"s_architecture\":\"${ARCH}\"}]" >>${NAME}.tjson
+
+if [ -f $REPORTER ]; then
+    set +e
+
+    URL_FLAG=
+    if [ ! -z ${BENCHMARK_REPORTER_URL} ]; then
+        URL_FLAG="-u ${BENCHMARK_REPORTER_URL}"
+    fi
+
+    python3 $REPORTER -v -e ${EXPORT_FILE} -o ${NAME}.json --csv ${NAME}.csv --gpu-metrics --token-latency ${URL_FLAG} ${NAME}.tjson
+    if (($? != 0)); then
+        RET=1
+    fi
+
+    set -e
+fi
+
+rm -rf $MODEL_REPO $INPUT_DATA
+
+if (($RET == 0)); then
+    echo -e "\n***\n*** Test Passed\n***"
+else
+    echo -e "\n***\n*** Test FAILED\n***"
+fi
+
+exit $RET
diff --git a/qa/L0_pinned_memory/test.sh b/qa/L0_pinned_memory/test.sh
index 799c908b76..89b59d7c18 100755
--- a/qa/L0_pinned_memory/test.sh
+++ b/qa/L0_pinned_memory/test.sh
@@ -50,7 +50,7 @@ source ../common/util.sh
 # Select the single GPU that will be available to the inference server
 export CUDA_VISIBLE_DEVICES=0
 
-rm -f *.log *.serverlog *.csv *.metrics
+rm -f *.log  *.csv *.metrics
 RET=0
 
 rm -fr ./custom_models && mkdir ./custom_models && \
@@ -81,7 +81,7 @@ for BACKEND in $BACKENDS; do
 
     # With pinned memory
     SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=1"
-    SERVER_LOG="${ENSEMBLE_NAME}.pinned.serverlog"
+    SERVER_LOG="${ENSEMBLE_NAME}.pinned.server.log"
     run_server
     if (( $SERVER_PID == 0 )); then
         echo -e "\n***\n*** Failed to start $SERVER\n***"
@@ -96,7 +96,7 @@ for BACKEND in $BACKENDS; do
         RET=1
     fi
 
-    grep "] non-pinned" ${ENSEMBLE_NAME}.pinned.serverlog
+    grep "] non-pinned" ${ENSEMBLE_NAME}.pinned.server.log
     if [ $? -eq 0 ]; then
         echo -e "\n***\n*** Failed. Expected only pinned memory is allocated\n***"
         RET=1
@@ -108,7 +108,7 @@ for BACKEND in $BACKENDS; do
 
     # Restart the server without verbose logging
     SERVER_ARGS="--model-repository=`pwd`/models"
-    SERVER_LOG="${ENSEMBLE_NAME}.pinned.serverlog"
+    SERVER_LOG="${ENSEMBLE_NAME}.pinned.server.log"
     run_server
     if (( $SERVER_PID == 0 )); then
         echo -e "\n***\n*** Failed to start $SERVER\n***"
@@ -133,7 +133,7 @@ for BACKEND in $BACKENDS; do
 
     # Without pinned memory
     SERVER_ARGS="--model-repository=`pwd`/models --pinned-memory-pool-byte-size=0 --log-verbose=1"
-    SERVER_LOG="${ENSEMBLE_NAME}.nonpinned.serverlog"
+    SERVER_LOG="${ENSEMBLE_NAME}.nonpinned.server.log"
     run_server
     if (( $SERVER_PID == 0 )); then
         echo -e "\n***\n*** Failed to start $SERVER\n***"
@@ -148,7 +148,7 @@ for BACKEND in $BACKENDS; do
         RET=1
     fi
 
-    grep "] pinned" ${ENSEMBLE_NAME}.nonpinned.serverlog
+    grep "] pinned" ${ENSEMBLE_NAME}.nonpinned.server.log
     if [ $? -eq 0 ]; then
         echo -e "\n***\n*** Failed. Expected only non-pinned memory is allocated\n***"
         RET=1
@@ -160,7 +160,7 @@ for BACKEND in $BACKENDS; do
 
     # Restart the server without verbose logging
     SERVER_ARGS="--model-repository=`pwd`/models --pinned-memory-pool-byte-size=0"
-    SERVER_LOG="${ENSEMBLE_NAME}.nonpinned.serverlog"
+    SERVER_LOG="${ENSEMBLE_NAME}.nonpinned.server.log"
     run_server
     if (( $SERVER_PID == 0 )); then
         echo -e "\n***\n*** Failed to start $SERVER\n***"
diff --git a/qa/L0_python_api/test.sh b/qa/L0_python_api/test.sh
new file mode 100755
index 0000000000..c5021acae0
--- /dev/null
+++ b/qa/L0_python_api/test.sh
@@ -0,0 +1,50 @@
+#!/bin/bash
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+TEST_LOG="./python_binding.log"
+
+RET=0
+
+rm -f $TEST_LOG
+
+set +e
+
+python test_binding.py > $TEST_LOG 2>&1
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+set -e
+
+if [ $RET -eq 0 ]; then
+    echo -e "\n***\n*** Test Passed\n***"
+else
+    cat $TEST_LOG
+    echo -e "\n***\n*** Test FAILED\n***"
+fi
+
+exit $RET
diff --git a/qa/L0_jetson_example/test.sh b/qa/L0_python_client_unit_tests/test.sh
old mode 100644
new mode 100755
similarity index 57%
rename from qa/L0_jetson_example/test.sh
rename to qa/L0_python_client_unit_tests/test.sh
index 4d692a8b0a..5a46ecccc5
--- a/qa/L0_jetson_example/test.sh
+++ b/qa/L0_python_client_unit_tests/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,39 +25,30 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/tao/peoplenet/versions/pruned_v2.1/zip -O pruned_v2.1.zip
-unzip pruned_v2.1.zip -d concurrency_and_dynamic_batching/tao/models/peoplenet && rm pruned_v2.1.zip
+TEST_LOG="./python_client_unit_tests.log"
+PYTHON_CLIENT_UNIT_TESTS_DIR=/opt/tritonserver/qa/python_client_unit_tests/
+PYTHON_CLIENT_UNIT_TESTS_CMD="python3 -m unittest discover -v -s $PYTHON_CLIENT_UNIT_TESTS_DIR -t $PYTHON_CLIENT_UNIT_TESTS_DIR"
 
-# Use TAO convertor for JP4.6
-wget --content-disposition https://developer.nvidia.com/jp46-20210820t231431z-001zip -O jp4.6-20210820T231431Z-001.zip
-unzip jp4.6-20210820T231431Z-001.zip && rm jp4.6-20210820T231431Z-001.zip
+# DLPack test requires Torch to validate GPU tensor
+pip3 install torch
 
-cp tao-converter-jp46-trt8.0.1.6/tao-converter concurrency_and_dynamic_batching/tao/tao-converter && rm -rf jp4.6
-chmod 777 concurrency_and_dynamic_batching/tao/tao-converter
+RET=0
 
-(cd concurrency_and_dynamic_batching/tao && bash convert_peoplenet.sh)
+rm -f $TEST_LOG
 
-# Build the example and make sure permissions
-cd concurrency_and_dynamic_batching && make
+set +e
 
-CLIENT_LOG="./client.log"
-
-# Running the example/s
-./people_detection -m gpu -v -r trtis_model_repo_sample_1 -t 6 -s false -p ${HOME}/tritonserver >> ${CLIENT_LOG}.1 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG.1
-    RET=1
-fi
-
-./people_detection -m gpu -v -r trtis_model_repo_sample_2 -t 6 -s false -p ${HOME}/tritonserver >> ${CLIENT_LOG}.2 2>&1
+$PYTHON_CLIENT_UNIT_TESTS_CMD > $TEST_LOG 2>&1
 if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG.2
+    echo -e "\n***\n*** Test Failed\n***"
     RET=1
 fi
+set -e
 
 if [ $RET -eq 0 ]; then
     echo -e "\n***\n*** Test Passed\n***"
 else
+    cat $TEST_LOG
     echo -e "\n***\n*** Test FAILED\n***"
 fi
 
diff --git a/qa/L0_query/query_e2e.py b/qa/L0_query/query_e2e.py
old mode 100644
new mode 100755
index 69849749d4..048a4a8d41
--- a/qa/L0_query/query_e2e.py
+++ b/qa/L0_query/query_e2e.py
@@ -1,5 +1,5 @@
 #!/usr/bin/env python
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -27,26 +27,23 @@
 
 import sys
 
-sys.path.append('../common')
+sys.path.append("../common")
+
+import unittest
 
-import argparse
 import numpy as np
-import os
-from builtins import range
-import tritonclient.http as tritonhttpclient
+import test_util as tu
 import tritonclient.grpc as tritongrpcclient
+import tritonclient.http as tritonhttpclient
 from tritonclient.utils import InferenceServerException
 from tritonclient.utils import cuda_shared_memory as cudashm
-import unittest
-import test_util as tu
 
 
 class QueryTest(tu.TestResultCollector):
-
     def test_http(self):
         triton_client = tritonhttpclient.InferenceServerClient("localhost:8000")
         inputs = []
-        inputs.append(tritonhttpclient.InferInput('INPUT', [1], "UINT8"))
+        inputs.append(tritonhttpclient.InferInput("INPUT", [1], "UINT8"))
         inputs[0].set_data_from_numpy(np.arange(1, dtype=np.uint8))
 
         try:
@@ -59,33 +56,33 @@ def test_http(self):
     def test_http_shared_memory(self):
         triton_client = tritonhttpclient.InferenceServerClient("localhost:8000")
         inputs = []
-        inputs.append(tritonhttpclient.InferInput('INPUT', [1], "UINT8"))
+        inputs.append(tritonhttpclient.InferInput("INPUT", [1], "UINT8"))
         inputs[0].set_data_from_numpy(np.arange(1, dtype=np.uint8))
 
         # Set up CUDA shared memory for outputs
         triton_client.unregister_system_shared_memory()
         triton_client.unregister_cuda_shared_memory()
-        shm_op0_handle = cudashm.create_shared_memory_region(
-            "output0_data", 4, 0)
-        shm_op1_handle = cudashm.create_shared_memory_region(
-            "output1_data", 4, 0)
+        shm_op0_handle = cudashm.create_shared_memory_region("output0_data", 4, 0)
+        shm_op1_handle = cudashm.create_shared_memory_region("output1_data", 4, 0)
         triton_client.register_cuda_shared_memory(
-            "output0_data", cudashm.get_raw_handle(shm_op0_handle), 0, 4)
+            "output0_data", cudashm.get_raw_handle(shm_op0_handle), 0, 4
+        )
         triton_client.register_cuda_shared_memory(
-            "output1_data", cudashm.get_raw_handle(shm_op1_handle), 0, 4)
+            "output1_data", cudashm.get_raw_handle(shm_op1_handle), 0, 4
+        )
         outputs = []
         outputs.append(
-            tritonhttpclient.InferRequestedOutput('OUTPUT0', binary_data=True))
+            tritonhttpclient.InferRequestedOutput("OUTPUT0", binary_data=True)
+        )
         outputs[-1].set_shared_memory("output0_data", 4)
 
         outputs.append(
-            tritonhttpclient.InferRequestedOutput('OUTPUT1', binary_data=True))
+            tritonhttpclient.InferRequestedOutput("OUTPUT1", binary_data=True)
+        )
         outputs[-1].set_shared_memory("output1_data", 4)
 
         try:
-            triton_client.infer(model_name="query",
-                                inputs=inputs,
-                                outputs=outputs)
+            triton_client.infer(model_name="query", inputs=inputs, outputs=outputs)
             self.assertTrue(False, "expect error with query information")
         except InferenceServerException as ex:
             self.assertTrue("OUTPUT0 GPU 0" in ex.message())
@@ -99,34 +96,34 @@ def test_http_shared_memory(self):
     def test_http_out_of_shared_memory(self):
         triton_client = tritonhttpclient.InferenceServerClient("localhost:8000")
         inputs = []
-        inputs.append(tritonhttpclient.InferInput('INPUT', [1], "UINT8"))
+        inputs.append(tritonhttpclient.InferInput("INPUT", [1], "UINT8"))
         inputs[0].set_data_from_numpy(np.arange(1, dtype=np.uint8))
 
         # Set up too small CUDA shared memory for outputs, expect query
         # returns default value
         triton_client.unregister_system_shared_memory()
         triton_client.unregister_cuda_shared_memory()
-        shm_op0_handle = cudashm.create_shared_memory_region(
-            "output0_data", 1, 0)
-        shm_op1_handle = cudashm.create_shared_memory_region(
-            "output1_data", 1, 0)
+        shm_op0_handle = cudashm.create_shared_memory_region("output0_data", 1, 0)
+        shm_op1_handle = cudashm.create_shared_memory_region("output1_data", 1, 0)
         triton_client.register_cuda_shared_memory(
-            "output0_data", cudashm.get_raw_handle(shm_op0_handle), 0, 1)
+            "output0_data", cudashm.get_raw_handle(shm_op0_handle), 0, 1
+        )
         triton_client.register_cuda_shared_memory(
-            "output1_data", cudashm.get_raw_handle(shm_op1_handle), 0, 1)
+            "output1_data", cudashm.get_raw_handle(shm_op1_handle), 0, 1
+        )
         outputs = []
         outputs.append(
-            tritonhttpclient.InferRequestedOutput('OUTPUT0', binary_data=True))
+            tritonhttpclient.InferRequestedOutput("OUTPUT0", binary_data=True)
+        )
         outputs[-1].set_shared_memory("output0_data", 1)
 
         outputs.append(
-            tritonhttpclient.InferRequestedOutput('OUTPUT1', binary_data=True))
+            tritonhttpclient.InferRequestedOutput("OUTPUT1", binary_data=True)
+        )
         outputs[-1].set_shared_memory("output1_data", 1)
 
         try:
-            triton_client.infer(model_name="query",
-                                inputs=inputs,
-                                outputs=outputs)
+            triton_client.infer(model_name="query", inputs=inputs, outputs=outputs)
             self.assertTrue(False, "expect error with query information")
         except InferenceServerException as ex:
             self.assertTrue("OUTPUT0 CPU 0" in ex.message())
@@ -140,7 +137,7 @@ def test_http_out_of_shared_memory(self):
     def test_grpc(self):
         triton_client = tritongrpcclient.InferenceServerClient("localhost:8001")
         inputs = []
-        inputs.append(tritongrpcclient.InferInput('INPUT', [1], "UINT8"))
+        inputs.append(tritongrpcclient.InferInput("INPUT", [1], "UINT8"))
         inputs[0].set_data_from_numpy(np.arange(1, dtype=np.uint8))
 
         try:
@@ -153,31 +150,29 @@ def test_grpc(self):
     def test_grpc_shared_memory(self):
         triton_client = tritongrpcclient.InferenceServerClient("localhost:8001")
         inputs = []
-        inputs.append(tritongrpcclient.InferInput('INPUT', [1], "UINT8"))
+        inputs.append(tritongrpcclient.InferInput("INPUT", [1], "UINT8"))
         inputs[0].set_data_from_numpy(np.arange(1, dtype=np.uint8))
 
         # Set up CUDA shared memory for outputs
         triton_client.unregister_system_shared_memory()
         triton_client.unregister_cuda_shared_memory()
-        shm_op0_handle = cudashm.create_shared_memory_region(
-            "output0_data", 4, 0)
-        shm_op1_handle = cudashm.create_shared_memory_region(
-            "output1_data", 4, 0)
+        shm_op0_handle = cudashm.create_shared_memory_region("output0_data", 4, 0)
+        shm_op1_handle = cudashm.create_shared_memory_region("output1_data", 4, 0)
         triton_client.register_cuda_shared_memory(
-            "output0_data", cudashm.get_raw_handle(shm_op0_handle), 0, 4)
+            "output0_data", cudashm.get_raw_handle(shm_op0_handle), 0, 4
+        )
         triton_client.register_cuda_shared_memory(
-            "output1_data", cudashm.get_raw_handle(shm_op1_handle), 0, 4)
+            "output1_data", cudashm.get_raw_handle(shm_op1_handle), 0, 4
+        )
         outputs = []
-        outputs.append(tritongrpcclient.InferRequestedOutput('OUTPUT0'))
+        outputs.append(tritongrpcclient.InferRequestedOutput("OUTPUT0"))
         outputs[-1].set_shared_memory("output0_data", 4)
 
-        outputs.append(tritongrpcclient.InferRequestedOutput('OUTPUT1'))
+        outputs.append(tritongrpcclient.InferRequestedOutput("OUTPUT1"))
         outputs[-1].set_shared_memory("output1_data", 4)
 
         try:
-            triton_client.infer(model_name="query",
-                                inputs=inputs,
-                                outputs=outputs)
+            triton_client.infer(model_name="query", inputs=inputs, outputs=outputs)
             self.assertTrue(False, "expect error with query information")
         except InferenceServerException as ex:
             self.assertTrue("OUTPUT0 GPU 0" in ex.message())
@@ -191,32 +186,30 @@ def test_grpc_shared_memory(self):
     def test_grpc_out_of_shared_memory(self):
         triton_client = tritongrpcclient.InferenceServerClient("localhost:8001")
         inputs = []
-        inputs.append(tritongrpcclient.InferInput('INPUT', [1], "UINT8"))
+        inputs.append(tritongrpcclient.InferInput("INPUT", [1], "UINT8"))
         inputs[0].set_data_from_numpy(np.arange(1, dtype=np.uint8))
 
         # Set up too small CUDA shared memory for outputs, expect query
         # returns default value
         triton_client.unregister_system_shared_memory()
         triton_client.unregister_cuda_shared_memory()
-        shm_op0_handle = cudashm.create_shared_memory_region(
-            "output0_data", 1, 0)
-        shm_op1_handle = cudashm.create_shared_memory_region(
-            "output1_data", 1, 0)
+        shm_op0_handle = cudashm.create_shared_memory_region("output0_data", 1, 0)
+        shm_op1_handle = cudashm.create_shared_memory_region("output1_data", 1, 0)
         triton_client.register_cuda_shared_memory(
-            "output0_data", cudashm.get_raw_handle(shm_op0_handle), 0, 1)
+            "output0_data", cudashm.get_raw_handle(shm_op0_handle), 0, 1
+        )
         triton_client.register_cuda_shared_memory(
-            "output1_data", cudashm.get_raw_handle(shm_op1_handle), 0, 1)
+            "output1_data", cudashm.get_raw_handle(shm_op1_handle), 0, 1
+        )
         outputs = []
-        outputs.append(tritongrpcclient.InferRequestedOutput('OUTPUT0'))
+        outputs.append(tritongrpcclient.InferRequestedOutput("OUTPUT0"))
         outputs[-1].set_shared_memory("output0_data", 1)
 
-        outputs.append(tritongrpcclient.InferRequestedOutput('OUTPUT1'))
+        outputs.append(tritongrpcclient.InferRequestedOutput("OUTPUT1"))
         outputs[-1].set_shared_memory("output1_data", 1)
 
         try:
-            triton_client.infer(model_name="query",
-                                inputs=inputs,
-                                outputs=outputs)
+            triton_client.infer(model_name="query", inputs=inputs, outputs=outputs)
             self.assertTrue(False, "expect error with query information")
         except InferenceServerException as ex:
             self.assertTrue("OUTPUT0 CPU 0" in ex.message())
@@ -228,5 +221,5 @@ def test_grpc_out_of_shared_memory(self):
         triton_client.unregister_cuda_shared_memory()
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_query/test.sh b/qa/L0_query/test.sh
old mode 100644
new mode 100755
diff --git a/qa/L0_rate_limiter/rate_limiter_test.py b/qa/L0_rate_limiter/rate_limiter_test.py
old mode 100644
new mode 100755
index c02c50b61e..4bc7b82e70
--- a/qa/L0_rate_limiter/rate_limiter_test.py
+++ b/qa/L0_rate_limiter/rate_limiter_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -29,11 +31,12 @@
 sys.path.append("../common")
 
 import functools
-import numpy as np
 import os
-import unittest
 import threading
 import time
+import unittest
+
+import numpy as np
 import sequence_util as su
 import tritongrpcclient as grpcclient
 from tritonclientutils import *
@@ -46,7 +49,6 @@
 
 
 class AsyncGrpcRunner:
-
     def __init__(self, tester, server_url, model_name, delay_ms):
         self._tester = tester
         self._server_url = server_url
@@ -79,18 +81,17 @@ def req_loop(self):
         client = grpcclient.InferenceServerClient(self._server_url)
 
         inputs = [
-            grpcclient.InferInput("INPUT0", self._shape,
-                                  np_to_triton_dtype(self._dtype))
+            grpcclient.InferInput(
+                "INPUT0", self._shape, np_to_triton_dtype(self._dtype)
+            )
         ]
 
         self._inflight_requests = 0
-        start_stat = client.get_inference_statistics(
-            model_name=self._model_name)
+        start_stat = client.get_inference_statistics(model_name=self._model_name)
         global _exit_signal
 
         while not _exit_signal:
-            input_numpy = np.random.random_sample(self._shape).astype(
-                self._dtype)
+            input_numpy = np.random.random_sample(self._shape).astype(self._dtype)
             inputs[0].set_data_from_numpy(input_numpy)
             self._input_data.append(input_numpy)
 
@@ -99,12 +100,15 @@ def req_loop(self):
                 def _check_can_send():
                     return self._inflight_requests < _inference_concurrency
 
-                can_send = self._sync.wait_for(_check_can_send,
-                                               timeout=_response_wait_time_s)
+                can_send = self._sync.wait_for(
+                    _check_can_send, timeout=_response_wait_time_s
+                )
                 self._tester.assertTrue(
                     can_send,
                     "client didn't receive a response within {}s".format(
-                        _response_wait_time_s))
+                        _response_wait_time_s
+                    ),
+                )
 
                 callback = functools.partial(AsyncGrpcRunner._on_result, self)
                 client.async_infer(
@@ -115,7 +119,7 @@ def _check_can_send():
                 )
                 self._inflight_requests += 1
                 self._num_sent_request += 1
-                if (self._num_sent_request == _inference_count):
+                if self._num_sent_request == _inference_count:
                     _exit_signal = True
                 time.sleep(self._delay_ms / 1000.0)
 
@@ -125,17 +129,21 @@ def _check_can_send():
             def _all_processed():
                 return self._inflight_requests == 0
 
-            self._processed_all = self._sync.wait_for(_all_processed,
-                                                      _finish_wait_time_s)
+            self._processed_all = self._sync.wait_for(
+                _all_processed, _finish_wait_time_s
+            )
             self._tester.assertTrue(
                 self._processed_all,
-                "the processing didn't complete even after waiting for {}s".
-                format(_finish_wait_time_s))
+                "the processing didn't complete even after waiting for {}s".format(
+                    _finish_wait_time_s
+                ),
+            )
 
         end_stat = client.get_inference_statistics(model_name=self._model_name)
-        self._processed_request_count = end_stat.model_stats[
-            0].inference_stats.success.count - start_stat.model_stats[
-                0].inference_stats.success.count
+        self._processed_request_count = (
+            end_stat.model_stats[0].inference_stats.success.count
+            - start_stat.model_stats[0].inference_stats.success.count
+        )
 
     def start(self):
         self._req_thread.start()
@@ -144,13 +152,15 @@ def _validate_run(self):
         if len(self._errors) != 0:
             raise self._errors[0]
         self._tester.assertEqual(
-            len(self._input_data), len(self._results.keys()),
-            "the number of inputs and output should match")
+            len(self._input_data),
+            len(self._results.keys()),
+            "the number of inputs and output should match",
+        )
         for i in range(len(self._input_data)):
             self._tester.assertFalse(
-                (self._input_data[i] !=
-                 self._results[i].as_numpy('OUTPUT0')).any(),
-                "the output data should match with the input data")
+                (self._input_data[i] != self._results[i].as_numpy("OUTPUT0")).any(),
+                "the output data should match with the input data",
+            )
 
     def join(self):
         self._req_thread.join()
@@ -158,17 +168,16 @@ def join(self):
 
 
 class RateLimiterTest(su.SequenceBatcherTestUtil):
-
     def stress_models(self, model_names, delay_ms=0):
         infer_counts = {}
         try:
             runners = []
             for model_name in model_names:
                 runners.append(
-                    AsyncGrpcRunner(self,
-                                    "localhost:8001",
-                                    model_name,
-                                    delay_ms=delay_ms))
+                    AsyncGrpcRunner(
+                        self, "localhost:8001", model_name, delay_ms=delay_ms
+                    )
+                )
             for r in runners:
                 r.start()
             for r in runners:
@@ -191,7 +200,7 @@ def test_single_model(self):
     def test_cross_model_prioritization_limited_resource(self):
         # Sends requests to two models, one operating at
         # priority of 1 and other at 2 respectively.
-        # The availabe resource counts doesn't allow models
+        # The available resource counts doesn't allow models
         # to execute simultaneously.
 
         model_names = ["custom_zero_1_float32", "custom_zero_1_float32_v2"]
@@ -199,32 +208,36 @@ def test_cross_model_prioritization_limited_resource(self):
         # TODO: Validate the priority and resource counts are set correctly
 
         infer_counts = self.stress_models(model_names)
-        infer_ratio = infer_counts[model_names[0]] / float(
-            infer_counts[model_names[1]])
+        infer_ratio = infer_counts[model_names[0]] / float(infer_counts[model_names[1]])
 
         self.assertGreater(
-            infer_ratio, 1.80,
+            infer_ratio,
+            1.80,
             "Got infer ratio across models {}, expected closer to 2".format(
-                infer_ratio))
+                infer_ratio
+            ),
+        )
 
     def test_cross_model_prioritization_plenty_resource(self):
         # Sends requests to two models, one operating at
         # priority of 1 and other at 2 respectively.
-        # The availabe resource counts wll allow both models
-        # to run simulataneously.
+        # The available resource counts wll allow both models
+        # to run simultaneously.
 
         model_names = ["custom_zero_1_float32", "custom_zero_1_float32_v2"]
 
         # TODO: Validate the priority and resource counts are set correctly
 
         infer_counts = self.stress_models(model_names)
-        infer_diff = abs(infer_counts[model_names[0]] -
-                         infer_counts[model_names[1]])
+        infer_diff = abs(infer_counts[model_names[0]] - infer_counts[model_names[1]])
 
         self.assertGreater(
-            10, infer_diff,
-            "Got infer difference between models {}, expected closer to 0".
-            format(infer_diff))
+            10,
+            infer_diff,
+            "Got infer difference between models {}, expected closer to 0".format(
+                infer_diff
+            ),
+        )
 
     def test_single_model_dynamic_batching(self):
         # Send all the inference requests with a delay to a model
@@ -242,18 +255,25 @@ def test_single_model_dynamic_batching(self):
 
         batch_stats = stats.model_stats[0].batch_stats
         self.assertEqual(
-            len(batch_stats), 1,
-            "expected single batch-size, got {}".format(len(batch_stats)))
+            len(batch_stats),
+            1,
+            "expected single batch-size, got {}".format(len(batch_stats)),
+        )
 
         for batch_stat in batch_stats:
             self.assertEqual(
-                batch_stat.batch_size, 4,
-                "unexpected batch-size {}".format(batch_stat.batch_size))
+                batch_stat.batch_size,
+                4,
+                "unexpected batch-size {}".format(batch_stat.batch_size),
+            )
             # Get count from one of the stats
             self.assertEqual(
-                batch_stat.compute_infer.count, _inference_count / 4,
-                "expected model-execution-count {} for batch size {}, got {}".
-                format(_inference_count / 4, 4, batch_stat.compute_infer.count))
+                batch_stat.compute_infer.count,
+                _inference_count / 4,
+                "expected model-execution-count {} for batch size {}, got {}".format(
+                    _inference_count / 4, 4, batch_stat.compute_infer.count
+                ),
+            )
 
     def test_single_model_sequence_batching(self):
         # Send one sequence and check for correct accumulator
@@ -265,19 +285,26 @@ def test_single_model_sequence_batching(self):
             model_name = "custom_sequence_int32"
             self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
             self.check_sequence(
-                'custom',
+                "custom",
                 model_name,
                 np.int32,
                 5,
                 (4000, None),
                 # (flag_str, value, (ls_ms, gt_ms), (pre_delay, post_delay))
-                (("start", 1, None, None), (None, 2, None, None),
-                 (None, 3, None, None), (None, 4, None, None),
-                 (None, 5, None, None), (None, 6, None, None),
-                 (None, 7, None, None), (None, 8, None, None),
-                 ("end", 9, None, None)),
+                (
+                    ("start", 1, None, None),
+                    (None, 2, None, None),
+                    (None, 3, None, None),
+                    (None, 4, None, None),
+                    (None, 5, None, None),
+                    (None, 6, None, None),
+                    (None, 7, None, None),
+                    (None, 8, None, None),
+                    ("end", 9, None, None),
+                ),
                 45,
-                'grpc')
+                "grpc",
+            )
 
             self.check_deferred_exception()
             self.check_status(model_name, {1: 9}, 9, 9)
@@ -285,5 +312,5 @@ def test_single_model_sequence_batching(self):
             self.assertTrue(False, "unexpected error {}".format(ex))
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_rate_limiter/test.sh b/qa/L0_rate_limiter/test.sh
old mode 100644
new mode 100755
index 9a23822056..334af99e4c
--- a/qa/L0_rate_limiter/test.sh
+++ b/qa/L0_rate_limiter/test.sh
@@ -102,12 +102,17 @@ if [ "$SERVER_PID" != "0" ]; then
     kill $SERVER_PID
     wait $SERVER_PID
 fi
+
+set +e
 grep "Resource count for \"resource1\" is limited to 1 which will prevent scheduling of one or more model instances, the minimum required count is 4" $SERVER_LOG
 if [ $? -ne 0 ]; then
+    cat $SERVER_LOG
     echo -e "\n***\n*** Failed. Expected error message while loading the model \"custom_zero_1_float32\"\n***"
     RET=1
 fi
 
+set -e
+
 # Case2: resources sufficient only for one model
 SERVER_ARGS="--rate-limit=execution_count --rate-limit-resource=resource1:3 --rate-limit-resource=resource2:2 --model-repository=$MODELDIR/custom_models"
 SERVER_LOG="./inference_server_r3.log"
@@ -119,12 +124,17 @@ if [ "$SERVER_PID" != "0" ]; then
     kill $SERVER_PID
     wait $SERVER_PID
 fi
+
+set +e
 grep "Resource count for \"resource1\" is limited to 3 which will prevent scheduling of one or more model instances, the minimum required count is 4" $SERVER_LOG
 if [ $? -ne 0 ]; then
+    cat $SERVER_LOG
     echo -e "\n***\n*** Failed. Expected error message while loading the model \"custom_zero_1_float32\"\n***"
     RET=1
 fi
 
+set -e
+
 # Case3: Resource specified only for specific device id 10 and not for the GPU that loads the model instance.
 SERVER_ARGS="--rate-limit=execution_count --rate-limit-resource=resource1:10:10 --rate-limit-resource=resource2:2 --model-repository=$MODELDIR/custom_models"
 SERVER_LOG="./inference_server_rdevice.log"
@@ -136,12 +146,17 @@ if [ "$SERVER_PID" != "0" ]; then
     kill $SERVER_PID
     wait $SERVER_PID
 fi
+
+set +e
 grep "Resource count for \"resource1\" is limited to 0 which will prevent scheduling of one or more model instances, the minimum required count is 4" $SERVER_LOG
 if [ $? -ne 0 ]; then
+    cat $SERVER_LOG
     echo -e "\n***\n*** Failed. Expected error message while loading the model \"custom_zero_1_float32\"\n***"
     RET=1
 fi
 
+set -e
+
 # Case4: Conflicting resource types in the config
 cp -r ./custom_models/custom_zero_1_float32_v2 ./custom_models/custom_zero_1_float32_v3
 (cd custom_models/custom_zero_1_float32_v3 && \
@@ -158,13 +173,18 @@ if [ "$SERVER_PID" != "0" ]; then
     kill $SERVER_PID
     wait $SERVER_PID
 fi
+
+set +e
 grep "Resource \"resource2\" is present as both global and device-specific resource in the model configuration." $SERVER_LOG
 if [ $? -ne 0 ]; then
+    cat $SERVER_LOG
     echo -e "\n***\n*** Failed. Expected error message for conflicting resource types\n***"
     RET=1
 fi
 rm -rf ./custom_models/custom_zero_1_float32_v3
 
+set -e
+
 ##
 ## Tests with cross-model prioritization with various cases:
 ##
@@ -258,7 +278,7 @@ kill $SERVER_PID
 wait $SERVER_PID
 
 ##
-## Tests with mulitple instances of the same model
+## Tests with multiple instances of the same model
 ##
 # Replace the second model with a second instance with same resource requirements and priority.
 # TODO: Currently there is no way to check which instance got to run inferences hence we only
diff --git a/qa/L0_register/test.sh b/qa/L0_register/test.sh
old mode 100644
new mode 100755
diff --git a/qa/L0_repoagent_checksum/identity_test.py b/qa/L0_repoagent_checksum/identity_test.py
old mode 100644
new mode 100755
index ad9f268967..4db55e0d45
--- a/qa/L0_repoagent_checksum/identity_test.py
+++ b/qa/L0_repoagent_checksum/identity_test.py
@@ -27,40 +27,43 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import argparse
-import numpy as np
 import sys
+
+import numpy as np
 import tritongrpcclient as grpcclient
 import tritonhttpclient as httpclient
 from tritonclientutils import np_to_triton_dtype
 
 FLAGS = None
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('-v',
-                        '--verbose',
-                        action="store_true",
-                        required=False,
-                        default=False,
-                        help='Enable verbose output')
-    parser.add_argument('-u',
-                        '--url',
-                        type=str,
-                        required=False,
-                        help='Inference server URL.')
     parser.add_argument(
-        '-i',
-        '--protocol',
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable verbose output",
+    )
+    parser.add_argument(
+        "-u", "--url", type=str, required=False, help="Inference server URL."
+    )
+    parser.add_argument(
+        "-i",
+        "--protocol",
         type=str,
         required=False,
-        default='http',
-        help='Protocol ("http"/"grpc") used to ' +
-        'communicate with inference service. Default is "http".')
+        default="http",
+        help='Protocol ("http"/"grpc") used to '
+        + 'communicate with inference service. Default is "http".',
+    )
 
     FLAGS = parser.parse_args()
     if (FLAGS.protocol != "http") and (FLAGS.protocol != "grpc"):
-        print("unexpected protocol \"{}\", expects \"http\" or \"grpc\"".format(
-            FLAGS.protocol))
+        print(
+            'unexpected protocol "{}", expects "http" or "grpc"'.format(FLAGS.protocol)
+        )
         exit(1)
 
     client_util = httpclient if FLAGS.protocol == "http" else grpcclient
@@ -69,23 +72,23 @@
         FLAGS.url = "localhost:8000" if FLAGS.protocol == "http" else "localhost:8001"
 
     # Reuse a single client for all sync tests
-    with client_util.InferenceServerClient(FLAGS.url,
-                                           verbose=FLAGS.verbose) as client:
+    with client_util.InferenceServerClient(FLAGS.url, verbose=FLAGS.verbose) as client:
         for model_name, np_dtype, shape in (
-                # yapf: disable
+            # yapf: disable
             ("identity_int32", np.int32, [0]),
-            ("identity_int32", np.int32, [7])):
+            ("identity_int32", np.int32, [7])
+        ):
             # yapf: enable
             if np_dtype != object:
                 input_data = (16384 * np.random.randn(*shape)).astype(np_dtype)
             else:
-                in0 = (16384 * np.ones(shape, dtype='int'))
-                in0n = np.array([str(x) for x in in0.reshape(in0.size)],
-                                dtype=object)
+                in0 = 16384 * np.ones(shape, dtype="int")
+                in0n = np.array([str(x) for x in in0.reshape(in0.size)], dtype=object)
                 input_data = in0n.reshape(in0.shape)
             inputs = [
-                client_util.InferInput("INPUT0", input_data.shape,
-                                       np_to_triton_dtype(input_data.dtype))
+                client_util.InferInput(
+                    "INPUT0", input_data.shape, np_to_triton_dtype(input_data.dtype)
+                )
             ]
             inputs[0].set_data_from_numpy(input_data)
 
@@ -102,6 +105,9 @@
                 output_data = np.char.decode(output_data)
 
             if not np.array_equal(output_data, input_data):
-                print("error: expected output {} to match input {}".format(
-                    output_data, input_data))
+                print(
+                    "error: expected output {} to match input {}".format(
+                        output_data, input_data
+                    )
+                )
                 sys.exit(1)
diff --git a/qa/L0_request_cancellation/grpc_cancellation_test.py b/qa/L0_request_cancellation/grpc_cancellation_test.py
new file mode 100755
index 0000000000..fadaa291e8
--- /dev/null
+++ b/qa/L0_request_cancellation/grpc_cancellation_test.py
@@ -0,0 +1,141 @@
+#!/usr/bin/env python3
+
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import asyncio
+import queue
+import time
+import unittest
+from functools import partial
+
+import numpy as np
+import tritonclient.grpc as grpcclient
+import tritonclient.grpc.aio as grpcclientaio
+from tritonclient.utils import InferenceServerException
+
+
+class UserData:
+    def __init__(self):
+        self._completed_requests = queue.Queue()
+
+
+def callback(user_data, result, error):
+    if error:
+        user_data._completed_requests.put(error)
+    else:
+        user_data._completed_requests.put(result)
+
+
+class GrpcCancellationTest(unittest.IsolatedAsyncioTestCase):
+    _model_name = "custom_identity_int32"
+    _model_delay = 10.0  # seconds
+    _grpc_params = {"url": "localhost:8001", "verbose": True}
+
+    def setUp(self):
+        self._client = grpcclient.InferenceServerClient(**self._grpc_params)
+        self._client_aio = grpcclientaio.InferenceServerClient(**self._grpc_params)
+        self._user_data = UserData()
+        self._callback = partial(callback, self._user_data)
+        self._prepare_request()
+        self._start_time = time.time()  # seconds
+
+    def tearDown(self):
+        self._end_time = time.time()  # seconds
+        self._assert_max_duration()
+
+    def _prepare_request(self):
+        self._inputs = []
+        self._inputs.append(grpcclient.InferInput("INPUT0", [1, 1], "INT32"))
+        self._outputs = []
+        self._outputs.append(grpcclient.InferRequestedOutput("OUTPUT0"))
+        self._inputs[0].set_data_from_numpy(np.array([[10]], dtype=np.int32))
+
+    def _assert_max_duration(self):
+        max_duration = self._model_delay * 0.5  # seconds
+        duration = self._end_time - self._start_time  # seconds
+        self.assertLess(
+            duration,
+            max_duration,
+            f"test runtime expected less than {max_duration}s response time, got {duration}s",
+        )
+
+    def _assert_callback_cancelled(self):
+        self.assertFalse(self._user_data._completed_requests.empty())
+        data_item = self._user_data._completed_requests.get()
+        self.assertIsInstance(data_item, InferenceServerException)
+        self.assertIn("Locally cancelled by application!", str(data_item))
+
+    def test_grpc_async_infer(self):
+        future = self._client.async_infer(
+            model_name=self._model_name,
+            inputs=self._inputs,
+            callback=self._callback,
+            outputs=self._outputs,
+        )
+        time.sleep(2)  # ensure the inference has started
+        future.cancel()
+        time.sleep(0.1)  # context switch
+        self._assert_callback_cancelled()
+
+    def test_grpc_stream_infer(self):
+        self._client.start_stream(callback=self._callback)
+        self._client.async_stream_infer(
+            model_name=self._model_name, inputs=self._inputs, outputs=self._outputs
+        )
+        time.sleep(2)  # ensure the inference has started
+        self._client.stop_stream(cancel_requests=True)
+        self._assert_callback_cancelled()
+
+    async def test_aio_grpc_async_infer(self):
+        infer_task = asyncio.create_task(
+            self._client_aio.infer(
+                model_name=self._model_name, inputs=self._inputs, outputs=self._outputs
+            )
+        )
+        await asyncio.sleep(2)  # ensure the inference has started
+        infer_task.cancel()
+        with self.assertRaises(asyncio.CancelledError):
+            await infer_task
+
+    async def test_aio_grpc_stream_infer(self):
+        async def requests_generator():
+            yield {
+                "model_name": self._model_name,
+                "inputs": self._inputs,
+                "outputs": self._outputs,
+            }
+
+        responses_iterator = self._client_aio.stream_infer(requests_generator())
+        await asyncio.sleep(2)  # ensure the inference has started
+        self.assertTrue(responses_iterator.cancel())
+        with self.assertRaises(asyncio.CancelledError):
+            async for result, error in responses_iterator:
+                self._callback(result, error)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_request_cancellation/scheduler_test.py b/qa/L0_request_cancellation/scheduler_test.py
new file mode 100755
index 0000000000..a6cd97efaa
--- /dev/null
+++ b/qa/L0_request_cancellation/scheduler_test.py
@@ -0,0 +1,233 @@
+#!/usr/bin/env python3
+
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import concurrent.futures
+import time
+import unittest
+
+import numpy as np
+import tritonclient.grpc as grpcclient
+from tritonclient.utils import InferenceServerException
+
+
+class TestScheduler(unittest.TestCase):
+    def setUp(self):
+        # Initialize client
+        self._triton = grpcclient.InferenceServerClient("localhost:8001")
+
+    def _get_inputs(self, batch_size):
+        self.assertIsInstance(batch_size, int)
+        self.assertGreater(batch_size, 0)
+        shape = [batch_size, 8]
+        inputs = [grpcclient.InferInput("INPUT0", shape, "FP32")]
+        inputs[0].set_data_from_numpy(np.ones(shape, dtype=np.float32))
+        return inputs
+
+    def _generate_callback_and_response_pair(self):
+        response = {"responded": False, "result": None, "error": None}
+
+        def callback(result, error):
+            response["responded"] = True
+            response["result"] = result
+            response["error"] = error
+
+        return callback, response
+
+    def _assert_response_is_cancelled(self, response):
+        self.assertTrue(response["responded"])
+        self.assertEqual(response["result"], None)
+        self.assertIsInstance(response["error"], InferenceServerException)
+        self.assertEqual(response["error"].status(), "StatusCode.CANCELLED")
+
+    def _generate_streaming_callback_and_response_pair(self):
+        response = []  # [{"result": result, "error": error}, ...]
+
+        def callback(result, error):
+            response.append({"result": result, "error": error})
+
+        return callback, response
+
+    def _assert_streaming_response_is_cancelled(self, response):
+        self.assertGreater(len(response), 0)
+        cancelled_count = 0
+        for res in response:
+            result, error = res["result"], res["error"]
+            if error:
+                self.assertEqual(result, None)
+                self.assertIsInstance(error, InferenceServerException)
+                if error.status() == "StatusCode.CANCELLED":
+                    cancelled_count += 1
+        self.assertEqual(cancelled_count, 1)
+
+    # Test queued requests on dynamic batch scheduler can be cancelled
+    def test_dynamic_batch_scheduler_request_cancellation(self):
+        model_name = "dynamic_batch"
+        with concurrent.futures.ThreadPoolExecutor() as pool:
+            # Saturate the 2 batch slots on the model of 1 instance
+            saturate_thread_1 = pool.submit(
+                self._triton.infer, model_name, self._get_inputs(batch_size=1)
+            )
+            saturate_thread_2 = pool.submit(
+                self._triton.infer, model_name, self._get_inputs(batch_size=1)
+            )
+            time.sleep(2)  # ensure the slots are filled
+            # The next request should be queued
+            callback, response = self._generate_callback_and_response_pair()
+            queue_future = self._triton.async_infer(
+                model_name, self._get_inputs(batch_size=1), callback
+            )
+            time.sleep(2)  # ensure the request is queued
+            self.assertFalse(response["responded"])
+            # Cancel the queued request
+            queue_future.cancel()
+            time.sleep(2)  # ensure the cancellation is delivered
+            self._assert_response_is_cancelled(response)
+            # Join saturating thread
+            saturate_thread_1.result()
+            saturate_thread_2.result()
+
+    # Test backlogged requests on sequence batch scheduler can be cancelled
+    def test_sequence_batch_scheduler_backlog_request_cancellation(self):
+        model_name = "sequence_direct"
+        with concurrent.futures.ThreadPoolExecutor() as pool:
+            # Saturate the single sequence slot
+            saturate_thread = pool.submit(
+                self._triton.infer,
+                model_name,
+                self._get_inputs(batch_size=1),
+                sequence_id=1,
+                sequence_start=True,
+            )
+            time.sleep(2)  # ensure the slot is filled
+            # The next sequence with 2 requests should be on the backlog
+            backlog_requests = []
+            for i in range(2):
+                callback, response = self._generate_callback_and_response_pair()
+                backlog_future = self._triton.async_infer(
+                    model_name,
+                    self._get_inputs(batch_size=1),
+                    callback,
+                    sequence_id=2,
+                    sequence_start=(True if i == 0 else False),
+                )
+                backlog_requests.append(
+                    {"future": backlog_future, "response": response}
+                )
+            time.sleep(2)  # ensure the sequence is backlogged
+            self.assertFalse(backlog_requests[0]["response"]["responded"])
+            self.assertFalse(backlog_requests[1]["response"]["responded"])
+            # Cancelling any backlogged request cancels the entire sequence
+            backlog_requests[0]["future"].cancel()
+            time.sleep(2)  # ensure the cancellation is delivered
+            time.sleep(2)  # ensure reaper thread has responded
+            self._assert_response_is_cancelled(backlog_requests[0]["response"])
+            self._assert_response_is_cancelled(backlog_requests[1]["response"])
+            # Join saturating thread
+            saturate_thread.result()
+
+    # Test queued requests on direct sequence batch scheduler can be cancelled
+    def test_direct_sequence_batch_scheduler_request_cancellation(self):
+        model_name = "sequence_direct"
+        self._test_sequence_batch_scheduler_queued_request_cancellation(model_name)
+
+    # Test queued requests on oldest sequence batch scheduler can be cancelled
+    def test_oldest_sequence_batch_scheduler_request_cancellation(self):
+        model_name = "sequence_oldest"
+        self._test_sequence_batch_scheduler_queued_request_cancellation(model_name)
+
+    # Helper function
+    def _test_sequence_batch_scheduler_queued_request_cancellation(self, model_name):
+        with concurrent.futures.ThreadPoolExecutor() as pool:
+            # Start the sequence
+            start_thread = pool.submit(
+                self._triton.infer,
+                model_name,
+                self._get_inputs(batch_size=1),
+                sequence_id=1,
+                sequence_start=True,
+            )
+            time.sleep(2)  # ensure the sequence has started
+            # The next 2 requests should be queued
+            queue_requests = []
+            for i in range(2):
+                callback, response = self._generate_callback_and_response_pair()
+                queue_future = self._triton.async_infer(
+                    model_name, self._get_inputs(batch_size=1), callback, sequence_id=1
+                )
+                queue_requests.append({"future": queue_future, "response": response})
+            time.sleep(2)  # ensure the requests are queued
+            self.assertFalse(queue_requests[0]["response"]["responded"])
+            self.assertFalse(queue_requests[1]["response"]["responded"])
+            # Cancelling any queued request cancels the entire sequence
+            queue_requests[0]["future"].cancel()
+            time.sleep(2)  # ensure the cancellation is delivered
+            time.sleep(2)  # ensure reaper thread has responded
+            self._assert_response_is_cancelled(queue_requests[0]["response"])
+            self._assert_response_is_cancelled(queue_requests[1]["response"])
+            # Join start thread
+            start_thread.result()
+
+    # Test ensemble scheduler will propagate cancellation request to child
+    def test_ensemble_scheduler_request_cancellation(self):
+        model_name = "ensemble_model"
+        callback, response = self._generate_callback_and_response_pair()
+        infer_future = self._triton.async_infer(
+            model_name, self._get_inputs(batch_size=1), callback
+        )
+        time.sleep(2)  # ensure the inference has started
+        self.assertFalse(response["responded"])
+        infer_future.cancel()
+        time.sleep(2)  # ensure the cancellation is delivered
+        self._assert_response_is_cancelled(response)
+
+    # Test cancellation on multiple gRPC streaming sequences
+    def test_scheduler_streaming_request_cancellation(self):
+        model_name = "sequence_oldest"
+        # Start 2 sequences with many requests
+        callback, response = self._generate_streaming_callback_and_response_pair()
+        self._triton.start_stream(callback)
+        for sequence_id in [1, 2]:
+            sequence_start = True
+            for request_id in range(16):
+                self._triton.async_stream_infer(
+                    model_name,
+                    self._get_inputs(batch_size=1),
+                    sequence_id=sequence_id,
+                    sequence_start=sequence_start,
+                )
+                sequence_start = False
+        time.sleep(2)  # ensure the requests are delivered
+        # Cancelling the stream cancels all requests on the stream
+        self._triton.stop_stream(cancel_requests=True)
+        time.sleep(2)  # ensure the cancellation is delivered
+        time.sleep(2)  # ensure reaper thread has responded
+        self._assert_streaming_response_is_cancelled(response)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_request_cancellation/test.sh b/qa/L0_request_cancellation/test.sh
new file mode 100755
index 0000000000..4929be3a5f
--- /dev/null
+++ b/qa/L0_request_cancellation/test.sh
@@ -0,0 +1,183 @@
+#!/bin/bash
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
+if [ "$#" -ge 1 ]; then
+    REPO_VERSION=$1
+fi
+if [ -z "$REPO_VERSION" ]; then
+    echo -e "Repository version must be specified"
+    echo -e "\n***\n*** Test Failed\n***"
+    exit 1
+fi
+if [ ! -z "$TEST_REPO_ARCH" ]; then
+    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
+fi
+
+export CUDA_VISIBLE_DEVICES=0
+
+SERVER=/opt/tritonserver/bin/tritonserver
+source ../common/util.sh
+
+RET=0
+
+#
+# Unit tests
+#
+rm -rf models && mkdir models
+mkdir -p models/model/1 && (cd models/model && \
+    echo 'name: "model"' >> config.pbtxt && \
+    echo 'backend: "identity"' >> config.pbtxt && \
+    echo 'max_batch_size: 64' >> config.pbtxt && \
+    echo -e 'input [{ name: "INPUT0" \n data_type: TYPE_INT32 \n dims: [ 1000 ] }]' >> config.pbtxt && \
+    echo -e 'output [{ name: "OUTPUT0" \n data_type: TYPE_INT32 \n dims: [ 1000 ] }]' >> config.pbtxt && \
+    echo 'instance_group [{ kind: KIND_CPU }]' >> config.pbtxt)
+
+SERVER_LOG=server.log
+LD_LIBRARY_PATH=/opt/tritonserver/lib:$LD_LIBRARY_PATH ./request_cancellation_test > $SERVER_LOG
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Unit Tests Failed\n***"
+    cat $SERVER_LOG
+    RET=1
+fi
+
+#
+# gRPC cancellation tests
+#
+rm -rf models && mkdir models
+mkdir -p models/custom_identity_int32/1 && (cd models/custom_identity_int32 && \
+    echo 'name: "custom_identity_int32"' >> config.pbtxt && \
+    echo 'backend: "identity"' >> config.pbtxt && \
+    echo 'max_batch_size: 1024' >> config.pbtxt && \
+    echo -e 'input [{ name: "INPUT0" \n data_type: TYPE_INT32 \n dims: [ -1 ] }]' >> config.pbtxt && \
+    echo -e 'output [{ name: "OUTPUT0" \n data_type: TYPE_INT32 \n dims: [ -1 ] }]' >> config.pbtxt && \
+    echo 'instance_group [{ kind: KIND_CPU }]' >> config.pbtxt && \
+    echo -e 'parameters [{ key: "execute_delay_ms" \n value: { string_value: "10000" } }]' >> config.pbtxt)
+
+for TEST_CASE in "test_grpc_async_infer" "test_grpc_stream_infer" "test_aio_grpc_async_infer" "test_aio_grpc_stream_infer"; do
+
+    TEST_LOG="./grpc_cancellation_test.$TEST_CASE.log"
+    SERVER_LOG="grpc_cancellation_test.$TEST_CASE.server.log"
+
+    SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=1"
+    run_server
+    if [ "$SERVER_PID" == "0" ]; then
+        echo -e "\n***\n*** Failed to start $SERVER\n***"
+        cat $SERVER_LOG
+        exit 1
+    fi
+
+    set +e
+    python grpc_cancellation_test.py GrpcCancellationTest.$TEST_CASE > $TEST_LOG 2>&1
+    if [ $? -ne 0 ]; then
+        echo -e "\n***\n*** gRPC Cancellation Tests Failed on $TEST_CASE\n***"
+        cat $TEST_LOG
+        RET=1
+    fi
+    grep "Cancellation notification received for" $SERVER_LOG
+    if [ $? -ne 0 ]; then
+        echo -e "\n***\n*** Cancellation not received by server on $TEST_CASE\n***"
+        cat $SERVER_LOG
+        RET=1
+    fi
+    set -e
+
+    kill $SERVER_PID
+    wait $SERVER_PID
+done
+
+#
+# End-to-end scheduler tests
+#
+rm -rf models && mkdir models
+mkdir -p models/dynamic_batch/1 && (cd models/dynamic_batch && \
+    echo 'name: "dynamic_batch"' >> config.pbtxt && \
+    echo 'backend: "identity"' >> config.pbtxt && \
+    echo 'max_batch_size: 2' >> config.pbtxt && \
+    echo -e 'input [{ name: "INPUT0" \n data_type: TYPE_FP32 \n dims: [ -1 ] }]' >> config.pbtxt && \
+    echo -e 'output [{ name: "OUTPUT0" \n data_type: TYPE_FP32 \n dims: [ -1 ] }]' >> config.pbtxt && \
+    echo -e 'instance_group [{ count: 1 \n kind: KIND_CPU }]' >> config.pbtxt && \
+    echo -e 'dynamic_batching { max_queue_delay_microseconds: 600000 }' >> config.pbtxt && \
+    echo -e 'parameters [{ key: "execute_delay_ms" \n value: { string_value: "6000" } }]' >> config.pbtxt)
+mkdir -p models/sequence_direct/1 && (cd models/sequence_direct && \
+    echo 'name: "sequence_direct"' >> config.pbtxt && \
+    echo 'backend: "identity"' >> config.pbtxt && \
+    echo 'max_batch_size: 1' >> config.pbtxt && \
+    echo -e 'input [{ name: "INPUT0" \n data_type: TYPE_FP32 \n dims: [ -1 ] }]' >> config.pbtxt && \
+    echo -e 'output [{ name: "OUTPUT0" \n data_type: TYPE_FP32 \n dims: [ -1 ] }]' >> config.pbtxt && \
+    echo -e 'instance_group [{ count: 1 \n kind: KIND_CPU }]' >> config.pbtxt && \
+    echo -e 'sequence_batching { direct { } \n max_sequence_idle_microseconds: 6000000 }' >> config.pbtxt && \
+    echo -e 'parameters [{ key: "execute_delay_ms" \n value: { string_value: "6000" } }]' >> config.pbtxt)
+mkdir -p models/sequence_oldest/1 && (cd models/sequence_oldest && \
+    echo 'name: "sequence_oldest"' >> config.pbtxt && \
+    echo 'backend: "identity"' >> config.pbtxt && \
+    echo 'max_batch_size: 1' >> config.pbtxt && \
+    echo -e 'input [{ name: "INPUT0" \n data_type: TYPE_FP32 \n dims: [ -1 ] }]' >> config.pbtxt && \
+    echo -e 'output [{ name: "OUTPUT0" \n data_type: TYPE_FP32 \n dims: [ -1 ] }]' >> config.pbtxt && \
+    echo -e 'instance_group [{ count: 1 \n kind: KIND_CPU }]' >> config.pbtxt && \
+    echo -e 'sequence_batching { oldest { max_candidate_sequences: 1 } \n max_sequence_idle_microseconds: 6000000 }' >> config.pbtxt && \
+    echo -e 'parameters [{ key: "execute_delay_ms" \n value: { string_value: "6000" } }]' >> config.pbtxt)
+mkdir -p models/ensemble_model/1 && (cd models/ensemble_model && \
+    echo 'name: "ensemble_model"' >> config.pbtxt && \
+    echo 'platform: "ensemble"' >> config.pbtxt && \
+    echo 'max_batch_size: 1' >> config.pbtxt && \
+    echo -e 'input [{ name: "INPUT0" \n data_type: TYPE_FP32 \n dims: [ -1 ] }]' >> config.pbtxt && \
+    echo -e 'output [{ name: "OUTPUT0" \n data_type: TYPE_FP32 \n dims: [ -1 ] }]' >> config.pbtxt && \
+    echo 'ensemble_scheduling { step [' >> config.pbtxt && \
+    echo -e '{ model_name: "dynamic_batch" \n model_version: -1 \n input_map { key: "INPUT0" \n value: "INPUT0" } \n output_map { key: "OUTPUT0" \n value: "out" } },' >> config.pbtxt && \
+    echo -e '{ model_name: "dynamic_batch" \n model_version: -1 \n input_map { key: "INPUT0" \n value: "out" } \n output_map { key: "OUTPUT0" \n value: "OUTPUT0" } }' >> config.pbtxt && \
+    echo '] }' >> config.pbtxt)
+
+TEST_LOG="scheduler_test.log"
+SERVER_LOG="./scheduler_test.server.log"
+
+SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=2"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python scheduler_test.py > $TEST_LOG 2>&1
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Scheduler Tests Failed\n***"
+    cat $TEST_LOG
+    RET=1
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+if [ $RET -eq 0 ]; then
+    echo -e "\n***\n*** Test Passed\n***"
+else
+    echo -e "\n***\n*** Test FAILED\n***"
+fi
+exit $RET
diff --git a/qa/L0_response_cache/models/decoupled_cache/config.pbtxt b/qa/L0_response_cache/models/decoupled_cache/config.pbtxt
new file mode 100644
index 0000000000..c243e72861
--- /dev/null
+++ b/qa/L0_response_cache/models/decoupled_cache/config.pbtxt
@@ -0,0 +1,49 @@
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+backend: "identity"
+max_batch_size: 0
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ -1 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ -1 ]
+  }
+]
+
+model_transaction_policy {
+  decoupled: True
+}
+response_cache {
+  enable: True
+}
diff --git a/qa/L0_response_cache/models/identity_cache/config.pbtxt b/qa/L0_response_cache/models/identity_cache/config.pbtxt
new file mode 100644
index 0000000000..7ba5cf2afb
--- /dev/null
+++ b/qa/L0_response_cache/models/identity_cache/config.pbtxt
@@ -0,0 +1,46 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+backend: "identity"
+max_batch_size: 0
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ -1 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ -1 ]
+  }
+]
+
+response_cache {
+  enable: True
+}
diff --git a/qa/L0_response_cache/test.sh b/qa/L0_response_cache/test.sh
index c13858226d..434195b693 100755
--- a/qa/L0_response_cache/test.sh
+++ b/qa/L0_response_cache/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -29,19 +29,254 @@ RET=0
 
 TEST_LOG="./response_cache_test.log"
 UNIT_TEST=./response_cache_test
+export CUDA_VISIBLE_DEVICES=0
+
+# Only localhost supported in this test for now, but in future could make
+# use of a persistent remote redis server, or similarly use --replicaof arg.
+export TRITON_REDIS_HOST="localhost"
+export TRITON_REDIS_PORT="6379"
+REDIS_LOG="./redis-server.unit_tests.log"
 
 rm -fr *.log
 
+function install_redis() {
+  ## Install redis if not already installed
+  if ! command -v redis-server >/dev/null 2>&1; then
+    apt update -y && apt install -y redis
+  fi
+}
+
+function start_redis() {
+  # Run redis server in background
+  redis-server                    \
+    --daemonize yes               \
+    --port "${TRITON_REDIS_PORT}" \
+    --logfile "${REDIS_LOG}"      \
+    --loglevel debug
+
+  # Check redis server is running
+  REDIS_PING_RESPONSE=$(redis-cli -h ${TRITON_REDIS_HOST} -p ${TRITON_REDIS_PORT} ping)
+  if [ "${REDIS_PING_RESPONSE}" == "PONG" ]; then
+    echo "Redis successfully started in background"
+  else
+    echo -e "\n***\n*** Failed: Redis server did not start successfully\n***"
+    RET=1
+  fi
+}
+
+function stop_redis() {
+  echo "Stopping Redis server..."
+  redis-cli -h "${TRITON_REDIS_HOST}" -p "${TRITON_REDIS_PORT}" shutdown || true
+  echo "Redis server shutdown"
+}
+
+function set_redis_auth() {
+  # NOTE: Per-user auth [Access Control List (ACL)] is only supported in
+  #       Redis >= 6.0 and is more comprehensive in what can be configured.
+  #       For simplicity and wider range of Redis version support, use
+  #       server-wide password  via "requirepass" for now.
+  redis-cli -h "${TRITON_REDIS_HOST}" -p "${TRITON_REDIS_PORT}" config set requirepass "${REDIS_PW}"
+  export REDISCLI_AUTH="${REDIS_PW}"
+}
+
+function unset_redis_auth() {
+  # Authenticate implicitly via REDISCLI_AUTH env var, then unset password/var
+  redis-cli -h "${TRITON_REDIS_HOST}" -p "${TRITON_REDIS_PORT}" config set requirepass ""
+  unset REDISCLI_AUTH
+}
+
+# UNIT TESTS
 set +e
-export CUDA_VISIBLE_DEVICES=0
+
+## Unit tests currently run for both Local and Redis cache implementations
+## by default. However, we could break out the unit tests for each
+## into separate runs gtest filters if needed in the future:
+## - `${UNIT_TEST} --gtest_filter=*Local*`
+## - `${UNIT_TEST} --gtest_filter=*Redis*`
+install_redis
+# Stop any existing redis server first for good measure
+stop_redis
+start_redis
 LD_LIBRARY_PATH=/opt/tritonserver/lib:$LD_LIBRARY_PATH $UNIT_TEST >>$TEST_LOG 2>&1
 if [ $? -ne 0 ]; then
     cat $TEST_LOG
     echo -e "\n***\n*** Response Cache Unit Test Failed\n***"
     RET=1
 fi
+stop_redis
 set -e
 
+# SERVER TESTS
+function check_server_success_and_kill {
+    if [ "${SERVER_PID}" == "0" ]; then
+        echo -e "\n***\n*** Failed to start ${SERVER}\n***"
+        cat ${SERVER_LOG}
+        RET=1
+    else
+        kill ${SERVER_PID}
+        wait ${SERVER_PID}
+    fi
+}
+
+function check_server_expected_failure {
+    EXPECTED_MESSAGE="${1}"
+    if [ "${SERVER_PID}" != "0" ]; then
+        echo -e "\n***\n*** Failed: ${SERVER} started successfully when it was expected to fail\n***"
+        cat ${SERVER_LOG}
+        RET=1
+
+        kill ${SERVER_PID}
+        wait ${SERVER_PID}
+    else
+        # Check that server fails with the correct error message
+        set +e
+        grep -i "${EXPECTED_MESSAGE}" ${SERVER_LOG}
+        if [ $? -ne 0 ]; then
+            echo -e "\n***\n*** Failed: Expected [${EXPECTED_MESSAGE}] error message in output\n***"
+            cat $SERVER_LOG
+            RET=1
+        fi
+        set -e
+    fi
+}
+
+MODEL_DIR="${PWD}/models"
+mkdir -p "${MODEL_DIR}/decoupled_cache/1"
+mkdir -p "${MODEL_DIR}/identity_cache/1"
+
+# Check that server fails to start for a "decoupled" model with cache enabled
+EXTRA_ARGS="--model-control-mode=explicit --load-model=decoupled_cache"
+
+SERVER=/opt/tritonserver/bin/tritonserver
+SERVER_ARGS="--model-repository=${MODEL_DIR} --response-cache-byte-size=8192 ${EXTRA_ARGS}"
+SERVER_LOG="./inference_server.log"
+source ../common/util.sh
+run_server
+if [ "$SERVER_PID" != "0" ]; then
+    echo -e "\n***\n*** Failed: $SERVER started successfully when it was expected to fail\n***"
+    cat $SERVER_LOG
+    RET=1
+
+    kill $SERVER_PID
+    wait $SERVER_PID
+else
+    # Check that server fails with the correct error message
+    set +e
+    grep -i "response cache does not currently support" ${SERVER_LOG} | grep -i "decoupled"
+    if [ $? -ne 0 ]; then
+        echo -e "\n***\n*** Failed: Expected response cache / decoupled mode error message in output\n***"
+        cat $SERVER_LOG
+        RET=1
+    fi
+    set -e
+fi
+
+# Test with model expected to load successfully
+EXTRA_ARGS="--model-control-mode=explicit --load-model=identity_cache"
+
+# Test old cache config method
+# --response-cache-byte-size must be non-zero to test models with cache enabled
+SERVER_ARGS="--model-repository=${MODEL_DIR} --response-cache-byte-size=8192 ${EXTRA_ARGS}"
+run_server
+check_server_success_and_kill
+
+# Test new cache config method
+SERVER_ARGS="--model-repository=${MODEL_DIR} --cache-config=local,size=8192 ${EXTRA_ARGS}"
+run_server
+check_server_success_and_kill
+
+# Test that specifying multiple cache types is not supported and should fail
+SERVER_ARGS="--model-repository=${MODEL_DIR} --cache-config=local,size=8192 --cache-config=redis,key=value ${EXTRA_ARGS}"
+run_server
+check_server_expected_failure "multiple cache configurations"
+
+# Test that specifying both config styles is incompatible and should fail
+SERVER_ARGS="--model-repository=${MODEL_DIR} --response-cache-byte-size=12345 --cache-config=local,size=67890 ${EXTRA_ARGS}"
+run_server
+check_server_expected_failure "incompatible flags"
+
+## Redis Cache CLI tests
+REDIS_ENDPOINT="--cache-config redis,host=${TRITON_REDIS_HOST} --cache-config redis,port=${TRITON_REDIS_PORT}"
+REDIS_LOG="./redis-server.cli_tests.log"
+start_redis
+
+# Test simple redis cache config succeeds
+SERVER_ARGS="--model-repository=${MODEL_DIR} ${REDIS_ENDPOINT} ${EXTRA_ARGS}"
+run_server
+check_server_success_and_kill
+
+# Test triton fails to initialize if it can't connect to redis cache
+SERVER_ARGS="--model-repository=${MODEL_DIR} --cache-config=redis,host=localhost --cache-config=redis,port=nonexistent ${EXTRA_ARGS}"
+run_server
+check_server_expected_failure "Failed to connect to Redis: Connection refused"
+
+# Test triton fails to initialize if it can't resolve host for redis cache
+SERVER_ARGS="--model-repository=${MODEL_DIR} --cache-config=redis,host=nonexistent --cache-config=redis,port=nonexistent ${EXTRA_ARGS}"
+run_server
+# Either of these errors can be returned for bad hostname, so check for either.
+MSG1="Temporary failure in name resolution"
+MSG2="Name or service not known"
+check_server_expected_failure "${MSG1}\|${MSG2}"
+
+# Test triton fails to initialize if minimum required args (host & port) not all provided
+SERVER_ARGS="--model-repository=${MODEL_DIR} --cache-config=redis,port=${TRITON_REDIS_HOST} ${EXTRA_ARGS}"
+run_server
+check_server_expected_failure "Must at a minimum specify"
+
+## Redis Authentication tests
+
+# Automatically provide auth via REDISCLI_AUTH env var when set: https://redis.io/docs/ui/cli/
+REDIS_PW="redis123!"
+set_redis_auth
+
+### Credentials via command-line
+
+# Test simple redis authentication succeeds with correct credentials
+REDIS_CACHE_AUTH="--cache-config redis,password=${REDIS_PW}"
+SERVER_ARGS="--model-repository=${MODEL_DIR} ${REDIS_ENDPOINT} ${REDIS_CACHE_AUTH} ${EXTRA_ARGS}"
+run_server
+check_server_success_and_kill
+
+# Test simple redis authentication fails with wrong credentials
+REDIS_CACHE_AUTH="--cache-config redis,password=wrong"
+SERVER_ARGS="--model-repository=${MODEL_DIR} ${REDIS_ENDPOINT} ${REDIS_CACHE_AUTH} ${EXTRA_ARGS}"
+run_server
+check_server_expected_failure "WRONGPASS"
+
+# Test simple redis authentication fails with no credentials
+SERVER_ARGS="--model-repository=${MODEL_DIR} ${REDIS_ENDPOINT} ${EXTRA_ARGS}"
+run_server
+check_server_expected_failure "NOAUTH Authentication required"
+
+### Credentials via environment variables
+
+# Test simple redis authentication succeeds with password-only via env vars
+# No username means use "default" as the username
+unset TRITONCACHE_REDIS_USERNAME
+export TRITONCACHE_REDIS_PASSWORD="${REDIS_PW}"
+SERVER_ARGS="--model-repository=${MODEL_DIR} ${REDIS_ENDPOINT} ${EXTRA_ARGS}"
+run_server
+check_server_success_and_kill
+
+# Test simple redis authentication succeeds with correct user and password via env vars
+export TRITONCACHE_REDIS_USERNAME="default"
+export TRITONCACHE_REDIS_PASSWORD="${REDIS_PW}"
+SERVER_ARGS="--model-repository=${MODEL_DIR} ${REDIS_ENDPOINT} ${EXTRA_ARGS}"
+run_server
+check_server_success_and_kill
+
+# Test simple redis authentication fails with wrong credentials via env vars
+export TRITONCACHE_REDIS_PASSWORD="wrong"
+SERVER_ARGS="--model-repository=${MODEL_DIR} ${REDIS_ENDPOINT} ${EXTRA_ARGS}"
+run_server
+check_server_expected_failure "WRONGPASS"
+unset TRITONCACHE_REDIS_USERNAME
+unset TRITONCACHE_REDIS_PASSWORD
+
+# Clean up redis server before exiting test
+unset_redis_auth
+stop_redis
+
 if [ $RET -eq 0 ]; then
   echo -e "\n***\n*** Test Passed\n***"
 else
diff --git a/qa/L0_sagemaker/sagemaker_multi_model_test.py b/qa/L0_sagemaker/sagemaker_multi_model_test.py
old mode 100644
new mode 100755
index 820562c1da..b2052f6751
--- a/qa/L0_sagemaker/sagemaker_multi_model_test.py
+++ b/qa/L0_sagemaker/sagemaker_multi_model_test.py
@@ -1,5 +1,5 @@
 #!/usr/bin/python
-# Copyright (c) 2021-2022, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -29,27 +29,20 @@
 
 sys.path.append("../common")
 
+import json
 import os
-import shutil
+import sys
 import time
 import unittest
+
 import numpy as np
-import infer_util as iu
+import requests
 import test_util as tu
 import tritonclient.http as httpclient
 
-import argparse
-import csv
-import json
-import os
-import requests
-import socket
-import sys
-
 
 class SageMakerMultiModelTest(tu.TestResultCollector):
     def setUp(self):
-
         SAGEMAKER_BIND_TO_PORT = os.getenv("SAGEMAKER_BIND_TO_PORT", "8080")
         self.url_mme_ = "http://localhost:{}/models".format(SAGEMAKER_BIND_TO_PORT)
 
@@ -58,15 +51,59 @@ def setUp(self):
         self.model1_url = "/opt/ml/models/123456789abcdefghi/model"
 
         self.model1_input_data_ = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
-        self.model1_expected_output0_data_ = [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30]
-        self.model1_expected_output1_data_ = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
+        self.model1_expected_output0_data_ = [
+            0,
+            2,
+            4,
+            6,
+            8,
+            10,
+            12,
+            14,
+            16,
+            18,
+            20,
+            22,
+            24,
+            26,
+            28,
+            30,
+        ]
+        self.model1_expected_output1_data_ = [
+            0,
+            0,
+            0,
+            0,
+            0,
+            0,
+            0,
+            0,
+            0,
+            0,
+            0,
+            0,
+            0,
+            0,
+            0,
+            0,
+        ]
 
         self.model1_expected_result_ = {
             "model_name": "sm_mme_model_1",
             "model_version": "1",
             "outputs": [
-                {"name": "OUTPUT0", "datatype": "INT32", "shape": [1, 16], "data": self.model1_expected_output0_data_},
-                {"name": "OUTPUT1", "datatype": "INT32", "shape": [1, 16], "data": self.model1_expected_output1_data_},
+                {
+                    "name": "OUTPUT0",
+                    "datatype": "INT32",
+                    "shape": [1, 16],
+                    "data": self.model1_expected_output0_data_,
+                },
+                {
+                    "name": "OUTPUT1",
+                    "datatype": "INT32",
+                    "shape": [1, 16],
+                    "data": self.model1_expected_output1_data_,
+                },
             ],
         }
 
@@ -77,9 +114,15 @@ def setUp(self):
         # Output is same as input since this is an identity model
         self.model2_input_data_ = [0, 1, 2, 3, 4, 5, 6, 7]
 
+        # ensemble model setup
+        self.model3_name = "123456789ensemble"
+        self.model3_url = "/opt/ml/models/123456789ensemble/model"
+
     def test_sm_0_environment_variables_set(self):
         self.assertEqual(
-            os.getenv("SAGEMAKER_MULTI_MODEL"), "true", "Variable SAGEMAKER_MULTI_MODEL must be set to true"
+            os.getenv("SAGEMAKER_MULTI_MODEL"),
+            "true",
+            "Variable SAGEMAKER_MULTI_MODEL must be set to true",
         )
 
     def test_sm_1_model_load(self):
@@ -88,35 +131,59 @@ def test_sm_1_model_load(self):
         headers = {"Content-Type": "application/json"}
         r = requests.post(self.url_mme_, data=json.dumps(request_body), headers=headers)
         time.sleep(5)  # wait for model to load
-        self.assertEqual(r.status_code, 200, "Expected status code 200, received {}".format(r.status_code))
+        self.assertEqual(
+            r.status_code,
+            200,
+            "Expected status code 200, received {}".format(r.status_code),
+        )
 
         # Load the same model again, expect a 409
         request_body = {"model_name": self.model1_name, "url": self.model1_url}
         headers = {"Content-Type": "application/json"}
         r = requests.post(self.url_mme_, data=json.dumps(request_body), headers=headers)
         time.sleep(5)  # wait for model to load
-        self.assertEqual(r.status_code, 409, "Expected status code 409, received {}".format(r.status_code))
+        self.assertEqual(
+            r.status_code,
+            409,
+            "Expected status code 409, received {}".format(r.status_code),
+        )
 
         # Load model_2
         request_body = {"model_name": self.model2_name, "url": self.model2_url}
         headers = {"Content-Type": "application/json"}
         r = requests.post(self.url_mme_, data=json.dumps(request_body), headers=headers)
         time.sleep(5)  # wait for model to load
-        self.assertEqual(r.status_code, 200, "Expected status code 200, received {}".format(r.status_code))
+        self.assertEqual(
+            r.status_code,
+            200,
+            "Expected status code 200, received {}".format(r.status_code),
+        )
 
     def test_sm_2_model_list(self):
         r = requests.get(self.url_mme_)
         time.sleep(3)
         expected_response_1 = {
             "models": [
-                {"modelName": self.model1_name, "modelUrl": self.model1_url},
-                {"modelName": self.model2_name, "modelUrl": self.model2_url},
+                {
+                    "modelName": self.model1_name,
+                    "modelUrl": self.model1_url.rstrip("/model"),
+                },
+                {
+                    "modelName": self.model2_name,
+                    "modelUrl": self.model2_url.rstrip("/model"),
+                },
             ]
         }
         expected_response_2 = {
             "models": [
-                {"modelName": self.model2_name, "modelUrl": self.model2_url},
-                {"modelName": self.model1_name, "modelUrl": self.model1_url},
+                {
+                    "modelName": self.model2_name,
+                    "modelUrl": self.model2_url.rstrip("/model"),
+                },
+                {
+                    "modelName": self.model1_name,
+                    "modelUrl": self.model1_url.rstrip("/model"),
+                },
             ]
         }
 
@@ -124,16 +191,23 @@ def test_sm_2_model_list(self):
         self.assertIn(
             r.json(),
             [expected_response_1, expected_response_2],
-            "Expected one of {}, received: {}".format([expected_response_1, expected_response_2], r.json()),
+            "Expected one of {}, received: {}".format(
+                [expected_response_1, expected_response_2], r.json()
+            ),
         )
 
     def test_sm_3_model_get(self):
         get_url = "{}/{}".format(self.url_mme_, self.model1_name)
         r = requests.get(get_url)
         time.sleep(3)
-        expected_response = {"modelName": self.model1_name, "modelUrl": self.model1_url}
+        expected_response = {
+            "modelName": self.model1_name,
+            "modelUrl": self.model1_url.rstrip("/model"),
+        }
         self.assertEqual(
-            r.json(), expected_response, "Expected response: {}, received: {}".format(expected_response, r.json())
+            r.json(),
+            expected_response,
+            "Expected response: {}, received: {}".format(expected_response, r.json()),
         )
 
     def test_sm_4_model_invoke(self):
@@ -151,7 +225,9 @@ def test_sm_4_model_invoke(self):
 
         outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False))
         outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False))
-        request_body, _ = httpclient.InferenceServerClient.generate_request_body(inputs, outputs=outputs)
+        request_body, _ = httpclient.InferenceServerClient.generate_request_body(
+            inputs, outputs=outputs
+        )
 
         headers = {"Content-Type": "application/json"}
         invoke_url = "{}/{}/invoke".format(self.url_mme_, self.model1_name)
@@ -161,7 +237,9 @@ def test_sm_4_model_invoke(self):
         self.assertEqual(
             self.model1_expected_result_,
             r.json(),
-            "Expected response : {}, received: {}".format(self.model1_expected_result_, r.json()),
+            "Expected response : {}, received: {}".format(
+                self.model1_expected_result_, r.json()
+            ),
         )
 
         # Invoke model_2
@@ -180,42 +258,121 @@ def test_sm_4_model_invoke(self):
 
         outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=True))
 
-        request_body, header_length = httpclient.InferenceServerClient.generate_request_body(inputs, outputs=outputs)
+        (
+            request_body,
+            header_length,
+        ) = httpclient.InferenceServerClient.generate_request_body(
+            inputs, outputs=outputs
+        )
 
         invoke_url = "{}/{}/invoke".format(self.url_mme_, self.model2_name)
         headers = {
-            "Content-Type": "application/vnd.sagemaker-triton.binary+json;json-header-size={}".format(header_length)
+            "Content-Type": "application/vnd.sagemaker-triton.binary+json;json-header-size={}".format(
+                header_length
+            )
         }
         r = requests.post(invoke_url, data=request_body, headers=headers)
 
-        header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size="
+        header_length_prefix = (
+            "application/vnd.sagemaker-triton.binary+json;json-header-size="
+        )
         header_length_str = r.headers["Content-Type"][len(header_length_prefix) :]
-        result = httpclient.InferenceServerClient.parse_response_body(r._content, header_length=int(header_length_str))
+        result = httpclient.InferenceServerClient.parse_response_body(
+            r._content, header_length=int(header_length_str)
+        )
 
         # Get the inference header size so we can locate the output binary data
         output_data = result.as_numpy("OUTPUT0")
-        
+
         for i in range(8):
-            self.assertEqual(output_data[0][i], input_data[0][i], "Tensor Value Mismatch")
+            self.assertEqual(
+                output_data[0][i], input_data[0][i], "Tensor Value Mismatch"
+            )
 
     def test_sm_5_model_unload(self):
         # Unload model_1
         unload_url = "{}/{}".format(self.url_mme_, self.model1_name)
         r = requests.delete(unload_url)
         time.sleep(3)
-        self.assertEqual(r.status_code, 200, "Expected status code 200, received {}".format(r.status_code))
+        self.assertEqual(
+            r.status_code,
+            200,
+            "Expected status code 200, received {}".format(r.status_code),
+        )
 
         # Unload model_2
         unload_url = "{}/{}".format(self.url_mme_, self.model2_name)
         r = requests.delete(unload_url)
         time.sleep(3)
-        self.assertEqual(r.status_code, 200, "Expected status code 200, received {}".format(r.status_code))
+        self.assertEqual(
+            r.status_code,
+            200,
+            "Expected status code 200, received {}".format(r.status_code),
+        )
 
         # Unload a non-loaded model, expect a 404
         unload_url = "{}/sm_non_loaded_model".format(self.url_mme_)
         r = requests.delete(unload_url)
         time.sleep(3)
-        self.assertEqual(r.status_code, 404, "Expected status code 404, received {}".format(r.status_code))
+        self.assertEqual(
+            r.status_code,
+            404,
+            "Expected status code 404, received {}".format(r.status_code),
+        )
+
+    def test_sm_6_ensemble_model(self):
+        # Load ensemble model
+        request_body = {"model_name": self.model3_name, "url": self.model3_url}
+        headers = {
+            "Content-Type": "application/json",
+            "X-Amzn-SageMaker-Target-Model": f"{self.model3_name}",
+        }
+        r = requests.post(self.url_mme_, data=json.dumps(request_body), headers=headers)
+        time.sleep(5)  # wait for model to load
+        self.assertEqual(
+            r.status_code,
+            200,
+            "Expected status code 200, received {}".format(r.status_code),
+        )
+
+        # Invoke ensemble model
+        inputs = []
+        outputs = []
+        inputs.append(httpclient.InferInput("INPUT0", [1, 16], "FP32"))
+        inputs.append(httpclient.InferInput("INPUT1", [1, 16], "FP32"))
+
+        # Initialize the data
+        input_data = np.array(self.model1_input_data_, dtype=np.float32)
+        input_data = np.expand_dims(input_data, axis=0)
+        inputs[0].set_data_from_numpy(input_data, binary_data=False)
+        inputs[1].set_data_from_numpy(input_data, binary_data=False)
+
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False))
+        request_body, _ = httpclient.InferenceServerClient.generate_request_body(
+            inputs, outputs=outputs
+        )
+
+        headers = {"Content-Type": "application/json"}
+        invoke_url = "{}/{}/invoke".format(self.url_mme_, self.model3_name)
+        r = requests.post(invoke_url, data=request_body, headers=headers)
+        print(f"response: {r.text}")
+        r.raise_for_status()
+        self.assertEqual(
+            r.status_code,
+            200,
+            "Expected status code 200, received {}".format(r.status_code),
+        )
+
+        # Unload ensemble model
+        unload_url = "{}/{}".format(self.url_mme_, self.model3_name)
+        r = requests.delete(unload_url, headers=headers)
+        time.sleep(5)
+        self.assertEqual(
+            r.status_code,
+            200,
+            "Expected status code 200, received {}".format(r.status_code),
+        )
 
 
 if __name__ == "__main__":
diff --git a/qa/L0_sagemaker/sagemaker_test.py b/qa/L0_sagemaker/sagemaker_test.py
old mode 100644
new mode 100755
index baff8b5528..6e76a9f0fd
--- a/qa/L0_sagemaker/sagemaker_test.py
+++ b/qa/L0_sagemaker/sagemaker_test.py
@@ -1,5 +1,5 @@
 #!/usr/bin/python
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,88 +26,98 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
+import json
 import os
-import shutil
-import time
+import sys
 import unittest
+
 import numpy as np
-import infer_util as iu
+import requests
 import test_util as tu
 import tritonclient.http as httpclient
 
-import argparse
-import csv
-import json
-import os
-import requests
-import socket
-import sys
-
 
 class SageMakerTest(tu.TestResultCollector):
-
     def setUp(self):
-        SAGEMAKER_BIND_TO_PORT = os.getenv('SAGEMAKER_BIND_TO_PORT', '8080')
-        self.url_ = "http://localhost:{}/invocations".format(
-            SAGEMAKER_BIND_TO_PORT)
-        self.input_data_ = [
-            0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
-        ]
+        SAGEMAKER_BIND_TO_PORT = os.getenv("SAGEMAKER_BIND_TO_PORT", "8080")
+        self.url_ = "http://localhost:{}/invocations".format(SAGEMAKER_BIND_TO_PORT)
+        self.input_data_ = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
         self.expected_output0_data_ = [
-            0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30
-        ]
-        self.expected_output1_data_ = [
-            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+            0,
+            2,
+            4,
+            6,
+            8,
+            10,
+            12,
+            14,
+            16,
+            18,
+            20,
+            22,
+            24,
+            26,
+            28,
+            30,
         ]
+        self.expected_output1_data_ = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 
         self.expected_result_ = {
-            "model_name":
-                "sm_model",
-            "model_version":
-                "1",
-            "outputs": [{
-                "name": "OUTPUT0",
-                "datatype": "INT32",
-                "shape": [1, 16],
-                "data": self.expected_output0_data_
-            }, {
-                "name": "OUTPUT1",
-                "datatype": "INT32",
-                "shape": [1, 16],
-                "data": self.expected_output1_data_
-            }]
+            "model_name": "sm_model",
+            "model_version": "1",
+            "outputs": [
+                {
+                    "name": "OUTPUT0",
+                    "datatype": "INT32",
+                    "shape": [1, 16],
+                    "data": self.expected_output0_data_,
+                },
+                {
+                    "name": "OUTPUT1",
+                    "datatype": "INT32",
+                    "shape": [1, 16],
+                    "data": self.expected_output1_data_,
+                },
+            ],
         }
 
     def test_direct_inference(self):
         request = {
-            "inputs": [{
-                "name": "INPUT0",
-                "datatype": "INT32",
-                "shape": [1, 16],
-                "data": self.input_data_
-            }, {
-                "name": "INPUT1",
-                "datatype": "INT32",
-                "shape": [1, 16],
-                "data": self.input_data_
-            }]
+            "inputs": [
+                {
+                    "name": "INPUT0",
+                    "datatype": "INT32",
+                    "shape": [1, 16],
+                    "data": self.input_data_,
+                },
+                {
+                    "name": "INPUT1",
+                    "datatype": "INT32",
+                    "shape": [1, 16],
+                    "data": self.input_data_,
+                },
+            ]
         }
-        headers = {'Content-Type': 'application/json'}
+        headers = {"Content-Type": "application/json"}
         r = requests.post(self.url_, data=json.dumps(request), headers=headers)
         r.raise_for_status()
 
         self.assertEqual(
-            self.expected_result_, r.json(),
+            self.expected_result_,
+            r.json(),
             "Expected response body: {}; got: {}".format(
-                self.expected_result_, r.json()))
+                self.expected_result_, r.json()
+            ),
+        )
 
     def test_inference_client_generated_request(self):
         inputs = []
         outputs = []
-        inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32"))
-        inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32"))
 
         # Initialize the data
         input_data = np.array(self.input_data_, dtype=np.int32)
@@ -115,27 +125,29 @@ def test_inference_client_generated_request(self):
         inputs[0].set_data_from_numpy(input_data, binary_data=False)
         inputs[1].set_data_from_numpy(input_data, binary_data=False)
 
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT0', binary_data=False))
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT1', binary_data=False))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False))
         request_body, _ = httpclient.InferenceServerClient.generate_request_body(
-            inputs, outputs=outputs)
+            inputs, outputs=outputs
+        )
 
-        headers = {'Content-Type': 'application/json'}
+        headers = {"Content-Type": "application/json"}
         r = requests.post(self.url_, data=request_body, headers=headers)
         r.raise_for_status()
 
         self.assertEqual(
-            self.expected_result_, r.json(),
+            self.expected_result_,
+            r.json(),
             "Expected response body: {}; got: {}".format(
-                self.expected_result_, r.json()))
+                self.expected_result_, r.json()
+            ),
+        )
 
     def test_inference_client_generated_request_binary(self):
         inputs = []
         outputs = []
-        inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32"))
-        inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32"))
 
         # Initialize the data
         input_data = np.array(self.input_data_, dtype=np.int32)
@@ -143,31 +155,36 @@ def test_inference_client_generated_request_binary(self):
         inputs[0].set_data_from_numpy(input_data, binary_data=True)
         inputs[1].set_data_from_numpy(input_data, binary_data=False)
 
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT0', binary_data=False))
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT1', binary_data=False))
-        request_body, header_length = httpclient.InferenceServerClient.generate_request_body(
-            inputs, outputs=outputs)
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False))
+        (
+            request_body,
+            header_length,
+        ) = httpclient.InferenceServerClient.generate_request_body(
+            inputs, outputs=outputs
+        )
 
         headers = {
-            'Content-Type':
-                'application/vnd.sagemaker-triton.binary+json;json-header-size={}'
-                .format(header_length)
+            "Content-Type": "application/vnd.sagemaker-triton.binary+json;json-header-size={}".format(
+                header_length
+            )
         }
         r = requests.post(self.url_, data=request_body, headers=headers)
         r.raise_for_status()
 
         self.assertEqual(
-            self.expected_result_, r.json(),
+            self.expected_result_,
+            r.json(),
             "Expected response body: {}; got: {}".format(
-                self.expected_result_, r.json()))
+                self.expected_result_, r.json()
+            ),
+        )
 
     def test_inference_client_generated_response(self):
         inputs = []
         outputs = []
-        inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32"))
-        inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32"))
 
         # Initialize the data
         input_data = np.array(self.input_data_, dtype=np.int32)
@@ -175,22 +192,20 @@ def test_inference_client_generated_response(self):
         inputs[0].set_data_from_numpy(input_data, binary_data=False)
         inputs[1].set_data_from_numpy(input_data, binary_data=False)
 
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT0', binary_data=False))
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT1', binary_data=False))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False))
         request_body, _ = httpclient.InferenceServerClient.generate_request_body(
-            inputs, outputs=outputs)
+            inputs, outputs=outputs
+        )
 
-        headers = {'Content-Type': 'application/json'}
+        headers = {"Content-Type": "application/json"}
         r = requests.post(self.url_, data=request_body, headers=headers)
         r.raise_for_status()
 
-        result = httpclient.InferenceServerClient.parse_response_body(
-            r._content)
+        result = httpclient.InferenceServerClient.parse_response_body(r._content)
 
-        output0_data = result.as_numpy('OUTPUT0')
-        output1_data = result.as_numpy('OUTPUT1')
+        output0_data = result.as_numpy("OUTPUT0")
+        output1_data = result.as_numpy("OUTPUT1")
         for i in range(16):
             self.assertEqual(output0_data[0][i], self.expected_output0_data_[i])
             self.assertEqual(output1_data[0][i], self.expected_output1_data_[i])
@@ -198,8 +213,8 @@ def test_inference_client_generated_response(self):
     def test_inference_client_generated_response_binary(self):
         inputs = []
         outputs = []
-        inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32"))
-        inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32"))
 
         # Initialize the data
         input_data = np.array(self.input_data_, dtype=np.int32)
@@ -207,25 +222,26 @@ def test_inference_client_generated_response_binary(self):
         inputs[0].set_data_from_numpy(input_data, binary_data=False)
         inputs[1].set_data_from_numpy(input_data, binary_data=False)
 
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT0', binary_data=True))
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT1', binary_data=False))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=True))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False))
         request_body, _ = httpclient.InferenceServerClient.generate_request_body(
-            inputs, outputs=outputs)
+            inputs, outputs=outputs
+        )
 
-        headers = {'Content-Type': 'application/json'}
+        headers = {"Content-Type": "application/json"}
         r = requests.post(self.url_, data=request_body, headers=headers)
         r.raise_for_status()
 
-        header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size="
-        header_length_str = r.headers['Content-Type'][len(header_length_prefix
-                                                         ):]
+        header_length_prefix = (
+            "application/vnd.sagemaker-triton.binary+json;json-header-size="
+        )
+        header_length_str = r.headers["Content-Type"][len(header_length_prefix) :]
         result = httpclient.InferenceServerClient.parse_response_body(
-            r._content, header_length=int(header_length_str))
+            r._content, header_length=int(header_length_str)
+        )
 
-        output0_data = result.as_numpy('OUTPUT0')
-        output1_data = result.as_numpy('OUTPUT1')
+        output0_data = result.as_numpy("OUTPUT0")
+        output1_data = result.as_numpy("OUTPUT1")
         for i in range(16):
             self.assertEqual(output0_data[0][i], self.expected_output0_data_[i])
             self.assertEqual(output1_data[0][i], self.expected_output1_data_[i])
@@ -233,8 +249,8 @@ def test_inference_client_generated_response_binary(self):
     def test_malformed_binary_header(self):
         inputs = []
         outputs = []
-        inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32"))
-        inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32"))
 
         # Initialize the data
         input_data = np.array(self.input_data_, dtype=np.int32)
@@ -242,29 +258,34 @@ def test_malformed_binary_header(self):
         inputs[0].set_data_from_numpy(input_data, binary_data=True)
         inputs[1].set_data_from_numpy(input_data, binary_data=False)
 
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT0', binary_data=False))
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT1', binary_data=False))
-        request_body, header_length = httpclient.InferenceServerClient.generate_request_body(
-            inputs, outputs=outputs)
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False))
+        (
+            request_body,
+            header_length,
+        ) = httpclient.InferenceServerClient.generate_request_body(
+            inputs, outputs=outputs
+        )
 
         headers = {
-            'Content-Type':
-                'additional-string/application/vnd.sagemaker-triton.binary+json;json-header-size={}'
-                .format(header_length)
+            "Content-Type": "additional-string/application/vnd.sagemaker-triton.binary+json;json-header-size={}".format(
+                header_length
+            )
         }
         r = requests.post(self.url_, data=request_body, headers=headers)
         self.assertEqual(
-            400, r.status_code,
+            400,
+            r.status_code,
             "Expected error code {} returned for the request; got: {}".format(
-                400, r.status_code))
+                400, r.status_code
+            ),
+        )
 
     def test_malformed_binary_header_not_number(self):
         inputs = []
         outputs = []
-        inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32"))
-        inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32"))
 
         # Initialize the data
         input_data = np.array(self.input_data_, dtype=np.int32)
@@ -272,29 +293,34 @@ def test_malformed_binary_header_not_number(self):
         inputs[0].set_data_from_numpy(input_data, binary_data=True)
         inputs[1].set_data_from_numpy(input_data, binary_data=False)
 
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT0', binary_data=False))
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT1', binary_data=False))
-        request_body, header_length = httpclient.InferenceServerClient.generate_request_body(
-            inputs, outputs=outputs)
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False))
+        (
+            request_body,
+            header_length,
+        ) = httpclient.InferenceServerClient.generate_request_body(
+            inputs, outputs=outputs
+        )
 
         headers = {
-            'Content-Type':
-                'application/vnd.sagemaker-triton.binary+json;json-header-size=additional-string{}'
-                .format(header_length)
+            "Content-Type": "application/vnd.sagemaker-triton.binary+json;json-header-size=additional-string{}".format(
+                header_length
+            )
         }
         r = requests.post(self.url_, data=request_body, headers=headers)
         self.assertEqual(
-            400, r.status_code,
+            400,
+            r.status_code,
             "Expected error code {} returned for the request; got: {}".format(
-                400, r.status_code))
+                400, r.status_code
+            ),
+        )
 
     def test_malformed_binary_header_negative_number(self):
         inputs = []
         outputs = []
-        inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32"))
-        inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32"))
 
         # Initialize the data
         input_data = np.array(self.input_data_, dtype=np.int32)
@@ -302,28 +328,32 @@ def test_malformed_binary_header_negative_number(self):
         inputs[0].set_data_from_numpy(input_data, binary_data=True)
         inputs[1].set_data_from_numpy(input_data, binary_data=False)
 
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT0', binary_data=False))
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT1', binary_data=False))
-        request_body, header_length = httpclient.InferenceServerClient.generate_request_body(
-            inputs, outputs=outputs)
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False))
+        (
+            request_body,
+            header_length,
+        ) = httpclient.InferenceServerClient.generate_request_body(
+            inputs, outputs=outputs
+        )
 
         headers = {
-            'Content-Type':
-                'application/vnd.sagemaker-triton.binary+json;json-header-size=-123'
+            "Content-Type": "application/vnd.sagemaker-triton.binary+json;json-header-size=-123"
         }
         r = requests.post(self.url_, data=request_body, headers=headers)
         self.assertEqual(
-            400, r.status_code,
+            400,
+            r.status_code,
             "Expected error code {} returned for the request; got: {}".format(
-                400, r.status_code))
+                400, r.status_code
+            ),
+        )
 
     def test_malformed_binary_header_large_number(self):
         inputs = []
         outputs = []
-        inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32"))
-        inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32"))
 
         # Initialize the data
         input_data = np.array(self.input_data_, dtype=np.int32)
@@ -331,23 +361,27 @@ def test_malformed_binary_header_large_number(self):
         inputs[0].set_data_from_numpy(input_data, binary_data=True)
         inputs[1].set_data_from_numpy(input_data, binary_data=False)
 
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT0', binary_data=False))
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT1', binary_data=False))
-        request_body, header_length = httpclient.InferenceServerClient.generate_request_body(
-            inputs, outputs=outputs)
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False))
+        (
+            request_body,
+            header_length,
+        ) = httpclient.InferenceServerClient.generate_request_body(
+            inputs, outputs=outputs
+        )
 
         headers = {
-            'Content-Type':
-                'application/vnd.sagemaker-triton.binary+json;json-header-size=12345'
+            "Content-Type": "application/vnd.sagemaker-triton.binary+json;json-header-size=12345"
         }
         r = requests.post(self.url_, data=request_body, headers=headers)
         self.assertEqual(
-            400, r.status_code,
+            400,
+            r.status_code,
             "Expected error code {} returned for the request; got: {}".format(
-                400, r.status_code))
+                400, r.status_code
+            ),
+        )
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_sagemaker/test.sh b/qa/L0_sagemaker/test.sh
index e701e8dd71..b5bd07c519 100755
--- a/qa/L0_sagemaker/test.sh
+++ b/qa/L0_sagemaker/test.sh
@@ -56,11 +56,12 @@ rm -f *.out
 
 SAGEMAKER_TEST=sagemaker_test.py
 SAGEMAKER_MULTI_MODEL_TEST=sagemaker_multi_model_test.py
-MULTI_MODEL_UNIT_TEST_COUNT=6
+MULTI_MODEL_UNIT_TEST_COUNT=7
 UNIT_TEST_COUNT=9
 CLIENT_LOG="./client.log"
 
 DATADIR=/data/inferenceserver/${REPO_VERSION}
+ENSEMBLEDIR=/data/inferenceserver/${REPO_VERSION}/qa_ensemble_model_repository/qa_model_repository
 SERVER=/opt/tritonserver/bin/tritonserver
 SERVER_LOG="./server.log"
 # Link model repository to "/opt/ml/model"
@@ -352,7 +353,7 @@ if [ "$SERVER_PID" == "0" ]; then
     exit 1
 fi
 
-# Ping and expect error code
+# Ping and expect error code in SME mode.
 set +e
 code=`curl -s -w %{http_code} -o ./ping.out localhost:8080/ping`
 set -e
@@ -382,6 +383,33 @@ cp -r $DATADIR/qa_model_repository/onnx_int32_int32_int32/* ${MODEL1_PATH} && \
 cp -r $DATADIR/qa_identity_model_repository/onnx_zero_1_float32/* ${MODEL2_PATH} && \
     sed -i "s/onnx_zero_1_float32/sm_mme_model_2/" ${MODEL2_PATH}/config.pbtxt
 
+# Ensemble model
+ENSEMBLE_MODEL_PATH="models/123456789ensemble/model"
+mkdir -p "${ENSEMBLE_MODEL_PATH}"
+
+model_name=python_float32_float32_float32
+
+mkdir -p ${ENSEMBLE_MODEL_PATH}/${model_name}/1 && \
+cp ../python_models/add_sub/model.py ${ENSEMBLE_MODEL_PATH}/${model_name}/1/. && \
+cp ../python_models/add_sub/config.pbtxt ${ENSEMBLE_MODEL_PATH}/${model_name}/.
+(cd ${ENSEMBLE_MODEL_PATH}/${model_name} && \
+                    sed -i "s/label_filename:.*//" config.pbtxt && \
+                    echo "max_batch_size: 64" >> config.pbtxt)
+
+# Ensemble part
+mkdir -p ${ENSEMBLE_MODEL_PATH}/fan_${model_name}/1 && \
+            cp ../python_models/add_sub/model.py ${ENSEMBLE_MODEL_PATH}/fan_${model_name}/1/. && \
+            cp ../python_models/fan_add_sub/config.pbtxt ${ENSEMBLE_MODEL_PATH}/fan_${model_name}/. && \
+            (cd ${ENSEMBLE_MODEL_PATH}/fan_${model_name} && \
+                    sed -i "s/label_filename:.*//" config.pbtxt && \
+                    sed -i "s/model_name: \"ENSEMBLE_MODEL_NAME\"/model_name: \"${model_name}\"/" config.pbtxt && \
+                    sed -i "0,/name:.*/{s/name:.*/name: \"fan_${model_name}\"/}" config.pbtxt && \
+                    echo "max_batch_size: 64" >> config.pbtxt)
+
+# # custom float32 component of ensemble
+cp -r $ENSEMBLEDIR/nop_TYPE_FP32_-1 ${ENSEMBLE_MODEL_PATH}/. && \
+    mkdir -p ${ENSEMBLE_MODEL_PATH}/nop_TYPE_FP32_-1/1
+
 # Start server with 'serve' script
 export SAGEMAKER_MULTI_MODEL=true
 export SAGEMAKER_TRITON_LOG_VERBOSE=true
@@ -423,10 +451,8 @@ rm -rf /opt/ml/models
 
 kill $SERVER_PID
 wait $SERVE_PID
-
 # MME end
 
-
 unlink /opt/ml/model
 rm -rf /opt/ml/model
 
diff --git a/qa/L0_savedmodel_shape/saved_model_shape_test.py b/qa/L0_savedmodel_shape/saved_model_shape_test.py
old mode 100644
new mode 100755
index c1c74c97a7..b5ae13a680
--- a/qa/L0_savedmodel_shape/saved_model_shape_test.py
+++ b/qa/L0_savedmodel_shape/saved_model_shape_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2018-2021, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,198 +27,202 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-from builtins import range
-from future.utils import iteritems
 import unittest
-import numpy as np
+
 import infer_util as iu
+import numpy as np
 import test_util as tu
-import os
 
 np_dtype_string = np.dtype(object)
 
 
 class SavedModelShapeTest(tu.TestResultCollector):
-
-    def _full_exact(self, input_dtype, output0_dtype, output1_dtype,
-                    output0_raw, output1_raw, swap):
-
-        def _infer_exact_helper(tester,
-                                pf,
-                                tensor_shape,
-                                batch_size,
-                                input_dtype,
-                                output0_dtype,
-                                output1_dtype,
-                                output0_raw=True,
-                                output1_raw=True,
-                                model_version=None,
-                                swap=False,
-                                outputs=("OUTPUT0", "OUTPUT1"),
-                                use_http=True,
-                                use_grpc=True,
-                                skip_request_id_check=False,
-                                use_streaming=True,
-                                correlation_id=0):
+    def _full_exact(
+        self, input_dtype, output0_dtype, output1_dtype, output0_raw, output1_raw, swap
+    ):
+        def _infer_exact_helper(
+            tester,
+            pf,
+            tensor_shape,
+            batch_size,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            output0_raw=True,
+            output1_raw=True,
+            model_version=None,
+            swap=False,
+            outputs=("OUTPUT0", "OUTPUT1"),
+            use_http=True,
+            use_grpc=True,
+            skip_request_id_check=False,
+            use_streaming=True,
+            correlation_id=0,
+        ):
             for bs in (1, batch_size):
                 # model that does not support batching
                 if bs == 1:
-                    iu.infer_exact(tester,
-                                   "savedmodel_nobatch",
-                                   tensor_shape,
-                                   bs,
-                                   input_dtype,
-                                   output0_dtype,
-                                   output1_dtype,
-                                   output0_raw=output0_raw,
-                                   output1_raw=output1_raw,
-                                   model_version=model_version,
-                                   swap=swap,
-                                   outputs=outputs,
-                                   use_http=use_http,
-                                   use_grpc=use_grpc,
-                                   skip_request_id_check=skip_request_id_check,
-                                   use_streaming=use_streaming,
-                                   correlation_id=correlation_id)
+                    iu.infer_exact(
+                        tester,
+                        "savedmodel_nobatch",
+                        tensor_shape,
+                        bs,
+                        input_dtype,
+                        output0_dtype,
+                        output1_dtype,
+                        output0_raw=output0_raw,
+                        output1_raw=output1_raw,
+                        model_version=model_version,
+                        swap=swap,
+                        outputs=outputs,
+                        use_http=use_http,
+                        use_grpc=use_grpc,
+                        skip_request_id_check=skip_request_id_check,
+                        use_streaming=use_streaming,
+                        correlation_id=correlation_id,
+                    )
                 # model that supports batching
-                iu.infer_exact(tester,
-                               "savedmodel", (bs,) + tensor_shape,
-                               bs,
-                               input_dtype,
-                               output0_dtype,
-                               output1_dtype,
-                               output0_raw=output0_raw,
-                               output1_raw=output1_raw,
-                               model_version=model_version,
-                               swap=swap,
-                               outputs=outputs,
-                               use_http=use_http,
-                               use_grpc=use_grpc,
-                               skip_request_id_check=skip_request_id_check,
-                               use_streaming=use_streaming,
-                               correlation_id=correlation_id)
+                iu.infer_exact(
+                    tester,
+                    "savedmodel",
+                    (bs,) + tensor_shape,
+                    bs,
+                    input_dtype,
+                    output0_dtype,
+                    output1_dtype,
+                    output0_raw=output0_raw,
+                    output1_raw=output1_raw,
+                    model_version=model_version,
+                    swap=swap,
+                    outputs=outputs,
+                    use_http=use_http,
+                    use_grpc=use_grpc,
+                    skip_request_id_check=skip_request_id_check,
+                    use_streaming=use_streaming,
+                    correlation_id=correlation_id,
+                )
 
         input_size = 16
 
-        if tu.validate_for_tf_model(input_dtype, output0_dtype, output1_dtype,
-                                    (input_size,), (input_size,),
-                                    (input_size,)):
-            _infer_exact_helper(self,
-                                "savedmodel", (input_size,),
-                                8,
-                                input_dtype,
-                                output0_dtype,
-                                output1_dtype,
-                                output0_raw=output0_raw,
-                                output1_raw=output1_raw,
-                                swap=swap)
+        if tu.validate_for_tf_model(
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            (input_size,),
+            (input_size,),
+            (input_size,),
+        ):
+            _infer_exact_helper(
+                self,
+                "savedmodel",
+                (input_size,),
+                8,
+                input_dtype,
+                output0_dtype,
+                output1_dtype,
+                output0_raw=output0_raw,
+                output1_raw=output1_raw,
+                swap=swap,
+            )
 
     def test_raw_bbb(self):
-        self._full_exact(np.int8,
-                         np.int8,
-                         np.int8,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=True)
+        self._full_exact(
+            np.int8, np.int8, np.int8, output0_raw=True, output1_raw=True, swap=True
+        )
 
     def test_raw_sss(self):
-        self._full_exact(np.int16,
-                         np.int16,
-                         np.int16,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=True)
+        self._full_exact(
+            np.int16, np.int16, np.int16, output0_raw=True, output1_raw=True, swap=True
+        )
 
     def test_raw_iii(self):
-        self._full_exact(np.int32,
-                         np.int32,
-                         np.int32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=True)
+        self._full_exact(
+            np.int32, np.int32, np.int32, output0_raw=True, output1_raw=True, swap=True
+        )
 
     def test_raw_lll(self):
-        self._full_exact(np.int64,
-                         np.int64,
-                         np.int64,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.int64, np.int64, np.int64, output0_raw=True, output1_raw=True, swap=False
+        )
 
     def test_raw_hhh(self):
-        self._full_exact(np.float16,
-                         np.float16,
-                         np.float16,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.float16,
+            np.float16,
+            np.float16,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_fff(self):
-        self._full_exact(np.float32,
-                         np.float32,
-                         np.float32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=True)
+        self._full_exact(
+            np.float32,
+            np.float32,
+            np.float32,
+            output0_raw=True,
+            output1_raw=True,
+            swap=True,
+        )
 
     def test_raw_hff(self):
-        self._full_exact(np.float16,
-                         np.float32,
-                         np.float32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.float16,
+            np.float32,
+            np.float32,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_bii(self):
-        self._full_exact(np.int8,
-                         np.int32,
-                         np.int32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.int8, np.int32, np.int32, output0_raw=True, output1_raw=True, swap=False
+        )
 
     def test_raw_ibb(self):
-        self._full_exact(np.int32,
-                         np.int8,
-                         np.int8,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.int32, np.int8, np.int8, output0_raw=True, output1_raw=True, swap=False
+        )
 
     def test_raw_ibs(self):
-        self._full_exact(np.int32,
-                         np.int8,
-                         np.int16,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.int32, np.int8, np.int16, output0_raw=True, output1_raw=True, swap=False
+        )
 
     def test_raw_iff(self):
-        self._full_exact(np.int32,
-                         np.float32,
-                         np.float32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.int32,
+            np.float32,
+            np.float32,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_fii(self):
-        self._full_exact(np.float32,
-                         np.int32,
-                         np.int32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.float32,
+            np.int32,
+            np.int32,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_ihs(self):
-        self._full_exact(np.int32,
-                         np.float16,
-                         np.int16,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.int32,
+            np.float16,
+            np.int16,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_savedmodel_shape/test.sh b/qa/L0_savedmodel_shape/test.sh
old mode 100644
new mode 100755
diff --git a/qa/L0_scalar_io/scalar_test.py b/qa/L0_scalar_io/scalar_test.py
new file mode 100755
index 0000000000..16aa1136ca
--- /dev/null
+++ b/qa/L0_scalar_io/scalar_test.py
@@ -0,0 +1,71 @@
+#!/usr/bin/env python3
+
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+
+import os
+import unittest
+
+import numpy as np
+import test_util as tu
+import tritonclient.grpc as grpcclient
+from tritonclient.utils import np_to_triton_dtype
+
+
+class ScalarIOTest(tu.TestResultCollector):
+    def setUp(self):
+        self._client = grpcclient.InferenceServerClient(url="localhost:8001")
+        self._backends = os.environ.get("BACKENDS", "onnx").split(",")
+
+    def _send_request_and_verify_result(self, input, model_name):
+        inputs = []
+        inputs.append(
+            grpcclient.InferInput("INPUT", input.shape, np_to_triton_dtype(input.dtype))
+        )
+        inputs[-1].set_data_from_numpy(input)
+        result = self._client.infer(inputs=inputs, model_name=model_name)
+        output = result.as_numpy("OUTPUT")
+        np.testing.assert_allclose(input, output)
+
+    def test_scalar_io(self):
+        for backend in self._backends:
+            model_name = f"{backend}_scalar_1dim"
+            self._send_request_and_verify_result(
+                np.asarray([1], dtype=np.float32), model_name
+            )
+
+            model_name = f"{backend}_scalar_2dim"
+            self._send_request_and_verify_result(
+                np.asarray([[1]], dtype=np.float32), model_name
+            )
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_scalar_io/test.sh b/qa/L0_scalar_io/test.sh
new file mode 100755
index 0000000000..ebb9a48d95
--- /dev/null
+++ b/qa/L0_scalar_io/test.sh
@@ -0,0 +1,93 @@
+#!/bin/bash
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
+if [ "$#" -ge 1 ]; then
+    REPO_VERSION=$1
+fi
+if [ -z "$REPO_VERSION" ]; then
+    echo -e "Repository version must be specified"
+    echo -e "\n***\n*** Test Failed\n***"
+    exit 1
+fi
+if [ ! -z "$TEST_REPO_ARCH" ]; then
+    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
+fi
+
+RET=0
+TEST_RESULT_FILE='test_results.txt'
+BACKENDS="onnx"
+export CUDA_VISIBLE_DEVICES=0
+DATADIR=/data/inferenceserver/${REPO_VERSION}
+
+rm -rf models
+mkdir models
+cp -r $DATADIR/qa_scalar_models/* models/
+
+CLIENT_LOG="./client.log"
+SCALAR_TEST=scalar_test.py
+source ../common/util.sh
+
+SERVER=/opt/tritonserver/bin/tritonserver
+SERVER_ARGS="--model-repository=`pwd`/models"
+SERVER_LOG="./inference_server.log"
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+python3 $SCALAR_TEST >> $CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** scalar_test.py FAILED. \n***"
+    cat $CLIENT_LOG
+    cat $SERVER_LOG
+    RET=1
+fi
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Make sure the server fails loading the model if it has a dimension higher than
+# 1
+sed -i "s/dims.*/dims:\[2\]/g" models/onnx_scalar_1dim/config.pbtxt
+run_server
+if [ "$SERVER_PID" != "0" ]; then
+    echo -e "\n***\n*** Expected the server to fail loading \n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+if [ $RET -eq 0 ]; then
+    echo -e "\n***\n*** Test Passed\n***"
+else
+    echo -e "\n***\n*** Test FAILED\n***"
+fi
+
+exit $RET
diff --git a/qa/L0_sdk/grpc_test.cc b/qa/L0_sdk/grpc_test.cc
index 09fe5bbc51..3f45e4ae25 100644
--- a/qa/L0_sdk/grpc_test.cc
+++ b/qa/L0_sdk/grpc_test.cc
@@ -25,6 +25,7 @@
 // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 #include <iostream>
+
 #include "grpc_client.h"
 
 namespace tc = triton::client;
diff --git a/qa/L0_sdk/http_test.cc b/qa/L0_sdk/http_test.cc
index 2c8e231fb2..0b2a4da597 100644
--- a/qa/L0_sdk/http_test.cc
+++ b/qa/L0_sdk/http_test.cc
@@ -25,6 +25,7 @@
 // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 #include <iostream>
+
 #include "http_client.h"
 
 namespace tc = triton::client;
diff --git a/qa/L0_sdk/test.sh b/qa/L0_sdk/test.sh
index 8a52fc05ef..20baf31639 100755
--- a/qa/L0_sdk/test.sh
+++ b/qa/L0_sdk/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2019-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -152,7 +152,9 @@ else
     RET=1
 fi
 
-# Check wheels
+# Check wheels, note that even TRITON_VERSION is passed as version field for
+# wheel generation. The version number will be normalized by setuptools, so
+# we need to replace the text here as well to match the normalized version.
 WHLVERSION=`cat /workspace/TRITON_VERSION | sed 's/dev/\.dev0/'`
 if [[ "aarch64" != $(uname -m) ]] ; then
     WHLS="tritonclient-${WHLVERSION}-py3-none-any.whl \
diff --git a/qa/L0_secure_grpc/test.sh b/qa/L0_secure_grpc/test.sh
old mode 100644
new mode 100755
index b090258027..784613c6a2
--- a/qa/L0_secure_grpc/test.sh
+++ b/qa/L0_secure_grpc/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2020-2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -42,6 +42,7 @@ export CUDA_VISIBLE_DEVICES=0
 
 RET=0
 
+TEST_CLIENT_AIO_PY=../clients/simple_grpc_aio_infer_client.py
 TEST_CLIENT_PY=../clients/simple_grpc_infer_client.py
 TEST_CLIENT=../clients/simple_grpc_infer_client
 
@@ -102,6 +103,11 @@ for CASE in server mutual both; do
         cat ${CLIENT_LOG}.${CASE}.ssl_infer
         RET=1
     fi
+    $TEST_CLIENT_AIO_PY -v --ssl --root-certificates ca.crt --private-key client.key --certificate-chain client.crt >> ${CLIENT_LOG}.${CASE}.ssl_infer.aio 2>&1
+    if [ $? -ne 0 ]; then
+        cat ${CLIENT_LOG}.${CASE}.ssl_infer.aio
+        RET=1
+    fi
 
     $TEST_CLIENT -v --ssl --root-certificates ca.crt --private-key client.key --certificate-chain client.crt >> ${CLIENT_LOG}.${CASE}.c++.ssl_infer 2>&1
     if [ $? -ne 0 ]; then
@@ -140,6 +146,13 @@ for CASE in server mutual; do
     else
         RET=1
     fi
+    $TEST_CLIENT_AIO_PY -v >> ${CLIENT_LOG}.${CASE}.no_ssl_fail_infer.aio 2>&1
+    if [ $? -ne 0 ]; then
+        cat ${CLIENT_LOG}.${CASE}.no_ssl_fail_infer.aio
+        echo -e "\n***\n*** Expected test failure\n***"
+    else
+        RET=1
+    fi
 
     $TEST_CLIENT -v >> ${CLIENT_LOG}.${CASE}.c++.no_ssl_fail_infer 2>&1
     if [ $? -ne 0 ]; then
@@ -157,6 +170,13 @@ for CASE in server mutual; do
     else
         RET=1
     fi
+    $TEST_CLIENT_AIO_PY -v --ssl --root-certificates ca.crt --private-key client2.key --certificate-chain client2.crt >> ${CLIENT_LOG}.${CASE}.wrong_ssl_fail_infer.aio 2>&1
+    if [ $? -ne 0 ]; then
+        cat ${CLIENT_LOG}.${CASE}.wrong_ssl_fail_infer.aio
+        echo -e "\n***\n*** Expected test failure\n***"
+    else
+        RET=1
+    fi
 
     $TEST_CLIENT -v --ssl --root-certificates ca.crt --private-key client2.key --certificate-chain client2.crt >> ${CLIENT_LOG}.${CASE}.c++.wrong_ssl_fail_infer 2>&1
     if [ $? -ne 0 ]; then
diff --git a/qa/L0_sequence_batcher/request_timeout_models/custom_sequence_int32_timeout/config.pbtxt b/qa/L0_sequence_batcher/request_timeout_models/custom_sequence_int32_timeout/config.pbtxt
new file mode 100644
index 0000000000..d9be228d5d
--- /dev/null
+++ b/qa/L0_sequence_batcher/request_timeout_models/custom_sequence_int32_timeout/config.pbtxt
@@ -0,0 +1,62 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+backend: "identity"
+max_batch_size: 1
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  }
+]
+
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  }
+]
+
+instance_group [
+  {
+    count: 1
+    kind : KIND_CPU
+  }
+]
+
+sequence_batching {
+  max_sequence_idle_microseconds: 50000000
+}
+
+parameters [
+  {
+    key: "execute_delay_ms"
+    value: { string_value: "5000" }
+  }
+]
diff --git a/qa/L0_sequence_batcher/sequence_batcher_test.py b/qa/L0_sequence_batcher/sequence_batcher_test.py
old mode 100644
new mode 100755
index 0b794ece9c..3e6cfc032a
--- a/qa/L0_sequence_batcher/sequence_batcher_test.py
+++ b/qa/L0_sequence_batcher/sequence_batcher_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2018-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python
+
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,22 +30,25 @@
 
 sys.path.append("../common")
 
-from builtins import str
 import os
-import time
+import random
 import threading
+import time
 import unittest
+from builtins import str
+from functools import partial
+
 import numpy as np
-import test_util as tu
 import sequence_util as su
+import test_util as tu
+import tritonclient.grpc as grpcclient
+from tritonclient.utils import InferenceServerException
 
-TEST_SYSTEM_SHARED_MEMORY = bool(
-    int(os.environ.get('TEST_SYSTEM_SHARED_MEMORY', 0)))
-TEST_CUDA_SHARED_MEMORY = bool(int(os.environ.get('TEST_CUDA_SHARED_MEMORY',
-                                                  0)))
+TEST_SYSTEM_SHARED_MEMORY = bool(int(os.environ.get("TEST_SYSTEM_SHARED_MEMORY", 0)))
+TEST_CUDA_SHARED_MEMORY = bool(int(os.environ.get("TEST_CUDA_SHARED_MEMORY", 0)))
 
-USE_GRPC = (os.environ.get('USE_GRPC', 1) != "0")
-USE_HTTP = (os.environ.get('USE_HTTP', 1) != "0")
+USE_GRPC = os.environ.get("USE_GRPC", 1) != "0"
+USE_HTTP = os.environ.get("USE_HTTP", 1) != "0"
 assert USE_GRPC or USE_HTTP, "USE_GRPC or USE_HTTP must be non-zero"
 if USE_GRPC and USE_HTTP:
     _protocols = ("http", "grpc")
@@ -52,27 +57,27 @@
 else:
     _protocols = ("http",)
 
-BACKENDS = os.environ.get('BACKENDS', "graphdef savedmodel onnx plan custom")
-ENSEMBLES = bool(int(os.environ.get('ENSEMBLES', 1)))
+BACKENDS = os.environ.get("BACKENDS", "graphdef savedmodel onnx plan custom python")
+ENSEMBLES = bool(int(os.environ.get("ENSEMBLES", 1)))
 
-NO_BATCHING = (int(os.environ['NO_BATCHING']) == 1)
-MODEL_INSTANCES = int(os.environ['MODEL_INSTANCES'])
-IMPLICIT_STATE = (int(os.environ['IMPLICIT_STATE']) == 1)
+NO_BATCHING = int(os.environ["NO_BATCHING"]) == 1
+MODEL_INSTANCES = int(os.environ["MODEL_INSTANCES"])
+IMPLICIT_STATE = int(os.environ["IMPLICIT_STATE"]) == 1
 
 # Use initial state for implicit state
-INITIAL_STATE_FILE = (int(os.environ['INITIAL_STATE_FILE']) == 1)
+INITIAL_STATE_FILE = int(os.environ["INITIAL_STATE_FILE"]) == 1
 
 _trials = ()
 if NO_BATCHING:
-    for backend in BACKENDS.split(' '):
-        if (backend != "libtorch") and (backend != 'custom'):
+    for backend in BACKENDS.split(" "):
+        if backend != "custom":
             _trials += (backend + "_nobatch",)
-elif os.environ['BATCHER_TYPE'] == "VARIABLE":
-    for backend in BACKENDS.split(' '):
-        if (backend != "libtorch") and (backend != 'custom'):
+elif os.environ["BATCHER_TYPE"] == "VARIABLE":
+    for backend in BACKENDS.split(" "):
+        if (backend != "libtorch") and (backend != "custom"):
             _trials += (backend,)
 else:
-    _trials = BACKENDS.split(' ')
+    _trials = BACKENDS.split(" ")
 
 # Add ensemble to the _trials
 ENSEMBLE_PREFIXES = ["simple_", "sequence_", "fan_"]
@@ -94,7 +99,7 @@
 # Not all models can be tested for ragged handling because the models
 # don't deal well with non-size-1 shapes
 _ragged_batch_not_supported_trials = list()
-if os.environ['BATCHER_TYPE'] == "VARIABLE":
+if os.environ["BATCHER_TYPE"] == "VARIABLE":
     if "custom" in _trials:
         _ragged_batch_not_supported_trials.append("custom")
     if "plan" in _trials:
@@ -115,45 +120,47 @@ def is_ensemble(model_name):
 
 
 class SequenceBatcherTest(su.SequenceBatcherTestUtil):
-
     def get_datatype(self, trial):
         # Get the datatype to use based on what models are available (see test.sh)
-        if ("plan" in trial):
+        if "plan" in trial:
             return (np.float32,)
-        if ("custom" in trial):
+        if "custom" in trial:
             return (np.int32,)
-        if ("savedmodel" in trial):
+        if "savedmodel" in trial:
             return (np.float32, np.bool_)
-        if ("graphdef" in trial):
+        if "graphdef" in trial:
             return (np.dtype(object), np.bool_)
 
-        # Only test the string data type for ONNX models in implicit state
+        # Only test the string data type for ONNX and libtorch models in implicit state
         if IMPLICIT_STATE:
-            if ("onnx" in trial):
+            if "onnx" in trial:
                 return (np.dtype(object), np.int32, np.bool_)
+            if NO_BATCHING:
+                if "libtorch" in trial:
+                    return (np.dtype(object), np.int32, np.bool_)
 
         return (np.int32, np.bool_)
 
     def get_expected_result(self, expected_result, value, trial, flag_str=None):
         # Adjust the expected_result for models that
-        # couldn't implement the full accumulator. See
+        # could not implement the full accumulator. See
         # qa/common/gen_qa_sequence_models.py for more
         # information.
-        if ((not NO_BATCHING and
-             ("custom" not in trial)) or ("graphdef" in trial) or
-            ("plan" in trial) or ("onnx" in trial)) or ("libtorch" in trial):
+        if (
+            (not NO_BATCHING and ("custom" not in trial))
+            or ("graphdef" in trial)
+            or ("plan" in trial)
+            or ("onnx" in trial)
+        ) or ("libtorch" in trial):
             expected_result = value
             if (flag_str is not None) and ("start" in flag_str):
                 expected_result += 1
         return expected_result
 
-    def get_expected_result_implicit(self,
-                                     expected_result,
-                                     value,
-                                     trial,
-                                     flag_str=None,
-                                     dtype=None):
-        if dtype == np.dtype(object):
+    def get_expected_result_implicit(
+        self, expected_result, value, trial, flag_str=None, dtype=None
+    ):
+        if dtype == np.dtype(object) and trial.startswith("onnx"):
             return value
 
         if INITIAL_STATE_FILE:
@@ -176,7 +183,8 @@ def test_simple_sequence(self):
                     model_name = tu.get_sequence_model_name(trial, dtype)
                     # Skip bool type ensemble models
                     if (any(word in trial for word in ENSEMBLE_PREFIXES)) and (
-                            dtype == np.bool_):
+                        dtype == np.bool_
+                    ):
                         continue
                     # For bool type control models, use int32 as I/O types
                     if dtype == np.bool_:
@@ -185,14 +193,17 @@ def test_simple_sequence(self):
                     self.clear_deferred_exceptions()
                     try:
                         self.check_setup(model_name)
-                        self.assertFalse(
-                            "TRITONSERVER_DELAY_SCHEDULER" in os.environ)
-                        self.assertFalse("TRITONSERVER_BACKLOG_DELAY_SCHEDULER"
-                                         in os.environ)
-                        expected_result = self.get_expected_result(
-                            45, 9, trial, "end"
-                        ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                            45, 9, trial, "end", dtype)
+                        self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
+                        self.assertNotIn(
+                            "TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ
+                        )
+                        expected_result = (
+                            self.get_expected_result(45, 9, trial, "end")
+                            if not IMPLICIT_STATE
+                            else self.get_expected_result_implicit(
+                                45, 9, trial, "end", dtype
+                            )
+                        )
 
                         self.check_sequence(
                             trial,
@@ -201,19 +212,28 @@ def test_simple_sequence(self):
                             5,
                             (4000, None),
                             # (flag_str, value, (ls_ms, gt_ms), (pre_delay, post_delay))
-                            (("start", 1, None, None), (None, 2, None, None),
-                             (None, 3, None, None), (None, 4, None, None),
-                             (None, 5, None, None), (None, 6, None, None),
-                             (None, 7, None, None), (None, 8, None, None),
-                             ("end", 9, None, None)),
+                            (
+                                ("start", 1, None, None),
+                                (None, 2, None, None),
+                                (None, 3, None, None),
+                                (None, 4, None, None),
+                                (None, 5, None, None),
+                                (None, 6, None, None),
+                                (None, 7, None, None),
+                                (None, 8, None, None),
+                                ("end", 9, None, None),
+                            ),
                             expected_result,
                             protocol,
                             sequence_name="{}_{}".format(
-                                self._testMethodName, protocol))
+                                self._testMethodName, protocol
+                            ),
+                        )
 
                         self.check_deferred_exception()
-                        self.check_status(model_name, {1: 9 * (idx + 1)},
-                                          9 * (idx + 1), 9 * (idx + 1))
+                        self.check_status(
+                            model_name, {1: 9 * (idx + 1)}, 9 * (idx + 1), 9 * (idx + 1)
+                        )
                     except Exception as ex:
                         self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -229,7 +249,8 @@ def test_length1_sequence(self):
                     model_name = tu.get_sequence_model_name(trial, dtype)
                     # Skip bool type ensemble models
                     if (any(word in trial for word in ENSEMBLE_PREFIXES)) and (
-                            dtype == np.bool_):
+                        dtype == np.bool_
+                    ):
                         continue
                     # For bool type control models, use int32 as I/O types
                     if dtype == np.bool_:
@@ -238,14 +259,17 @@ def test_length1_sequence(self):
                     self.clear_deferred_exceptions()
                     try:
                         self.check_setup(model_name)
-                        self.assertFalse(
-                            "TRITONSERVER_DELAY_SCHEDULER" in os.environ)
-                        self.assertFalse("TRITONSERVER_BACKLOG_DELAY_SCHEDULER"
-                                         in os.environ)
-                        expected_result = self.get_expected_result(
-                            42, 42, trial, "start,end"
-                        ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                            42, 42, trial, "start,end", dtype)
+                        self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
+                        self.assertNotIn(
+                            "TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ
+                        )
+                        expected_result = (
+                            self.get_expected_result(42, 42, trial, "start,end")
+                            if not IMPLICIT_STATE
+                            else self.get_expected_result_implicit(
+                                42, 42, trial, "start,end", dtype
+                            )
+                        )
 
                         self.check_sequence(
                             trial,
@@ -254,16 +278,18 @@ def test_length1_sequence(self):
                             99,
                             (4000, None),
                             # (flag_str, value, (ls_ms, gt_ms), (pre_delay, post_delay))
-                            (
-                                ("start,end", 42, None, None),),
+                            (("start,end", 42, None, None),),
                             expected_result,
                             protocol,
                             sequence_name="{}_{}".format(
-                                self._testMethodName, protocol))
+                                self._testMethodName, protocol
+                            ),
+                        )
 
                         self.check_deferred_exception()
-                        self.check_status(model_name, {1: idx + 1}, (idx + 1),
-                                          (idx + 1))
+                        self.check_status(
+                            model_name, {1: idx + 1}, (idx + 1), (idx + 1)
+                        )
                     except Exception as ex:
                         self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -285,7 +311,8 @@ def test_batch_size(self):
                     model_name = tu.get_sequence_model_name(trial, dtype)
                     # Skip bool type ensemble models
                     if (any(word in trial for word in ENSEMBLE_PREFIXES)) and (
-                            dtype == np.bool_):
+                        dtype == np.bool_
+                    ):
                         continue
                     # For bool type control models, use int32 as I/O types
                     if dtype == np.bool_:
@@ -294,14 +321,17 @@ def test_batch_size(self):
                     self.clear_deferred_exceptions()
                     try:
                         self.check_setup(model_name)
-                        self.assertFalse(
-                            "TRITONSERVER_DELAY_SCHEDULER" in os.environ)
-                        self.assertFalse("TRITONSERVER_BACKLOG_DELAY_SCHEDULER"
-                                         in os.environ)
-                        expected_result = self.get_expected_result(
-                            10, 9, trial, "end"
-                        ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                            10, 9, trial, "end", dtype)
+                        self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
+                        self.assertNotIn(
+                            "TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ
+                        )
+                        expected_result = (
+                            self.get_expected_result(10, 9, trial, "end")
+                            if not IMPLICIT_STATE
+                            else self.get_expected_result_implicit(
+                                10, 9, trial, "end", dtype
+                            )
+                        )
 
                         self.check_sequence(
                             trial,
@@ -315,27 +345,36 @@ def test_batch_size(self):
                             protocol,
                             batch_size=2,
                             sequence_name="{}_{}".format(
-                                self._testMethodName, protocol))
+                                self._testMethodName, protocol
+                            ),
+                        )
 
                         self.check_deferred_exception()
                         self.assertTrue(False, "expected error")
                     except Exception as ex:
                         for prefix in ENSEMBLE_PREFIXES:
                             if model_name.startswith(prefix):
-                                base_model_name = model_name[(len(prefix)):]
-                                self.assertTrue(ex.message().startswith(
-                                    str("in ensemble '{}', " +
-                                        "inference request to model '{}' must specify "
-                                        +
-                                        "batch-size 1 due to requirements of sequence "
-                                        + "batcher").format(
-                                            model_name, base_model_name)))
+                                base_model_name = model_name[(len(prefix)) :]
+                                self.assertTrue(
+                                    ex.message().startswith(
+                                        str(
+                                            "in ensemble '{}', "
+                                            + "inference request to model '{}' must specify "
+                                            + "batch-size 1 due to requirements of sequence "
+                                            + "batcher"
+                                        ).format(model_name, base_model_name)
+                                    )
+                                )
                                 return
-                        self.assertTrue(ex.message().startswith(
-                            str("inference request to model '{}' must specify "
-                                +
-                                "batch-size 1 due to requirements of sequence "
-                                + "batcher").format(model_name)))
+                        self.assertTrue(
+                            ex.message().startswith(
+                                str(
+                                    "inference request to model '{}' must specify "
+                                    + "batch-size 1 due to requirements of sequence "
+                                    + "batcher"
+                                ).format(model_name)
+                            )
+                        )
 
     def test_no_correlation_id(self):
         # Send sequence without correlation ID and check for error.
@@ -347,7 +386,8 @@ def test_no_correlation_id(self):
                     model_name = tu.get_sequence_model_name(trial, dtype)
                     # Skip bool type ensemble models
                     if (any(word in trial for word in ENSEMBLE_PREFIXES)) and (
-                            dtype == np.bool_):
+                        dtype == np.bool_
+                    ):
                         continue
                     # For bool type control models, use int32 as I/O types
                     if dtype == np.bool_:
@@ -356,14 +396,17 @@ def test_no_correlation_id(self):
                     self.clear_deferred_exceptions()
                     try:
                         self.check_setup(model_name)
-                        self.assertFalse(
-                            "TRITONSERVER_DELAY_SCHEDULER" in os.environ)
-                        self.assertFalse("TRITONSERVER_BACKLOG_DELAY_SCHEDULER"
-                                         in os.environ)
-                        expected_result = self.get_expected_result(
-                            10, 9, trial, "end"
-                        ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                            10, 9, trial, "end", dtype)
+                        self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
+                        self.assertNotIn(
+                            "TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ
+                        )
+                        expected_result = (
+                            self.get_expected_result(10, 9, trial, "end")
+                            if not IMPLICIT_STATE
+                            else self.get_expected_result_implicit(
+                                10, 9, trial, "end", dtype
+                            )
+                        )
 
                         self.check_sequence(
                             trial,
@@ -376,25 +419,34 @@ def test_no_correlation_id(self):
                             expected_result,
                             protocol,
                             sequence_name="{}_{}".format(
-                                self._testMethodName, protocol))
+                                self._testMethodName, protocol
+                            ),
+                        )
 
                         self.check_deferred_exception()
                         self.assertTrue(False, "expected error")
                     except Exception as ex:
                         for prefix in ENSEMBLE_PREFIXES:
                             if model_name.startswith(prefix):
-                                base_model_name = model_name[(len(prefix)):]
-                                self.assertTrue(ex.message().startswith(
-                                    str("in ensemble '{}', " +
-                                        "inference request to model '{}' must specify a "
-                                        + "non-zero or non-empty correlation ID"
-                                       ).format(model_name, base_model_name)))
+                                base_model_name = model_name[(len(prefix)) :]
+                                self.assertTrue(
+                                    ex.message().startswith(
+                                        str(
+                                            "in ensemble '{}', "
+                                            + "inference request to model '{}' must specify a "
+                                            + "non-zero or non-empty correlation ID"
+                                        ).format(model_name, base_model_name)
+                                    )
+                                )
                                 return
-                        self.assertTrue(ex.message().startswith(
-                            str("inference request to model '{}' must specify a "
-                                +
-                                "non-zero or non-empty correlation ID").format(
-                                    model_name)))
+                        self.assertTrue(
+                            ex.message().startswith(
+                                str(
+                                    "inference request to model '{}' must specify a "
+                                    + "non-zero or non-empty correlation ID"
+                                ).format(model_name)
+                            )
+                        )
 
     def test_no_sequence_start(self):
         # Send sequence without start flag for never before seen
@@ -407,7 +459,8 @@ def test_no_sequence_start(self):
                     model_name = tu.get_sequence_model_name(trial, dtype)
                     # Skip bool type ensemble models
                     if (any(word in trial for word in ENSEMBLE_PREFIXES)) and (
-                            dtype == np.bool_):
+                        dtype == np.bool_
+                    ):
                         continue
                     # For bool type control models, use int32 as I/O types
                     if dtype == np.bool_:
@@ -416,15 +469,18 @@ def test_no_sequence_start(self):
                     self.clear_deferred_exceptions()
                     try:
                         self.check_setup(model_name)
-                        self.assertFalse(
-                            "TRITONSERVER_DELAY_SCHEDULER" in os.environ)
-                        self.assertFalse("TRITONSERVER_BACKLOG_DELAY_SCHEDULER"
-                                         in os.environ)
-
-                        expected_result = self.get_expected_result(
-                            6, 3, trial, "end"
-                        ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                            6, 3, trial, "end", dtype)
+                        self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
+                        self.assertNotIn(
+                            "TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ
+                        )
+
+                        expected_result = (
+                            self.get_expected_result(6, 3, trial, "end")
+                            if not IMPLICIT_STATE
+                            else self.get_expected_result_implicit(
+                                6, 3, trial, "end", dtype
+                            )
+                        )
                         self.check_sequence(
                             trial,
                             model_name,
@@ -432,12 +488,17 @@ def test_no_sequence_start(self):
                             37469245,
                             (4000, None),
                             # (flag_str, value, (ls_ms, gt_ms), (pre_delay, post_delay))
-                            ((None, 1, None, None), (None, 2, None, None),
-                             ("end", 3, None, None)),
+                            (
+                                (None, 1, None, None),
+                                (None, 2, None, None),
+                                ("end", 3, None, None),
+                            ),
                             expected_result,
                             protocol,
                             sequence_name="{}_{}".format(
-                                self._testMethodName, protocol))
+                                self._testMethodName, protocol
+                            ),
+                        )
 
                         self.check_deferred_exception()
                         self.assertTrue(False, "expected error")
@@ -445,20 +506,27 @@ def test_no_sequence_start(self):
                         print(model_name + "-> " + ex.message())
                         for prefix in ENSEMBLE_PREFIXES:
                             if model_name.startswith(prefix):
-                                base_model_name = model_name[(len(prefix)):]
-                                self.assertTrue(ex.message().startswith(
-                                    str("in ensemble '{}', " +
-                                        "inference request for sequence 37469245 to "
-                                        +
-                                        "model '{}' must specify the START flag on the first "
-                                        + "request of the sequence").format(
-                                            model_name, base_model_name)))
+                                base_model_name = model_name[(len(prefix)) :]
+                                self.assertTrue(
+                                    ex.message().startswith(
+                                        str(
+                                            "in ensemble '{}', "
+                                            + "inference request for sequence 37469245 to "
+                                            + "model '{}' must specify the START flag on the first "
+                                            + "request of the sequence"
+                                        ).format(model_name, base_model_name)
+                                    )
+                                )
                                 return
-                        self.assertTrue(ex.message().startswith(
-                            str("inference request for sequence 37469245 to " +
-                                "model '{}' must specify the START flag on the first "
-                                +
-                                "request of the sequence").format(model_name)))
+                        self.assertTrue(
+                            ex.message().startswith(
+                                str(
+                                    "inference request for sequence 37469245 to "
+                                    + "model '{}' must specify the START flag on the first "
+                                    + "request of the sequence"
+                                ).format(model_name)
+                            )
+                        )
 
     def test_no_sequence_start2(self):
         # Send sequence without start flag after sending a valid
@@ -472,7 +540,8 @@ def test_no_sequence_start2(self):
                     model_name = tu.get_sequence_model_name(trial, dtype)
                     # Skip bool type ensemble models
                     if (any(word in trial for word in ENSEMBLE_PREFIXES)) and (
-                            dtype == np.bool_):
+                        dtype == np.bool_
+                    ):
                         continue
                     # For bool type control models, use int32 as I/O types
                     if dtype == np.bool_:
@@ -481,14 +550,17 @@ def test_no_sequence_start2(self):
                     self.clear_deferred_exceptions()
                     try:
                         self.check_setup(model_name)
-                        self.assertFalse(
-                            "TRITONSERVER_DELAY_SCHEDULER" in os.environ)
-                        self.assertFalse("TRITONSERVER_BACKLOG_DELAY_SCHEDULER"
-                                         in os.environ)
-                        expected_result = self.get_expected_result(
-                            6, 3, trial, None
-                        ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                            6, 3, trial, None, dtype)
+                        self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
+                        self.assertNotIn(
+                            "TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ
+                        )
+                        expected_result = (
+                            self.get_expected_result(6, 3, trial, None)
+                            if not IMPLICIT_STATE
+                            else self.get_expected_result_implicit(
+                                6, 3, trial, None, dtype
+                            )
+                        )
 
                         self.check_sequence(
                             trial,
@@ -497,34 +569,48 @@ def test_no_sequence_start2(self):
                             3,
                             (4000, None),
                             # (flag_str, value, (ls_ms, gt_ms), (pre_delay, post_delay))
-                            (("start", 1, None, None), (None, 2, None, None),
-                             ("end", 3, None, None), (None, 55, None, None)),
+                            (
+                                ("start", 1, None, None),
+                                (None, 2, None, None),
+                                ("end", 3, None, None),
+                                (None, 55, None, None),
+                            ),
                             expected_result,
                             protocol,
                             sequence_name="{}_{}".format(
-                                self._testMethodName, protocol))
+                                self._testMethodName, protocol
+                            ),
+                        )
 
-                        self.check_status(model_name, {1: 3 * (idx + 1)},
-                                          3 * (idx + 1), 3 * (idx + 1))
+                        self.check_status(
+                            model_name, {1: 3 * (idx + 1)}, 3 * (idx + 1), 3 * (idx + 1)
+                        )
                         self.check_deferred_exception()
                         self.assertTrue(False, "expected error")
                     except Exception as ex:
                         for prefix in ENSEMBLE_PREFIXES:
                             if model_name.startswith(prefix):
-                                base_model_name = model_name[(len(prefix)):]
-                                self.assertTrue(ex.message().startswith(
-                                    str("in ensemble '{}', " +
-                                        "inference request for sequence 3 to model '{}' must "
-                                        +
-                                        "specify the START flag on the first request of "
-                                        + "the sequence").format(
-                                            model_name, base_model_name)))
+                                base_model_name = model_name[(len(prefix)) :]
+                                self.assertTrue(
+                                    ex.message().startswith(
+                                        str(
+                                            "in ensemble '{}', "
+                                            + "inference request for sequence 3 to model '{}' must "
+                                            + "specify the START flag on the first request of "
+                                            + "the sequence"
+                                        ).format(model_name, base_model_name)
+                                    )
+                                )
                                 return
-                        self.assertTrue(ex.message().startswith(
-                            str("inference request for sequence 3 to model '{}' must "
-                                +
-                                "specify the START flag on the first request of "
-                                + "the sequence").format(model_name)))
+                        self.assertTrue(
+                            ex.message().startswith(
+                                str(
+                                    "inference request for sequence 3 to model '{}' must "
+                                    + "specify the START flag on the first request of "
+                                    + "the sequence"
+                                ).format(model_name)
+                            )
+                        )
 
     def test_no_sequence_end(self):
         # Send sequence without end flag. Use same correlation ID to
@@ -538,7 +624,8 @@ def test_no_sequence_end(self):
                     model_name = tu.get_sequence_model_name(trial, dtype)
                     # Skip bool type ensemble models
                     if (any(word in trial for word in ENSEMBLE_PREFIXES)) and (
-                            dtype == np.bool_):
+                        dtype == np.bool_
+                    ):
                         continue
                     # For bool type control models, use int32 as I/O types
                     if dtype == np.bool_:
@@ -547,14 +634,17 @@ def test_no_sequence_end(self):
                     self.clear_deferred_exceptions()
                     try:
                         self.check_setup(model_name)
-                        self.assertFalse(
-                            "TRITONSERVER_DELAY_SCHEDULER" in os.environ)
-                        self.assertFalse("TRITONSERVER_BACKLOG_DELAY_SCHEDULER"
-                                         in os.environ)
-                        expected_result = self.get_expected_result(
-                            51, 9, trial, "end"
-                        ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                            51, 9, trial, "end", dtype)
+                        self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
+                        self.assertNotIn(
+                            "TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ
+                        )
+                        expected_result = (
+                            self.get_expected_result(51, 9, trial, "end")
+                            if not IMPLICIT_STATE
+                            else self.get_expected_result_implicit(
+                                51, 9, trial, "end", dtype
+                            )
+                        )
 
                         self.check_sequence(
                             trial,
@@ -563,16 +653,23 @@ def test_no_sequence_end(self):
                             4566,
                             (4000, None),
                             # (flag_str, value, (ls_ms, gt_ms), (pre_delay, post_delay))
-                            (("start", 1, None, None), (None, 2, None, None),
-                             ("start", 42, None, None), ("end", 9, None, None)),
+                            (
+                                ("start", 1, None, None),
+                                (None, 2, None, None),
+                                ("start", 42, None, None),
+                                ("end", 9, None, None),
+                            ),
                             expected_result,
                             protocol,
                             sequence_name="{}_{}".format(
-                                self._testMethodName, protocol))
+                                self._testMethodName, protocol
+                            ),
+                        )
 
                         self.check_deferred_exception()
-                        self.check_status(model_name, {1: 4 * (idx + 1)},
-                                          4 * (idx + 1), 4 * (idx + 1))
+                        self.check_status(
+                            model_name, {1: 4 * (idx + 1)}, 4 * (idx + 1), 4 * (idx + 1)
+                        )
                     except Exception as ex:
                         self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -586,8 +683,9 @@ def test_half_batch(self):
             for dtype in dtypes:
                 model_name = tu.get_sequence_model_name(trial, dtype)
                 # Skip bool type ensemble models
-                if (any(word in trial
-                        for word in ENSEMBLE_PREFIXES)) and (dtype == np.bool_):
+                if (any(word in trial for word in ENSEMBLE_PREFIXES)) and (
+                    dtype == np.bool_
+                ):
                     continue
                 # For bool type control models, use int32 as I/O types
                 if dtype == np.bool_:
@@ -596,29 +694,31 @@ def test_half_batch(self):
                 self.clear_deferred_exceptions()
 
                 precreated_shm0_handles = self.precreate_register_regions(
-                    (1, 2, 3, 4), dtype, 0)
+                    (1, 2, 3, 4), dtype, 0
+                )
                 precreated_shm1_handles = self.precreate_register_regions(
-                    (0, 9, 5, 13), dtype, 1)
+                    (0, 9, 5, 13), dtype, 1
+                )
 
                 try:
                     self.check_setup(model_name)
 
                     # Need scheduler to wait for queue to contain all
                     # inferences for both sequences.
-                    self.assertTrue(
-                        "TRITONSERVER_DELAY_SCHEDULER" in os.environ)
-                    self.assertEqual(
-                        int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 8)
-                    self.assertTrue(
-                        "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ)
+                    self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
+                    self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 8)
+                    self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
                     self.assertEqual(
-                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]),
-                        0)
+                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0
+                    )
 
-                    expected_result = self.get_expected_result(
-                        10, 4, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        10, 4, trial, "end", dtype)
+                    expected_result = (
+                        self.get_expected_result(10, 4, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            10, 4, trial, "end", dtype
+                        )
+                    )
 
                     threads = []
                     threads.append(
@@ -631,18 +731,25 @@ def test_half_batch(self):
                                 987,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 1, None), (None, 2, None),
-                                 (None, 3, None), ("end", 4, None)),
+                                (
+                                    ("start", 1, None),
+                                    (None, 2, None),
+                                    (None, 3, None),
+                                    ("end", 4, None),
+                                ),
                                 expected_result,
-                                precreated_shm0_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        27, 13, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        27, 13, trial, "end", dtype)
+                                precreated_shm0_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(27, 13, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            27, 13, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -653,14 +760,18 @@ def test_half_batch(self):
                                 988,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 0, None), (None, 9, None),
-                                 (None, 5, None), ("end", 13, None)),
+                                (
+                                    ("start", 0, None),
+                                    (None, 9, None),
+                                    (None, 5, None),
+                                    ("end", 13, None),
+                                ),
                                 expected_result,
-                                precreated_shm1_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
+                                precreated_shm1_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
 
                     for t in threads:
                         t.start()
@@ -676,7 +787,9 @@ def test_half_batch(self):
                         self.check_status(
                             model_name,
                             {stats_batch_size: 4 * min(2, MODEL_INSTANCES)},
-                            exec_cnt, 8)
+                            exec_cnt,
+                            8,
+                        )
                 except Exception as ex:
                     self.assertTrue(False, "unexpected error {}".format(ex))
                 finally:
@@ -694,8 +807,9 @@ def test_skip_batch(self):
             for dtype in dtypes:
                 model_name = tu.get_sequence_model_name(trial, dtype)
                 # Skip bool type ensemble models
-                if (any(word in trial
-                        for word in ENSEMBLE_PREFIXES)) and (dtype == np.bool_):
+                if (any(word in trial for word in ENSEMBLE_PREFIXES)) and (
+                    dtype == np.bool_
+                ):
                     continue
                 # For bool type control models, use int32 as I/O types
                 if dtype == np.bool_:
@@ -704,34 +818,40 @@ def test_skip_batch(self):
                 self.clear_deferred_exceptions()
 
                 precreated_shm0_handles = self.precreate_register_regions(
-                    (1, 3), dtype, 0)
+                    (1, 3), dtype, 0
+                )
                 precreated_shm1_handles = self.precreate_register_regions(
-                    (11, 12, 13, 14), dtype, 1)
+                    (11, 12, 13, 14), dtype, 1
+                )
                 precreated_shm2_handles = self.precreate_register_regions(
-                    (111, 113), dtype, 2)
+                    (111, 113), dtype, 2
+                )
                 precreated_shm3_handles = self.precreate_register_regions(
-                    (1111, 1112, 1113, 1114), dtype, 3)
+                    (1111, 1112, 1113, 1114), dtype, 3
+                )
 
                 try:
                     self.check_setup(model_name)
 
                     # Need scheduler to wait for queue to contain all
                     # inferences for both sequences.
-                    self.assertTrue(
-                        "TRITONSERVER_DELAY_SCHEDULER" in os.environ)
+                    self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
                     self.assertEqual(
-                        int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12)
-                    self.assertTrue(
-                        "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ)
+                        int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12
+                    )
+                    self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
                     self.assertEqual(
-                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]),
-                        0)
+                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0
+                    )
 
                     threads = []
-                    expected_result = self.get_expected_result(
-                        4, 3, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        4, 3, trial, "end", dtype)
+                    expected_result = (
+                        self.get_expected_result(4, 3, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            4, 3, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -744,15 +864,18 @@ def test_skip_batch(self):
                                 # (flag_str, value, pre_delay_ms)
                                 (("start", 1, None), ("end", 3, None)),
                                 expected_result,
-                                precreated_shm0_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        50, 14, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        50, 14, trial, "end", dtype)
+                                precreated_shm0_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(50, 14, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            50, 14, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -763,18 +886,25 @@ def test_skip_batch(self):
                                 1002,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 11, None), (None, 12, None),
-                                 (None, 13, None), ("end", 14, None)),
+                                (
+                                    ("start", 11, None),
+                                    (None, 12, None),
+                                    (None, 13, None),
+                                    ("end", 14, None),
+                                ),
                                 expected_result,
-                                precreated_shm1_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        224, 113, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        224, 113, trial, "end", dtype)
+                                precreated_shm1_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(224, 113, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            224, 113, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -787,15 +917,18 @@ def test_skip_batch(self):
                                 # (flag_str, value, pre_delay_ms)
                                 (("start", 111, None), ("end", 113, None)),
                                 expected_result,
-                                precreated_shm2_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        4450, 1114, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        4450, 1114, trial, "end", dtype)
+                                precreated_shm2_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(4450, 1114, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            4450, 1114, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -806,14 +939,18 @@ def test_skip_batch(self):
                                 1004,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 1111, None), (None, 1112, None),
-                                 (None, 1113, None), ("end", 1114, None)),
+                                (
+                                    ("start", 1111, None),
+                                    (None, 1112, None),
+                                    (None, 1113, None),
+                                    ("end", 1114, None),
+                                ),
                                 expected_result,
-                                precreated_shm3_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
+                                precreated_shm3_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
 
                     threads[1].start()
                     threads[3].start()
@@ -858,8 +995,9 @@ def test_full_batch(self):
             for dtype in dtypes:
                 model_name = tu.get_sequence_model_name(trial, dtype)
                 # Skip bool type ensemble models
-                if (any(word in trial
-                        for word in ENSEMBLE_PREFIXES)) and (dtype == np.bool_):
+                if (any(word in trial for word in ENSEMBLE_PREFIXES)) and (
+                    dtype == np.bool_
+                ):
                     continue
                 # For bool type control models, use int32 as I/O types
                 if dtype == np.bool_:
@@ -868,33 +1006,39 @@ def test_full_batch(self):
                 self.clear_deferred_exceptions()
 
                 precreated_shm0_handles = self.precreate_register_regions(
-                    (1, 2, 3), dtype, 0)
+                    (1, 2, 3), dtype, 0
+                )
                 precreated_shm1_handles = self.precreate_register_regions(
-                    (11, 12, 13), dtype, 1)
+                    (11, 12, 13), dtype, 1
+                )
                 precreated_shm2_handles = self.precreate_register_regions(
-                    (111, 112, 113), dtype, 2)
+                    (111, 112, 113), dtype, 2
+                )
                 precreated_shm3_handles = self.precreate_register_regions(
-                    (1111, 1112, 1113), dtype, 3)
+                    (1111, 1112, 1113), dtype, 3
+                )
 
                 try:
                     self.check_setup(model_name)
 
                     # Need scheduler to wait for queue to contain all
                     # inferences for both sequences.
-                    self.assertTrue(
-                        "TRITONSERVER_DELAY_SCHEDULER" in os.environ)
+                    self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
                     self.assertEqual(
-                        int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12)
-                    self.assertTrue(
-                        "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ)
+                        int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12
+                    )
+                    self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
                     self.assertEqual(
-                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]),
-                        0)
-
-                    expected_result = self.get_expected_result(
-                        6, 3, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        6, 3, trial, "end", dtype)
+                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0
+                    )
+
+                    expected_result = (
+                        self.get_expected_result(6, 3, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            6, 3, trial, "end", dtype
+                        )
+                    )
                     threads = []
                     threads.append(
                         threading.Thread(
@@ -906,19 +1050,21 @@ def test_full_batch(self):
                                 1001,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 1, None), (None, 2, None), ("end", 3,
-                                                                       None)),
+                                (("start", 1, None), (None, 2, None), ("end", 3, None)),
                                 expected_result,
-                                precreated_shm0_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-
-                    expected_result = self.get_expected_result(
-                        36, 13, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        36, 13, trial, "end", dtype)
+                                precreated_shm0_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+
+                    expected_result = (
+                        self.get_expected_result(36, 13, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            36, 13, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -929,19 +1075,25 @@ def test_full_batch(self):
                                 1002,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 11, None), (None, 12, None),
-                                 ("end", 13, None)),
+                                (
+                                    ("start", 11, None),
+                                    (None, 12, None),
+                                    ("end", 13, None),
+                                ),
                                 expected_result,
-                                precreated_shm1_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-
-                    expected_result = self.get_expected_result(
-                        336, 113, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        336, 113, trial, "end", dtype)
+                                precreated_shm1_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+
+                    expected_result = (
+                        self.get_expected_result(336, 113, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            336, 113, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -952,18 +1104,24 @@ def test_full_batch(self):
                                 1003,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 111, None), (None, 112, None),
-                                 ("end", 113, None)),
+                                (
+                                    ("start", 111, None),
+                                    (None, 112, None),
+                                    ("end", 113, None),
+                                ),
                                 expected_result,
-                                precreated_shm2_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        3336, 1113, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        3336, 1113, trial, "end", dtype)
+                                precreated_shm2_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(3336, 1113, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            3336, 1113, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -974,14 +1132,17 @@ def test_full_batch(self):
                                 1004,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 1111, None), (None, 1112, None),
-                                 ("end", 1113, None)),
+                                (
+                                    ("start", 1111, None),
+                                    (None, 1112, None),
+                                    ("end", 1113, None),
+                                ),
                                 expected_result,
-                                precreated_shm3_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
+                                precreated_shm3_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
 
                     for t in threads:
                         t.start()
@@ -992,9 +1153,12 @@ def test_full_batch(self):
                         # Requests do not get batched for the ensemble model
                         self.check_status(model_name, {1: 12}, 12, 12)
                     else:
-                        self.check_status(model_name, {
-                            (4 / MODEL_INSTANCES): (3 * MODEL_INSTANCES)
-                        }, 3 * MODEL_INSTANCES, 12)
+                        self.check_status(
+                            model_name,
+                            {(4 / MODEL_INSTANCES): (3 * MODEL_INSTANCES)},
+                            3 * MODEL_INSTANCES,
+                            12,
+                        )
                 except Exception as ex:
                     self.assertTrue(False, "unexpected error {}".format(ex))
                 finally:
@@ -1021,8 +1185,9 @@ def test_ragged_batch(self):
             for dtype in dtypes:
                 model_name = tu.get_sequence_model_name(trial, dtype)
                 # Skip bool type ensemble models
-                if (any(word in trial
-                        for word in ENSEMBLE_PREFIXES)) and (dtype == np.bool_):
+                if (any(word in trial for word in ENSEMBLE_PREFIXES)) and (
+                    dtype == np.bool_
+                ):
                     continue
                 # For bool type control models, use int32 as I/O types
                 if dtype == np.bool_:
@@ -1031,34 +1196,40 @@ def test_ragged_batch(self):
                 self.clear_deferred_exceptions()
 
                 precreated_shm0_handles = self.precreate_register_regions(
-                    (1, 2, 3), dtype, 0, tensor_shape=(2,))
+                    (1, 2, 3), dtype, 0, tensor_shape=(2,)
+                )
                 precreated_shm1_handles = self.precreate_register_regions(
-                    (11, 12, 13), dtype, 1, tensor_shape=(2,))
+                    (11, 12, 13), dtype, 1, tensor_shape=(2,)
+                )
                 precreated_shm2_handles = self.precreate_register_regions(
-                    (111, 112, 113), dtype, 2, tensor_shape=(1,))
+                    (111, 112, 113), dtype, 2, tensor_shape=(1,)
+                )
                 precreated_shm3_handles = self.precreate_register_regions(
-                    (1111, 1112, 1113), dtype, 3, tensor_shape=(3,))
+                    (1111, 1112, 1113), dtype, 3, tensor_shape=(3,)
+                )
 
                 try:
                     self.check_setup(model_name)
 
                     # Need scheduler to wait for queue to contain all
                     # inferences for both sequences.
-                    self.assertTrue(
-                        "TRITONSERVER_DELAY_SCHEDULER" in os.environ)
+                    self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
                     self.assertEqual(
-                        int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12)
-                    self.assertTrue(
-                        "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ)
+                        int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12
+                    )
+                    self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
                     self.assertEqual(
-                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]),
-                        0)
+                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0
+                    )
 
                     threads = []
-                    expected_result = self.get_expected_result(
-                        6 * 2, 3, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        6, 3, trial, "end", dtype)
+                    expected_result = (
+                        self.get_expected_result(6 * 2, 3, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            6, 3, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -1069,20 +1240,24 @@ def test_ragged_batch(self):
                                 1001,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 1, None), (None, 2, None), ("end", 3,
-                                                                       None)),
+                                (("start", 1, None), (None, 2, None), ("end", 3, None)),
                                 expected_result,
-                                precreated_shm0_handles),
+                                precreated_shm0_handles,
+                            ),
                             kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName),
-                                'tensor_shape': (2,)
-                            }))
-
-                    expected_result = self.get_expected_result(
-                        36 * 2, 13, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        36, 13, trial, "end", dtype)
+                                "sequence_name": "{}".format(self._testMethodName),
+                                "tensor_shape": (2,),
+                            },
+                        )
+                    )
+
+                    expected_result = (
+                        self.get_expected_result(36 * 2, 13, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            36, 13, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -1093,19 +1268,27 @@ def test_ragged_batch(self):
                                 1002,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 11, None), (None, 12, None),
-                                 ("end", 13, None)),
+                                (
+                                    ("start", 11, None),
+                                    (None, 12, None),
+                                    ("end", 13, None),
+                                ),
                                 expected_result,
-                                precreated_shm1_handles),
+                                precreated_shm1_handles,
+                            ),
                             kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName),
-                                'tensor_shape': (2,)
-                            }))
-                    expected_result = self.get_expected_result(
-                        336, 113, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        336, 113, trial, "end", dtype)
+                                "sequence_name": "{}".format(self._testMethodName),
+                                "tensor_shape": (2,),
+                            },
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(336, 113, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            336, 113, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -1116,19 +1299,27 @@ def test_ragged_batch(self):
                                 1003,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 111, None), (None, 112, None),
-                                 ("end", 113, None)),
+                                (
+                                    ("start", 111, None),
+                                    (None, 112, None),
+                                    ("end", 113, None),
+                                ),
                                 expected_result,
-                                precreated_shm2_handles),
+                                precreated_shm2_handles,
+                            ),
                             kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName),
-                                'tensor_shape': (1,)
-                            }))
-                    expected_result = self.get_expected_result(
-                        3336 * 3, 1113, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        3336, 1113, trial, "end", dtype)
+                                "sequence_name": "{}".format(self._testMethodName),
+                                "tensor_shape": (1,),
+                            },
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(3336 * 3, 1113, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            3336, 1113, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -1139,15 +1330,20 @@ def test_ragged_batch(self):
                                 1004,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 1111, None), (None, 1112, None),
-                                 ("end", 1113, None)),
+                                (
+                                    ("start", 1111, None),
+                                    (None, 1112, None),
+                                    ("end", 1113, None),
+                                ),
                                 expected_result,
-                                precreated_shm3_handles),
+                                precreated_shm3_handles,
+                            ),
                             kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName),
-                                'tensor_shape': (3,)
-                            }))
+                                "sequence_name": "{}".format(self._testMethodName),
+                                "tensor_shape": (3,),
+                            },
+                        )
+                    )
 
                     threads[0].start()
                     threads[1].start()
@@ -1188,8 +1384,9 @@ def test_ragged_batch_allowed(self):
             for dtype in dtypes:
                 model_name = tu.get_sequence_model_name(trial, dtype)
                 # Skip bool type ensemble models
-                if (any(word in trial
-                        for word in ENSEMBLE_PREFIXES)) and (dtype == np.bool_):
+                if (any(word in trial for word in ENSEMBLE_PREFIXES)) and (
+                    dtype == np.bool_
+                ):
                     continue
                 # For bool type control models, use int32 as I/O types
                 if dtype == np.bool_:
@@ -1198,34 +1395,40 @@ def test_ragged_batch_allowed(self):
                 self.clear_deferred_exceptions()
 
                 precreated_shm0_handles = self.precreate_register_regions(
-                    (1, 2, 3), dtype, 0, tensor_shape=(2,))
+                    (1, 2, 3), dtype, 0, tensor_shape=(2,)
+                )
                 precreated_shm1_handles = self.precreate_register_regions(
-                    (11, 12, 13), dtype, 1, tensor_shape=(2,))
+                    (11, 12, 13), dtype, 1, tensor_shape=(2,)
+                )
                 precreated_shm2_handles = self.precreate_register_regions(
-                    (111, 112, 113), dtype, 2, tensor_shape=(1,))
+                    (111, 112, 113), dtype, 2, tensor_shape=(1,)
+                )
                 precreated_shm3_handles = self.precreate_register_regions(
-                    (1111, 1112, 1113), dtype, 3, tensor_shape=(3,))
+                    (1111, 1112, 1113), dtype, 3, tensor_shape=(3,)
+                )
                 try:
                     self.check_setup(model_name)
 
                     # Need scheduler to wait for queue to contain all
                     # inferences for both sequences.
-                    self.assertTrue(
-                        "TRITONSERVER_DELAY_SCHEDULER" in os.environ)
+                    self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
                     self.assertEqual(
-                        int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12)
-                    self.assertTrue(
-                        "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ)
+                        int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12
+                    )
+                    self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
                     self.assertEqual(
-                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]),
-                        0)
+                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0
+                    )
 
                     threads = []
 
-                    expected_result = self.get_expected_result(
-                        6 * 2, 3, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        6 * 2, 3, trial, "end", dtype)
+                    expected_result = (
+                        self.get_expected_result(6 * 2, 3, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            6 * 2, 3, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -1236,20 +1439,24 @@ def test_ragged_batch_allowed(self):
                                 1001,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 1, None), (None, 2, None), ("end", 3,
-                                                                       None)),
+                                (("start", 1, None), (None, 2, None), ("end", 3, None)),
                                 expected_result,
-                                precreated_shm0_handles),
+                                precreated_shm0_handles,
+                            ),
                             kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName),
-                                'tensor_shape': (2,)
-                            }))
-
-                    expected_result = self.get_expected_result(
-                        36 * 2, 13, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        36 * 2, 13, trial, "end", dtype)
+                                "sequence_name": "{}".format(self._testMethodName),
+                                "tensor_shape": (2,),
+                            },
+                        )
+                    )
+
+                    expected_result = (
+                        self.get_expected_result(36 * 2, 13, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            36 * 2, 13, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -1260,19 +1467,27 @@ def test_ragged_batch_allowed(self):
                                 1002,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 11, None), (None, 12, None),
-                                 ("end", 13, None)),
+                                (
+                                    ("start", 11, None),
+                                    (None, 12, None),
+                                    ("end", 13, None),
+                                ),
                                 expected_result,
-                                precreated_shm1_handles),
+                                precreated_shm1_handles,
+                            ),
                             kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName),
-                                'tensor_shape': (2,)
-                            }))
-                    expected_result = self.get_expected_result(
-                        336, 113, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        336, 113, trial, "end", dtype)
+                                "sequence_name": "{}".format(self._testMethodName),
+                                "tensor_shape": (2,),
+                            },
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(336, 113, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            336, 113, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -1283,19 +1498,27 @@ def test_ragged_batch_allowed(self):
                                 1003,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 111, None), (None, 112, None),
-                                 ("end", 113, None)),
+                                (
+                                    ("start", 111, None),
+                                    (None, 112, None),
+                                    ("end", 113, None),
+                                ),
                                 expected_result,
-                                precreated_shm2_handles),
+                                precreated_shm2_handles,
+                            ),
                             kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName),
-                                'tensor_shape': (1,)
-                            }))
-                    expected_result = self.get_expected_result(
-                        3336 * 3, 1113, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        3336 * 3, 1113, trial, "end", dtype)
+                                "sequence_name": "{}".format(self._testMethodName),
+                                "tensor_shape": (1,),
+                            },
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(3336 * 3, 1113, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            3336 * 3, 1113, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -1306,15 +1529,20 @@ def test_ragged_batch_allowed(self):
                                 1004,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 1111, None), (None, 1112, None),
-                                 ("end", 1113, None)),
+                                (
+                                    ("start", 1111, None),
+                                    (None, 1112, None),
+                                    ("end", 1113, None),
+                                ),
                                 expected_result,
-                                precreated_shm3_handles),
+                                precreated_shm3_handles,
+                            ),
                             kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName),
-                                'tensor_shape': (3,)
-                            }))
+                                "sequence_name": "{}".format(self._testMethodName),
+                                "tensor_shape": (3,),
+                            },
+                        )
+                    )
 
                     for t in threads:
                         t.start()
@@ -1346,8 +1574,9 @@ def test_backlog(self):
             for dtype in dtypes:
                 model_name = tu.get_sequence_model_name(trial, dtype)
                 # Skip bool type ensemble models
-                if (any(word in trial
-                        for word in ENSEMBLE_PREFIXES)) and (dtype == np.bool_):
+                if (any(word in trial for word in ENSEMBLE_PREFIXES)) and (
+                    dtype == np.bool_
+                ):
                     continue
                 # For bool type control models, use int32 as I/O types
                 if dtype == np.bool_:
@@ -1356,36 +1585,43 @@ def test_backlog(self):
                 self.clear_deferred_exceptions()
 
                 precreated_shm0_handles = self.precreate_register_regions(
-                    (1, 2, 3), dtype, 0)
+                    (1, 2, 3), dtype, 0
+                )
                 precreated_shm1_handles = self.precreate_register_regions(
-                    (11, 12, 13), dtype, 1)
+                    (11, 12, 13), dtype, 1
+                )
                 precreated_shm2_handles = self.precreate_register_regions(
-                    (111, 112, 113), dtype, 2)
+                    (111, 112, 113), dtype, 2
+                )
                 precreated_shm3_handles = self.precreate_register_regions(
-                    (1111, 1112, 1113), dtype, 3)
+                    (1111, 1112, 1113), dtype, 3
+                )
                 precreated_shm4_handles = self.precreate_register_regions(
-                    (11111, 11112, 11113), dtype, 4)
+                    (11111, 11112, 11113), dtype, 4
+                )
 
                 try:
                     self.check_setup(model_name)
 
                     # Need scheduler to wait for queue to contain all
                     # inferences for both sequences.
-                    self.assertTrue(
-                        "TRITONSERVER_DELAY_SCHEDULER" in os.environ)
+                    self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
                     self.assertEqual(
-                        int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12)
-                    self.assertTrue(
-                        "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ)
+                        int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12
+                    )
+                    self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
                     self.assertEqual(
-                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]),
-                        0)
+                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0
+                    )
 
                     threads = []
-                    expected_result = self.get_expected_result(
-                        6, 3, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        6, 3, trial, "end", dtype)
+                    expected_result = (
+                        self.get_expected_result(6, 3, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            6, 3, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -1396,18 +1632,20 @@ def test_backlog(self):
                                 1001,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 1, None), (None, 2, None), ("end", 3,
-                                                                       None)),
+                                (("start", 1, None), (None, 2, None), ("end", 3, None)),
                                 expected_result,
-                                precreated_shm0_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        36, 13, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        36, 13, trial, "end", dtype)
+                                precreated_shm0_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(36, 13, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            36, 13, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -1418,18 +1656,24 @@ def test_backlog(self):
                                 1002,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 11, None), (None, 12, None),
-                                 ("end", 13, None)),
+                                (
+                                    ("start", 11, None),
+                                    (None, 12, None),
+                                    ("end", 13, None),
+                                ),
                                 expected_result,
-                                precreated_shm1_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        336, 113, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        336, 113, trial, "end", dtype)
+                                precreated_shm1_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(336, 113, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            336, 113, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -1440,18 +1684,24 @@ def test_backlog(self):
                                 1003,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 111, None), (None, 112, None),
-                                 ("end", 113, None)),
+                                (
+                                    ("start", 111, None),
+                                    (None, 112, None),
+                                    ("end", 113, None),
+                                ),
                                 expected_result,
-                                precreated_shm2_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        3336, 1113, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        3336, 1113, trial, "end", dtype)
+                                precreated_shm2_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(3336, 1113, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            3336, 1113, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -1462,19 +1712,25 @@ def test_backlog(self):
                                 1004,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 1111, None), (None, 1112, None),
-                                 ("end", 1113, None)),
+                                (
+                                    ("start", 1111, None),
+                                    (None, 1112, None),
+                                    ("end", 1113, None),
+                                ),
                                 expected_result,
-                                precreated_shm3_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-
-                    expected_result = self.get_expected_result(
-                        33336, 11113, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        33336, 11113, trial, "end", dtype)
+                                precreated_shm3_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+
+                    expected_result = (
+                        self.get_expected_result(33336, 11113, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            33336, 11113, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -1485,14 +1741,17 @@ def test_backlog(self):
                                 1005,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 11111, None), (None, 11112, None),
-                                 ("end", 11113, None)),
+                                (
+                                    ("start", 11111, None),
+                                    (None, 11112, None),
+                                    ("end", 11113, None),
+                                ),
                                 expected_result,
-                                precreated_shm4_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
+                                precreated_shm4_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
 
                     for t in threads:
                         t.start()
@@ -1537,8 +1796,9 @@ def test_backlog_fill(self):
             for dtype in dtypes:
                 model_name = tu.get_sequence_model_name(trial, dtype)
                 # Skip bool type ensemble models
-                if (any(word in trial
-                        for word in ENSEMBLE_PREFIXES)) and (dtype == np.bool_):
+                if (any(word in trial for word in ENSEMBLE_PREFIXES)) and (
+                    dtype == np.bool_
+                ):
                     continue
                 # For bool type control models, use int32 as I/O types
                 if dtype == np.bool_:
@@ -1547,38 +1807,46 @@ def test_backlog_fill(self):
                 self.clear_deferred_exceptions()
 
                 precreated_shm0_handles = self.precreate_register_regions(
-                    (1, 2, 3), dtype, 0)
+                    (1, 2, 3), dtype, 0
+                )
                 precreated_shm1_handles = self.precreate_register_regions(
-                    (11, 13), dtype, 1)
+                    (11, 13), dtype, 1
+                )
                 precreated_shm2_handles = self.precreate_register_regions(
-                    (111, 113), dtype, 2)
+                    (111, 113), dtype, 2
+                )
                 precreated_shm3_handles = self.precreate_register_regions(
-                    (1111, 1112, 1113), dtype, 3)
+                    (1111, 1112, 1113), dtype, 3
+                )
                 precreated_shm4_handles = self.precreate_register_regions(
-                    (11111,), dtype, 4)
+                    (11111,), dtype, 4
+                )
                 precreated_shm5_handles = self.precreate_register_regions(
-                    (22222,), dtype, 5)
+                    (22222,), dtype, 5
+                )
 
                 try:
                     self.check_setup(model_name)
 
                     # Need scheduler to wait for queue to contain all
                     # inferences for both sequences.
-                    self.assertTrue(
-                        "TRITONSERVER_DELAY_SCHEDULER" in os.environ)
+                    self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
                     self.assertEqual(
-                        int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 10)
-                    self.assertTrue(
-                        "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ)
+                        int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 10
+                    )
+                    self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
                     self.assertEqual(
-                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]),
-                        2)
+                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 2
+                    )
 
                     threads = []
-                    expected_result = self.get_expected_result(
-                        6, 3, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        6, 3, trial, "end", dtype)
+                    expected_result = (
+                        self.get_expected_result(6, 3, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            6, 3, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -1589,18 +1857,20 @@ def test_backlog_fill(self):
                                 1001,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 1, None), (None, 2, None), ("end", 3,
-                                                                       None)),
+                                (("start", 1, None), (None, 2, None), ("end", 3, None)),
                                 expected_result,
-                                precreated_shm0_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        24, 13, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        24, 13, trial, "end", dtype)
+                                precreated_shm0_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(24, 13, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            24, 13, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -1613,15 +1883,18 @@ def test_backlog_fill(self):
                                 # (flag_str, value, pre_delay_ms)
                                 (("start", 11, None), ("end", 13, None)),
                                 expected_result,
-                                precreated_shm1_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        224, 113, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        224, 113, trial, "end", dtype)
+                                precreated_shm1_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(224, 113, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            224, 113, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -1634,15 +1907,18 @@ def test_backlog_fill(self):
                                 # (flag_str, value, pre_delay_ms)
                                 (("start", 111, None), ("end", 113, None)),
                                 expected_result,
-                                precreated_shm2_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        3336, 1113, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        3336, 1113, trial, "end", dtype)
+                                precreated_shm2_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(3336, 1113, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            3336, 1113, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -1653,18 +1929,24 @@ def test_backlog_fill(self):
                                 1004,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 1111, None), (None, 1112, None),
-                                 ("end", 1113, None)),
+                                (
+                                    ("start", 1111, None),
+                                    (None, 1112, None),
+                                    ("end", 1113, None),
+                                ),
                                 expected_result,
-                                precreated_shm3_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        11111, 11111, trial, "start,end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        11111, 11111, trial, "start,end", dtype)
+                                precreated_shm3_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(11111, 11111, trial, "start,end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            11111, 11111, trial, "start,end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -1675,18 +1957,20 @@ def test_backlog_fill(self):
                                 1005,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (
-                                    ("start,end", 11111, None),),
+                                (("start,end", 11111, None),),
                                 expected_result,
-                                precreated_shm4_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        22222, 22222, trial, "start,end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        22222, 22222, trial, "start,end", dtype)
+                                precreated_shm4_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(22222, 22222, trial, "start,end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            22222, 22222, trial, "start,end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -1697,14 +1981,13 @@ def test_backlog_fill(self):
                                 1006,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (
-                                    ("start,end", 22222, None),),
+                                (("start,end", 22222, None),),
                                 expected_result,
-                                precreated_shm5_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
+                                precreated_shm5_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
 
                     threads[0].start()
                     threads[1].start()
@@ -1751,8 +2034,9 @@ def test_backlog_fill_no_end(self):
             for dtype in dtypes:
                 model_name = tu.get_sequence_model_name(trial, dtype)
                 # Skip bool type ensemble models
-                if (any(word in trial
-                        for word in ENSEMBLE_PREFIXES)) and (dtype == np.bool_):
+                if (any(word in trial for word in ENSEMBLE_PREFIXES)) and (
+                    dtype == np.bool_
+                ):
                     continue
                 # For bool type control models, use int32 as I/O types
                 if dtype == np.bool_:
@@ -1761,38 +2045,46 @@ def test_backlog_fill_no_end(self):
                 self.clear_deferred_exceptions()
 
                 precreated_shm0_handles = self.precreate_register_regions(
-                    (1, 2, 3), dtype, 0)
+                    (1, 2, 3), dtype, 0
+                )
                 precreated_shm1_handles = self.precreate_register_regions(
-                    (11, 13), dtype, 1)
+                    (11, 13), dtype, 1
+                )
                 precreated_shm2_handles = self.precreate_register_regions(
-                    (111, 113), dtype, 2)
+                    (111, 113), dtype, 2
+                )
                 precreated_shm3_handles = self.precreate_register_regions(
-                    (1111, 1112, 1113), dtype, 3)
+                    (1111, 1112, 1113), dtype, 3
+                )
                 precreated_shm4_handles = self.precreate_register_regions(
-                    (11111,), dtype, 4)
+                    (11111,), dtype, 4
+                )
                 precreated_shm5_handles = self.precreate_register_regions(
-                    (22222, 22223, 22224), dtype, 5)
+                    (22222, 22223, 22224), dtype, 5
+                )
 
                 try:
                     self.check_setup(model_name)
 
                     # Need scheduler to wait for queue to contain all
                     # inferences for both sequences.
-                    self.assertTrue(
-                        "TRITONSERVER_DELAY_SCHEDULER" in os.environ)
+                    self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
                     self.assertEqual(
-                        int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 10)
-                    self.assertTrue(
-                        "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ)
+                        int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 10
+                    )
+                    self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
                     self.assertEqual(
-                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]),
-                        3)
+                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 3
+                    )
 
                     threads = []
-                    expected_result = self.get_expected_result(
-                        6, 3, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        6, 3, trial, "end", dtype)
+                    expected_result = (
+                        self.get_expected_result(6, 3, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            6, 3, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -1803,18 +2095,20 @@ def test_backlog_fill_no_end(self):
                                 1001,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 1, None), (None, 2, None), ("end", 3,
-                                                                       None)),
+                                (("start", 1, None), (None, 2, None), ("end", 3, None)),
                                 expected_result,
-                                precreated_shm0_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        24, 13, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        24, 13, trial, "end", dtype)
+                                precreated_shm0_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(24, 13, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            24, 13, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -1827,15 +2121,18 @@ def test_backlog_fill_no_end(self):
                                 # (flag_str, value, pre_delay_ms)
                                 (("start", 11, None), ("end", 13, None)),
                                 expected_result,
-                                precreated_shm1_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        224, 113, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        224, 113, trial, "end", dtype)
+                                precreated_shm1_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(224, 113, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            224, 113, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -1848,15 +2145,18 @@ def test_backlog_fill_no_end(self):
                                 # (flag_str, value, pre_delay_ms)
                                 (("start", 111, None), ("end", 113, None)),
                                 expected_result,
-                                precreated_shm2_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        3336, 1113, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        3336, 1113, trial, "end", dtype)
+                                precreated_shm2_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(3336, 1113, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            3336, 1113, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -1867,18 +2167,24 @@ def test_backlog_fill_no_end(self):
                                 1004,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 1111, None), (None, 1112, None),
-                                 ("end", 1113, None)),
+                                (
+                                    ("start", 1111, None),
+                                    (None, 1112, None),
+                                    ("end", 1113, None),
+                                ),
                                 expected_result,
-                                precreated_shm3_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        11111, 11111, trial, "start,end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        11111, 11111, trial, "end", dtype)
+                                precreated_shm3_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(11111, 11111, trial, "start,end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            11111, 11111, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -1889,18 +2195,20 @@ def test_backlog_fill_no_end(self):
                                 1005,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (
-                                    ("start,end", 11111, None),),
+                                (("start,end", 11111, None),),
                                 expected_result,
-                                precreated_shm4_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        66669, 22224, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        66669, 22224, trial, "end", dtype)
+                                precreated_shm4_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(66669, 22224, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            66669, 22224, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -1917,11 +2225,11 @@ def test_backlog_fill_no_end(self):
                                     ("end", 22224, 2000),
                                 ),
                                 expected_result,
-                                precreated_shm5_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
+                                precreated_shm5_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
 
                     threads[0].start()
                     time.sleep(2)
@@ -1967,8 +2275,9 @@ def test_backlog_same_correlation_id(self):
             for dtype in dtypes:
                 model_name = tu.get_sequence_model_name(trial, dtype)
                 # Skip bool type ensemble models
-                if (any(word in trial
-                        for word in ENSEMBLE_PREFIXES)) and (dtype == np.bool_):
+                if (any(word in trial for word in ENSEMBLE_PREFIXES)) and (
+                    dtype == np.bool_
+                ):
                     continue
                 # For bool type control models, use int32 as I/O types
                 if dtype == np.bool_:
@@ -1977,36 +2286,43 @@ def test_backlog_same_correlation_id(self):
                 self.clear_deferred_exceptions()
 
                 precreated_shm0_handles = self.precreate_register_regions(
-                    (1, 2, 3), dtype, 0)
+                    (1, 2, 3), dtype, 0
+                )
                 precreated_shm1_handles = self.precreate_register_regions(
-                    (11, 12, 13), dtype, 1)
+                    (11, 12, 13), dtype, 1
+                )
                 precreated_shm2_handles = self.precreate_register_regions(
-                    (111, 112, 113), dtype, 2)
+                    (111, 112, 113), dtype, 2
+                )
                 precreated_shm3_handles = self.precreate_register_regions(
-                    (1111, 1112, 1113), dtype, 3)
+                    (1111, 1112, 1113), dtype, 3
+                )
                 precreated_shm4_handles = self.precreate_register_regions(
-                    (11111, 11113), dtype, 4)
+                    (11111, 11113), dtype, 4
+                )
 
                 try:
                     self.check_setup(model_name)
 
                     # Need scheduler to wait for queue to contain all
                     # inferences for both sequences.
-                    self.assertTrue(
-                        "TRITONSERVER_DELAY_SCHEDULER" in os.environ)
+                    self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
                     self.assertEqual(
-                        int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12)
-                    self.assertTrue(
-                        "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ)
+                        int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12
+                    )
+                    self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
                     self.assertEqual(
-                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]),
-                        2)
+                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 2
+                    )
 
                     threads = []
-                    expected_result = self.get_expected_result(
-                        6, 3, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        6, 3, trial, "end", dtype)
+                    expected_result = (
+                        self.get_expected_result(6, 3, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            6, 3, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -2017,18 +2333,20 @@ def test_backlog_same_correlation_id(self):
                                 1001,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 1, None), (None, 2, None), ("end", 3,
-                                                                       None)),
+                                (("start", 1, None), (None, 2, None), ("end", 3, None)),
                                 expected_result,
-                                precreated_shm0_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        36, 13, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        36, 13, trial, "end", dtype)
+                                precreated_shm0_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(36, 13, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            36, 13, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -2039,18 +2357,24 @@ def test_backlog_same_correlation_id(self):
                                 1002,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 11, None), (None, 12, None),
-                                 ("end", 13, None)),
+                                (
+                                    ("start", 11, None),
+                                    (None, 12, None),
+                                    ("end", 13, None),
+                                ),
                                 expected_result,
-                                precreated_shm1_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        336, 113, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        336, 113, trial, "end", dtype)
+                                precreated_shm1_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(336, 113, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            336, 113, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -2061,18 +2385,24 @@ def test_backlog_same_correlation_id(self):
                                 1003,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 111, None), (None, 112, None),
-                                 ("end", 113, None)),
+                                (
+                                    ("start", 111, None),
+                                    (None, 112, None),
+                                    ("end", 113, None),
+                                ),
                                 expected_result,
-                                precreated_shm2_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        3336, 1113, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        3336, 1113, trial, "end", dtype)
+                                precreated_shm2_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(3336, 1113, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            3336, 1113, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -2083,18 +2413,24 @@ def test_backlog_same_correlation_id(self):
                                 1004,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 1111, None), (None, 1112, None),
-                                 ("end", 1113, None)),
+                                (
+                                    ("start", 1111, None),
+                                    (None, 1112, None),
+                                    ("end", 1113, None),
+                                ),
                                 expected_result,
-                                precreated_shm3_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        22224, 11113, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        22224, 11113, trial, "end", dtype)
+                                precreated_shm3_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(22224, 11113, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            22224, 11113, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -2107,11 +2443,11 @@ def test_backlog_same_correlation_id(self):
                                 # (flag_str, value, pre_delay_ms)
                                 (("start", 11111, None), ("end", 11113, None)),
                                 expected_result,
-                                precreated_shm4_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
+                                precreated_shm4_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
 
                     threads[0].start()
                     threads[1].start()
@@ -2129,12 +2465,13 @@ def test_backlog_same_correlation_id(self):
                         if MODEL_INSTANCES != 4:
                             batch_exec = {
                                 (4 / MODEL_INSTANCES): (3 * MODEL_INSTANCES),
-                                1: 2
+                                1: 2,
                             }
                         else:
                             batch_exec = {1: (3 * MODEL_INSTANCES) + 2}
-                        self.check_status(model_name, batch_exec,
-                                          (3 * MODEL_INSTANCES) + 2, 14)
+                        self.check_status(
+                            model_name, batch_exec, (3 * MODEL_INSTANCES) + 2, 14
+                        )
                 except Exception as ex:
                     self.assertTrue(False, "unexpected error {}".format(ex))
                 finally:
@@ -2166,8 +2503,9 @@ def test_backlog_same_correlation_id_no_end(self):
             for dtype in dtypes:
                 model_name = tu.get_sequence_model_name(trial, dtype)
                 # Skip bool type ensemble models
-                if (any(word in trial
-                        for word in ENSEMBLE_PREFIXES)) and (dtype == np.bool_):
+                if (any(word in trial for word in ENSEMBLE_PREFIXES)) and (
+                    dtype == np.bool_
+                ):
                     continue
                 # For bool type control models, use int32 as I/O types
                 if dtype == np.bool_:
@@ -2176,35 +2514,40 @@ def test_backlog_same_correlation_id_no_end(self):
                 self.clear_deferred_exceptions()
 
                 precreated_shm0_handles = self.precreate_register_regions(
-                    (1, 3), dtype, 0)
+                    (1, 3), dtype, 0
+                )
                 precreated_shm1_handles = self.precreate_register_regions(
-                    (11, 12, 12, 13), dtype, 1)
+                    (11, 12, 12, 13), dtype, 1
+                )
                 precreated_shm2_handles = self.precreate_register_regions(
-                    (111, 112, 112, 113), dtype, 2)
+                    (111, 112, 112, 113), dtype, 2
+                )
                 precreated_shm3_handles = self.precreate_register_regions(
-                    (1111, 1112, 1112, 1113), dtype, 3)
+                    (1111, 1112, 1112, 1113), dtype, 3
+                )
                 precreated_shm4_handles = self.precreate_register_regions(
-                    (11111, 11113), dtype, 4)
+                    (11111, 11113), dtype, 4
+                )
                 try:
                     self.check_setup(model_name)
 
                     # Need scheduler to wait for queue to contain all
                     # inferences for both sequences.
-                    self.assertTrue(
-                        "TRITONSERVER_DELAY_SCHEDULER" in os.environ)
+                    self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
                     self.assertEqual(
-                        int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 16)
-                    self.assertTrue(
-                        "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ)
+                        int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 16
+                    )
+                    self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
                     self.assertEqual(
-                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]),
-                        0)
+                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0
+                    )
 
                     threads = []
-                    expected_result = self.get_expected_result(
-                        4, 3, trial, None
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        4, 3, trial, None, dtype)
+                    expected_result = (
+                        self.get_expected_result(4, 3, trial, None)
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(4, 3, trial, None, dtype)
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -2217,15 +2560,18 @@ def test_backlog_same_correlation_id_no_end(self):
                                 # (flag_str, value, pre_delay_ms)
                                 (("start", 1, None), (None, 3, None)),
                                 expected_result,
-                                precreated_shm0_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        48, 13, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        48, 13, trial, "end", dtype)
+                                precreated_shm0_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(48, 13, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            48, 13, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -2236,18 +2582,25 @@ def test_backlog_same_correlation_id_no_end(self):
                                 1002,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 11, None), (None, 12, None),
-                                 (None, 12, None), ("end", 13, None)),
+                                (
+                                    ("start", 11, None),
+                                    (None, 12, None),
+                                    (None, 12, None),
+                                    ("end", 13, None),
+                                ),
                                 expected_result,
-                                precreated_shm1_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        448, 113, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        448, 113, trial, "end", dtype)
+                                precreated_shm1_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(448, 113, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            448, 113, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -2258,18 +2611,25 @@ def test_backlog_same_correlation_id_no_end(self):
                                 1003,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 111, None), (None, 112, None),
-                                 (None, 112, None), ("end", 113, None)),
+                                (
+                                    ("start", 111, None),
+                                    (None, 112, None),
+                                    (None, 112, None),
+                                    ("end", 113, None),
+                                ),
                                 expected_result,
-                                precreated_shm2_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        4448, 1113, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        4448, 1113, trial, "end", dtype)
+                                precreated_shm2_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(4448, 1113, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            4448, 1113, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -2280,18 +2640,25 @@ def test_backlog_same_correlation_id_no_end(self):
                                 1004,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 1111, None), (None, 1112, None),
-                                 (None, 1112, None), ("end", 1113, None)),
+                                (
+                                    ("start", 1111, None),
+                                    (None, 1112, None),
+                                    (None, 1112, None),
+                                    ("end", 1113, None),
+                                ),
                                 expected_result,
-                                precreated_shm3_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        22224, 11113, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        22224, 11113, trial, "end", dtype)
+                                precreated_shm3_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(22224, 11113, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            22224, 11113, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -2304,11 +2671,11 @@ def test_backlog_same_correlation_id_no_end(self):
                                 # (flag_str, value, pre_delay_ms)
                                 (("start", 11111, None), ("end", 11113, None)),
                                 expected_result,
-                                precreated_shm4_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
+                                precreated_shm4_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
 
                     threads[0].start()
                     threads[1].start()
@@ -2355,8 +2722,9 @@ def test_backlog_sequence_timeout(self):
             for dtype in dtypes:
                 model_name = tu.get_sequence_model_name(trial, dtype)
                 # Skip bool type ensemble models
-                if (any(word in trial
-                        for word in ENSEMBLE_PREFIXES)) and (dtype == np.bool_):
+                if (any(word in trial for word in ENSEMBLE_PREFIXES)) and (
+                    dtype == np.bool_
+                ):
                     continue
                 # For bool type control models, use int32 as I/O types
                 if dtype == np.bool_:
@@ -2365,35 +2733,38 @@ def test_backlog_sequence_timeout(self):
                 self.clear_deferred_exceptions()
 
                 precreated_shm0_handles = self.precreate_register_regions(
-                    (1, 3), dtype, 0)
+                    (1, 3), dtype, 0
+                )
                 precreated_shm1_handles = self.precreate_register_regions(
-                    (11, 12, 12, 13), dtype, 1)
+                    (11, 12, 12, 13), dtype, 1
+                )
                 precreated_shm2_handles = self.precreate_register_regions(
-                    (111, 112, 112, 113), dtype, 2)
+                    (111, 112, 112, 113), dtype, 2
+                )
                 precreated_shm3_handles = self.precreate_register_regions(
-                    (1111, 1112, 1112, 1113), dtype, 3)
+                    (1111, 1112, 1112, 1113), dtype, 3
+                )
                 precreated_shm4_handles = self.precreate_register_regions(
-                    (11111, 11113), dtype, 4)
+                    (11111, 11113), dtype, 4
+                )
                 try:
                     self.check_setup(model_name)
 
                     # Need scheduler to wait for queue to contain all
                     # inferences for all sequences.
-                    self.assertTrue(
-                        "TRITONSERVER_DELAY_SCHEDULER" in os.environ)
+                    self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
+                    self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 4)
+                    self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
                     self.assertEqual(
-                        int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 4)
-                    self.assertTrue(
-                        "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ)
-                    self.assertEqual(
-                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]),
-                        0)
+                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0
+                    )
 
                     threads = []
-                    expected_result = self.get_expected_result(
-                        4, 3, trial, None
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        4, 3, trial, None, dtype)
+                    expected_result = (
+                        self.get_expected_result(4, 3, trial, None)
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(4, 3, trial, None, dtype)
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -2404,18 +2775,23 @@ def test_backlog_sequence_timeout(self):
                                 1001,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 1, None),
-                                 (None, 3, _max_sequence_idle_ms + 1000)),
+                                (
+                                    ("start", 1, None),
+                                    (None, 3, _max_sequence_idle_ms + 1000),
+                                ),
                                 expected_result,
-                                precreated_shm0_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        48, 13, trial, None
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        48, 13, trial, None, dtype)
+                                precreated_shm0_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(48, 13, trial, None)
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            48, 13, trial, None, dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -2426,20 +2802,25 @@ def test_backlog_sequence_timeout(self):
                                 1002,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 11,
-                                  None), (None, 12, _max_sequence_idle_ms / 2),
-                                 (None, 12, _max_sequence_idle_ms / 2),
-                                 ("end", 13, _max_sequence_idle_ms / 2)),
+                                (
+                                    ("start", 11, None),
+                                    (None, 12, _max_sequence_idle_ms / 2),
+                                    (None, 12, _max_sequence_idle_ms / 2),
+                                    ("end", 13, _max_sequence_idle_ms / 2),
+                                ),
                                 expected_result,
-                                precreated_shm1_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        448, 113, trial, None
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        448, 113, trial, None, dtype)
+                                precreated_shm1_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(448, 113, trial, None)
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            448, 113, trial, None, dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -2450,20 +2831,25 @@ def test_backlog_sequence_timeout(self):
                                 1003,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 111,
-                                  None), (None, 112, _max_sequence_idle_ms / 2),
-                                 (None, 112, _max_sequence_idle_ms / 2),
-                                 ("end", 113, _max_sequence_idle_ms / 2)),
+                                (
+                                    ("start", 111, None),
+                                    (None, 112, _max_sequence_idle_ms / 2),
+                                    (None, 112, _max_sequence_idle_ms / 2),
+                                    ("end", 113, _max_sequence_idle_ms / 2),
+                                ),
                                 expected_result,
-                                precreated_shm2_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        4448, 1113, trial, None
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        4448, 1113, trial, None, dtype)
+                                precreated_shm2_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(4448, 1113, trial, None)
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            4448, 1113, trial, None, dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -2474,20 +2860,25 @@ def test_backlog_sequence_timeout(self):
                                 1004,
                                 (None, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (("start", 1111, None),
-                                 (None, 1112, _max_sequence_idle_ms / 2),
-                                 (None, 1112, _max_sequence_idle_ms / 2),
-                                 ("end", 1113, _max_sequence_idle_ms / 2)),
+                                (
+                                    ("start", 1111, None),
+                                    (None, 1112, _max_sequence_idle_ms / 2),
+                                    (None, 1112, _max_sequence_idle_ms / 2),
+                                    ("end", 1113, _max_sequence_idle_ms / 2),
+                                ),
                                 expected_result,
-                                precreated_shm3_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        22224, 11113, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        22224, 11113, trial, "end", dtype)
+                                precreated_shm3_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(22224, 11113, trial, "end")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            22224, 11113, trial, "end", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -2500,11 +2891,11 @@ def test_backlog_sequence_timeout(self):
                                 # (flag_str, value, pre_delay_ms)
                                 (("start", 11111, None), ("end", 11113, None)),
                                 expected_result,
-                                precreated_shm4_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
+                                precreated_shm4_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
 
                     threads[0].start()
                     threads[1].start()
@@ -2520,18 +2911,27 @@ def test_backlog_sequence_timeout(self):
                 except Exception as ex:
                     for prefix in ENSEMBLE_PREFIXES:
                         if model_name.startswith(prefix):
-                            base_model_name = model_name[(len(prefix)):]
-                            self.assertTrue(ex.message().startswith(
-                                str("in ensemble '{}', " +
-                                    "inference request for sequence 1001 to " +
-                                    "model '{}' must specify the START flag on the first "
-                                    + "request of the sequence").format(
-                                        model_name, base_model_name)))
+                            base_model_name = model_name[(len(prefix)) :]
+                            self.assertTrue(
+                                ex.message().startswith(
+                                    str(
+                                        "in ensemble '{}', "
+                                        + "inference request for sequence 1001 to "
+                                        + "model '{}' must specify the START flag on the first "
+                                        + "request of the sequence"
+                                    ).format(model_name, base_model_name)
+                                )
+                            )
                             return
-                    self.assertTrue(ex.message().startswith(
-                        str("inference request for sequence 1001 to " +
-                            "model '{}' must specify the START flag on the first "
-                            + "request of the sequence").format(model_name)))
+                    self.assertTrue(
+                        ex.message().startswith(
+                            str(
+                                "inference request for sequence 1001 to "
+                                + "model '{}' must specify the START flag on the first "
+                                + "request of the sequence"
+                            ).format(model_name)
+                        )
+                    )
                 finally:
                     if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY:
                         self.cleanup_shm_regions(precreated_shm0_handles)
@@ -2567,28 +2967,30 @@ def test_queue_delay_no_min_util(self):
                 self.clear_deferred_exceptions()
 
                 precreated_shm0_handles = self.precreate_register_regions(
-                    (1,), dtype, 0)
+                    (1,), dtype, 0
+                )
                 precreated_shm1_handles = self.precreate_register_regions(
-                    (11, 12), dtype, 1)
+                    (11, 12), dtype, 1
+                )
                 try:
                     self.check_setup(model_name)
 
                     # Need scheduler to wait for queue to contain 2 sequences.
-                    self.assertTrue(
-                        "TRITONSERVER_DELAY_SCHEDULER" in os.environ)
+                    self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
+                    self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 2)
+                    self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
                     self.assertEqual(
-                        int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 2)
-                    self.assertTrue(
-                        "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ)
-                    self.assertEqual(
-                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]),
-                        0)
+                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0
+                    )
 
                     threads = []
-                    expected_result = self.get_expected_result(
-                        1, 1, trial, "start"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        1, 1, trial, "start", dtype)
+                    expected_result = (
+                        self.get_expected_result(1, 1, trial, "start")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            1, 1, trial, "start", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -2599,18 +3001,20 @@ def test_queue_delay_no_min_util(self):
                                 1001,
                                 (2000, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (
-                                    ("start", 1, None),),
+                                (("start", 1, None),),
                                 expected_result,
-                                precreated_shm0_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        23, 12, trial, None
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        23, 12, trial, None, dtype)
+                                precreated_shm0_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(23, 12, trial, None)
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            23, 12, trial, None, dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -2626,11 +3030,11 @@ def test_queue_delay_no_min_util(self):
                                     (None, 12, None),
                                 ),
                                 expected_result,
-                                precreated_shm1_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
+                                precreated_shm1_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
 
                     threads[0].start()
                     time.sleep(1)
@@ -2674,28 +3078,30 @@ def test_queue_delay_half_min_util(self):
                 self.clear_deferred_exceptions()
 
                 precreated_shm0_handles = self.precreate_register_regions(
-                    (1,), dtype, 0)
+                    (1,), dtype, 0
+                )
                 precreated_shm1_handles = self.precreate_register_regions(
-                    (11, 12), dtype, 1)
+                    (11, 12), dtype, 1
+                )
                 try:
                     self.check_setup(model_name)
 
                     # Need scheduler to wait for queue to contain 2 sequences.
-                    self.assertTrue(
-                        "TRITONSERVER_DELAY_SCHEDULER" in os.environ)
+                    self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
+                    self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 2)
+                    self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
                     self.assertEqual(
-                        int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 2)
-                    self.assertTrue(
-                        "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ)
-                    self.assertEqual(
-                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]),
-                        0)
+                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0
+                    )
 
                     threads = []
-                    expected_result = self.get_expected_result(
-                        1, 1, trial, "start"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        1, 1, trial, "start", dtype)
+                    expected_result = (
+                        self.get_expected_result(1, 1, trial, "start")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            1, 1, trial, "start", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -2706,18 +3112,20 @@ def test_queue_delay_half_min_util(self):
                                 1001,
                                 (2000, None),
                                 # (flag_str, value, pre_delay_ms)
-                                (
-                                    ("start", 1, None),),
+                                (("start", 1, None),),
                                 expected_result,
-                                precreated_shm0_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        23, 12, trial, None
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        23, 12, trial, None, dtype)
+                                precreated_shm0_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(23, 12, trial, None)
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            23, 12, trial, None, dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -2733,11 +3141,11 @@ def test_queue_delay_half_min_util(self):
                                     (None, 12, None),
                                 ),
                                 expected_result,
-                                precreated_shm1_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
+                                precreated_shm1_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
 
                     threads[0].start()
                     time.sleep(1)
@@ -2781,28 +3189,30 @@ def test_queue_delay_full_min_util(self):
                 self.clear_deferred_exceptions()
 
                 precreated_shm0_handles = self.precreate_register_regions(
-                    (1,), dtype, 0)
+                    (1,), dtype, 0
+                )
                 precreated_shm1_handles = self.precreate_register_regions(
-                    (11, 12), dtype, 1)
+                    (11, 12), dtype, 1
+                )
                 try:
                     self.check_setup(model_name)
 
                     # Need scheduler to wait for queue to contain 2 sequences.
-                    self.assertTrue(
-                        "TRITONSERVER_DELAY_SCHEDULER" in os.environ)
+                    self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
+                    self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 2)
+                    self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
                     self.assertEqual(
-                        int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 2)
-                    self.assertTrue(
-                        "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ)
-                    self.assertEqual(
-                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]),
-                        0)
+                        int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0
+                    )
 
                     threads = []
-                    expected_result = self.get_expected_result(
-                        1, 1, trial, "start"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        1, 1, trial, "start", dtype)
+                    expected_result = (
+                        self.get_expected_result(1, 1, trial, "start")
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            1, 1, trial, "start", dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -2813,18 +3223,20 @@ def test_queue_delay_full_min_util(self):
                                 1001,
                                 (4000, 3000),
                                 # (flag_str, value, pre_delay_ms)
-                                (
-                                    ("start", 1, None),),
+                                (("start", 1, None),),
                                 expected_result,
-                                precreated_shm0_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
-                    expected_result = self.get_expected_result(
-                        23, 12, trial, None
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        23, 12, trial, None, dtype)
+                                precreated_shm0_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
+                    expected_result = (
+                        self.get_expected_result(23, 12, trial, None)
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            23, 12, trial, None, dtype
+                        )
+                    )
                     threads.append(
                         threading.Thread(
                             target=self.check_sequence_async,
@@ -2840,11 +3252,11 @@ def test_queue_delay_full_min_util(self):
                                     (None, 12, 2000),
                                 ),
                                 expected_result,
-                                precreated_shm1_handles),
-                            kwargs={
-                                'sequence_name':
-                                    "{}".format(self._testMethodName)
-                            }))
+                                precreated_shm1_handles,
+                            ),
+                            kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                        )
+                    )
 
                     threads[0].start()
                     time.sleep(1)
@@ -2862,5 +3274,345 @@ def test_queue_delay_full_min_util(self):
                         self.cleanup_shm_regions(precreated_shm1_handles)
 
 
-if __name__ == '__main__':
+class SequenceBatcherRequestTimeoutTest(su.SequenceBatcherTestUtil):
+    def setUp(self):
+        super(SequenceBatcherRequestTimeoutTest, self).setUp()
+        # By default, find tritonserver on "localhost", but can be overridden
+        # with TRITONSERVER_IPADDR envvar
+        self.server_address_ = (
+            os.environ.get("TRITONSERVER_IPADDR", "localhost") + ":8001"
+        )
+
+        # Prepare input and expected output based on the model and
+        # the infer sequence sent for testing. If the test is to be extended
+        # for different sequence and model, then proper grouping should be added
+        self.model_name_ = "custom_sequence_int32_timeout"
+        self.tensor_data_ = np.ones(shape=[1, 1], dtype=np.int32)
+        self.inputs_ = [grpcclient.InferInput("INPUT0", [1, 1], "INT32")]
+        self.inputs_[0].set_data_from_numpy(self.tensor_data_)
+        self.expected_out_seq_ = [
+            ("OUTPUT0", self.tensor_data_),
+            ("OUTPUT0", self.tensor_data_),
+            ("OUTPUT0", self.tensor_data_),
+        ]
+
+    def send_sequence_with_timeout(
+        self, seq_id, callback, timeout_us=3000000, request_pause_sec=0
+    ):
+        with grpcclient.InferenceServerClient(self.server_address_) as triton_client:
+            triton_client.start_stream(callback=callback)
+            triton_client.async_stream_infer(
+                self.model_name_,
+                self.inputs_,
+                sequence_id=seq_id,
+                sequence_start=True,
+                timeout=timeout_us,
+            )
+            if request_pause_sec != 0:
+                time.sleep(request_pause_sec)
+            triton_client.async_stream_infer(
+                self.model_name_, self.inputs_, sequence_id=seq_id, timeout=timeout_us
+            )
+            if request_pause_sec != 0:
+                time.sleep(request_pause_sec)
+            triton_client.async_stream_infer(
+                self.model_name_,
+                self.inputs_,
+                sequence_id=seq_id,
+                sequence_end=True,
+                timeout=timeout_us,
+            )
+
+    def test_request_timeout(self):
+        # Test long running model that receives requests with shorter timeout,
+        # expect the timeout will only be expired on backlog sequence and reject
+        # all requests of the sequence once expired.
+        # Sending two sequences while the model can only process one sequence
+        # at a time. Each model execution takes 5 second and all requests have
+        # 3 second timeout, so the second sequence will be rejected.
+
+        # correlation ID is 1-index
+        seq1_res = []
+        seq2_res = []
+        seq1_callback = lambda result, error: seq1_res.append((result, error))
+        seq2_callback = lambda result, error: seq2_res.append((result, error))
+
+        # send sequence with 1s interval to ensure processing order
+        threads = []
+        threads.append(
+            threading.Thread(
+                target=self.send_sequence_with_timeout, args=(1, seq1_callback)
+            )
+        )
+        threads.append(
+            threading.Thread(
+                target=self.send_sequence_with_timeout, args=(2, seq2_callback)
+            )
+        )
+        threads[0].start()
+        time.sleep(1)
+        threads[1].start()
+        for t in threads:
+            t.join()
+
+        for idx in range(len(seq1_res)):
+            result, error = seq1_res[idx]
+            self.assertIsNone(
+                error,
+                "Expect successful inference for sequence 1 requests, got error: {}".format(
+                    error
+                ),
+            )
+            out = result.as_numpy(self.expected_out_seq_[idx][0])
+            expected_out = self.expected_out_seq_[idx][1]
+            np.testing.assert_allclose(
+                out,
+                expected_out,
+                err_msg="Unexpected output tensor: expect {}, got {}".format(
+                    expected_out, out
+                ),
+            )
+
+        for _, error in seq2_res:
+            self.assertIsNotNone(error, "Expect error for sequence 2 requests")
+            with self.assertRaisesRegex(
+                InferenceServerException,
+                "timeout of the corresponding sequence has been expired",
+                msg="Unexpected error: {}".format(error),
+            ):
+                raise error
+
+    def test_send_request_after_timeout(self):
+        # Similar to test_request_timeout, but the sequence to be timed out
+        # will send the last request after the sequence has been timed out,
+        # and expecting server to return error regarding sending request of
+        # an untracked sequence
+
+        seq1_res = []
+        seq2_res = []
+        seq1_callback = lambda result, error: seq1_res.append((result, error))
+        seq2_callback = lambda result, error: seq2_res.append((result, error))
+
+        threads = []
+        threads.append(
+            threading.Thread(
+                target=self.send_sequence_with_timeout, args=(1, seq1_callback)
+            )
+        )
+        # Each request will be sent with a pause, so the third request
+        # will be sent after the sequence has been timed out
+        threads.append(
+            threading.Thread(
+                target=self.send_sequence_with_timeout,
+                args=(2, seq2_callback),
+                kwargs={"request_pause_sec": 2},
+            )
+        )
+        threads[0].start()
+        time.sleep(1)
+        threads[1].start()
+        for t in threads:
+            t.join()
+
+        # Check error message of the last request and the rest
+        # separately
+        for _, error in seq2_res[0:-1]:
+            self.assertIsNotNone(error, "Expect error for sequence 2 requests")
+            with self.assertRaisesRegex(
+                InferenceServerException,
+                "timeout of the corresponding sequence has been expired",
+                msg="Unexpected error: {}".format(error),
+            ):
+                raise error
+        _, last_err = seq2_res[-1]
+        self.assertIsNotNone(last_err, "Expect error for sequence 2 requests")
+        with self.assertRaisesRegex(
+            InferenceServerException,
+            "must specify the START flag on the first request",
+            msg="Unexpected error: {}".format(last_err),
+        ):
+            raise last_err
+
+
+class SequenceBatcherPreserveOrderingTest(su.SequenceBatcherTestUtil):
+    def setUp(self):
+        super().setUp()
+        # By default, find tritonserver on "localhost", but can be overridden
+        # with TRITONSERVER_IPADDR envvar
+        self.server_address_ = (
+            os.environ.get("TRITONSERVER_IPADDR", "localhost") + ":8001"
+        )
+
+        # Prepare input and expected output based on the model and
+        # the infer sequence sent for testing. If the test is to be extended
+        # for different sequence and model, then proper grouping should be added
+        self.model_name_ = "sequence_py"
+        self.tensor_data_ = np.ones(shape=[1, 1], dtype=np.int32)
+        self.inputs_ = [grpcclient.InferInput("INPUT0", [1, 1], "INT32")]
+        self.inputs_[0].set_data_from_numpy(self.tensor_data_)
+        self.triton_client = grpcclient.InferenceServerClient(self.server_address_)
+
+        # Atomic request ID for multi-threaded inference
+        self.request_id_lock = threading.Lock()
+        self.request_id = 1
+
+    def send_sequence(self, seq_id, seq_id_map, req_id_map):
+        if seq_id not in seq_id_map:
+            seq_id_map[seq_id] = []
+
+        start, middle, end = (True, False), (False, False), (False, True)
+        # Send sequence with 1 start, 1 middle, and 1 end request
+        seq_flags = [start, middle, end]
+        for start_flag, end_flag in seq_flags:
+            # Introduce random sleep to better interweave requests from different sequences
+            time.sleep(random.uniform(0.0, 1.0))
+
+            # Serialize sending requests to ensure ordered request IDs
+            with self.request_id_lock:
+                req_id = self.request_id
+                self.request_id += 1
+
+                # Store metadata to validate results later
+                req_id_map[req_id] = seq_id
+                seq_id_map[seq_id].append(req_id)
+
+                self.triton_client.async_stream_infer(
+                    self.model_name_,
+                    self.inputs_,
+                    sequence_id=seq_id,
+                    sequence_start=start_flag,
+                    sequence_end=end_flag,
+                    timeout=None,
+                    request_id=str(req_id),
+                )
+
+    def _test_sequence_ordering(self, preserve_ordering, decoupled):
+        # 1. Send a few grpc streaming sequence requests to the model.
+        # 2. With grpc streaming, the model should receive the requests in
+        #    the same order they are sent from client, and the client should
+        #    receive the responses in the same order sent back by the
+        #    model/server. With sequence scheduler, the requests for each sequence should be routed to the same model
+        #    instance, and no two requests from the same sequence should
+        #    get batched together.
+        # 3. With preserve_ordering=False, we may get the responses back in a different
+        #    order than the requests, but with grpc streaming we should still expect responses for each sequence to be ordered.
+        # 4. Assert that the sequence values are ordered, and that the response IDs per sequence are ordered
+        class SequenceResult:
+            def __init__(self, seq_id, result, request_id):
+                self.seq_id = seq_id
+                self.result = result
+                self.request_id = int(request_id)
+
+        def full_callback(sequence_dict, sequence_list, result, error):
+            # We expect no model errors for this test
+            if error:
+                self.assertTrue(False, error)
+
+            # Gather all the necessary metadata for validation
+            request_id = int(result.get_response().id)
+            sequence_id = request_id_map[request_id]
+            # Overall list of results in the order received, regardless of sequence ID
+            sequence_list.append(SequenceResult(sequence_id, result, request_id))
+            # Ordered results organized by their seq IDs
+            sequence_dict[sequence_id].append(result)
+
+        # Store ordered list in which responses are received by client
+        sequence_list = []
+        # Store mapping of sequence ID to response results
+        sequence_dict = {}
+        # Store mapping of sequence ID to request IDs and vice versa
+        sequence_id_map = {}
+        request_id_map = {}
+
+        # Start stream
+        seq_callback = partial(full_callback, sequence_dict, sequence_list)
+        self.triton_client.start_stream(callback=seq_callback)
+
+        # Send N sequences concurrently
+        threads = []
+        num_sequences = 10
+        for i in range(num_sequences):
+            # Sequence IDs are 1-indexed
+            sequence_id = i + 1
+            # Add a result list and callback for each sequence
+            sequence_dict[sequence_id] = []
+            threads.append(
+                threading.Thread(
+                    target=self.send_sequence,
+                    args=(sequence_id, sequence_id_map, request_id_map),
+                )
+            )
+
+        # Start all sequence threads
+        for t in threads:
+            t.start()
+
+        # Wait for threads to return
+        for t in threads:
+            t.join()
+
+        # Block until all requests are completed
+        self.triton_client.stop_stream()
+
+        # Make sure some inferences occurred and metadata was collected
+        self.assertGreater(len(sequence_dict), 0)
+        self.assertGreater(len(sequence_list), 0)
+
+        # Validate model results are sorted per sequence ID (model specific logic)
+        print(f"=== {preserve_ordering=} {decoupled=} ===")
+        print("Outputs per Sequence:")
+        for seq_id, sequence in sequence_dict.items():
+            seq_outputs = [
+                result.as_numpy("OUTPUT0").flatten().tolist() for result in sequence
+            ]
+            print(f"{seq_id}: {seq_outputs}")
+            self.assertEqual(seq_outputs, sorted(seq_outputs))
+
+        # Validate request/response IDs for each response in a sequence is sorted
+        # This should be true regardless of preserve_ordering or not
+        print("Request IDs per Sequence:")
+        for seq_id in sequence_id_map:
+            per_seq_request_ids = sequence_id_map[seq_id]
+            print(f"{seq_id}: {per_seq_request_ids}")
+            self.assertEqual(per_seq_request_ids, sorted(per_seq_request_ids))
+
+        # Validate results are sorted in request order if preserve_ordering is True
+        if preserve_ordering:
+            request_ids = [s.request_id for s in sequence_list]
+            print(f"Request IDs overall:\n{request_ids}")
+            sequence_ids = [s.seq_id for s in sequence_list]
+            print(f"Sequence IDs overall:\n{sequence_ids}")
+            self.assertEqual(request_ids, sorted(request_ids))
+
+        # Assert some dynamic batching of requests was done
+        stats = self.triton_client.get_inference_statistics(
+            model_name=self.model_name_, headers={}, as_json=True
+        )
+        model_stats = stats["model_stats"][0]
+        self.assertEqual(model_stats["name"], self.model_name_)
+        self.assertLess(
+            int(model_stats["execution_count"]), int(model_stats["inference_count"])
+        )
+
+    def test_sequence_with_preserve_ordering(self):
+        self.model_name_ = "seqpy_preserve_ordering_nondecoupled"
+        self._test_sequence_ordering(preserve_ordering=True, decoupled=False)
+
+    def test_sequence_without_preserve_ordering(self):
+        self.model_name_ = "seqpy_no_preserve_ordering_nondecoupled"
+        self._test_sequence_ordering(preserve_ordering=False, decoupled=False)
+
+    # FIXME [DLIS-5280]: This may fail for decoupled models if writes to GRPC
+    # stream are done out of order in server, so disable test for now.
+    # def test_sequence_with_preserve_ordering_decoupled(self):
+    #    self.model_name_ = "seqpy_preserve_ordering_decoupled"
+    #    self._test_sequence_ordering(preserve_ordering=True, decoupled=True)
+
+    # FIXME [DLIS-5280]
+    # def test_sequence_without_preserve_ordering_decoupled(self):
+    #    self.model_name_ = "seqpy_no_preserve_ordering_decoupled"
+    #    self._test_sequence_ordering(preserve_ordering=False, decoupled=True)
+
+
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_sequence_batcher/test.sh b/qa/L0_sequence_batcher/test.sh
index a201dcf7a3..d91b433966 100755
--- a/qa/L0_sequence_batcher/test.sh
+++ b/qa/L0_sequence_batcher/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -42,11 +42,21 @@ TEST_RESULT_FILE='test_results.txt'
 
 # Must run on a single device or else the TRITONSERVER_DELAY_SCHEDULER
 # can fail when the requests are distributed to multiple devices.
+ldconfig || true
+
 export CUDA_VISIBLE_DEVICES=0
 
 CLIENT_LOG="./client.log"
 BATCHER_TEST=sequence_batcher_test.py
 
+if [ -z "$TEST_SYSTEM_SHARED_MEMORY" ]; then
+    TEST_SYSTEM_SHARED_MEMORY="0"
+fi
+
+if [ -z "$TEST_CUDA_SHARED_MEMORY" ]; then
+    TEST_CUDA_SHARED_MEMORY="0"
+fi
+
 if [ -z "$TEST_VALGRIND" ]; then
     TEST_VALGRIND="0"
 fi
@@ -77,33 +87,43 @@ if [ "$TEST_JETSON" -eq 1 ]; then
     MODEL_TRIALS="0 v"
 fi
 
-TF_VERSION=${TF_VERSION:=1}
+TF_VERSION=${TF_VERSION:=2}
 
 # On windows the paths invoked by the script (running in WSL) must use
 # /mnt/c when needed but the paths on the tritonserver command-line
 # must be C:/ style.
+WINDOWS=0
 if [[ "$(< /proc/sys/kernel/osrelease)" == *microsoft* ]]; then
     MODELDIR=${MODELDIR:=C:/models}
     DATADIR=${DATADIR:="/mnt/c/data/inferenceserver/${REPO_VERSION}"}
     BACKEND_DIR=${BACKEND_DIR:=C:/tritonserver/backends}
     SERVER=${SERVER:=/mnt/c/tritonserver/bin/tritonserver.exe}
     export WSLENV=$WSLENV:TRITONSERVER_DELAY_SCHEDULER:TRITONSERVER_BACKLOG_DELAY_SCHEDULER
+    WINDOWS=1
 else
     MODELDIR=${MODELDIR:=`pwd`}
     DATADIR=${DATADIR:="/data/inferenceserver/${REPO_VERSION}"}
     TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"}
     SERVER=${TRITON_DIR}/bin/tritonserver
     BACKEND_DIR=${TRITON_DIR}/backends
+
+    # PyTorch on SBSA requires libgomp to be loaded first. See the following
+    # GitHub issue for more information:
+    # https://github.com/pytorch/pytorch/issues/2575
+    arch=`uname -m`
+    if [ $arch = "aarch64" ]; then
+      SERVER_LD_PRELOAD=/usr/lib/$(uname -m)-linux-gnu/libgomp.so.1
+    fi
 fi
 
-SERVER_ARGS_EXTRA="--backend-directory=${BACKEND_DIR} --backend-config=tensorflow,version=${TF_VERSION}"
+SERVER_ARGS_EXTRA="--backend-directory=${BACKEND_DIR} --backend-config=tensorflow,version=${TF_VERSION} --log-verbose=1"
 
 source ../common/util.sh
 
 RET=0
 
 # If BACKENDS not specified, set to all
-BACKENDS=${BACKENDS:="graphdef savedmodel onnx plan libtorch custom"}
+BACKENDS=${BACKENDS:="graphdef savedmodel onnx plan libtorch custom python"}
 export BACKENDS
 
 # If MODEL_TRIALS not specified set to 0 1 2 4 v
@@ -151,13 +171,17 @@ export INITIAL_STATE_FILE
 INITIAL_STATE_ZERO=${INITIAL_STATE_ZERO:="0"}
 export INITIAL_STATE_ZERO
 
+# If USE_SINGLE_BUFFER is not specified, set to 0
+USE_SINGLE_BUFFER=${USE_SINGLE_BUFFER:="0"}
+export USE_SINGLE_BUFFER
+
 # Setup non-variable-size model repositories. The same models are in each
 # repository but they are configured as:
 #   models0 - four instances with non-batching model
 #   models1 - one instance with batch-size 4
 #   models2 - two instances with batch-size 2
 #   models4 - four instances with batch-size 1
-rm -fr *.log *.serverlog models{0,1,2,4} queue_delay_models && mkdir models{0,1,2,4} queue_delay_models
+rm -fr *.log  models{0,1,2,4} queue_delay_models && mkdir models{0,1,2,4} queue_delay_models
 
 # Get the datatype to use based on the backend
 function get_datatype () {
@@ -175,10 +199,29 @@ function get_datatype () {
     if [[ $1 == "onnx" ]]; then
         dtype="object int32 bool"
     fi
+    if [[ $1 == "libtorch" ]]; then
+        dtype="object int32 bool"
+    fi
   fi
   echo $dtype
 }
 
+# Modify corresponding onnx config.pbtxt to create python config.pbtxt
+function generate_python_models () {
+  model_path=$1
+  dest_dir=$2
+  onnx_model=$(echo ${model_path//python/onnx})
+  python_model=$(basename $model_path)
+  mkdir -p $dest_dir/$python_model/1/
+  # for emsemble models keep "platform: ensemble"
+  if [[ "$model_path" == *"ensemble_model"* ]]; then
+    cat $onnx_model/config.pbtxt | sed 's/onnx/python/g' > $dest_dir/$python_model/config.pbtxt
+  else
+    cat $onnx_model/config.pbtxt | sed 's/platform:.*/backend:\ "python"/g' | sed 's/onnx/python/g' > $dest_dir/$python_model/config.pbtxt
+    cp ../python_models/sequence_int32/model.py $dest_dir/$python_model/1/
+  fi
+}
+
 if [[ "$INITIAL_STATE_ZERO" == "1" && "$INITIAL_STATE_FILE" == "1" ]]; then
   echo -e "\n***\n*** 'INITIAL_STATE_ZERO' and 'INITIAL_STATE_FILE' can't be enabled simultaneously. \n***"
   exit 1
@@ -200,6 +243,7 @@ else
 fi
 
 MODELS=""
+PYTHON_MODELS=""
 for BACKEND in $BACKENDS; do
   if [[ $BACKEND == "custom" ]]; then
     MODELS="$MODELS ../custom_models/custom_sequence_int32"
@@ -214,7 +258,13 @@ for BACKEND in $BACKENDS; do
       for DTYPE in $DTYPES; do
         # We don't generate ensemble models for bool data type.
         if [[ $DTYPE != "bool" ]]; then
-          MODELS="$MODELS $DATADIR/qa_ensemble_model_repository/$FIXED_MODEL_REPOSITORY/*_${BACKEND}_sequence_${DTYPE}"
+          if [ "$BACKEND" == "python" ]; then
+            PYTHON_MODELS="$DATADIR/qa_ensemble_model_repository/$FIXED_MODEL_REPOSITORY/*_onnx_sequence_${DTYPE}"
+            TMP=$(echo $PYTHON_MODELS)
+            MODELS="$MODELS ${TMP//onnx/python}"
+          else
+            MODELS="$MODELS $DATADIR/qa_ensemble_model_repository/$FIXED_MODEL_REPOSITORY/*_${BACKEND}_sequence_${DTYPE}"
+          fi
         fi
       done
     fi
@@ -229,28 +279,57 @@ fi
 
 for MODEL in $MODELS; do
   if [[ ! "$TEST_VALGRIND" -eq 1 ]]; then
-    cp -r $MODEL models1/. && \
+    # Skip libtorch string models
+    if [[ "$MODEL" =~ .*"libtorch".*"object".* ]]; then
+        continue
+    fi
+    if [[ "$MODEL" =~ .*"python".* ]]; then
+      generate_python_models "$MODEL" "models1"
+    else
+      cp -r $MODEL models1/.
+    fi
       (cd models1/$(basename $MODEL) && \
         sed -i "s/^max_batch_size:.*/max_batch_size: 4/" config.pbtxt && \
         sed -i "s/kind: KIND_GPU/kind: KIND_GPU\\ncount: 1/" config.pbtxt && \
         sed -i "s/kind: KIND_CPU/kind: KIND_CPU\\ncount: 1/" config.pbtxt)
-    cp -r $MODEL models2/. && \
+
+    # Skip libtorch string models
+    if [[ "$MODEL" =~ .*"libtorch".*"object".* ]]; then
+        continue
+    fi
+
+    if [[ "$MODEL" =~ .*"python".* ]]; then
+      generate_python_models "$MODEL" "models2"
+    else
+      cp -r $MODEL models2/.
+    fi
       (cd models2/$(basename $MODEL) && \
         sed -i "s/^max_batch_size:.*/max_batch_size: 2/" config.pbtxt && \
         sed -i "s/kind: KIND_GPU/kind: KIND_GPU\\ncount: 2/" config.pbtxt && \
         sed -i "s/kind: KIND_CPU/kind: KIND_CPU\\ncount: 2/" config.pbtxt)
-    cp -r $MODEL models4/. && \
+
+    if [[ "$MODEL" =~ .*"python".* ]]; then
+      generate_python_models "$MODEL" "models4"
+    else
+      cp -r $MODEL models4/.
+    fi
       (cd models4/$(basename $MODEL) && \
         sed -i "s/^max_batch_size:.*/max_batch_size: 1/" config.pbtxt && \
         sed -i "s/kind: KIND_GPU/kind: KIND_GPU\\ncount: 4/" config.pbtxt && \
         sed -i "s/kind: KIND_CPU/kind: KIND_CPU\\ncount: 4/" config.pbtxt)
+
     # Duplicate the models for different delay settings
-    cp -r $MODEL queue_delay_models/. && \
+    if [[ "$MODEL" =~ .*"python".* ]]; then
+      generate_python_models "$MODEL" "queue_delay_models"
+    else
+      cp -r $MODEL queue_delay_models/.
+    fi
       (cd queue_delay_models/$(basename $MODEL) && \
         sed -i "s/^max_batch_size:.*/max_batch_size: 4/" config.pbtxt && \
         sed -i "s/kind: KIND_GPU/kind: KIND_GPU\\ncount: 1/" config.pbtxt && \
         sed -i "s/kind: KIND_CPU/kind: KIND_CPU\\ncount: 1/" config.pbtxt && \
         sed -i "s/sequence_batching {/sequence_batching {\\ndirect {\\nmax_queue_delay_microseconds: 3000000\\nminimum_slot_utilization: 0\\n}/" config.pbtxt)
+
     cp -r queue_delay_models/$(basename $MODEL) queue_delay_models/$(basename $MODEL)_half && \
       (cd queue_delay_models/$(basename $MODEL)_half && \
         sed -i "s/$(basename $MODEL)/$(basename $MODEL)_half/" config.pbtxt && \
@@ -259,6 +338,23 @@ for MODEL in $MODELS; do
       (cd queue_delay_models/$(basename $MODEL)_full && \
         sed -i "s/$(basename $MODEL)/$(basename $MODEL)_full/" config.pbtxt && \
         sed -i "s/minimum_slot_utilization: 0/minimum_slot_utilization: 1/" config.pbtxt)
+
+    # TODO: Enable single state buffer testing for sequence batcher
+    # if [ "$USE_SINGLE_BUFFER" == "1" && "$IMPLICIT_STATE" == "1" ]; then
+    #   SED_REPLACE_PATTERN="N;N;N;N;N;/state.*dims:.*/a use_single_buffer: true"
+    #   (cd models0/$(basename $MODEL) && \
+    #     sed -i "$SED_REPLACE_PATTERN" config.pbtxt)
+    #   (cd models1/$(basename $MODEL) && \
+    #     sed -i "$SED_REPLACE_PATTERN" config.pbtxt)
+    #   (cd models2/$(basename $MODEL) && \
+    #     sed -i "$SED_REPLACE_PATTERN" config.pbtxt)
+    #   (cd models4/$(basename $MODEL) && \
+    #     sed -i "$SED_REPLACE_PATTERN" config.pbtxt)
+    #   (cd queue_delay_models/$(basename $MODEL)_full && \
+    #     sed -i "$SED_REPLACE_PATTERN" config.pbtxt)
+    #   (cd queue_delay_models/$(basename $MODEL)_half && \
+    #     sed -i "$SED_REPLACE_PATTERN" config.pbtxt)
+    # fi
   else
     cp -r $MODEL queue_delay_models/$(basename $MODEL)_full && \
       (cd queue_delay_models/$(basename $MODEL)_full && \
@@ -307,6 +403,7 @@ if [ "$INITIAL_STATE_FILE" == "1" ]; then
 fi
 
 MODELS=""
+PYTHON_MODELS=""
 for BACKEND in $BACKENDS; do
   if [[ $BACKEND == "custom" ]]; then
     MODELS="$MODELS ../custom_models/custom_sequence_int32"
@@ -320,7 +417,13 @@ for BACKEND in $BACKENDS; do
       for DTYPE in $DTYPES; do
         # We don't generate ensemble models for bool data type.
         if [[ $DTYPE != "bool" ]]; then
-          MODELS="$MODELS $DATADIR/qa_ensemble_model_repository/$FIXED_MODEL_REPOSITORY/*_${BACKEND}_nobatch_sequence_${DTYPE}"
+          if [ "$BACKEND" == "python" ]; then
+            PYTHON_MODELS="$DATADIR/qa_ensemble_model_repository/$FIXED_MODEL_REPOSITORY/*_onnx_nobatch_sequence_${DTYPE}"
+            TMP=$(echo $PYTHON_MODELS)
+            MODELS="$MODELS ${TMP//onnx/python}"
+          else
+            MODELS="$MODELS $DATADIR/qa_ensemble_model_repository/$FIXED_MODEL_REPOSITORY/*_${BACKEND}_nobatch_sequence_${DTYPE}"
+          fi
         fi
       done
 
@@ -329,22 +432,27 @@ for BACKEND in $BACKENDS; do
 done
 
 for MODEL in $MODELS; do
-    cp -r $MODEL models0/. && \
-        (cd models0/$(basename $MODEL) && \
-            sed -i "s/kind: KIND_GPU/kind: KIND_GPU\\ncount: 4/" config.pbtxt && \
-            sed -i "s/kind: KIND_CPU/kind: KIND_CPU\\ncount: 4/" config.pbtxt)
-
-    if [ "$INITIAL_STATE_FILE" == "1" ]; then
-        mkdir -p models0/$(basename $MODEL)/initial_state/ && cp input_state_data models0/$(basename $MODEL)/initial_state/ && \
-           (cd models0/$(basename $MODEL) && \
-            sed -i "s/zero_data.*/data_file:\"input_state_data\"/" config.pbtxt)
-    fi
+  if [[ "$MODEL" =~ .*"python".* ]]; then
+      generate_python_models "$MODEL" "models0"
+  else
+    cp -r $MODEL models0/.
+  fi
+    (cd models0/$(basename $MODEL) && \
+      sed -i "s/kind: KIND_GPU/kind: KIND_GPU\\ncount: 4/" config.pbtxt && \
+      sed -i "s/kind: KIND_CPU/kind: KIND_CPU\\ncount: 4/" config.pbtxt)
+
+  if [ "$INITIAL_STATE_FILE" == "1" ]; then
+      mkdir -p models0/$(basename $MODEL)/initial_state/ && cp input_state_data models0/$(basename $MODEL)/initial_state/ && \
+          (cd models0/$(basename $MODEL) && \
+          sed -i "s/zero_data.*/data_file:\"input_state_data\"/" config.pbtxt)
+  fi
 done
 
 # modelsv - one instance with batch-size 4
 rm -fr modelsv && mkdir modelsv
 
 MODELS=""
+PYTHON_MODELS=""
 for BACKEND in $BACKENDS; do
   if [[ $BACKEND == "custom" ]]; then
     MODELS="$MODELS ../custom_models/custom_sequence_int32"
@@ -358,7 +466,13 @@ for BACKEND in $BACKENDS; do
       for DTYPE in $DTYPES; do
         # We don't generate ensemble models for bool data type.
         if [[ $DTYPE != "bool" ]]; then
-          MODELS="$MODELS $DATADIR/qa_ensemble_model_repository/${VAR_MODEL_REPOSITORY}/*_${BACKEND}_sequence_${DTYPE}"
+          if [ "$BACKEND" == "python" ]; then
+            PYTHON_MODELS="$DATADIR/qa_ensemble_model_repository/$FIXED_MODEL_REPOSITORY/*_onnx_sequence_${DTYPE}"
+            TMP=$(echo $PYTHON_MODELS)
+            MODELS="$MODELS ${TMP//onnx/python}"
+          else
+            MODELS="$MODELS $DATADIR/qa_ensemble_model_repository/${VAR_MODEL_REPOSITORY}/*_${BACKEND}_sequence_${DTYPE}"
+          fi
         fi
       done
     fi
@@ -366,17 +480,25 @@ for BACKEND in $BACKENDS; do
 done
 
 for MODEL in $MODELS; do
-    cp -r $MODEL modelsv/. && \
-        (cd modelsv/$(basename $MODEL) && \
-            sed -i "s/^max_batch_size:.*/max_batch_size: 4/" config.pbtxt && \
-            sed -i "s/kind: KIND_GPU/kind: KIND_GPU\\ncount: 1/" config.pbtxt && \
-            sed -i "s/kind: KIND_CPU/kind: KIND_CPU\\ncount: 1/" config.pbtxt)
-
-    if [ "$INITIAL_STATE_FILE" == "1" ]; then
-        mkdir -p modelsv/$(basename $MODEL)/initial_state/ && cp input_state_data modelsv/$(basename $MODEL)/initial_state/ && \
-           (cd modelsv/$(basename $MODEL) && \
-            sed -i "s/zero_data.*/data_file:\"input_state_data\"/" config.pbtxt)
-    fi
+  # Skip libtorch string models
+  if [[ "$MODEL" =~ .*"libtorch".*"object".* ]]; then
+      continue
+  fi
+  if [[ "$MODEL" =~ .*"python".* ]]; then
+      generate_python_models "$MODEL" "modelsv"
+  else
+    cp -r $MODEL modelsv/.
+  fi
+    (cd modelsv/$(basename $MODEL) && \
+      sed -i "s/^max_batch_size:.*/max_batch_size: 4/" config.pbtxt && \
+      sed -i "s/kind: KIND_GPU/kind: KIND_GPU\\ncount: 1/" config.pbtxt && \
+      sed -i "s/kind: KIND_CPU/kind: KIND_CPU\\ncount: 1/" config.pbtxt)
+
+  if [ "$INITIAL_STATE_FILE" == "1" ]; then
+      mkdir -p modelsv/$(basename $MODEL)/initial_state/ && cp input_state_data modelsv/$(basename $MODEL)/initial_state/ && \
+          (cd modelsv/$(basename $MODEL) && \
+          sed -i "s/zero_data.*/data_file:\"input_state_data\"/" config.pbtxt)
+  fi
 done
 
 # Same test work on all models since they all have same total number
@@ -408,7 +530,7 @@ for model_trial in $MODEL_TRIALS; do
 
     for i in $NO_DELAY_TESTS; do
         SERVER_ARGS="--model-repository=$MODELDIR/$MODEL_PATH ${SERVER_ARGS_EXTRA}"
-        SERVER_LOG="./$i.$MODEL_PATH.serverlog"
+        SERVER_LOG="./$i.$MODEL_PATH.server.log"
 
         if [ "$TEST_VALGRIND" -eq 1 ]; then
             LEAKCHECK_LOG="./$i.$MODEL_PATH.valgrind.log"
@@ -468,7 +590,7 @@ for model_trial in $MODEL_TRIALS; do
             [[ "$i" != "test_half_batch" ]] && export TRITONSERVER_DELAY_SCHEDULER=4 &&
             [[ "$i" != "test_backlog_sequence_timeout" ]] && export TRITONSERVER_DELAY_SCHEDULER=12
         SERVER_ARGS="--model-repository=$MODELDIR/$MODEL_PATH ${SERVER_ARGS_EXTRA}"
-        SERVER_LOG="./$i.$MODEL_PATH.serverlog"
+        SERVER_LOG="./$i.$MODEL_PATH.server.log"
 
         if [ "$TEST_VALGRIND" -eq 1 ]; then
             LEAKCHECK_LOG="./$i.$MODEL_PATH.valgrind.log"
@@ -538,7 +660,7 @@ if [[ $BACKENDS == *"custom"* ]]; then
     export TRITONSERVER_DELAY_SCHEDULER=12
 
     SERVER_ARGS="--model-repository=$MODELDIR/$MODEL_PATH ${SERVER_ARGS_EXTRA}"
-    SERVER_LOG="./$i.$MODEL_PATH.serverlog"
+    SERVER_LOG="./$i.$MODEL_PATH.server.log"
 
     if [ "$TEST_VALGRIND" -eq 1 ]; then
       LEAKCHECK_LOG="./$i.$MODEL_PATH.valgrind.log"
@@ -596,7 +718,7 @@ for i in $QUEUE_DELAY_TESTS ; do
     export TRITONSERVER_BACKLOG_DELAY_SCHEDULER=0
     export TRITONSERVER_DELAY_SCHEDULER=2
     SERVER_ARGS="--model-repository=$MODELDIR/$MODEL_PATH ${SERVER_ARGS_EXTRA}"
-    SERVER_LOG="./$i.$MODEL_PATH.serverlog"
+    SERVER_LOG="./$i.$MODEL_PATH.server.log"
 
     if [ "$TEST_VALGRIND" -eq 1 ]; then
         LEAKCHECK_LOG="./$i.$MODEL_PATH.valgrind.log"
@@ -644,6 +766,144 @@ for i in $QUEUE_DELAY_TESTS ; do
     set -e
 done
 
+# Test request timeout with sequence batcher
+# only run the test outside shared memory setting as
+# shared memory feature is irrelevant
+if [ "$TEST_SYSTEM_SHARED_MEMORY" -ne 1 ] && [ "$TEST_CUDA_SHARED_MEMORY" -ne 1 ]; then
+    export NO_BATCHING=0
+    export MODEL_INSTANCES=1
+    export BATCHER_TYPE="FIXED"
+
+    TEST_CASE=SequenceBatcherRequestTimeoutTest
+    MODEL_PATH=request_timeout_models
+    mkdir -p ${MODEL_PATH}/custom_sequence_int32_timeout/1
+
+    SERVER_ARGS="--model-repository=$MODELDIR/$MODEL_PATH ${SERVER_ARGS_EXTRA}"
+    SERVER_LOG="./$TEST_CASE.$MODEL_PATH.server.log"
+
+    if [ "$TEST_VALGRIND" -eq 1 ]; then
+        LEAKCHECK_LOG="./$i.$MODEL_PATH.valgrind.log"
+        LEAKCHECK_ARGS="$LEAKCHECK_ARGS_BASE --log-file=$LEAKCHECK_LOG"
+        run_server_leakcheck
+    else
+        run_server
+    fi
+
+    if [ "$SERVER_PID" == "0" ]; then
+        echo -e "\n***\n*** Failed to start $SERVER\n***"
+        cat $SERVER_LOG
+        exit 1
+    fi
+
+    echo "Test: $TEST_CASE, repository $MODEL_PATH" >>$CLIENT_LOG
+
+    set +e
+    python3 $BATCHER_TEST $TEST_CASE >>$CLIENT_LOG 2>&1
+    if [ $? -ne 0 ]; then
+        echo -e "\n***\n*** Test $TEST_CASE Failed\n***" >>$CLIENT_LOG
+        echo -e "\n***\n*** Test $TEST_CASE Failed\n***"
+        RET=1
+    else
+        check_test_results $TEST_RESULT_FILE 2
+        if [ $? -ne 0 ]; then
+            cat $CLIENT_LOG
+            echo -e "\n***\n*** Test Result Verification Failed\n***"
+            RET=1
+        fi
+    fi
+    set -e
+
+    kill_server
+
+    set +e
+    if [ "$TEST_VALGRIND" -eq 1 ]; then
+        python3 ../common/check_valgrind_log.py -f $LEAKCHECK_LOG
+        if [ $? -ne 0 ]; then
+            RET=1
+        fi
+    fi
+    set -e
+fi
+
+### Start Preserve Ordering Tests ###
+
+# Test only supported on windows currently due to use of python backend models
+if [ ${WINDOWS} -ne 1 ]; then
+    # Test preserve ordering true/false and decoupled/non-decoupled
+    TEST_CASE=SequenceBatcherPreserveOrderingTest
+    MODEL_PATH=preserve_ordering_models
+    BASE_MODEL="../python_models/sequence_py"
+    rm -rf ${MODEL_PATH}
+
+    # FIXME [DLIS-5280]: This may fail for decoupled models if writes to GRPC
+    # stream are done out of order in server, so decoupled tests are disabled.
+    MODES="decoupled nondecoupled"
+    for mode in $MODES; do
+        NO_PRESERVE="${MODEL_PATH}/seqpy_no_preserve_ordering_${mode}"
+        mkdir -p ${NO_PRESERVE}/1
+        cp ${BASE_MODEL}/config.pbtxt ${NO_PRESERVE}
+        cp ${BASE_MODEL}/model.py ${NO_PRESERVE}/1
+
+        PRESERVE="${MODEL_PATH}/seqpy_preserve_ordering_${mode}"
+        cp -r ${NO_PRESERVE} ${PRESERVE}
+        sed -i "s/^preserve_ordering: False/preserve_ordering: True/" ${PRESERVE}/config.pbtxt
+
+        if [ ${mode} == "decoupled" ]; then
+          echo -e "\nmodel_transaction_policy { decoupled: true }" >> ${NO_PRESERVE}/config.pbtxt
+          echo -e "\nmodel_transaction_policy { decoupled: true }" >> ${PRESERVE}/config.pbtxt
+        fi
+    done
+
+    SERVER_ARGS="--model-repository=$MODELDIR/$MODEL_PATH ${SERVER_ARGS_EXTRA}"
+    SERVER_LOG="./$TEST_CASE.$MODEL_PATH.server.log"
+
+    if [ "$TEST_VALGRIND" -eq 1 ]; then
+        LEAKCHECK_LOG="./$i.$MODEL_PATH.valgrind.log"
+        LEAKCHECK_ARGS="$LEAKCHECK_ARGS_BASE --log-file=$LEAKCHECK_LOG"
+        run_server_leakcheck
+    else
+        run_server
+    fi
+
+    if [ "$SERVER_PID" == "0" ]; then
+        echo -e "\n***\n*** Failed to start $SERVER\n***"
+        cat $SERVER_LOG
+        exit 1
+    fi
+
+    echo "Test: $TEST_CASE, repository $MODEL_PATH" >>$CLIENT_LOG
+
+    set +e
+    python3 $BATCHER_TEST $TEST_CASE >>$CLIENT_LOG 2>&1
+    if [ $? -ne 0 ]; then
+        echo -e "\n***\n*** Test $TEST_CASE Failed\n***" >>$CLIENT_LOG
+        echo -e "\n***\n*** Test $TEST_CASE Failed\n***"
+        RET=1
+    else
+        # 2 for preserve_ordering = True/False
+        check_test_results $TEST_RESULT_FILE 2
+        if [ $? -ne 0 ]; then
+            cat $CLIENT_LOG
+            echo -e "\n***\n*** Test Result Verification Failed\n***"
+            RET=1
+        fi
+    fi
+    set -e
+
+    kill_server
+
+    set +e
+    if [ "$TEST_VALGRIND" -eq 1 ]; then
+        python3 ../common/check_valgrind_log.py -f $LEAKCHECK_LOG
+        if [ $? -ne 0 ]; then
+            RET=1
+        fi
+    fi
+    set -e
+fi
+
+### End Preserve Ordering Tests ###
+
 if [ $RET -eq 0 ]; then
     echo -e "\n***\n*** Test Passed\n***"
 else
diff --git a/qa/L0_sequence_corrid_batcher/sequence_corrid_batcher_test.py b/qa/L0_sequence_corrid_batcher/sequence_corrid_batcher_test.py
old mode 100644
new mode 100755
index d992b75246..15f16da352
--- a/qa/L0_sequence_corrid_batcher/sequence_corrid_batcher_test.py
+++ b/qa/L0_sequence_corrid_batcher/sequence_corrid_batcher_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,27 +27,26 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
 import os
-import time
 import threading
+import time
 import unittest
+
 import numpy as np
-import test_util as tu
 import sequence_util as su
+import test_util as tu
 
-_test_system_shared_memory = bool(
-    int(os.environ.get('TEST_SYSTEM_SHARED_MEMORY', 0)))
-_test_cuda_shared_memory = bool(
-    int(os.environ.get('TEST_CUDA_SHARED_MEMORY', 0)))
+_test_system_shared_memory = bool(int(os.environ.get("TEST_SYSTEM_SHARED_MEMORY", 0)))
+_test_cuda_shared_memory = bool(int(os.environ.get("TEST_CUDA_SHARED_MEMORY", 0)))
 
-_no_batching = (int(os.environ['NO_BATCHING']) == 1)
-_model_instances = int(os.environ['MODEL_INSTANCES'])
+_no_batching = int(os.environ["NO_BATCHING"]) == 1
+_model_instances = int(os.environ["MODEL_INSTANCES"])
 
 if _no_batching:
-    _trials = ("savedmodel_nobatch", "graphdef_nobatch", "plan_nobatch",
-               "onnx_nobatch")
+    _trials = ("savedmodel_nobatch", "graphdef_nobatch", "plan_nobatch", "onnx_nobatch")
 else:
     _trials = ("savedmodel", "graphdef", "plan", "onnx")
 
@@ -54,23 +55,20 @@
 
 
 class SequenceCorrIDBatcherTest(su.SequenceBatcherTestUtil):
-
     def get_datatype(self, trial):
         return np.int32
 
-    def get_expected_result(self,
-                            expected_result,
-                            corrid,
-                            value,
-                            trial,
-                            flag_str=None):
+    def get_expected_result(self, expected_result, corrid, value, trial, flag_str=None):
         # Adjust the expected_result for models that
-        # couldn't implement the full accumulator. See
+        # could not implement the full accumulator. See
         # qa/common/gen_qa_dyna_sequence_models.py for more
         # information.
-        if ((("nobatch" not in trial) and ("custom" not in trial)) or \
-            ("graphdef" in trial) or ("plan" in trial) or \
-            ("onnx" in trial)) or ("libtorch" in trial):
+        if (
+            (("nobatch" not in trial) and ("custom" not in trial))
+            or ("graphdef" in trial)
+            or ("plan" in trial)
+            or ("onnx" in trial)
+        ) or ("libtorch" in trial):
             expected_result = value
             if flag_str is not None:
                 if "start" in flag_str:
@@ -87,14 +85,16 @@ def test_skip_batch(self):
         for trial in _trials:
             self.clear_deferred_exceptions()
             dtype = self.get_datatype(trial)
-            precreated_shm0_handles = self.precreate_register_regions((1, 3),
-                                                                      dtype, 0)
+            precreated_shm0_handles = self.precreate_register_regions((1, 3), dtype, 0)
             precreated_shm1_handles = self.precreate_register_regions(
-                (11, 12, 13, 14), dtype, 1)
+                (11, 12, 13, 14), dtype, 1
+            )
             precreated_shm2_handles = self.precreate_register_regions(
-                (111, 113), dtype, 2)
+                (111, 113), dtype, 2
+            )
             precreated_shm3_handles = self.precreate_register_regions(
-                (1111, 1112, 1113, 1114), dtype, 3)
+                (1111, 1112, 1113, 1114), dtype, 3
+            )
             try:
                 model_name = tu.get_dyna_sequence_model_name(trial, dtype)
 
@@ -103,12 +103,11 @@ def test_skip_batch(self):
                 # Need scheduler to wait for queue to contain all
                 # inferences for both sequences.
                 self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
+                self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12)
+                self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
                 self.assertEqual(
-                    int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12)
-                self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER",
-                              os.environ)
-                self.assertEqual(
-                    int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0)
+                    int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0
+                )
 
                 corrids = [1001, 1002, 1003, 1004]
                 threads = []
@@ -123,12 +122,14 @@ def test_skip_batch(self):
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
                             (("start", 1, None), ("end", 3, None)),
-                            self.get_expected_result(4 + corrids[0], corrids[0],
-                                                     3, trial, "end"),
-                            precreated_shm0_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
+                            self.get_expected_result(
+                                4 + corrids[0], corrids[0], 3, trial, "end"
+                            ),
+                            precreated_shm0_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -139,15 +140,20 @@ def test_skip_batch(self):
                             corrids[1],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 11, None), (None, 12, None),
-                             (None, 13, None), ("end", 14, None)),
-                            self.get_expected_result(50 + corrids[1],
-                                                     corrids[1], 14, trial,
-                                                     "end"),
-                            precreated_shm1_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
+                            (
+                                ("start", 11, None),
+                                (None, 12, None),
+                                (None, 13, None),
+                                ("end", 14, None),
+                            ),
+                            self.get_expected_result(
+                                50 + corrids[1], corrids[1], 14, trial, "end"
+                            ),
+                            precreated_shm1_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -159,13 +165,14 @@ def test_skip_batch(self):
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
                             (("start", 111, None), ("end", 113, None)),
-                            self.get_expected_result(224 + corrids[2],
-                                                     corrids[2], 113, trial,
-                                                     "end"),
-                            precreated_shm2_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
+                            self.get_expected_result(
+                                224 + corrids[2], corrids[2], 113, trial, "end"
+                            ),
+                            precreated_shm2_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -176,15 +183,20 @@ def test_skip_batch(self):
                             corrids[3],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 1111, None), (None, 1112, None),
-                             (None, 1113, None), ("end", 1114, None)),
-                            self.get_expected_result(4450 + corrids[3],
-                                                     corrids[3], 1114, trial,
-                                                     "end"),
-                            precreated_shm3_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
+                            (
+                                ("start", 1111, None),
+                                (None, 1112, None),
+                                (None, 1113, None),
+                                ("end", 1114, None),
+                            ),
+                            self.get_expected_result(
+                                4450 + corrids[3], corrids[3], 1114, trial, "end"
+                            ),
+                            precreated_shm3_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
 
                 threads[1].start()
                 threads[3].start()
@@ -210,5 +222,5 @@ def test_skip_batch(self):
                     self.cleanup_shm_regions(precreated_shm3_handles)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_sequence_corrid_batcher/test.sh b/qa/L0_sequence_corrid_batcher/test.sh
index 83a8085342..8d114a395a 100755
--- a/qa/L0_sequence_corrid_batcher/test.sh
+++ b/qa/L0_sequence_corrid_batcher/test.sh
@@ -57,7 +57,7 @@ export CUDA_VISIBLE_DEVICES=0
 # Setup non-variable-size model repositories. The same models are in each
 # repository but they are configured as:
 #   models4 - four instances with batch-size 1
-rm -fr *.log *.serverlog models{0,1,2,4} && mkdir models4
+rm -fr *.log  models{0,1,2,4} && mkdir models4
 for m in \
         $DATADIR/qa_dyna_sequence_model_repository/graphdef_dyna_sequence_int32 \
         $DATADIR/qa_dyna_sequence_model_repository/savedmodel_dyna_sequence_int32 \
@@ -88,7 +88,7 @@ for model_trial in 4; do
         export TRITONSERVER_BACKLOG_DELAY_SCHEDULER=0
         export TRITONSERVER_DELAY_SCHEDULER=12
         SERVER_ARGS="--model-repository=`pwd`/$MODEL_DIR"
-        SERVER_LOG="./$i.$MODEL_DIR.serverlog"
+        SERVER_LOG="./$i.$MODEL_DIR.server.log"
         run_server
         if [ "$SERVER_PID" == "0" ]; then
             echo -e "\n***\n*** Failed to start $SERVER\n***"
diff --git a/qa/L0_sequence_stress/sequence_stress.py b/qa/L0_sequence_stress/sequence_stress.py
old mode 100644
new mode 100755
index 44679e171e..039cf793a2
--- a/qa/L0_sequence_stress/sequence_stress.py
+++ b/qa/L0_sequence_stress/sequence_stress.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,17 +27,18 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
 import argparse
-from builtins import range
-from builtins import str
-import time
 import threading
+import time
 import traceback
+from builtins import range, str
+from functools import partial
+
 import numpy as np
 import test_util as tu
-from functools import partial
 import tritongrpcclient as grpcclient
 from tritonclientutils import np_to_triton_dtype
 
@@ -55,7 +58,6 @@
 
 
 class UserData:
-
     def __init__(self):
         self._completed_requests = queue.Queue()
 
@@ -70,21 +72,27 @@ class TimeoutException(Exception):
     pass
 
 
-def check_sequence_async(client_metadata,
-                         trial,
-                         model_name,
-                         input_dtype,
-                         steps,
-                         timeout_ms=DEFAULT_TIMEOUT_MS,
-                         sequence_name="<unknown>"):
+def check_sequence_async(
+    client_metadata,
+    trial,
+    model_name,
+    input_dtype,
+    steps,
+    timeout_ms=DEFAULT_TIMEOUT_MS,
+    sequence_name="<unknown>",
+):
     """Perform sequence of inferences using async run. The 'steps' holds
     a list of tuples, one for each inference with format:
 
     (flag_str, value, expected_result, delay_ms)
 
     """
-    if (("savedmodel" in trial) or ("graphdef" in trial) or
-        ("custom" in trial) or ("plan" in trial)):
+    if (
+        ("savedmodel" in trial)
+        or ("graphdef" in trial)
+        or ("custom" in trial)
+        or ("plan" in trial)
+    ):
         tensor_shape = (
             1,
             1,
@@ -107,27 +115,29 @@ def check_sequence_async(client_metadata,
         seq_start = False
         seq_end = False
         if flag_str is not None:
-            seq_start = ("start" in flag_str)
-            seq_end = ("end" in flag_str)
+            seq_start = "start" in flag_str
+            seq_end = "end" in flag_str
 
         if input_dtype == np.object_:
             in0 = np.full(tensor_shape, value, dtype=np.int32)
-            in0n = np.array([str(x) for x in in0.reshape(in0.size)],
-                            dtype=object)
+            in0n = np.array([str(x) for x in in0.reshape(in0.size)], dtype=object)
             in0 = in0n.reshape(tensor_shape)
         else:
             in0 = np.full(tensor_shape, value, dtype=input_dtype)
         inputs = [
-            grpcclient.InferInput("INPUT", tensor_shape,
-                                  np_to_triton_dtype(input_dtype)),
+            grpcclient.InferInput(
+                "INPUT", tensor_shape, np_to_triton_dtype(input_dtype)
+            ),
         ]
         inputs[0].set_data_from_numpy(in0)
 
-        triton_client.async_stream_infer(model_name,
-                                         inputs,
-                                         sequence_id=sequence_id,
-                                         sequence_start=seq_start,
-                                         sequence_end=seq_end)
+        triton_client.async_stream_infer(
+            model_name,
+            inputs,
+            sequence_id=sequence_id,
+            sequence_start=seq_start,
+            sequence_end=seq_end,
+        )
         sent_count += 1
 
         if delay_ms is not None:
@@ -146,23 +156,21 @@ def check_sequence_async(client_metadata,
         if timeout_ms != None:
             now_ms = int(round(time.time() * 1000))
             if (now_ms - seq_start_ms) > timeout_ms:
-                raise TimeoutException(
-                    "Timeout expired for {}".format(sequence_name))
+                raise TimeoutException("Timeout expired for {}".format(sequence_name))
 
         result = results.as_numpy("OUTPUT")[0][0]
         if FLAGS.verbose:
-            print("{} {}: + {} = {}".format(sequence_name, sequence_id, value,
-                                            result))
+            print("{} {}: + {} = {}".format(sequence_name, sequence_id, value, result))
 
         if expected is not None:
             if input_dtype == np.object_:
-                assert int(
-                    result
-                ) == expected, "{}: expected result {}, got {}".format(
-                    sequence_name, expected, int(result))
+                assert int(result) == expected, "{}: expected result {}, got {}".format(
+                    sequence_name, expected, int(result)
+                )
             else:
                 assert result == expected, "{}: expected result {}, got {}".format(
-                    sequence_name, expected, result)
+                    sequence_name, expected, result
+                )
     triton_client.stop_stream()
 
 
@@ -175,12 +183,12 @@ def get_datatype(trial):
     return np.int32
 
 
-def sequence_valid(client_metadata, rng, trial, model_name, dtype, len_mean,
-                   len_stddev, sequence_name):
+def sequence_valid(
+    client_metadata, rng, trial, model_name, dtype, len_mean, len_stddev, sequence_name
+):
     # Create a variable length sequence with "start" and "end" flags.
     seqlen = max(1, int(rng.normal(len_mean, len_stddev)))
-    print("{} {}: valid seqlen = {}".format(sequence_name, client_metadata[1],
-                                            seqlen))
+    print("{} {}: valid seqlen = {}".format(sequence_name, client_metadata[1], seqlen))
 
     values = rng.randint(0, 1024 * 1024, size=seqlen, dtype=dtype)
 
@@ -199,31 +207,34 @@ def sequence_valid(client_metadata, rng, trial, model_name, dtype, len_mean,
         expected_result += val
 
         # (flag_str, value, expected_result, delay_ms)
-        steps.append((flags, val, expected_result, delay_ms),)
+        steps.append(
+            (flags, val, expected_result, delay_ms),
+        )
 
-    check_sequence_async(client_metadata,
-                         trial,
-                         model_name,
-                         dtype,
-                         steps,
-                         sequence_name=sequence_name)
+    check_sequence_async(
+        client_metadata, trial, model_name, dtype, steps, sequence_name=sequence_name
+    )
 
 
-def sequence_valid_valid(client_metadata, rng, trial, model_name, dtype,
-                         len_mean, len_stddev, sequence_name):
+def sequence_valid_valid(
+    client_metadata, rng, trial, model_name, dtype, len_mean, len_stddev, sequence_name
+):
     # Create two variable length sequences with "start" and "end"
     # flags, where both sequences use the same correlation ID and are
     # sent back-to-back.
     seqlen = [
         max(1, int(rng.normal(len_mean, len_stddev))),
-        max(1, int(rng.normal(len_mean, len_stddev)))
+        max(1, int(rng.normal(len_mean, len_stddev))),
     ]
-    print("{} {}: valid-valid seqlen[0] = {}, seqlen[1] = {}".format(
-        sequence_name, client_metadata[1], seqlen[0], seqlen[1]))
+    print(
+        "{} {}: valid-valid seqlen[0] = {}, seqlen[1] = {}".format(
+            sequence_name, client_metadata[1], seqlen[0], seqlen[1]
+        )
+    )
 
     values = [
         rng.randint(0, 1024 * 1024, size=seqlen[0], dtype=dtype),
-        rng.randint(0, 1024 * 1024, size=seqlen[1], dtype=dtype)
+        rng.randint(0, 1024 * 1024, size=seqlen[1], dtype=dtype),
     ]
 
     for p in [0, 1]:
@@ -242,31 +253,34 @@ def sequence_valid_valid(client_metadata, rng, trial, model_name, dtype,
             expected_result += val
 
             # (flag_str, value, expected_result, delay_ms)
-            steps.append((flags, val, expected_result, delay_ms),)
+            steps.append(
+                (flags, val, expected_result, delay_ms),
+            )
 
-    check_sequence_async(client_metadata,
-                         trial,
-                         model_name,
-                         dtype,
-                         steps,
-                         sequence_name=sequence_name)
+    check_sequence_async(
+        client_metadata, trial, model_name, dtype, steps, sequence_name=sequence_name
+    )
 
 
-def sequence_valid_no_end(client_metadata, rng, trial, model_name, dtype,
-                          len_mean, len_stddev, sequence_name):
+def sequence_valid_no_end(
+    client_metadata, rng, trial, model_name, dtype, len_mean, len_stddev, sequence_name
+):
     # Create two variable length sequences, the first with "start" and
     # "end" flags and the second with no "end" flag, where both
     # sequences use the same correlation ID and are sent back-to-back.
     seqlen = [
         max(1, int(rng.normal(len_mean, len_stddev))),
-        max(1, int(rng.normal(len_mean, len_stddev)))
+        max(1, int(rng.normal(len_mean, len_stddev))),
     ]
-    print("{} {}: valid-no-end seqlen[0] = {}, seqlen[1] = {}".format(
-        sequence_name, client_metadata[1], seqlen[0], seqlen[1]))
+    print(
+        "{} {}: valid-no-end seqlen[0] = {}, seqlen[1] = {}".format(
+            sequence_name, client_metadata[1], seqlen[0], seqlen[1]
+        )
+    )
 
     values = [
         rng.randint(0, 1024 * 1024, size=seqlen[0], dtype=dtype),
-        rng.randint(0, 1024 * 1024, size=seqlen[1], dtype=dtype)
+        rng.randint(0, 1024 * 1024, size=seqlen[1], dtype=dtype),
     ]
 
     for p in [0, 1]:
@@ -285,23 +299,22 @@ def sequence_valid_no_end(client_metadata, rng, trial, model_name, dtype,
             expected_result += val
 
             # (flag_str, value, expected_result, delay_ms)
-            steps.append((flags, val, expected_result, delay_ms),)
+            steps.append(
+                (flags, val, expected_result, delay_ms),
+            )
 
-    check_sequence_async(client_metadata,
-                         trial,
-                         model_name,
-                         dtype,
-                         steps,
-                         sequence_name=sequence_name)
+    check_sequence_async(
+        client_metadata, trial, model_name, dtype, steps, sequence_name=sequence_name
+    )
 
 
-def sequence_no_start(client_metadata, rng, trial, model_name, dtype,
-                      sequence_name):
+def sequence_no_start(client_metadata, rng, trial, model_name, dtype, sequence_name):
     # Create a sequence without a "start" flag. Sequence should get an
     # error from the server.
     seqlen = 1
-    print("{} {}: no-start seqlen = {}".format(sequence_name,
-                                               client_metadata[1], seqlen))
+    print(
+        "{} {}: no-start seqlen = {}".format(sequence_name, client_metadata[1], seqlen)
+    )
 
     values = rng.randint(0, 1024 * 1024, size=seqlen, dtype=dtype)
 
@@ -313,29 +326,33 @@ def sequence_no_start(client_metadata, rng, trial, model_name, dtype,
         delay_ms = None
 
         # (flag_str, value, expected_result, delay_ms)
-        steps.append((flags, val, None, delay_ms),)
+        steps.append(
+            (flags, val, None, delay_ms),
+        )
 
     try:
-        check_sequence_async(client_metadata,
-                             trial,
-                             model_name,
-                             dtype,
-                             steps,
-                             sequence_name=sequence_name)
+        check_sequence_async(
+            client_metadata,
+            trial,
+            model_name,
+            dtype,
+            steps,
+            sequence_name=sequence_name,
+        )
         assert False, "expected inference failure from missing START flag"
     except Exception as ex:
         if "must specify the START flag" not in ex.message():
             raise
 
 
-def sequence_no_end(client_metadata, rng, trial, model_name, dtype, len_mean,
-                    len_stddev, sequence_name):
+def sequence_no_end(
+    client_metadata, rng, trial, model_name, dtype, len_mean, len_stddev, sequence_name
+):
     # Create a variable length sequence with "start" flag but that
     # never ends. The sequence should be aborted by the server and its
     # slot reused for another sequence.
     seqlen = max(1, int(rng.normal(len_mean, len_stddev)))
-    print("{} {}: no-end seqlen = {}".format(sequence_name, client_metadata[1],
-                                             seqlen))
+    print("{} {}: no-end seqlen = {}".format(sequence_name, client_metadata[1], seqlen))
 
     values = rng.randint(0, 1024 * 1024, size=seqlen, dtype=dtype)
 
@@ -352,18 +369,16 @@ def sequence_no_end(client_metadata, rng, trial, model_name, dtype, len_mean,
         expected_result += val
 
         # (flag_str, value, expected_result, delay_ms)
-        steps.append((flags, val, expected_result, delay_ms),)
+        steps.append(
+            (flags, val, expected_result, delay_ms),
+        )
 
-    check_sequence_async(client_metadata,
-                         trial,
-                         model_name,
-                         dtype,
-                         steps,
-                         sequence_name=sequence_name)
+    check_sequence_async(
+        client_metadata, trial, model_name, dtype, steps, sequence_name=sequence_name
+    )
 
 
-def stress_thread(name, seed, pass_cnt, correlation_id_base, trial, model_name,
-                  dtype):
+def stress_thread(name, seed, pass_cnt, correlation_id_base, trial, model_name, dtype):
     # Thread responsible for generating sequences of inference
     # requests.
     global _thread_exceptions
@@ -389,9 +404,13 @@ def stress_thread(name, seed, pass_cnt, correlation_id_base, trial, model_name,
 
         for c in range(common_cnt + rare_cnt):
             client_metadata_list.append(
-                (grpcclient.InferenceServerClient("localhost:8001",
-                                                  verbose=FLAGS.verbose),
-                 correlation_id_base + c))
+                (
+                    grpcclient.InferenceServerClient(
+                        "localhost:8001", verbose=FLAGS.verbose
+                    ),
+                    correlation_id_base + c,
+                )
+            )
             last_choices.append(None)
 
         rare_idx = 0
@@ -407,34 +426,40 @@ def stress_thread(name, seed, pass_cnt, correlation_id_base, trial, model_name,
                 # exercise the idle sequence path of the sequence
                 # scheduler
                 if choice < 0.33:
-                    sequence_no_end(client_metadata_list[client_idx],
-                                    rng,
-                                    trial,
-                                    model_name,
-                                    dtype,
-                                    SEQUENCE_LENGTH_MEAN,
-                                    SEQUENCE_LENGTH_STDEV,
-                                    sequence_name=name)
+                    sequence_no_end(
+                        client_metadata_list[client_idx],
+                        rng,
+                        trial,
+                        model_name,
+                        dtype,
+                        SEQUENCE_LENGTH_MEAN,
+                        SEQUENCE_LENGTH_STDEV,
+                        sequence_name=name,
+                    )
                     last_choices[client_idx] = "no-end"
                 elif choice < 0.66:
-                    sequence_valid_no_end(client_metadata_list[client_idx],
-                                          rng,
-                                          trial,
-                                          model_name,
-                                          dtype,
-                                          SEQUENCE_LENGTH_MEAN,
-                                          SEQUENCE_LENGTH_STDEV,
-                                          sequence_name=name)
+                    sequence_valid_no_end(
+                        client_metadata_list[client_idx],
+                        rng,
+                        trial,
+                        model_name,
+                        dtype,
+                        SEQUENCE_LENGTH_MEAN,
+                        SEQUENCE_LENGTH_STDEV,
+                        sequence_name=name,
+                    )
                     last_choices[client_idx] = "valid-no-end"
                 else:
-                    sequence_valid_valid(client_metadata_list[client_idx],
-                                         rng,
-                                         trial,
-                                         model_name,
-                                         dtype,
-                                         SEQUENCE_LENGTH_MEAN,
-                                         SEQUENCE_LENGTH_STDEV,
-                                         sequence_name=name)
+                    sequence_valid_valid(
+                        client_metadata_list[client_idx],
+                        rng,
+                        trial,
+                        model_name,
+                        dtype,
+                        SEQUENCE_LENGTH_MEAN,
+                        SEQUENCE_LENGTH_STDEV,
+                        sequence_name=name,
+                    )
                     last_choices[client_idx] = "valid-valid"
 
                 rare_idx = (rare_idx + 1) % rare_cnt
@@ -450,54 +475,67 @@ def stress_thread(name, seed, pass_cnt, correlation_id_base, trial, model_name,
                 # just assume that the no-start is a continuation of
                 # the no-end sequence instead of being a sequence
                 # missing start flag.
-                if ((last_choice != "no-end") and
-                    (last_choice != "valid-no-end") and (choice < 0.01)):
-                    sequence_no_start(client_metadata,
-                                      rng,
-                                      trial,
-                                      model_name,
-                                      dtype,
-                                      sequence_name=name)
+                if (
+                    (last_choice != "no-end")
+                    and (last_choice != "valid-no-end")
+                    and (choice < 0.01)
+                ):
+                    sequence_no_start(
+                        client_metadata,
+                        rng,
+                        trial,
+                        model_name,
+                        dtype,
+                        sequence_name=name,
+                    )
                     last_choices[client_idx] = "no-start"
                 elif choice < 0.05:
-                    sequence_no_end(client_metadata,
-                                    rng,
-                                    trial,
-                                    model_name,
-                                    dtype,
-                                    SEQUENCE_LENGTH_MEAN,
-                                    SEQUENCE_LENGTH_STDEV,
-                                    sequence_name=name)
+                    sequence_no_end(
+                        client_metadata,
+                        rng,
+                        trial,
+                        model_name,
+                        dtype,
+                        SEQUENCE_LENGTH_MEAN,
+                        SEQUENCE_LENGTH_STDEV,
+                        sequence_name=name,
+                    )
                     last_choices[client_idx] = "no-end"
                 elif choice < 0.10:
-                    sequence_valid_no_end(client_metadata,
-                                          rng,
-                                          trial,
-                                          model_name,
-                                          dtype,
-                                          SEQUENCE_LENGTH_MEAN,
-                                          SEQUENCE_LENGTH_STDEV,
-                                          sequence_name=name)
+                    sequence_valid_no_end(
+                        client_metadata,
+                        rng,
+                        trial,
+                        model_name,
+                        dtype,
+                        SEQUENCE_LENGTH_MEAN,
+                        SEQUENCE_LENGTH_STDEV,
+                        sequence_name=name,
+                    )
                     last_choices[client_idx] = "valid-no-end"
                 elif choice < 0.15:
-                    sequence_valid_valid(client_metadata,
-                                         rng,
-                                         trial,
-                                         model_name,
-                                         dtype,
-                                         SEQUENCE_LENGTH_MEAN,
-                                         SEQUENCE_LENGTH_STDEV,
-                                         sequence_name=name)
+                    sequence_valid_valid(
+                        client_metadata,
+                        rng,
+                        trial,
+                        model_name,
+                        dtype,
+                        SEQUENCE_LENGTH_MEAN,
+                        SEQUENCE_LENGTH_STDEV,
+                        sequence_name=name,
+                    )
                     last_choices[client_idx] = "valid-valid"
                 else:
-                    sequence_valid(client_metadata,
-                                   rng,
-                                   trial,
-                                   model_name,
-                                   dtype,
-                                   SEQUENCE_LENGTH_MEAN,
-                                   SEQUENCE_LENGTH_STDEV,
-                                   sequence_name=name)
+                    sequence_valid(
+                        client_metadata,
+                        rng,
+                        trial,
+                        model_name,
+                        dtype,
+                        SEQUENCE_LENGTH_MEAN,
+                        SEQUENCE_LENGTH_STDEV,
+                        sequence_name=name,
+                    )
                     last_choices[client_idx] = "valid"
 
     except Exception as ex:
@@ -518,38 +556,40 @@ def stress_thread(name, seed, pass_cnt, correlation_id_base, trial, model_name,
 
 
 def check_status(model_name):
-    client = grpcclient.InferenceServerClient("localhost:8001",
-                                              verbose=FLAGS.verbose)
+    client = grpcclient.InferenceServerClient("localhost:8001", verbose=FLAGS.verbose)
     stats = client.get_inference_statistics(model_name)
     print(stats)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('-v',
-                        '--verbose',
-                        action="store_true",
-                        required=False,
-                        default=False,
-                        help='Enable verbose output')
-    parser.add_argument('-r',
-                        '--random-seed',
-                        type=int,
-                        required=False,
-                        help='Random seed.')
-    parser.add_argument('-t',
-                        '--concurrency',
-                        type=int,
-                        required=False,
-                        default=8,
-                        help='Request concurrency. Default is 8.')
     parser.add_argument(
-        '-i',
-        '--iterations',
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable verbose output",
+    )
+    parser.add_argument(
+        "-r", "--random-seed", type=int, required=False, help="Random seed."
+    )
+    parser.add_argument(
+        "-t",
+        "--concurrency",
+        type=int,
+        required=False,
+        default=8,
+        help="Request concurrency. Default is 8.",
+    )
+    parser.add_argument(
+        "-i",
+        "--iterations",
         type=int,
         required=False,
         default=200,
-        help='Number of iterations of stress test to run. Default is 200.')
+        help="Number of iterations of stress test to run. Default is 200.",
+    )
     FLAGS = parser.parse_args()
 
     # Initialize the random seed. For reproducibility each thread
@@ -583,10 +623,19 @@ def check_status(model_name):
         correlation_id_base = 1 + (idx * CORRELATION_ID_BLOCK_SIZE)
 
         threads.append(
-            threading.Thread(target=stress_thread,
-                             args=(thread_name, seed, FLAGS.iterations,
-                                   correlation_id_base, trial, model_name,
-                                   dtype)))
+            threading.Thread(
+                target=stress_thread,
+                args=(
+                    thread_name,
+                    seed,
+                    FLAGS.iterations,
+                    correlation_id_base,
+                    trial,
+                    model_name,
+                    dtype,
+                ),
+            )
+        )
 
     for t in threads:
         t.start()
diff --git a/qa/L0_sequence_stress/test.sh b/qa/L0_sequence_stress/test.sh
index 3961107dfe..b2bc66f8ac 100755
--- a/qa/L0_sequence_stress/test.sh
+++ b/qa/L0_sequence_stress/test.sh
@@ -39,7 +39,7 @@ RET=0
 #   models1 - one instance with batch-size 4
 #   models2 - two instances with batch-size 2
 #   models4 - four instances with batch-size 1
-rm -fr *.log *.serverlog models{1,2,4} && mkdir models{1,2,4}
+rm -fr *.log  models{1,2,4} && mkdir models{1,2,4}
 for m in ../custom_models/custom_sequence_int32 ; do
     cp -r $m models1/. && \
         (cd models1/$(basename $m) && \
@@ -65,7 +65,7 @@ done
 for model_trial in 1 2 4 ; do
     MODEL_DIR=models${model_trial}
     SERVER_ARGS="--model-repository=`pwd`/$MODEL_DIR"
-    SERVER_LOG="./$MODEL_DIR.serverlog"
+    SERVER_LOG="./$MODEL_DIR.server.log"
     run_server
     if [ "$SERVER_PID" == "0" ]; then
         echo -e "\n***\n*** Failed to start $SERVER\n***"
diff --git a/qa/L0_server_status/server_status_test.py b/qa/L0_server_status/server_status_test.py
old mode 100644
new mode 100755
index ee6db2a575..7ab04708f0
--- a/qa/L0_server_status/server_status_test.py
+++ b/qa/L0_server_status/server_status_test.py
@@ -1,4 +1,6 @@
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,12 +27,14 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-import numpy as np
 import os
 import unittest
+
 import infer_util as iu
+import numpy as np
 import test_util as tu
 import tritongrpcclient as grpcclient
 import tritonhttpclient as httpclient
@@ -38,24 +42,29 @@
 
 
 class ServerMetadataTest(tu.TestResultCollector):
-
     def test_basic(self):
         try:
-            for pair in [("localhost:8000", "http"),
-                         ("localhost:8001", "grpc")]:
+            for pair in [("localhost:8000", "http"), ("localhost:8001", "grpc")]:
                 model_name = "graphdef_int32_int8_int8"
                 extensions = [
-                    'classification', 'sequence', 'model_repository',
-                    'schedule_policy', 'model_configuration',
-                    'system_shared_memory', 'cuda_shared_memory',
-                    'binary_tensor_data', 'statistics'
+                    "classification",
+                    "sequence",
+                    "model_repository",
+                    "schedule_policy",
+                    "model_configuration",
+                    "system_shared_memory",
+                    "cuda_shared_memory",
+                    "binary_tensor_data",
+                    "statistics",
                 ]
                 if pair[1] == "http":
                     triton_client = httpclient.InferenceServerClient(
-                        url=pair[0], verbose=True)
+                        url=pair[0], verbose=True
+                    )
                 else:
                     triton_client = grpcclient.InferenceServerClient(
-                        url=pair[0], verbose=True)
+                        url=pair[0], verbose=True
+                    )
 
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
@@ -63,16 +72,18 @@ def test_basic(self):
                 model_metadata = triton_client.get_model_metadata(model_name)
 
                 if pair[1] == "http":
-                    self.assertEqual(os.environ["TRITON_SERVER_VERSION"],
-                                     server_metadata['version'])
-                    self.assertEqual("triton", server_metadata['name'])
+                    self.assertEqual(
+                        os.environ["TRITON_SERVER_VERSION"], server_metadata["version"]
+                    )
+                    self.assertEqual("triton", server_metadata["name"])
                     for ext in extensions:
-                        self.assertIn(ext, server_metadata['extensions'])
+                        self.assertIn(ext, server_metadata["extensions"])
 
-                    self.assertEqual(model_name, model_metadata['name'])
+                    self.assertEqual(model_name, model_metadata["name"])
                 else:
-                    self.assertEqual(os.environ["TRITON_SERVER_VERSION"],
-                                     server_metadata.version)
+                    self.assertEqual(
+                        os.environ["TRITON_SERVER_VERSION"], server_metadata.version
+                    )
                     self.assertEqual("triton", server_metadata.name)
                     for ext in extensions:
                         self.assertIn(ext, server_metadata.extensions)
@@ -83,91 +94,96 @@ def test_basic(self):
 
     def test_unknown_model(self):
         try:
-            for pair in [("localhost:8000", "http"),
-                         ("localhost:8001", "grpc")]:
+            for pair in [("localhost:8000", "http"), ("localhost:8001", "grpc")]:
                 model_name = "foo"
                 if pair[1] == "http":
                     triton_client = httpclient.InferenceServerClient(
-                        url=pair[0], verbose=True)
+                        url=pair[0], verbose=True
+                    )
                 else:
                     triton_client = grpcclient.InferenceServerClient(
-                        url=pair[0], verbose=True)
+                        url=pair[0], verbose=True
+                    )
 
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
                 server_metadata = triton_client.get_server_metadata()
                 if pair[1] == "http":
-                    self.assertEqual(os.environ["TRITON_SERVER_VERSION"],
-                                     server_metadata['version'])
-                    self.assertEqual("triton", server_metadata['name'])
+                    self.assertEqual(
+                        os.environ["TRITON_SERVER_VERSION"], server_metadata["version"]
+                    )
+                    self.assertEqual("triton", server_metadata["name"])
                 else:
-                    self.assertEqual(os.environ["TRITON_SERVER_VERSION"],
-                                     server_metadata.version)
+                    self.assertEqual(
+                        os.environ["TRITON_SERVER_VERSION"], server_metadata.version
+                    )
                     self.assertEqual("triton", server_metadata.name)
 
                 model_metadata = triton_client.get_model_metadata(model_name)
                 self.assertTrue(False, "expected unknown model failure")
         except InferenceServerException as ex:
-            self.assertTrue(ex.message().startswith(
-                "Request for unknown model: 'foo' is not found"))
+            self.assertTrue(
+                ex.message().startswith("Request for unknown model: 'foo' is not found")
+            )
 
     def test_unknown_model_version(self):
         try:
-            for pair in [("localhost:8000", "http"),
-                         ("localhost:8001", "grpc")]:
+            for pair in [("localhost:8000", "http"), ("localhost:8001", "grpc")]:
                 model_name = "graphdef_int32_int8_int8"
                 if pair[1] == "http":
                     triton_client = httpclient.InferenceServerClient(
-                        url=pair[0], verbose=True)
+                        url=pair[0], verbose=True
+                    )
                 else:
                     triton_client = grpcclient.InferenceServerClient(
-                        url=pair[0], verbose=True)
+                        url=pair[0], verbose=True
+                    )
 
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
 
                 model_metadata = triton_client.get_model_metadata(
-                    model_name, model_version="99")
+                    model_name, model_version="99"
+                )
                 self.assertTrue(False, "expected unknown model version failure")
         except InferenceServerException as ex:
-            self.assertTrue(ex.message().startswith(
-                "Request for unknown model: 'graphdef_int32_int8_int8' version 99 is not found"
-            ))
+            self.assertTrue(
+                ex.message().startswith(
+                    "Request for unknown model: 'graphdef_int32_int8_int8' version 99 is not found"
+                )
+            )
 
     def test_model_latest_infer(self):
         input_size = 16
         tensor_shape = (1, input_size)
-        platform_name = {
-            'graphdef': 'tensorflow_graphdef',
-            'onnx': 'onnxruntime_onnx'
-        }
+        platform_name = {"graphdef": "tensorflow_graphdef", "onnx": "onnxruntime_onnx"}
 
         # There are 3 versions of *_int32_int32_int32 and all
         # should be available.
-        for platform in ('graphdef', 'onnx'):
+        for platform in ("graphdef", "onnx"):
             model_name = platform + "_int32_int32_int32"
 
             # Initially there should be no version stats..
             try:
-                for pair in [("localhost:8000", "http"),
-                             ("localhost:8001", "grpc")]:
+                for pair in [("localhost:8000", "http"), ("localhost:8001", "grpc")]:
                     if pair[1] == "http":
                         triton_client = httpclient.InferenceServerClient(
-                            url=pair[0], verbose=True)
+                            url=pair[0], verbose=True
+                        )
                     else:
                         triton_client = grpcclient.InferenceServerClient(
-                            url=pair[0], verbose=True)
+                            url=pair[0], verbose=True
+                        )
 
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    model_metadata = triton_client.get_model_metadata(
-                        model_name)
+                    model_metadata = triton_client.get_model_metadata(model_name)
                     # verify all versions are reported when no model version is specified
                     if pair[1] == "http":
-                        self.assertEqual(model_name, model_metadata['name'])
-                        self.assertEqual(len(model_metadata['versions']), 3)
+                        self.assertEqual(model_name, model_metadata["name"])
+                        self.assertEqual(len(model_metadata["versions"]), 3)
                         for v in (1, 2, 3):
-                            self.assertIn(str(v), model_metadata['versions'])
+                            self.assertIn(str(v), model_metadata["versions"])
                     else:
                         self.assertEqual(model_name, model_metadata.name)
                         self.assertEqual(len(model_metadata.versions), 3)
@@ -176,9 +192,9 @@ def test_model_latest_infer(self):
 
                     # verify contents of model metadata
                     if pair[1] == "http":
-                        model_platform = model_metadata['platform']
-                        model_inputs = model_metadata['inputs']
-                        model_outputs = model_metadata['outputs']
+                        model_platform = model_metadata["platform"]
+                        model_inputs = model_metadata["inputs"]
+                        model_outputs = model_metadata["outputs"]
                     else:
                         model_platform = model_metadata.platform
                         model_inputs = model_metadata.inputs
@@ -190,9 +206,9 @@ def test_model_latest_infer(self):
 
                     for model_input in model_inputs:
                         if pair[1] == "http":
-                            input_dtype = model_input['datatype']
-                            input_shape = model_input['shape']
-                            input_name = model_input['name']
+                            input_dtype = model_input["datatype"]
+                            input_shape = model_input["shape"]
+                            input_name = model_input["name"]
                         else:
                             input_dtype = model_input.datatype
                             input_shape = model_input.shape
@@ -203,9 +219,9 @@ def test_model_latest_infer(self):
 
                     for model_output in model_outputs:
                         if pair[1] == "http":
-                            output_dtype = model_output['datatype']
-                            output_shape = model_output['shape']
-                            output_name = model_output['name']
+                            output_dtype = model_output["datatype"]
+                            output_shape = model_output["shape"]
+                            output_name = model_output["name"]
                         else:
                             output_dtype = model_output.datatype
                             output_shape = model_output.shape
@@ -218,67 +234,79 @@ def test_model_latest_infer(self):
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
             # Infer using latest version (which is 3)...
-            iu.infer_exact(self,
-                           platform,
-                           tensor_shape,
-                           1,
-                           np.int32,
-                           np.int32,
-                           np.int32,
-                           model_version=None,
-                           swap=True)
+            iu.infer_exact(
+                self,
+                platform,
+                tensor_shape,
+                1,
+                np.int32,
+                np.int32,
+                np.int32,
+                model_version=None,
+                swap=True,
+            )
 
             try:
-                for pair in [("localhost:8000", "http"),
-                             ("localhost:8001", "grpc")]:
+                for pair in [("localhost:8000", "http"), ("localhost:8001", "grpc")]:
                     if pair[1] == "http":
                         triton_client = httpclient.InferenceServerClient(
-                            url=pair[0], verbose=True)
+                            url=pair[0], verbose=True
+                        )
                     else:
                         triton_client = grpcclient.InferenceServerClient(
-                            url=pair[0], verbose=True)
+                            url=pair[0], verbose=True
+                        )
 
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
                     for v in (1, 2, 3):
                         self.assertTrue(
-                            triton_client.is_model_ready(model_name,
-                                                         model_version=str(v)))
+                            triton_client.is_model_ready(
+                                model_name, model_version=str(v)
+                            )
+                        )
 
                     # Only version 3 should have infer stats
-                    infer_stats = triton_client.get_inference_statistics(
-                        model_name)
+                    infer_stats = triton_client.get_inference_statistics(model_name)
                     if pair[1] == "http":
-                        stats = infer_stats['model_stats']
+                        stats = infer_stats["model_stats"]
                     else:
                         stats = infer_stats.model_stats
                     self.assertEqual(
-                        len(stats), 3,
-                        "expected 3 infer stats for model " + model_name)
+                        len(stats), 3, "expected 3 infer stats for model " + model_name
+                    )
                     for s in stats:
                         if pair[1] == "http":
-                            v = s['version']
-                            stat = s['inference_stats']
+                            v = s["version"]
+                            stat = s["inference_stats"]
                         else:
                             v = s.version
                             stat = s.inference_stats
 
                         if v == "3":
                             if pair[1] == "http":
-                                self.assertTrue(stat['success']['count'], 3)
+                                self.assertTrue(stat["success"]["count"], 3)
                             else:
                                 self.assertTrue(stat.success.count, 3)
                         else:
                             if pair[1] == "http":
                                 self.assertEqual(
-                                    stat['success']['count'], 0,
+                                    stat["success"]["count"],
+                                    0,
                                     "unexpected infer success counts for version "
-                                    + str(v) + " of model " + model_name)
+                                    + str(v)
+                                    + " of model "
+                                    + model_name,
+                                )
                             else:
                                 self.assertEqual(
-                                    stat.success.count, 0,
+                                    stat.success.count,
+                                    0,
                                     "unexpected infer success counts for version "
-                                    + str(v) + " of model " + model_name)
+                                    + str(v)
+                                    + " of model "
+                                    + model_name,
+                                )
 
             except InferenceServerException as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
@@ -288,136 +316,150 @@ def test_model_specific_infer(self):
 
         # There are 3 versions of *_float32_float32_float32 but only
         # versions 1 and 3 should be available.
-        for platform in ('graphdef', 'onnx', 'plan'):
+        for platform in ("graphdef", "onnx", "plan"):
             tensor_shape = (1, input_size)
             model_name = platform + "_float32_float32_float32"
 
             # Initially there should be no version status...
             try:
-                for pair in [("localhost:8000", "http"),
-                             ("localhost:8001", "grpc")]:
+                for pair in [("localhost:8000", "http"), ("localhost:8001", "grpc")]:
                     if pair[1] == "http":
                         triton_client = httpclient.InferenceServerClient(
-                            url=pair[0], verbose=True)
+                            url=pair[0], verbose=True
+                        )
                     else:
                         triton_client = grpcclient.InferenceServerClient(
-                            url=pair[0], verbose=True)
+                            url=pair[0], verbose=True
+                        )
 
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
                     self.assertTrue(
-                        triton_client.is_model_ready(model_name,
-                                                     model_version="1"))
+                        triton_client.is_model_ready(model_name, model_version="1")
+                    )
                     self.assertFalse(
-                        triton_client.is_model_ready(model_name,
-                                                     model_version="2"))
+                        triton_client.is_model_ready(model_name, model_version="2")
+                    )
                     self.assertTrue(
-                        triton_client.is_model_ready(model_name,
-                                                     model_version="3"))
+                        triton_client.is_model_ready(model_name, model_version="3")
+                    )
             except InferenceServerException as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
             # Infer using version 1...
-            iu.infer_exact(self,
-                           platform,
-                           tensor_shape,
-                           1,
-                           np.float32,
-                           np.float32,
-                           np.float32,
-                           model_version=1,
-                           swap=False)
+            iu.infer_exact(
+                self,
+                platform,
+                tensor_shape,
+                1,
+                np.float32,
+                np.float32,
+                np.float32,
+                model_version=1,
+                swap=False,
+            )
 
             try:
-                for pair in [("localhost:8000", "http"),
-                             ("localhost:8001", "grpc")]:
+                for pair in [("localhost:8000", "http"), ("localhost:8001", "grpc")]:
                     if pair[1] == "http":
                         triton_client = httpclient.InferenceServerClient(
-                            url=pair[0], verbose=True)
+                            url=pair[0], verbose=True
+                        )
                     else:
                         triton_client = grpcclient.InferenceServerClient(
-                            url=pair[0], verbose=True)
+                            url=pair[0], verbose=True
+                        )
 
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
                     self.assertTrue(
-                        triton_client.is_model_ready(model_name,
-                                                     model_version="1"))
+                        triton_client.is_model_ready(model_name, model_version="1")
+                    )
                     self.assertFalse(
-                        triton_client.is_model_ready(model_name,
-                                                     model_version="2"))
+                        triton_client.is_model_ready(model_name, model_version="2")
+                    )
                     self.assertTrue(
-                        triton_client.is_model_ready(model_name,
-                                                     model_version="3"))
+                        triton_client.is_model_ready(model_name, model_version="3")
+                    )
 
                     # Only version 1 should have infer stats
                     infer_stats = triton_client.get_inference_statistics(
-                        model_name, model_version='1')
+                        model_name, model_version="1"
+                    )
                     if pair[1] == "http":
                         self.assertEqual(
-                            len(infer_stats['model_stats']), 1,
+                            len(infer_stats["model_stats"]),
+                            1,
                             "expected 1 infer stats for version 1"
-                            " of model " + model_name)
-                        stats = infer_stats['model_stats'][0]['inference_stats']
-                        self.assertTrue(stats['success']['count'], 3)
+                            " of model " + model_name,
+                        )
+                        stats = infer_stats["model_stats"][0]["inference_stats"]
+                        self.assertTrue(stats["success"]["count"], 3)
                     else:
                         self.assertEqual(
-                            len(infer_stats.model_stats), 1,
+                            len(infer_stats.model_stats),
+                            1,
                             "expected 1 infer stats for version 1"
-                            " of model " + model_name)
+                            " of model " + model_name,
+                        )
                         stats = infer_stats.model_stats[0].inference_stats
                         self.assertTrue(stats.success.count, 3)
                     infer_stats = triton_client.get_inference_statistics(
-                        model_name, model_version='3')
+                        model_name, model_version="3"
+                    )
                     if pair[1] == "http":
-                        stats = infer_stats['model_stats'][0]['inference_stats']
+                        stats = infer_stats["model_stats"][0]["inference_stats"]
                         self.assertEqual(
-                            stats['success']['count'], 0,
+                            stats["success"]["count"],
+                            0,
                             "unexpected infer stats for version 3"
-                            " of model " + model_name)
+                            " of model " + model_name,
+                        )
                     else:
                         stats = infer_stats.model_stats[0].inference_stats
                         self.assertEqual(
-                            stats.success.count, 0,
+                            stats.success.count,
+                            0,
                             "unexpected infer stats for version 3"
-                            " of model " + model_name)
+                            " of model " + model_name,
+                        )
 
             except InferenceServerException as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
 
 class ModelMetadataTest(tu.TestResultCollector):
-    '''
+    """
     These tests must be run after the ServerMetadataTest. See test.sh
     file for correct test running.
-    '''
+    """
 
     def test_model_versions_deleted(self):
         # Originally There were 3 versions of *_int32_int32_int32 and
         # version 3 was executed once. Version 2 and 3 models were
         # deleted from the model repository so now only expect version 1 to
         # be ready and show stats.
-        for platform in ('graphdef', 'onnx'):
+        for platform in ("graphdef", "onnx"):
             model_name = platform + "_int32_int32_int32"
 
             try:
-                for pair in [("localhost:8000", "http"),
-                             ("localhost:8001", "grpc")]:
+                for pair in [("localhost:8000", "http"), ("localhost:8001", "grpc")]:
                     if pair[1] == "http":
                         triton_client = httpclient.InferenceServerClient(
-                            url=pair[0], verbose=True)
+                            url=pair[0], verbose=True
+                        )
                     else:
                         triton_client = grpcclient.InferenceServerClient(
-                            url=pair[0], verbose=True)
+                            url=pair[0], verbose=True
+                        )
 
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    model_metadata = triton_client.get_model_metadata(
-                        model_name)
+                    model_metadata = triton_client.get_model_metadata(model_name)
                     if pair[1] == "http":
-                        self.assertEqual(model_name, model_metadata['name'])
-                        self.assertEqual(len(model_metadata['versions']), 1)
-                        self.assertEqual("1", model_metadata['versions'][0])
+                        self.assertEqual(model_name, model_metadata["name"])
+                        self.assertEqual(len(model_metadata["versions"]), 1)
+                        self.assertEqual("1", model_metadata["versions"][0])
                     else:
                         self.assertEqual(model_name, model_metadata.name)
                         self.assertEqual(len(model_metadata.versions), 1)
@@ -428,30 +470,41 @@ def test_model_versions_deleted(self):
                         if v == 1:
                             self.assertTrue(
                                 triton_client.is_model_ready(
-                                    model_name, model_version=str(v)))
+                                    model_name, model_version=str(v)
+                                )
+                            )
                             infer_stats = triton_client.get_inference_statistics(
-                                model_name, model_version=str(v))
+                                model_name, model_version=str(v)
+                            )
                             if pair[1] == "http":
                                 self.assertEqual(
-                                    len(infer_stats['model_stats']), 1,
-                                    "expected 1 infer stats for version " +
-                                    str(v) + " of model " + model_name)
-                                stats = infer_stats['model_stats'][0][
-                                    'inference_stats']
-                                self.assertEqual(stats['success']['count'], 0)
+                                    len(infer_stats["model_stats"]),
+                                    1,
+                                    "expected 1 infer stats for version "
+                                    + str(v)
+                                    + " of model "
+                                    + model_name,
+                                )
+                                stats = infer_stats["model_stats"][0]["inference_stats"]
+                                self.assertEqual(stats["success"]["count"], 0)
                             else:
                                 self.assertEqual(
-                                    len(infer_stats.model_stats), 1,
-                                    "expected 1 infer stats for version " +
-                                    str(v) + " of model " + model_name)
-                                stats = infer_stats.model_stats[
-                                    0].inference_stats
+                                    len(infer_stats.model_stats),
+                                    1,
+                                    "expected 1 infer stats for version "
+                                    + str(v)
+                                    + " of model "
+                                    + model_name,
+                                )
+                                stats = infer_stats.model_stats[0].inference_stats
                                 self.assertEqual(stats.success.count, 0)
 
                         else:
                             self.assertFalse(
                                 triton_client.is_model_ready(
-                                    model_name, model_version=str(v)))
+                                    model_name, model_version=str(v)
+                                )
+                            )
 
             except InferenceServerException as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
@@ -460,40 +513,46 @@ def test_model_versions_added(self):
         # Originally There was version 1 of *_float16_float32_float32.
         # Version 7 was added so now expect just version 7 to be ready
         # and provide infer stats.
-        for platform in ('graphdef',):
+        for platform in ("graphdef",):
             model_name = platform + "_float16_float32_float32"
 
             try:
-                for pair in [("localhost:8000", "http"),
-                             ("localhost:8001", "grpc")]:
+                for pair in [("localhost:8000", "http"), ("localhost:8001", "grpc")]:
                     if pair[1] == "http":
                         triton_client = httpclient.InferenceServerClient(
-                            url=pair[0], verbose=True)
+                            url=pair[0], verbose=True
+                        )
                     else:
                         triton_client = grpcclient.InferenceServerClient(
-                            url=pair[0], verbose=True)
+                            url=pair[0], verbose=True
+                        )
 
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    model_metadata = triton_client.get_model_metadata(
-                        model_name)
+                    model_metadata = triton_client.get_model_metadata(model_name)
                     if pair[1] == "http":
                         self.assertEqual(
-                            model_name, model_metadata['name'],
-                            "expected status for model " + model_name)
+                            model_name,
+                            model_metadata["name"],
+                            "expected status for model " + model_name,
+                        )
                         self.assertEqual(
-                            len(model_metadata['versions']), 1,
-                            "expected status for 1 versions for model " +
-                            model_name)
-                        self.assertEqual("7", model_metadata['versions'][0])
+                            len(model_metadata["versions"]),
+                            1,
+                            "expected status for 1 versions for model " + model_name,
+                        )
+                        self.assertEqual("7", model_metadata["versions"][0])
                     else:
                         self.assertEqual(
-                            model_name, model_metadata.name,
-                            "expected status for model " + model_name)
+                            model_name,
+                            model_metadata.name,
+                            "expected status for model " + model_name,
+                        )
                         self.assertEqual(
-                            len(model_metadata.versions), 1,
-                            "expected status for 1 versions for model " +
-                            model_name)
+                            len(model_metadata.versions),
+                            1,
+                            "expected status for 1 versions for model " + model_name,
+                        )
                         self.assertEqual("7", model_metadata.versions[0])
 
                     # Only version 7 should be ready and show infer stat.
@@ -501,39 +560,52 @@ def test_model_versions_added(self):
                         if v == 7:
                             self.assertTrue(
                                 triton_client.is_model_ready(
-                                    model_name, model_version=str(v)))
+                                    model_name, model_version=str(v)
+                                )
+                            )
                             infer_stats = triton_client.get_inference_statistics(
-                                model_name, model_version=str(v))
+                                model_name, model_version=str(v)
+                            )
                             if pair[1] == "http":
-                                stats = infer_stats['model_stats'][0][
-                                    'inference_stats']
+                                stats = infer_stats["model_stats"][0]["inference_stats"]
                                 self.assertEqual(
-                                    stats['success']['count'], 0,
-                                    "unexpected infer stats for version " +
-                                    str(v) + " of model " + model_name)
+                                    stats["success"]["count"],
+                                    0,
+                                    "unexpected infer stats for version "
+                                    + str(v)
+                                    + " of model "
+                                    + model_name,
+                                )
                             else:
-                                stats = infer_stats.model_stats[
-                                    0].inference_stats
+                                stats = infer_stats.model_stats[0].inference_stats
                                 self.assertEqual(
-                                    stats.success.count, 0,
-                                    "unexpected infer stats for version " +
-                                    str(v) + " of model " + model_name)
+                                    stats.success.count,
+                                    0,
+                                    "unexpected infer stats for version "
+                                    + str(v)
+                                    + " of model "
+                                    + model_name,
+                                )
 
                         else:
                             self.assertFalse(
                                 triton_client.is_model_ready(
-                                    model_name, model_version=str(v)))
+                                    model_name, model_version=str(v)
+                                )
+                            )
                             try:
                                 infer_stats = triton_client.get_inference_statistics(
-                                    model_name, model_version=str(v))
+                                    model_name, model_version=str(v)
+                                )
                                 self.assertTrue(
                                     False,
-                                    "unexpected infer stats for the model that is not ready"
+                                    "unexpected infer stats for the model that is not ready",
                                 )
                             except InferenceServerException as ex:
                                 self.assertIn(
                                     "requested model version is not available for model",
-                                    str(ex))
+                                    str(ex),
+                                )
 
             except InferenceServerException as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
@@ -543,27 +615,27 @@ def test_infer_stats_no_model_version(self):
         # version 3 was executed once. Version 2 and 3 models were
         # deleted from the model repository so now only expect version 1 to
         # be ready and show infer stats.
-        for platform in ('graphdef', 'onnx'):
+        for platform in ("graphdef", "onnx"):
             model_name = platform + "_int32_int32_int32"
 
             try:
-                for pair in [("localhost:8000", "http"),
-                             ("localhost:8001", "grpc")]:
+                for pair in [("localhost:8000", "http"), ("localhost:8001", "grpc")]:
                     if pair[1] == "http":
                         triton_client = httpclient.InferenceServerClient(
-                            url=pair[0], verbose=True)
+                            url=pair[0], verbose=True
+                        )
                     else:
                         triton_client = grpcclient.InferenceServerClient(
-                            url=pair[0], verbose=True)
+                            url=pair[0], verbose=True
+                        )
 
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    model_metadata = triton_client.get_model_metadata(
-                        model_name)
+                    model_metadata = triton_client.get_model_metadata(model_name)
                     if pair[1] == "http":
-                        self.assertEqual(model_name, model_metadata['name'])
-                        self.assertEqual(len(model_metadata['versions']), 1)
-                        self.assertEqual("1", model_metadata['versions'][0])
+                        self.assertEqual(model_name, model_metadata["name"])
+                        self.assertEqual(len(model_metadata["versions"]), 1)
+                        self.assertEqual("1", model_metadata["versions"][0])
                     else:
                         self.assertEqual(model_name, model_metadata.name)
                         self.assertEqual(len(model_metadata.versions), 1)
@@ -574,44 +646,55 @@ def test_infer_stats_no_model_version(self):
                         if v == 1:
                             self.assertTrue(
                                 triton_client.is_model_ready(
-                                    model_name, model_version=str(v)))
+                                    model_name, model_version=str(v)
+                                )
+                            )
                         else:
                             self.assertFalse(
                                 triton_client.is_model_ready(
-                                    model_name, model_version=str(v)))
+                                    model_name, model_version=str(v)
+                                )
+                            )
 
-                    infer_stats = triton_client.get_inference_statistics(
-                        model_name)
+                    infer_stats = triton_client.get_inference_statistics(model_name)
                     if pair[1] == "http":
-                        stats = infer_stats['model_stats']
+                        stats = infer_stats["model_stats"]
                     else:
                         stats = infer_stats.model_stats
                     self.assertEqual(
-                        len(stats), 1,
-                        "expected 1 infer stats for model " + model_name)
+                        len(stats), 1, "expected 1 infer stats for model " + model_name
+                    )
 
                     if pair[1] == "http":
-                        version = stats[0]['version']
-                        stat = stats[0]['inference_stats']
+                        version = stats[0]["version"]
+                        stat = stats[0]["inference_stats"]
                     else:
                         version = stats[0].version
                         stat = stats[0].inference_stats
 
                     if version != "1":
                         self.assertTrue(
-                            False,
-                            "expected version 1 for infer stat, got " + version)
+                            False, "expected version 1 for infer stat, got " + version
+                        )
                     else:
                         if pair[1] == "http":
                             self.assertEqual(
-                                stat['success']['count'], 0,
-                                "unexpected infer stats for version " +
-                                str(version) + " of model " + model_name)
+                                stat["success"]["count"],
+                                0,
+                                "unexpected infer stats for version "
+                                + str(version)
+                                + " of model "
+                                + model_name,
+                            )
                         else:
                             self.assertEqual(
-                                stat.success.count, 0,
-                                "unexpected infer stats for version " +
-                                str(version) + " of model " + model_name)
+                                stat.success.count,
+                                0,
+                                "unexpected infer stats for version "
+                                + str(version)
+                                + " of model "
+                                + model_name,
+                            )
 
             except InferenceServerException as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
@@ -619,14 +702,15 @@ def test_infer_stats_no_model_version(self):
     def test_infer_stats_no_model(self):
         # Test get_inference_statistics when no model/model_version is passed.
         try:
-            for pair in [("localhost:8000", "http"),
-                         ("localhost:8001", "grpc")]:
+            for pair in [("localhost:8000", "http"), ("localhost:8001", "grpc")]:
                 if pair[1] == "http":
                     triton_client = httpclient.InferenceServerClient(
-                        url=pair[0], verbose=True)
+                        url=pair[0], verbose=True
+                    )
                 else:
                     triton_client = grpcclient.InferenceServerClient(
-                        url=pair[0], verbose=True)
+                        url=pair[0], verbose=True
+                    )
 
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
@@ -634,17 +718,18 @@ def test_infer_stats_no_model(self):
                 # Returns infer stats for ALL models + ready versions
                 infer_stats = triton_client.get_inference_statistics()
                 if pair[1] == "http":
-                    stats = infer_stats['model_stats']
+                    stats = infer_stats["model_stats"]
                 else:
                     stats = infer_stats.model_stats
                 self.assertEqual(
-                    len(stats), 207,
-                    "expected 207 infer stats for all ready versions of all model"
+                    len(stats),
+                    219,
+                    "expected 219 infer stats for all ready versions of all model",
                 )
 
         except InferenceServerException as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_shared_memory/shared_memory_test.py b/qa/L0_shared_memory/shared_memory_test.py
old mode 100644
new mode 100755
index 867f6a85b8..6350dc2abe
--- a/qa/L0_shared_memory/shared_memory_test.py
+++ b/qa/L0_shared_memory/shared_memory_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,11 +27,13 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-import numpy as np
-import unittest
 import os
+import unittest
+
+import numpy as np
 import test_util as tu
 import tritonclient.grpc as grpcclient
 import tritonclient.http as httpclient
@@ -38,12 +42,12 @@
 
 
 class SharedMemoryTest(tu.TestResultCollector):
-
     def test_invalid_create_shm(self):
         # Raises error since tried to create invalid system shared memory region
         try:
             shm_op0_handle = shm.create_shared_memory_region(
-                "dummy_data", "/dummy_data", -1)
+                "dummy_data", "/dummy_data", -1
+            )
             shm.destroy_shared_memory_region(shm_op0_handle)
         except Exception as ex:
             self.assertTrue(str(ex) == "unable to initialize the size")
@@ -54,12 +58,11 @@ def test_valid_create_set_register(self):
             triton_client = httpclient.InferenceServerClient(_url, verbose=True)
         else:
             triton_client = grpcclient.InferenceServerClient(_url, verbose=True)
-        shm_op0_handle = shm.create_shared_memory_region(
-            "dummy_data", "/dummy_data", 8)
-        shm.set_shared_memory_region(shm_op0_handle,
-                                     [np.array([1, 2], dtype=np.float32)])
-        triton_client.register_system_shared_memory("dummy_data", "/dummy_data",
-                                                    8)
+        shm_op0_handle = shm.create_shared_memory_region("dummy_data", "/dummy_data", 8)
+        shm.set_shared_memory_region(
+            shm_op0_handle, [np.array([1, 2], dtype=np.float32)]
+        )
+        triton_client.register_system_shared_memory("dummy_data", "/dummy_data", 8)
         shm_status = triton_client.get_system_shared_memory_status()
         if _protocol == "http":
             self.assertTrue(len(shm_status) == 1)
@@ -73,8 +76,7 @@ def test_unregister_before_register(self):
             triton_client = httpclient.InferenceServerClient(_url, verbose=True)
         else:
             triton_client = grpcclient.InferenceServerClient(_url, verbose=True)
-        shm_op0_handle = shm.create_shared_memory_region(
-            "dummy_data", "/dummy_data", 8)
+        shm_op0_handle = shm.create_shared_memory_region("dummy_data", "/dummy_data", 8)
         triton_client.unregister_system_shared_memory("dummy_data")
         shm_status = triton_client.get_system_shared_memory_status()
         if _protocol == "http":
@@ -89,10 +91,8 @@ def test_unregister_after_register(self):
             triton_client = httpclient.InferenceServerClient(_url, verbose=True)
         else:
             triton_client = grpcclient.InferenceServerClient(_url, verbose=True)
-        shm_op0_handle = shm.create_shared_memory_region(
-            "dummy_data", "/dummy_data", 8)
-        triton_client.register_system_shared_memory("dummy_data", "/dummy_data",
-                                                    8)
+        shm_op0_handle = shm.create_shared_memory_region("dummy_data", "/dummy_data", 8)
+        triton_client.register_system_shared_memory("dummy_data", "/dummy_data", 8)
         triton_client.unregister_system_shared_memory("dummy_data")
         shm_status = triton_client.get_system_shared_memory_status()
         if _protocol == "http":
@@ -107,17 +107,14 @@ def test_reregister_after_register(self):
             triton_client = httpclient.InferenceServerClient(_url, verbose=True)
         else:
             triton_client = grpcclient.InferenceServerClient(_url, verbose=True)
-        shm_op0_handle = shm.create_shared_memory_region(
-            "dummy_data", "/dummy_data", 8)
-        triton_client.register_system_shared_memory("dummy_data", "/dummy_data",
-                                                    8)
+        shm_op0_handle = shm.create_shared_memory_region("dummy_data", "/dummy_data", 8)
+        triton_client.register_system_shared_memory("dummy_data", "/dummy_data", 8)
         try:
-            triton_client.register_system_shared_memory("dummy_data",
-                                                        "/dummy_data", 8)
+            triton_client.register_system_shared_memory("dummy_data", "/dummy_data", 8)
         except Exception as ex:
             self.assertTrue(
-                "shared memory region 'dummy_data' already in manager" in str(
-                    ex))
+                "shared memory region 'dummy_data' already in manager" in str(ex)
+            )
         shm_status = triton_client.get_system_shared_memory_status()
         if _protocol == "http":
             self.assertTrue(len(shm_status) == 1)
@@ -127,13 +124,17 @@ def test_reregister_after_register(self):
 
     def _configure_sever(self):
         shm_ip0_handle = shm.create_shared_memory_region(
-            "input0_data", "/input0_data", 64)
+            "input0_data", "/input0_data", 64
+        )
         shm_ip1_handle = shm.create_shared_memory_region(
-            "input1_data", "/input1_data", 64)
+            "input1_data", "/input1_data", 64
+        )
         shm_op0_handle = shm.create_shared_memory_region(
-            "output0_data", "/output0_data", 64)
+            "output0_data", "/output0_data", 64
+        )
         shm_op1_handle = shm.create_shared_memory_region(
-            "output1_data", "/output1_data", 64)
+            "output1_data", "/output1_data", 64
+        )
         input0_data = np.arange(start=0, stop=16, dtype=np.int32)
         input1_data = np.ones(shape=16, dtype=np.int32)
         shm.set_shared_memory_region(shm_ip0_handle, [input0_data])
@@ -142,28 +143,26 @@ def _configure_sever(self):
             triton_client = httpclient.InferenceServerClient(_url, verbose=True)
         else:
             triton_client = grpcclient.InferenceServerClient(_url, verbose=True)
-        triton_client.register_system_shared_memory("input0_data",
-                                                    "/input0_data", 64)
-        triton_client.register_system_shared_memory("input1_data",
-                                                    "/input1_data", 64)
-        triton_client.register_system_shared_memory("output0_data",
-                                                    "/output0_data", 64)
-        triton_client.register_system_shared_memory("output1_data",
-                                                    "/output1_data", 64)
+        triton_client.register_system_shared_memory("input0_data", "/input0_data", 64)
+        triton_client.register_system_shared_memory("input1_data", "/input1_data", 64)
+        triton_client.register_system_shared_memory("output0_data", "/output0_data", 64)
+        triton_client.register_system_shared_memory("output1_data", "/output1_data", 64)
         return [shm_ip0_handle, shm_ip1_handle, shm_op0_handle, shm_op1_handle]
 
     def _cleanup_server(self, shm_handles):
         for shm_handle in shm_handles:
             shm.destroy_shared_memory_region(shm_handle)
 
-    def _basic_inference(self,
-                         shm_ip0_handle,
-                         shm_ip1_handle,
-                         shm_op0_handle,
-                         shm_op1_handle,
-                         error_msg,
-                         big_shm_name="",
-                         big_shm_size=64):
+    def _basic_inference(
+        self,
+        shm_ip0_handle,
+        shm_ip1_handle,
+        shm_op0_handle,
+        shm_op1_handle,
+        error_msg,
+        big_shm_name="",
+        big_shm_size=64,
+    ):
         input0_data = np.arange(start=0, stop=16, dtype=np.int32)
         input1_data = np.ones(shape=16, dtype=np.int32)
         inputs = []
@@ -172,16 +171,16 @@ def _basic_inference(self,
             triton_client = httpclient.InferenceServerClient(_url, verbose=True)
             inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32"))
             inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32"))
+            outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=True))
             outputs.append(
-                httpclient.InferRequestedOutput('OUTPUT0', binary_data=True))
-            outputs.append(
-                httpclient.InferRequestedOutput('OUTPUT1', binary_data=False))
+                httpclient.InferRequestedOutput("OUTPUT1", binary_data=False)
+            )
         else:
             triton_client = grpcclient.InferenceServerClient(_url, verbose=True)
             inputs.append(grpcclient.InferInput("INPUT0", [1, 16], "INT32"))
             inputs.append(grpcclient.InferInput("INPUT1", [1, 16], "INT32"))
-            outputs.append(grpcclient.InferRequestedOutput('OUTPUT0'))
-            outputs.append(grpcclient.InferRequestedOutput('OUTPUT1'))
+            outputs.append(grpcclient.InferRequestedOutput("OUTPUT0"))
+            outputs.append(grpcclient.InferRequestedOutput("OUTPUT1"))
 
         inputs[0].set_shared_memory("input0_data", 64)
 
@@ -196,23 +195,24 @@ def _basic_inference(self,
         outputs[1].set_shared_memory("output1_data", 64)
 
         try:
-            results = triton_client.infer("simple",
-                                          inputs,
-                                          model_version="",
-                                          outputs=outputs)
-            output = results.get_output('OUTPUT0')
+            results = triton_client.infer(
+                "simple", inputs, model_version="", outputs=outputs
+            )
+            output = results.get_output("OUTPUT0")
             if _protocol == "http":
-                output_datatype = output['datatype']
-                output_shape = output['shape']
+                output_datatype = output["datatype"]
+                output_shape = output["shape"]
             else:
                 output_datatype = output.datatype
                 output_shape = output.shape
             output_dtype = utils.triton_to_np_dtype(output_datatype)
-            output_data = shm.get_contents_as_numpy(shm_op0_handle,
-                                                    output_dtype, output_shape)
+            output_data = shm.get_contents_as_numpy(
+                shm_op0_handle, output_dtype, output_shape
+            )
             self.assertTrue(
                 (output_data[0] == (input0_data + input1_data)).all(),
-                "Model output does not match expected output")
+                "Model output does not match expected output",
+            )
         except Exception as ex:
             error_msg.append(str(ex))
 
@@ -220,8 +220,9 @@ def test_unregister_after_inference(self):
         # Unregister after inference
         error_msg = []
         shm_handles = self._configure_sever()
-        self._basic_inference(shm_handles[0], shm_handles[1], shm_handles[2],
-                              shm_handles[3], error_msg)
+        self._basic_inference(
+            shm_handles[0], shm_handles[1], shm_handles[2], shm_handles[3], error_msg
+        )
         if len(error_msg) > 0:
             raise Exception(str(error_msg))
         if _protocol == "http":
@@ -244,14 +245,15 @@ def test_register_after_inference(self):
             triton_client = httpclient.InferenceServerClient(_url, verbose=True)
         else:
             triton_client = grpcclient.InferenceServerClient(_url, verbose=True)
-        self._basic_inference(shm_handles[0], shm_handles[1], shm_handles[2],
-                              shm_handles[3], error_msg)
+        self._basic_inference(
+            shm_handles[0], shm_handles[1], shm_handles[2], shm_handles[3], error_msg
+        )
         if len(error_msg) > 0:
             raise Exception(str(error_msg))
         shm_ip2_handle = shm.create_shared_memory_region(
-            "input2_data", "/input2_data", 64)
-        triton_client.register_system_shared_memory("input2_data",
-                                                    "/input2_data", 64)
+            "input2_data", "/input2_data", 64
+        )
+        triton_client.register_system_shared_memory("input2_data", "/input2_data", 64)
         shm_status = triton_client.get_system_shared_memory_status()
         if _protocol == "http":
             self.assertTrue(len(shm_status) == 5)
@@ -265,19 +267,27 @@ def test_too_big_shm(self):
         error_msg = []
         shm_handles = self._configure_sever()
         shm_ip2_handle = shm.create_shared_memory_region(
-            "input2_data", "/input2_data", 128)
+            "input2_data", "/input2_data", 128
+        )
         if _protocol == "http":
             triton_client = httpclient.InferenceServerClient(_url, verbose=True)
         else:
             triton_client = grpcclient.InferenceServerClient(_url, verbose=True)
-        triton_client.register_system_shared_memory("input2_data",
-                                                    "/input2_data", 128)
-        self._basic_inference(shm_handles[0], shm_ip2_handle, shm_handles[2],
-                              shm_handles[3], error_msg, "input2_data", 128)
+        triton_client.register_system_shared_memory("input2_data", "/input2_data", 128)
+        self._basic_inference(
+            shm_handles[0],
+            shm_ip2_handle,
+            shm_handles[2],
+            shm_handles[3],
+            error_msg,
+            "input2_data",
+            128,
+        )
         if len(error_msg) > 0:
             self.assertTrue(
                 "unexpected total byte size 128 for input 'INPUT1', expecting 64"
-                in error_msg[-1])
+                in error_msg[-1]
+            )
         shm_handles.append(shm_ip2_handle)
         self._cleanup_server(shm_handles)
 
@@ -286,8 +296,9 @@ def test_mixed_raw_shm(self):
         error_msg = []
         shm_handles = self._configure_sever()
         input1_data = np.ones(shape=16, dtype=np.int32)
-        self._basic_inference(shm_handles[0], [input1_data], shm_handles[2],
-                              shm_handles[3], error_msg)
+        self._basic_inference(
+            shm_handles[0], [input1_data], shm_handles[2], shm_handles[3], error_msg
+        )
         if len(error_msg) > 0:
             raise Exception(error_msg[-1])
         self._cleanup_server(shm_handles)
@@ -313,8 +324,8 @@ def test_unregisterall(self):
         self._cleanup_server(shm_handles)
 
 
-if __name__ == '__main__':
-    _protocol = os.environ.get('CLIENT_TYPE', "http")
+if __name__ == "__main__":
+    _protocol = os.environ.get("CLIENT_TYPE", "http")
     if _protocol == "http":
         _url = "localhost:8000"
     else:
diff --git a/qa/L0_shared_memory/test.sh b/qa/L0_shared_memory/test.sh
old mode 100644
new mode 100755
index b510688740..e30a7dffa7
--- a/qa/L0_shared_memory/test.sh
+++ b/qa/L0_shared_memory/test.sh
@@ -52,7 +52,7 @@ for i in \
         test_unregisterall; do
     for client_type in http grpc; do
         SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=1 ${SERVER_ARGS_EXTRA}"
-        SERVER_LOG="./$i.$client_type.serverlog"
+        SERVER_LOG="./$i.$client_type.server.log"
         run_server
         if [ "$SERVER_PID" == "0" ]; then
             echo -e "\n***\n*** Failed to start $SERVER\n***"
diff --git a/qa/L0_simple_ensemble/ensemble_test.py b/qa/L0_simple_ensemble/ensemble_test.py
old mode 100644
new mode 100755
index b91097dfee..0b064c13e8
--- a/qa/L0_simple_ensemble/ensemble_test.py
+++ b/qa/L0_simple_ensemble/ensemble_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,77 +27,82 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 sys.path.append("../clients")
 
 import logging
-
-import os
 import unittest
-import numpy as np
+
 import infer_util as iu
+import numpy as np
 import test_util as tu
 import tritonhttpclient
 
 
 class EnsembleTest(tu.TestResultCollector):
-
     def _get_infer_count_per_version(self, model_name):
-        triton_client = tritonhttpclient.InferenceServerClient("localhost:8000",
-                                                               verbose=True)
+        triton_client = tritonhttpclient.InferenceServerClient(
+            "localhost:8000", verbose=True
+        )
         stats = triton_client.get_inference_statistics(model_name)
         self.assertEqual(len(stats["model_stats"]), 2)
         infer_count = [0, 0]
         for model_stat in stats["model_stats"]:
-            self.assertEqual(model_stat["name"], model_name,
-                             "expected stats for model " + model_name)
-            model_version = model_stat['version']
+            self.assertEqual(
+                model_stat["name"], model_name, "expected stats for model " + model_name
+            )
+            model_version = model_stat["version"]
             if model_version == "1":
-                infer_count[0] = model_stat["inference_stats"]["success"][
-                    "count"]
+                infer_count[0] = model_stat["inference_stats"]["success"]["count"]
             elif model_version == "2":
-                infer_count[1] = model_stat["inference_stats"]["success"][
-                    "count"]
+                infer_count[1] = model_stat["inference_stats"]["success"]["count"]
             else:
                 self.assertTrue(
-                    False, "unexpected version {} for model {}".format(
-                        model_version, model_name))
+                    False,
+                    "unexpected version {} for model {}".format(
+                        model_version, model_name
+                    ),
+                )
         return infer_count
 
     def test_ensemble_add_sub(self):
         for bs in (1, 8):
-            iu.infer_exact(self, "ensemble_add_sub", (bs, 16), bs, np.int32,
-                           np.int32, np.int32)
+            iu.infer_exact(
+                self, "ensemble_add_sub", (bs, 16), bs, np.int32, np.int32, np.int32
+            )
 
         infer_count = self._get_infer_count_per_version("simple")
         # The two 'simple' versions should have the same infer count
-        if (infer_count[0] != infer_count[1]):
+        if infer_count[0] != infer_count[1]:
             self.assertTrue(
-                False,
-                "unexpeced different infer count for different 'simple' versions"
+                False, "unexpeced different infer count for different 'simple' versions"
             )
 
     def test_ensemble_add_sub_one_output(self):
         for bs in (1, 8):
-            iu.infer_exact(self,
-                           "ensemble_add_sub", (bs, 16),
-                           bs,
-                           np.int32,
-                           np.int32,
-                           np.int32,
-                           outputs=("OUTPUT0",))
+            iu.infer_exact(
+                self,
+                "ensemble_add_sub",
+                (bs, 16),
+                bs,
+                np.int32,
+                np.int32,
+                np.int32,
+                outputs=("OUTPUT0",),
+            )
 
         infer_count = self._get_infer_count_per_version("simple")
         # Only 'simple' version 2 should have non-zero infer count
         # as it is in charge of producing OUTPUT0
-        if (infer_count[0] != 0):
-            self.assertTrue(
-                False, "unexpeced non-zero infer count for 'simple' version 1")
-        elif (infer_count[1] == 0):
+        if infer_count[0] != 0:
             self.assertTrue(
-                False, "unexpeced zero infer count for 'simple' version 2")
+                False, "unexpeced non-zero infer count for 'simple' version 1"
+            )
+        elif infer_count[1] == 0:
+            self.assertTrue(False, "unexpeced zero infer count for 'simple' version 2")
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     logging.basicConfig(stream=sys.stderr)
     unittest.main()
diff --git a/qa/L0_simple_go_client/test.sh b/qa/L0_simple_go_client/test.sh
old mode 100644
new mode 100755
index fcf7ed41b5..f09b79bfa2
--- a/qa/L0_simple_go_client/test.sh
+++ b/qa/L0_simple_go_client/test.sh
@@ -29,7 +29,8 @@ export CUDA_VISIBLE_DEVICES=0
 
 TRITON_COMMON_REPO_TAG=${TRITON_COMMON_REPO_TAG:="main"}
 
-SIMPLE_GO_CLIENT=grpc_simple_client.go
+GO_CLIENT_DIR=client/src/grpc_generated/go
+SIMPLE_GO_CLIENT=${GO_CLIENT_DIR}/grpc_simple_client.go
 
 SERVER=/opt/tritonserver/bin/tritonserver
 SERVER_ARGS=--model-repository=`pwd`/models
@@ -47,28 +48,26 @@ fi
 
 RET=0
 
-# Fix to allow global stubs import
-sed -i 's/.\/nvidia_inferenceserver/nvidia_inferenceserver/g' $SIMPLE_GO_CLIENT
+# Generate Go stubs.
+rm -fr client common
+git clone https://github.com/triton-inference-server/client.git
+go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@latest
 
-PACKAGE_PATH="${GOPATH}/src"
-mkdir -p ${PACKAGE_PATH}
-
-# Get the proto files from the common repo
-rm -fr common
+pushd ${GO_CLIENT_DIR}
 git clone --single-branch --depth=1 -b $TRITON_COMMON_REPO_TAG \
     https://github.com/triton-inference-server/common.git
-mkdir core && cp common/protobuf/*.proto core/.
+bash gen_go_stubs.sh
+popd
 
-# Requires protoc and protoc-gen-go plugin: https://github.com/golang/protobuf#installation
-# Use "M" arguments since go_package is not specified in .proto files.
-# As mentioned here: https://developers.google.com/protocol-buffers/docs/reference/go-generated#package
-GO_PACKAGE="nvidia_inferenceserver"
-protoc -I core --go_out=plugins=grpc:${PACKAGE_PATH} --go_opt=Mgrpc_service.proto=./${GO_PACKAGE} \
-    --go_opt=Mmodel_config.proto=./${GO_PACKAGE} core/*.proto
+# Copy packages to GOPATH, where Go expects to find packages.
+PACKAGE_PATH="${GOPATH}/src/github.com/triton-inference-server"
+rm -rf ${PACKAGE_PATH}/client
+mkdir -p ${PACKAGE_PATH}
+cp -r client $PACKAGE_PATH
 
 set +e
 
-# Runs test for GRPC variant of go client
+# Run test for GRPC variant of go client
 GO111MODULE=off go run $SIMPLE_GO_CLIENT >>client.log 2>&1
 if [ $? -ne 0 ]; then
     RET=1
diff --git a/qa/L0_simple_nodejs_client/test.sh b/qa/L0_simple_nodejs_client/test.sh
old mode 100644
new mode 100755
diff --git a/qa/L0_socket/test.sh b/qa/L0_socket/test.sh
old mode 100644
new mode 100755
index 257976ce96..2fd37bd054
--- a/qa/L0_socket/test.sh
+++ b/qa/L0_socket/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -35,7 +35,7 @@ SERVER=/opt/tritonserver/bin/tritonserver
 SERVER_TIMEOUT=15
 source ../common/util.sh
 
-rm -f $CLIENT_LOG $SERVER_LOG
+rm -f *.log
 
 RET=0
 
@@ -46,8 +46,8 @@ for address in default explicit; do
         SAME_EXPLICIT_ADDRESS=""
         DIFF_EXPLICIT_ADDRESS_ARGS=""
     else
-        SAME_EXPLICIT_ADDRESS="--http-address 127.0.0.1 --grpc-address 127.0.0.1"
-        DIFF_EXPLICIT_ADDRESS="--http-address 127.0.0.1 --grpc-address 127.0.0.2"
+        SAME_EXPLICIT_ADDRESS="--http-address 127.0.0.1 --grpc-address 127.0.0.1 --metrics-address 127.0.0.1"
+        DIFF_EXPLICIT_ADDRESS="--http-address 127.0.0.1 --grpc-address 127.0.0.2 --metrics-address 127.0.0.3"
     fi
 
     for p in http grpc; do
@@ -138,7 +138,7 @@ for address in default explicit; do
         kill $SERVER_PID
         wait $SERVER_PID
 
-        # error if http/grpc port overlaps with grpc/http explicit port 
+        # error if http/grpc port overlaps with grpc/http explicit port
         if [ "$p" == "http" ]; then
             SERVER_ARGS="--model-repository=$DATADIR $SAME_EXPLICIT_ADDRESS --http-port 8003 --grpc-port 8003"
             run_server_nowait
@@ -302,6 +302,112 @@ for address in default explicit; do
     done
 done
 
+# Test multiple servers binding to the same http/grpc port
+SERVER0_LOG="./inference_server0.log"
+SERVER1_LOG="./inference_server1.log"
+SERVER2_LOG="./inference_server2.log"
+
+for p in http grpc; do
+    # error if servers bind to the same http/grpc port without setting the reuse flag
+    if [ "$p" == "http" ]; then
+        SERVER_ARGS="--model-repository=$DATADIR --metrics-port 8002 --reuse-grpc-port=true"
+        SERVER0_ARGS="--model-repository=$DATADIR --metrics-port 8003 --reuse-grpc-port=true"
+        SERVER1_ARGS="--model-repository=$DATADIR --metrics-port 8004 --reuse-grpc-port=true"
+    else
+        SERVER_ARGS="--model-repository=$DATADIR --metrics-port 8002 --reuse-http-port=true"
+        SERVER0_ARGS="--model-repository=$DATADIR --metrics-port 8003 --reuse-http-port=true"
+        SERVER1_ARGS="--model-repository=$DATADIR --metrics-port 8004 --reuse-http-port=true"
+    fi
+    # make sure the first server is launched successfully, then run the other
+    # two servers and expect them to fail
+    run_server
+    run_multiple_servers_nowait 2
+    sleep 15
+    if [ "$SERVER_PID" == "0" ]; then
+        echo -e "\n***\n*** Failed to start SERVER $SERVER\n***"
+        cat $SERVER_LOG
+        exit 1
+    fi
+    if [ "$SERVER1_PID" != "0" ]; then
+        set +e
+        kill $SERVER0_PID
+        wait $SERVER0_PID
+        if [ "$?" == "0" ]; then
+            echo -e "\n***\n*** unexpected start SERVER0 $SERVER\n***"
+            cat $SERVER0_LOG
+            exit 1
+        fi
+        set -e
+    fi
+    if [ "$SERVER1_PID" != "0" ]; then
+        set +e
+        kill $SERVER1_PID
+        wait $SERVER1_PID
+        if [ "$?" == "0" ]; then
+            echo -e "\n***\n*** unexpected start SERVER1 $SERVER\n***"
+            cat $SERVER1_LOG
+            exit 1
+        fi
+        set -e
+    fi
+    kill_server
+
+    # 1. Allow multiple servers bind to the same http/grpc port with setting the reuse flag
+    # 2. Test different forms of setting --metrics-address and verify metrics are queryable
+    #   (a) Test default metrics-address being same as http-address
+    #   (b) Test setting metrics-address explicitly to 0.0.0.0
+    #   (c) Test setting metrics-address explicitly to 127.0.0.1
+    SERVER0_ARGS="--model-repository=$DATADIR --metrics-port 8002 --reuse-http-port=true --reuse-grpc-port=true"
+    SERVER1_ARGS="--model-repository=$DATADIR --metrics-address 0.0.0.0 --metrics-port 8003 --reuse-http-port=true --reuse-grpc-port=true"
+    SERVER2_ARGS="--model-repository=$DATADIR --metrics-address 127.0.0.2 --metrics-port 8004 --reuse-http-port=true --reuse-grpc-port=true"
+    run_multiple_servers_nowait 3
+    sleep 15
+    if [ "$SERVER0_PID" == "0" ]; then
+        echo -e "\n***\n*** Failed to start SERVER0 $SERVER\n***"
+        cat $SERVER_LOG
+        exit 1
+    fi
+    if [ "$SERVER1_PID" == "0" ]; then
+        echo -e "\n***\n*** Failed to start SERVER1 $SERVER\n***"
+        cat $SERVER1_LOG
+        exit 1
+    fi
+    if [ "$SERVER2_PID" == "0" ]; then
+        echo -e "\n***\n*** Failed to start SERVER2 $SERVER\n***"
+        cat $SERVER2_LOG
+        exit 1
+    fi
+
+    set +e
+
+    # test if requests are being distributed among three servers
+    if [ "$p" == "http" ]; then
+        CLIENT_PY=../clients/simple_http_infer_client.py
+    else
+        CLIENT_PY=../clients/simple_grpc_infer_client.py
+    fi
+
+    pids=()
+    for i in {0..10}; do
+        python3 $CLIENT_PY >> $CLIENT_LOG 2>&1 &
+        pids+=" $!"
+    done
+    wait $pids || { echo -e "\n***\n*** Python ${p} Async Infer Test Failed\n***"; cat $CLIENT_LOG; RET=1; }
+
+    set -e
+
+    server0_request_count=`curl -s localhost:8002/metrics | awk '/nv_inference_request_success{/ {print $2}'`
+    server1_request_count=`curl -s localhost:8003/metrics | awk '/nv_inference_request_success{/ {print $2}'`
+    server2_request_count=`curl -s 127.0.0.2:8004/metrics | awk '/nv_inference_request_success{/ {print $2}'`
+    if [ ${server0_request_count%.*} -eq 0 ] || \
+       [ ${server1_request_count%.*} -eq 0 ] || \
+       [ ${server2_request_count%.*} -eq 0 ]; then
+        echo -e "\n***\n*** Failed: ${p} requests are not distributed among all servers.\n***"
+        RET=1
+    fi
+    kill_servers
+done
+
 if [ $RET -eq 0 ]; then
   echo -e "\n***\n*** Test Passed\n***"
 else
diff --git a/qa/L0_storage_S3/infer_test.py b/qa/L0_storage_S3/infer_test.py
deleted file mode 100644
index 9933809b6d..0000000000
--- a/qa/L0_storage_S3/infer_test.py
+++ /dev/null
@@ -1,174 +0,0 @@
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-#
-# Redistribution and use in source and binary forms, with or without
-# modification, are permitted provided that the following conditions
-# are met:
-#  * Redistributions of source code must retain the above copyright
-#    notice, this list of conditions and the following disclaimer.
-#  * Redistributions in binary form must reproduce the above copyright
-#    notice, this list of conditions and the following disclaimer in the
-#    documentation and/or other materials provided with the distribution.
-#  * Neither the name of NVIDIA CORPORATION nor the names of its
-#    contributors may be used to endorse or promote products derived
-#    from this software without specific prior written permission.
-#
-# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
-# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
-# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
-# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
-# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
-import sys
-sys.path.append("../common")
-
-from builtins import range
-from future.utils import iteritems
-import unittest
-import numpy as np
-import infer_util as iu
-import test_util as tu
-import os
-
-np_dtype_string = np.dtype(object)
-
-
-class InferTest(tu.TestResultCollector):
-
-    def _full_exact(self, input_dtype, output0_dtype, output1_dtype,
-                    output0_raw, output1_raw, swap):
-
-        def _infer_exact_helper(tester,
-                                pf,
-                                tensor_shape,
-                                batch_size,
-                                input_dtype,
-                                output0_dtype,
-                                output1_dtype,
-                                output0_raw=True,
-                                output1_raw=True,
-                                model_version=None,
-                                swap=False,
-                                outputs=("OUTPUT0", "OUTPUT1"),
-                                use_http=True,
-                                use_grpc=True,
-                                skip_request_id_check=False,
-                                use_streaming=True,
-                                correlation_id=0):
-            for bs in (1, batch_size):
-                iu.infer_exact(tester,
-                               pf, (bs,) + tensor_shape,
-                               bs,
-                               input_dtype,
-                               output0_dtype,
-                               output1_dtype,
-                               output0_raw=output0_raw,
-                               output1_raw=output1_raw,
-                               model_version=model_version,
-                               swap=swap,
-                               outputs=outputs,
-                               use_http=use_http,
-                               use_grpc=use_grpc,
-                               skip_request_id_check=skip_request_id_check,
-                               use_streaming=use_streaming,
-                               correlation_id=correlation_id)
-
-        input_size = 16
-
-        if tu.validate_for_tf_model(input_dtype, output0_dtype, output1_dtype,
-                                    (input_size,), (input_size,),
-                                    (input_size,)):
-            for pf in ["graphdef", "savedmodel"]:
-                _infer_exact_helper(self,
-                                    pf, (input_size,),
-                                    8,
-                                    input_dtype,
-                                    output0_dtype,
-                                    output1_dtype,
-                                    output0_raw=output0_raw,
-                                    output1_raw=output1_raw,
-                                    swap=swap)
-
-        if tu.validate_for_trt_model(input_dtype, output0_dtype, output1_dtype,
-                                     (input_size, 1, 1), (input_size, 1, 1),
-                                     (input_size, 1, 1)):
-            if input_dtype == np.int8:
-                _infer_exact_helper(self,
-                                    'plan', (input_size, 1, 1),
-                                    8,
-                                    input_dtype,
-                                    output0_dtype,
-                                    output1_dtype,
-                                    output0_raw=output0_raw,
-                                    output1_raw=output1_raw,
-                                    swap=swap)
-            else:
-                _infer_exact_helper(self,
-                                    'plan', (input_size,),
-                                    8,
-                                    input_dtype,
-                                    output0_dtype,
-                                    output1_dtype,
-                                    output0_raw=output0_raw,
-                                    output1_raw=output1_raw,
-                                    swap=swap)
-
-        if tu.validate_for_onnx_model(input_dtype, output0_dtype, output1_dtype,
-                                      (input_size,), (input_size,),
-                                      (input_size,)):
-            _infer_exact_helper(self,
-                                'onnx', (input_size,),
-                                8,
-                                input_dtype,
-                                output0_dtype,
-                                output1_dtype,
-                                output0_raw=output0_raw,
-                                output1_raw=output1_raw,
-                                swap=swap)
-
-        # Skip for batched string I/O
-        if tu.validate_for_libtorch_model(input_dtype, output0_dtype,
-                                          output1_dtype, (input_size,),
-                                          (input_size,), (input_size,), 8):
-            _infer_exact_helper(self,
-                                'libtorch', (input_size,),
-                                8,
-                                input_dtype,
-                                output0_dtype,
-                                output1_dtype,
-                                output0_raw=output0_raw,
-                                output1_raw=output1_raw,
-                                swap=swap)
-
-    def test_raw_fff(self):
-        self._full_exact(np.float32,
-                         np.float32,
-                         np.float32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=True)
-
-    def test_raw_ooo(self):
-        self._full_exact(np_dtype_string,
-                         np_dtype_string,
-                         np_dtype_string,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
-
-    def test_class_fff(self):
-        self._full_exact(np.float32,
-                         np.float32,
-                         np.float32,
-                         output0_raw=False,
-                         output1_raw=False,
-                         swap=True)
-
-
-if __name__ == '__main__':
-    unittest.main()
diff --git a/qa/L0_storage_S3/test.sh b/qa/L0_storage_S3/test.sh
index 5fe4315dd5..51c8b2ce1e 100755
--- a/qa/L0_storage_S3/test.sh
+++ b/qa/L0_storage_S3/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2018-2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2018-2023, NVIDIA CORPORATION. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -42,7 +42,7 @@ fi
 export CUDA_VISIBLE_DEVICES=0
 
 CLIENT_LOG_BASE="./client"
-INFER_TEST=infer_test.py
+INFER_TEST="../common/infer_test.py"
 EXPECTED_NUM_TESTS="3"
 TEST_RESULT_FILE='test_results.txt'
 
@@ -65,6 +65,11 @@ aws s3 mb "${BUCKET_URL}"
 BUCKET_URL=${BUCKET_URL%/}
 BUCKET_URL_SLASH="${BUCKET_URL}/"
 
+# Backup S3 credentials as they will be unset during the test
+AWS_DEFAULT_REGION_BACKUP=$AWS_DEFAULT_REGION
+AWS_ACCESS_KEY_ID_BACKUP=$AWS_ACCESS_KEY_ID
+AWS_SECRET_ACCESS_KEY_BACKUP=$AWS_SECRET_ACCESS_KEY
+
 SERVER=/opt/tritonserver/bin/tritonserver
 SERVER_TIMEOUT=420
 
@@ -77,7 +82,7 @@ RET=0
 # Test 3 Scenarios:
 # 1. Only AWS ENV vars (Without aws configure)
 # 2. AWS ENV vars + dummy values in aws configure [ENV vars have higher priority]
-# 3. Only aws configure (Without AWS ENV vars)
+# 3. Only AWS configured (Without AWS ENV vars)
 for ENV_VAR in "env" "env_dummy" "config"; do
     SERVER_LOG=$SERVER_LOG_BASE.$ENV_VAR.log
     CLIENT_LOG=$CLIENT_LOG_BASE.$ENV_VAR.log
@@ -242,6 +247,15 @@ for ENV_VAR in "env" "env_dummy" "config"; do
     done
 done
 
+# Restore S3 credentials
+rm ~/.aws/credentials && rm ~/.aws/config
+export AWS_DEFAULT_REGION=$AWS_DEFAULT_REGION_BACKUP
+export AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID_BACKUP
+export AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY_BACKUP
+aws configure set default.region $AWS_DEFAULT_REGION && \
+    aws configure set aws_access_key_id $AWS_ACCESS_KEY_ID && \
+    aws configure set aws_secret_access_key $AWS_SECRET_ACCESS_KEY
+
 # Test with polling enabled
 SERVER_ARGS="--model-repository=$ROOT_REPO --exit-timeout-secs=120 --model-control-mode=poll"
 
@@ -278,6 +292,48 @@ set -e
 kill $SERVER_PID
 wait $SERVER_PID
 
+# Test localization to a specified location
+export TRITON_AWS_MOUNT_DIRECTORY=`pwd`/aws_localization_test
+
+if [ -d "$TRITON_AWS_MOUNT_DIRECTORY" ]; then
+  rm -rf $TRITON_AWS_MOUNT_DIRECTORY
+fi
+
+mkdir -p $TRITON_AWS_MOUNT_DIRECTORY
+
+SERVER_LOG=$SERVER_LOG_BASE.custom_localization.log
+SERVER_ARGS="--model-repository=$ROOT_REPO --exit-timeout-secs=120"
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+if [ -z "$(ls -A $TRITON_AWS_MOUNT_DIRECTORY)" ]; then
+    echo -e "\n***\n*** Test localization to a specified location failed. \n***"
+    echo -e "\n***\n*** Specified mount folder $TRITON_AWS_MOUNT_DIRECTORY is empty \n***"
+    ls -A $TRITON_AWS_MOUNT_DIRECTORY
+    exit 1
+fi
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+if [ -d "$TRITON_AWS_MOUNT_DIRECTORY" ] && [ ! -z "$(ls -A $TRITON_AWS_MOUNT_DIRECTORY)" ]; then
+    echo -e "\n***\n*** Test localization to a specified location failed. \n***"
+    echo -e "\n***\n*** Specified mount folder $TRITON_AWS_MOUNT_DIRECTORY was not cleared properly. \n***"
+    ls -A $TRITON_AWS_MOUNT_DIRECTORY
+    exit 1
+fi
+
+rm -rf $TRITON_AWS_MOUNT_DIRECTORY
+unset TRITON_AWS_MOUNT_DIRECTORY
+
+# Save models for AWS_SESSION_TOKEN test
+rm -rf tmp_cred_test_models
+mv models tmp_cred_test_models
 # Clean up bucket contents
 aws s3 rm "${BUCKET_URL_SLASH}" --recursive --include "*"
 
@@ -323,6 +379,143 @@ fi
 kill $SERVER_PID
 wait $SERVER_PID
 
+# Clean up bucket contents
+aws s3 rm "${BUCKET_URL_SLASH}" --recursive --include "*"
+
+# Test with temporary credential (AWS_SESSION_TOKEN)
+AWS_GET_SESSION_TOKEN_RES=`aws sts get-session-token --duration-seconds 900` && \
+    export AWS_ACCESS_KEY_ID=`echo $AWS_GET_SESSION_TOKEN_RES | jq -r ".Credentials.AccessKeyId"` && \
+    export AWS_SECRET_ACCESS_KEY=`echo $AWS_GET_SESSION_TOKEN_RES | jq -r ".Credentials.SecretAccessKey"` && \
+    export AWS_SESSION_TOKEN=`echo $AWS_GET_SESSION_TOKEN_RES | jq -r ".Credentials.SessionToken"`
+rm ~/.aws/credentials && rm ~/.aws/config
+aws configure set default.region $AWS_DEFAULT_REGION && \
+    aws configure set aws_access_key_id $AWS_ACCESS_KEY_ID && \
+    aws configure set aws_secret_access_key $AWS_SECRET_ACCESS_KEY && \
+    aws configure set aws_session_token $AWS_SESSION_TOKEN
+
+# Copy models into S3 bucket
+aws s3 cp tmp_cred_test_models/ "${BUCKET_URL_SLASH}" --recursive --include "*"
+
+SERVER_LOG=$SERVER_LOG_BASE.temporary_credentials_test.log
+SERVER_ARGS="--model-repository=$BUCKET_URL --exit-timeout-secs=120"
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+
+python $INFER_TEST >$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Test access decline
+export AWS_SECRET_ACCESS_KEY="[Invalid]" && export AWS_SESSION_TOKEN=""
+SERVER_LOG=$SERVER_LOG_BASE.access_decline_test.log
+SERVER_ARGS="--model-repository=$BUCKET_URL --exit-timeout-secs=120"
+run_server
+if [ "$SERVER_PID" != "0" ]; then
+    echo -e "\n***\n*** Unexpected server start $SERVER\n***"
+    cat $SERVER_LOG
+    kill $SERVER_PID
+    wait $SERVER_PID
+    RET=1
+else
+  # AWS S3 does not appear to reply on access decline, but other implementations
+  # might provide extra messages, so make sure Triton will print the messages.
+  EXPECTED_MSG="Unable to create S3 filesystem client. Check account credentials. Exception: '' Message: 'No response body.'"
+  if ! grep "$EXPECTED_MSG" $SERVER_LOG; then
+    echo -e "\n***\n*** Expected error message not found\n***"
+    cat $SERVER_LOG
+    RET=1
+  fi
+fi
+
+# Restore S3 credentials
+rm ~/.aws/credentials && rm ~/.aws/config
+export AWS_DEFAULT_REGION=$AWS_DEFAULT_REGION_BACKUP
+export AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID_BACKUP
+export AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY_BACKUP
+aws configure set default.region $AWS_DEFAULT_REGION && \
+    aws configure set aws_access_key_id $AWS_ACCESS_KEY_ID && \
+    aws configure set aws_secret_access_key $AWS_SECRET_ACCESS_KEY
+
+# Clean up bucket contents
+aws s3 rm "${BUCKET_URL_SLASH}" --recursive --include "*"
+
+# Test case where S3 folder has >1000 files
+rm -rf models
+
+mkdir -p models/model/1
+# Create Python model that reads the number of files in the
+# model directory when loaded
+echo "import os
+
+class TritonPythonModel:
+
+    def initialize(self, args):
+        count = 0
+        model_dir = args['model_repository']
+        for path in os.listdir(model_dir):
+            if os.path.isfile(os.path.join(model_dir, path)):
+                count += 1
+        print('Found {} files in model directory'.format(count))
+
+    def execute(self):
+        pass" > models/model/1/model.py
+
+for i in {1..1050}; do
+    touch models/model/0${i}.txt
+done
+
+# Provide extended timeout to allow >1000 files to be loaded
+SERVER_ARGS="--model-repository=$BUCKET_URL --exit-timeout-secs=600 --model-control-mode=none"
+SERVER_LOG=$SERVER_LOG_BASE.many_files.log
+
+# copy contents of /models into S3 bucket and wait for them to be loaded.
+aws s3 cp models/ "${BUCKET_URL_SLASH}" --recursive --include "*"
+
+# Test that the server starts up. Files will be loaded in numerically
+# ascending order, so the model file is loaded after the first 1000
+# files. If AWS fails to load >1000 files, the model file will not
+# be loaded and the server will fail to start.
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Confirm the correct number of files loaded
+EXPECTED_MSG="Found 1050 files in model directory"
+if ! grep "$EXPECTED_MSG" $SERVER_LOG; then
+echo -e "\n***\n*** Expected file count message not found\n***"
+cat $SERVER_LOG
+RET=1
+fi
+
 # Clean up bucket contents and delete bucket
 aws s3 rm "${BUCKET_URL_SLASH}" --recursive --include "*"
 aws s3 rb "${BUCKET_URL}"
diff --git a/qa/L0_storage_S3_local/mock_s3_service.py b/qa/L0_storage_S3_local/mock_s3_service.py
new file mode 100755
index 0000000000..956aac0e66
--- /dev/null
+++ b/qa/L0_storage_S3_local/mock_s3_service.py
@@ -0,0 +1,113 @@
+#!/usr/bin/env python3
+
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import threading
+import time
+from http.server import BaseHTTPRequestHandler, HTTPServer
+
+
+class MockS3Service:
+    __address = "localhost"
+    __port = 8080
+
+    def __init__(self):
+        # Test passed when:
+        # - at least one HEAD request is received; and
+        # - at least one GET request is received; and
+        # - all received requests do not advertise for HTTP/2.
+        test_results = {"head_count": 0, "get_count": 0, "http2_ads": False}
+
+        class RequestValidator(BaseHTTPRequestHandler):
+            protocol_version = "HTTP/1.1"
+
+            def __CheckHttp2Ads(self):
+                if "connection" in self.headers:
+                    v = self.headers["connection"].lower()
+                    if "upgrade" in v or "http2" in v:
+                        test_results["http2_ads"] = True
+                if (
+                    "upgrade" in self.headers
+                    and "h2c" in self.headers["upgrade"].lower()
+                ):
+                    test_results["http2_ads"] = True
+                if "http2-settings" in self.headers:
+                    test_results["http2_ads"] = True
+
+            def do_HEAD(self):
+                self.__CheckHttp2Ads()
+                test_results["head_count"] += 1
+                self.send_response(200)
+                self.end_headers()
+
+            def do_GET(self):
+                self.__CheckHttp2Ads()
+                test_results["get_count"] += 1
+                self.send_error(
+                    404,
+                    "Thank you for using the mock s3 service!",
+                    "Your bucket is not found here!",
+                )
+
+        self.__test_results = test_results
+        self.__server = HTTPServer((self.__address, self.__port), RequestValidator)
+        self.__service_thread = threading.Thread(target=self.__server.serve_forever)
+
+    def __enter__(self):
+        self.__service_thread.start()
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        self.__server.shutdown()
+        self.__server.server_close()
+        self.__service_thread.join()
+
+    def TestPassed(self):
+        return (
+            self.__test_results["head_count"] > 0
+            and self.__test_results["get_count"] > 0
+            and not self.__test_results["http2_ads"]
+        )
+
+
+if __name__ == "__main__":
+    # Initialize mock service
+    mock_s3_service = MockS3Service()
+
+    # Start service and poll until test passed or timed-out
+    with mock_s3_service:
+        poll_interval = 1  # seconds
+        timeout = 10  # seconds
+        elapsed_time = 0  # seconds
+        while not mock_s3_service.TestPassed() and elapsed_time < timeout:
+            elapsed_time += poll_interval
+            time.sleep(poll_interval)
+
+    # Print the result
+    if mock_s3_service.TestPassed():
+        print("TEST PASSED")
+    else:
+        print("TEST FAILED")
diff --git a/qa/L0_s3_local/test.sh b/qa/L0_storage_S3_local/test.sh
old mode 100644
new mode 100755
similarity index 64%
rename from qa/L0_s3_local/test.sh
rename to qa/L0_storage_S3_local/test.sh
index eee5495971..e60b106b31
--- a/qa/L0_s3_local/test.sh
+++ b/qa/L0_storage_S3_local/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -41,17 +41,63 @@ fi
 export CUDA_VISIBLE_DEVICES=0
 
 CLIENT_LOG="./client.log"
-PERF_CLIENT=../clients/perf_client
+TEST_RESULT_FILE='test_results.txt'
+INFER_TEST="../common/infer_test.py"
+EXPECTED_NUM_TESTS="3"
 
 DATADIR="/data/inferenceserver/${REPO_VERSION}/qa_model_repository"
-BACKENDS="graphdef libtorch onnx plan savedmodel"
+# Used to control which backends are run in infer_test.py
+BACKENDS=${BACKENDS:="graphdef savedmodel onnx libtorch plan"}
 
-rm -rf models && mkdir models
-for BACKEND in $BACKENDS; do
-    cp -r $DATADIR/${BACKEND}_float32_float32_float32 models/.
-    # Remove version policy from config.pbtxt
-    sed -i '/^version_policy/d' models/${BACKEND}_float32_float32_float32/config.pbtxt
-done
+function run_unit_tests() {
+    echo "Running unit tests: ${INFER_TEST}"
+    python $INFER_TEST >$CLIENT_LOG 2>&1
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Failed\n***"
+        RET=1
+    else
+        check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+        if [ $? -ne 0 ]; then
+            cat $CLIENT_LOG
+            echo -e "\n***\n*** Test Result Verification Failed\n***"
+            RET=1
+        fi
+    fi
+}
+
+function setup_model_repo() {
+    model_repo=${1:-"models"}
+    backends=${2:-${BACKENDS}}
+    types=${3:-"float32_float32_float32 object_object_object"}
+    echo "[setup_model_repo] model_repo: ${model_repo}, backends: ${backends}"
+    rm -rf ${model_repo} && mkdir ${model_repo}
+    for BACKEND in ${backends}; do
+        for TYPE in ${types}; do
+            model="${BACKEND}_${TYPE}"
+	    echo "Copying ${DATADIR}/${model} to ${model_repo}."
+            cp -r "${DATADIR}/${model}" "${model_repo}/"
+            # Remove version policy from config.pbtxt
+            sed -i '/^version_policy/d' ${model_repo}/${model}/config.pbtxt
+        done
+    done
+}
+
+function load_models() {
+    model_repo=${1:-"models"}
+    for model in `ls ${model_repo}`; do
+	echo "Loading model: ${model}"
+	code=`curl -s -w %{http_code} -X POST localhost:8000/v2/repository/models/${model}/load`
+	if [ "$code" != "200" ]; then
+	    echo -e "\n***\n*** Test Failed. Failed to load model: ${model}\n***"
+	    RET=1
+	fi
+    done
+}
+
+set +e
+setup_model_repo
+set -e
 
 # Create model with name that has all types of allowed characters
 DUMMY_MODEL="Model_repo-1.0"
@@ -75,7 +121,7 @@ export MINIO_ACCESS_KEY="minio"
 # https://github.com/minio/minio/issues/15030
 export MINIO_CI_CD=true
 MINIO_VOLUMES="/usr/local/share/minio/"
-MINIO_OPTS="-C /etc/minio --address localhost:4572"
+MINIO_OPTS="-C /etc/minio --address 127.0.0.1:4572"
 export MINIO_SECRET_KEY="miniostorage"
 
 (curl -O https://raw.githubusercontent.com/minio/minio-service/master/linux-systemd/minio.service && \
@@ -105,6 +151,7 @@ awslocal $ENDPOINT_FLAG s3 mb s3://demo-bucket1.0 && \
 RET=0
 
 # Test with hostname and IP address
+echo "=== Running hostname/IP tests ==="
 for HOST in "127.0.0.1" "localhost"; do
     SERVER_ARGS="--model-repository=s3://$HOST:4572/demo-bucket1.0 --model-control-mode=explicit"
     if [ "$HOST" = "127.0.0.1" ]; then
@@ -124,20 +171,8 @@ for HOST in "127.0.0.1" "localhost"; do
     fi
 
     set +e
-    for BACKEND in $BACKENDS; do
-        code=`curl -s -w %{http_code} -X POST localhost:8000/v2/repository/models/${BACKEND}_float32_float32_float32/load`
-        if [ "$code" != "200" ]; then
-            echo -e "\n***\n*** Test Failed\n***"
-            RET=1
-        fi
-
-        $PERF_CLIENT -m ${BACKEND}_float32_float32_float32 -p 3000 -t 1 >$CLIENT_LOG 2>&1
-        if [ $? -ne 0 ]; then
-            echo -e "\n***\n*** Test Failed\n***"
-            cat $CLIENT_LOG
-            RET=1
-        fi
-    done
+    load_models
+    run_unit_tests
 
     # Try to load model with name that checks for all types of allowed characters
     code=`curl -s -w %{http_code} -X POST localhost:8000/v2/repository/models/${DUMMY_MODEL}/load`
@@ -152,6 +187,7 @@ for HOST in "127.0.0.1" "localhost"; do
 done
 
 # Test with Polling
+echo "=== Running polling tests ==="
 SERVER_ARGS="--model-repository=s3://localhost:4572/demo-bucket1.0 --model-control-mode=poll"
 SERVER_LOG="./inference_server_poll.log"
 
@@ -170,7 +206,7 @@ awslocal $ENDPOINT_FLAG s3 sync models s3://demo-bucket1.0
 
 sleep 20
 
-set + e
+set +e
 CURL_LOG=$(curl -X POST localhost:8000/v2/repository/index)
 if [[ "$CURL_LOG" != *"{\"name\":\"libtorch_float32_float32_float32\",\"version\":\"3\",\"state\":\"UNAVAILABLE\",\"reason\":\"unloaded\"}"* ]]; then
     echo -e "\n***\n*** Failed. Server did not unload libtorch_float32_float32_float32 version 3\n***"
@@ -191,9 +227,26 @@ awslocal $ENDPOINT_FLAG s3 rm s3://demo-bucket1.0 --recursive --include "*" && \
     awslocal $ENDPOINT_FLAG s3 rb s3://demo-bucket1.0
 
 # Test with Polling, no model configuration file - with strict model config disabled
-rm -rf models && mkdir models
-cp -r $DATADIR/savedmodel_float32_float32_float32 models/.
-rm models/savedmodel_float32_float32_float32/config.pbtxt
+echo "=== Running autocomplete tests ==="
+AUTOCOMPLETE_BACKENDS="savedmodel"
+export BACKENDS=${AUTOCOMPLETE_BACKENDS}
+
+set +e
+setup_model_repo
+
+TYPES="float32_float32_float32 object_object_object"
+for BACKEND in ${AUTOCOMPLETE_BACKENDS}; do
+    for TYPE in ${TYPES}; do
+        model="${BACKEND}_${TYPE}"
+        # Config files specify things expected by unit test like label_filename
+        # and max_batch_size for comparing results, so remove some key fields
+        # for autocomplete to fill that won't break the unit test.
+        sed -i '/platform:/d' models/${model}/config.pbtxt
+        sed -i '/data_type:/d' models/${model}/config.pbtxt
+        sed -i '/dims:/d' models/${model}/config.pbtxt
+    done
+done
+set -e
 
 awslocal $ENDPOINT_FLAG s3 mb s3://demo-bucket1.0 && \
     awslocal $ENDPOINT_FLAG s3 sync models s3://demo-bucket1.0
@@ -211,12 +264,7 @@ if [ "$SERVER_PID" == "0" ]; then
     exit 1
 fi
 
-$PERF_CLIENT -m savedmodel_float32_float32_float32 -p 3000 -t 1 > $CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    echo -e "\n***\n*** Test Failed\n***"
-    cat $CLIENT_LOG
-    RET=1
-fi
+run_unit_tests
 
 kill $SERVER_PID
 wait $SERVER_PID
@@ -226,23 +274,15 @@ awslocal $ENDPOINT_FLAG s3 rm s3://demo-bucket1.0 --recursive --include "*" && \
     awslocal $ENDPOINT_FLAG s3 rb s3://demo-bucket1.0
 
 # Test for multiple model repositories using S3 cloud storage
+echo "=== Running multiple-model-repository tests ==="
 BACKENDS1="graphdef libtorch"
 BACKENDS2="onnx plan savedmodel"
-BACKENDS="$BACKENDS1 $BACKENDS2"
-
-rm -rf models1 && mkdir models1
-for BACKEND in $BACKENDS1; do
-    cp -r $DATADIR/${BACKEND}_float32_float32_float32 models1/.
-    # Remove version policy from config.pbtxt
-    sed -i '/^version_policy/d' models1/${BACKEND}_float32_float32_float32/config.pbtxt
-done
+export BACKENDS="$BACKENDS1 $BACKENDS2"
 
-rm -rf models2 && mkdir models2
-for BACKEND in $BACKENDS2; do
-    cp -r $DATADIR/${BACKEND}_float32_float32_float32 models2/.
-    # Remove version policy from config.pbtxt
-    sed -i '/^version_policy/d' models2/${BACKEND}_float32_float32_float32/config.pbtxt
-done
+set +e
+setup_model_repo "models1" "${BACKENDS1}"
+setup_model_repo "models2" "${BACKENDS2}"
+set -e
 
 BUCKET_NAME="demo-bucket"
 MODEL_REPO_ARGS=""
@@ -272,25 +312,39 @@ if [ "$SERVER_PID" == "0" ]; then
 fi
 
 set +e
-for BACKEND in $BACKENDS; do
-    code=`curl -s -w %{http_code} -X POST localhost:8000/v2/repository/models/${BACKEND}_float32_float32_float32/load`
-    if [ "$code" != "200" ]; then
-        echo -e "\n***\n*** Test Failed\n***"
-        RET=1
-    fi
-
-    $PERF_CLIENT -m ${BACKEND}_float32_float32_float32 -p 3000 -t 1 >$CLIENT_LOG 2>&1
-    if [ $? -ne 0 ]; then
-        echo -e "\n***\n*** Test Failed\n***"
-        cat $CLIENT_LOG
-        RET=1
-    fi
-done
+load_models "models1"
+load_models "models2"
+run_unit_tests
 set -e
 
 kill $SERVER_PID
 wait $SERVER_PID
 
+# Test access decline
+AWS_SECRET_ACCESS_KEY_BACKUP=$AWS_SECRET_ACCESS_KEY
+export AWS_SECRET_ACCESS_KEY="[Invalid]"
+SERVER_ARGS="--model-repository=s3://localhost:4572/${BUCKET_NAME}1 --exit-timeout-secs=120"
+SERVER_LOG="./inference_server.access_decline.log"
+run_server
+if [ "$SERVER_PID" != "0" ]; then
+    echo -e "\n***\n*** Unexpected server start $SERVER\n***"
+    cat $SERVER_LOG
+    kill $SERVER_PID
+    wait $SERVER_PID
+    RET=1
+else
+  # MinIO does not appear to reply on access decline, but other implementations
+  # might provide extra messages, so make sure Triton will print the messages.
+  EXPECTED_MSG="Unable to create S3 filesystem client. Check account credentials. Exception: '' Message: 'No response body.'"
+  if ! grep "$EXPECTED_MSG" $SERVER_LOG; then
+    echo -e "\n***\n*** Expected error message not found\n***"
+    cat $SERVER_LOG
+    RET=1
+  fi
+fi
+# Restore keys for destroying buckets
+export AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY_BACKUP
+
 # Destroy buckets
 for BUCKET_SUFFIX in 1 2; do
     awslocal $ENDPOINT_FLAG s3 rm s3://$BUCKET_NAME$BUCKET_SUFFIX --recursive --include "*" && \
@@ -301,10 +355,33 @@ done
 kill $MINIO_PID
 wait $MINIO_PID
 
-if [ $RET -eq 0 ]; then
-  echo -e "\n***\n*** Test Passed\n***"
+# Test the S3 client will not advertise HTTP/2
+TEST_LOG="./http2_advertise_test.log"
+python3 mock_s3_service.py > $TEST_LOG 2>&1 &
+sleep 2  # make sure the mock service has started
+SERVER_LOG="./http2_advertise_test.server.log"
+SERVER_ARGS="--model-repository=s3://localhost:8080/dummy-bucket --exit-timeout-secs=120"
+run_server
+if [ "$SERVER_PID" != "0" ]; then
+    echo -e "\n***\n*** Unexpected server start $SERVER\n***"
+    cat $SERVER_LOG
+    kill $SERVER_PID
+    wait $SERVER_PID
+    RET=1
 else
-  echo -e "\n***\n*** Test Failed\n***"
+    sleep 2  # make sure the mock service has stopped
+    PASSED_MSG="TEST PASSED"
+    if ! grep "$PASSED_MSG" $TEST_LOG; then
+        echo -e "\n***\n*** S3 client HTTP/2 advertise test failed\n***"
+        cat $TEST_LOG
+        RET=1
+    fi
 fi
 
+# Print and return test result
+if [ $RET -eq 0 ]; then
+    echo -e "\n***\n*** Test Passed\n***"
+else
+    echo -e "\n***\n*** Test Failed\n***"
+fi
 exit $RET
diff --git a/qa/L0_storage_azure/infer_test.py b/qa/L0_storage_azure/infer_test.py
deleted file mode 100644
index 372adb2132..0000000000
--- a/qa/L0_storage_azure/infer_test.py
+++ /dev/null
@@ -1,174 +0,0 @@
-# Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-#
-# Redistribution and use in source and binary forms, with or without
-# modification, are permitted provided that the following conditions
-# are met:
-#  * Redistributions of source code must retain the above copyright
-#    notice, this list of conditions and the following disclaimer.
-#  * Redistributions in binary form must reproduce the above copyright
-#    notice, this list of conditions and the following disclaimer in the
-#    documentation and/or other materials provided with the distribution.
-#  * Neither the name of NVIDIA CORPORATION nor the names of its
-#    contributors may be used to endorse or promote products derived
-#    from this software without specific prior written permission.
-#
-# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
-# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
-# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
-# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
-# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
-import sys
-sys.path.append("../common")
-
-from builtins import range
-from future.utils import iteritems
-import unittest
-import numpy as np
-import infer_util as iu
-import test_util as tu
-import os
-
-np_dtype_string = np.dtype(object)
-
-
-class InferTest(tu.TestResultCollector):
-
-    def _full_exact(self, input_dtype, output0_dtype, output1_dtype,
-                    output0_raw, output1_raw, swap):
-
-        def _infer_exact_helper(tester,
-                                pf,
-                                tensor_shape,
-                                batch_size,
-                                input_dtype,
-                                output0_dtype,
-                                output1_dtype,
-                                output0_raw=True,
-                                output1_raw=True,
-                                model_version=None,
-                                swap=False,
-                                outputs=("OUTPUT0", "OUTPUT1"),
-                                use_http=True,
-                                use_grpc=True,
-                                skip_request_id_check=False,
-                                use_streaming=True,
-                                correlation_id=0):
-            for bs in (1, batch_size):
-                iu.infer_exact(tester,
-                               pf, (bs,) + tensor_shape,
-                               bs,
-                               input_dtype,
-                               output0_dtype,
-                               output1_dtype,
-                               output0_raw=output0_raw,
-                               output1_raw=output1_raw,
-                               model_version=model_version,
-                               swap=swap,
-                               outputs=outputs,
-                               use_http=use_http,
-                               use_grpc=use_grpc,
-                               skip_request_id_check=skip_request_id_check,
-                               use_streaming=use_streaming,
-                               correlation_id=correlation_id)
-
-        input_size = 16
-
-        if tu.validate_for_tf_model(input_dtype, output0_dtype, output1_dtype,
-                                    (input_size,), (input_size,),
-                                    (input_size,)):
-            for pf in ["graphdef", "savedmodel"]:
-                _infer_exact_helper(self,
-                                    pf, (input_size,),
-                                    8,
-                                    input_dtype,
-                                    output0_dtype,
-                                    output1_dtype,
-                                    output0_raw=output0_raw,
-                                    output1_raw=output1_raw,
-                                    swap=swap)
-
-        if tu.validate_for_trt_model(input_dtype, output0_dtype, output1_dtype,
-                                     (input_size, 1, 1), (input_size, 1, 1),
-                                     (input_size, 1, 1)):
-            if input_dtype == np.int8:
-                _infer_exact_helper(self,
-                                    'plan', (input_size, 1, 1),
-                                    8,
-                                    input_dtype,
-                                    output0_dtype,
-                                    output1_dtype,
-                                    output0_raw=output0_raw,
-                                    output1_raw=output1_raw,
-                                    swap=swap)
-            else:
-                _infer_exact_helper(self,
-                                    'plan', (input_size,),
-                                    8,
-                                    input_dtype,
-                                    output0_dtype,
-                                    output1_dtype,
-                                    output0_raw=output0_raw,
-                                    output1_raw=output1_raw,
-                                    swap=swap)
-
-        if tu.validate_for_onnx_model(input_dtype, output0_dtype, output1_dtype,
-                                      (input_size,), (input_size,),
-                                      (input_size,)):
-            _infer_exact_helper(self,
-                                'onnx', (input_size,),
-                                8,
-                                input_dtype,
-                                output0_dtype,
-                                output1_dtype,
-                                output0_raw=output0_raw,
-                                output1_raw=output1_raw,
-                                swap=swap)
-
-        # Skip for batched string I/O
-        if tu.validate_for_libtorch_model(input_dtype, output0_dtype,
-                                          output1_dtype, (input_size,),
-                                          (input_size,), (input_size,), 8):
-            _infer_exact_helper(self,
-                                'libtorch', (input_size,),
-                                8,
-                                input_dtype,
-                                output0_dtype,
-                                output1_dtype,
-                                output0_raw=output0_raw,
-                                output1_raw=output1_raw,
-                                swap=swap)
-
-    def test_raw_fff(self):
-        self._full_exact(np.float32,
-                         np.float32,
-                         np.float32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=True)
-
-    def test_raw_ooo(self):
-        self._full_exact(np_dtype_string,
-                         np_dtype_string,
-                         np_dtype_string,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
-
-    def test_class_fff(self):
-        self._full_exact(np.float32,
-                         np.float32,
-                         np.float32,
-                         output0_raw=False,
-                         output1_raw=False,
-                         swap=True)
-
-
-if __name__ == '__main__':
-    unittest.main()
diff --git a/qa/L0_storage_azure/test.sh b/qa/L0_storage_azure/test.sh
index 0bc44c9a60..15f9c78bcc 100755
--- a/qa/L0_storage_azure/test.sh
+++ b/qa/L0_storage_azure/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2020-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -55,9 +55,8 @@ ACCOUNT_NAME=$AZURE_STORAGE_ACCOUNT
 ACCOUNT_KEY=$AZURE_STORAGE_KEY
 export CUDA_VISIBLE_DEVICES=0
 CLIENT_LOG_BASE="./client"
-INFER_TEST=infer_test.py
+INFER_TEST="../common/infer_test.py"
 EXPECTED_NUM_TESTS="3"
-PERF_CLIENT=../clients/perf_client
 timestamp=$(date +%s)
 CONTAINER_NAME="tritonqatest${timestamp}"
 
@@ -82,19 +81,38 @@ source ../common/util.sh
 rm -f $SERVER_LOG_BASE* $CLIENT_LOG_BASE*
 RET=0
 
-BACKENDS="graphdef savedmodel onnx libtorch plan"
+# Used to control which backends are run in infer_test.py
+BACKENDS=${BACKENDS:="graphdef savedmodel onnx libtorch plan"}
 
-# Construct model repository
-mkdir -p models
-for FW in $BACKENDS; do
-    cp -r /data/inferenceserver/${REPO_VERSION}/qa_model_repository/${FW}_float32_float32_float32 models/
-done
+function run_unit_tests() {
+    BACKENDS=$BACKENDS python $INFER_TEST >$CLIENT_LOG 2>&1
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Failed\n***"
+        RET=1
+    else
+        check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+        if [ $? -ne 0 ]; then
+            cat $CLIENT_LOG
+            echo -e "\n***\n*** Test Result Verification Failed\n***"
+            RET=1
+        fi
+    fi
+}
 
-# Copy models with string inputs and remove nobatch (bs=1) models
-cp -r /data/inferenceserver/${REPO_VERSION}/qa_model_repository/*_object_object_object models/
+function setup_model_repo() {
+    # Construct model repository
+    rm -rf models && mkdir -p models
+    for FW in $BACKENDS; do
+        cp -r /data/inferenceserver/${REPO_VERSION}/qa_model_repository/${FW}_float32_float32_float32 models/
+    done
 
-rm -rf models/*nobatch*
+    # Copy models with string inputs and remove nobatch (bs=1) models
+    cp -r /data/inferenceserver/${REPO_VERSION}/qa_model_repository/*_object_object_object models/
+    rm -rf models/*nobatch*
+}
 
+setup_model_repo
 KIND="KIND_GPU"
 for FW in $BACKENDS; do
     for MC in `ls models/${FW}*/config.pbtxt`; do
@@ -144,27 +162,52 @@ for ENV_VAR in "shared_key"; do
     fi
 
     set +e
-
-    python $INFER_TEST >$CLIENT_LOG 2>&1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Failed\n***"
-        RET=1
-    else
-        check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
-        if [ $? -ne 0 ]; then
-            cat $CLIENT_LOG
-            echo -e "\n***\n*** Test Result Verification Failed\n***"
-            RET=1
-        fi
-    fi
-
+    run_unit_tests
     set -e
 
     kill $SERVER_PID
     wait $SERVER_PID
 done
 
+# Test localization to a specified location
+export TRITON_AZURE_MOUNT_DIRECTORY=`pwd`/azure_localization_test
+
+if [ -d "$TRITON_AZURE_MOUNT_DIRECTORY" ]; then
+  rm -rf $TRITON_AZURE_MOUNT_DIRECTORY
+fi
+
+mkdir -p $TRITON_AZURE_MOUNT_DIRECTORY
+
+SERVER_LOG=$SERVER_LOG_BASE.custom_localization.log
+SERVER_ARGS="--model-repository=$MODEL_REPO --exit-timeout-secs=120"
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+if [ -z "$(ls -A $TRITON_AZURE_MOUNT_DIRECTORY)" ]; then
+    echo -e "\n***\n*** Test localization to a specified location failed. \n***"
+    echo -e "\n***\n*** Specified mount folder $TRITON_AZURE_MOUNT_DIRECTORY is empty \n***"
+    ls -A $TRITON_AZURE_MOUNT_DIRECTORY
+    exit 1
+fi
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+if [ -d "$TRITON_AZURE_MOUNT_DIRECTORY" ] && [ ! -z "$(ls -A $TRITON_AZURE_MOUNT_DIRECTORY)" ]; then
+    echo -e "\n***\n*** Test localization to a specified location failed. \n***"
+    echo -e "\n***\n*** Specified mount folder $TRITON_AZURE_MOUNT_DIRECTORY was not cleared properly. \n***"
+    ls -A $TRITON_AZURE_MOUNT_DIRECTORY
+    exit 1
+fi
+
+rm -rf $TRITON_AZURE_MOUNT_DIRECTORY
+unset TRITON_AZURE_MOUNT_DIRECTORY
+
 # Add test for explicit model control
 SERVER_LOG=$SERVER_LOG_BASE.explicit.log
 CLIENT_LOG=$CLIENT_LOG_BASE.explicit.log
@@ -179,20 +222,16 @@ if [ "$SERVER_PID" == "0" ]; then
 fi
 
 set +e
-for BACKEND in $BACKENDS; do
-    code=`curl -s -w %{http_code} -X POST localhost:8000/v2/repository/models/${BACKEND}_float32_float32_float32/load`
+for model in `ls models/`; do
+    code=`curl -s -w %{http_code} -X POST localhost:8000/v2/repository/models/${model}/load`
     if [ "$code" != "200" ]; then
         echo -e "\n***\n*** Test Failed\n***"
         RET=1
     fi
-
-    $PERF_CLIENT -m ${BACKEND}_float32_float32_float32 -p 3000 -t 1 >$CLIENT_LOG 2>&1
-    if [ $? -ne 0 ]; then
-        echo -e "\n***\n*** Test Failed\n***"
-        cat $CLIENT_LOG
-        RET=1
-    fi
 done
+
+# Check that each explicitly loaded model runs correctly
+run_unit_tests
 set -e
 
 kill $SERVER_PID
@@ -211,9 +250,20 @@ SERVER_ARGS="--model-repository=${AS_URL}/models --model-control-mode=poll --str
 az storage container create --name ${CONTAINER_NAME} --account-name ${ACCOUNT_NAME} --account-key ${ACCOUNT_KEY}
 sleep 10
 
+# Setup model repository with minimal configs to be autocompleted
 rm -rf models && mkdir -p models
-cp -r /data/inferenceserver/${REPO_VERSION}/qa_model_repository/savedmodel_float32_float32_float32 models/
-rm models/savedmodel_float32_float32_float32/config.pbtxt
+AUTOCOMPLETE_BACKENDS="savedmodel"
+for FW in ${AUTOCOMPLETE_BACKENDS}; do
+    for model in ${FW}_float32_float32_float32 ${FW}_object_object_object; do
+        cp -r /data/inferenceserver/${REPO_VERSION}/qa_model_repository/${model} models/
+        # Config files specify things expected by unit test like label_filename
+        # and max_batch_size for comparing results, so remove some key fields
+        # for autocomplete to fill that won't break the unit test.
+        sed -i '/platform:/d' models/${model}/config.pbtxt
+        sed -i '/data_type:/d' models/${model}/config.pbtxt
+        sed -i '/dims:/d' models/${model}/config.pbtxt
+    done
+done
 
 # copy contents of models into container.
 for file in `find models -type f` ;do
@@ -229,12 +279,9 @@ if [ "$SERVER_PID" == "0" ]; then
 fi
 
 set +e
-$PERF_CLIENT -m savedmodel_float32_float32_float32 -p 3000 -t 1 >$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    echo -e "\n***\n*** Test Failed\n***"
-    cat $CLIENT_LOG
-    RET=1
-fi
+# Check that each polled model runs correctly
+export BACKENDS="${AUTOCOMPLETE_BACKENDS}"
+run_unit_tests
 set -e
 
 kill $SERVER_PID
diff --git a/qa/L0_storage_swiftstack/infer_test.py b/qa/L0_storage_swiftstack/infer_test.py
old mode 100644
new mode 100755
index db2499782c..f8a65a01a4
--- a/qa/L0_storage_swiftstack/infer_test.py
+++ b/qa/L0_storage_swiftstack/infer_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,139 +27,181 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-from builtins import range
-from future.utils import iteritems
 import unittest
-import numpy as np
+
 import infer_util as iu
+import numpy as np
 import test_util as tu
-import os
 
 
 class InferTest(tu.TestResultCollector):
-
-    def _full_exact(self, input_dtype, output0_dtype, output1_dtype,
-                    output0_raw, output1_raw, swap):
-
-        def _infer_exact_helper(tester,
-                                pf,
-                                tensor_shape,
-                                batch_size,
-                                input_dtype,
-                                output0_dtype,
-                                output1_dtype,
-                                output0_raw=True,
-                                output1_raw=True,
-                                model_version=None,
-                                swap=False,
-                                outputs=("OUTPUT0", "OUTPUT1"),
-                                use_http=True,
-                                use_grpc=True,
-                                skip_request_id_check=False,
-                                use_streaming=True,
-                                correlation_id=0):
+    def _full_exact(
+        self, input_dtype, output0_dtype, output1_dtype, output0_raw, output1_raw, swap
+    ):
+        def _infer_exact_helper(
+            tester,
+            pf,
+            tensor_shape,
+            batch_size,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            output0_raw=True,
+            output1_raw=True,
+            model_version=None,
+            swap=False,
+            outputs=("OUTPUT0", "OUTPUT1"),
+            use_http=True,
+            use_grpc=True,
+            skip_request_id_check=False,
+            use_streaming=True,
+            correlation_id=0,
+        ):
             for bs in (1, batch_size):
-                iu.infer_exact(tester,
-                               pf, (bs,) + tensor_shape,
-                               bs,
-                               input_dtype,
-                               output0_dtype,
-                               output1_dtype,
-                               output0_raw=output0_raw,
-                               output1_raw=output1_raw,
-                               model_version=model_version,
-                               swap=swap,
-                               outputs=outputs,
-                               use_http=use_http,
-                               use_grpc=use_grpc,
-                               skip_request_id_check=skip_request_id_check,
-                               use_streaming=use_streaming,
-                               correlation_id=correlation_id)
+                iu.infer_exact(
+                    tester,
+                    pf,
+                    (bs,) + tensor_shape,
+                    bs,
+                    input_dtype,
+                    output0_dtype,
+                    output1_dtype,
+                    output0_raw=output0_raw,
+                    output1_raw=output1_raw,
+                    model_version=model_version,
+                    swap=swap,
+                    outputs=outputs,
+                    use_http=use_http,
+                    use_grpc=use_grpc,
+                    skip_request_id_check=skip_request_id_check,
+                    use_streaming=use_streaming,
+                    correlation_id=correlation_id,
+                )
 
         input_size = 16
 
-        if tu.validate_for_tf_model(input_dtype, output0_dtype, output1_dtype,
-                                    (input_size,), (input_size,),
-                                    (input_size,)):
+        if tu.validate_for_tf_model(
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            (input_size,),
+            (input_size,),
+            (input_size,),
+        ):
             for pf in ["graphdef", "savedmodel"]:
-                _infer_exact_helper(self,
-                                    pf, (input_size,),
-                                    8,
-                                    input_dtype,
-                                    output0_dtype,
-                                    output1_dtype,
-                                    output0_raw=output0_raw,
-                                    output1_raw=output1_raw,
-                                    swap=swap)
-
-        if tu.validate_for_trt_model(input_dtype, output0_dtype, output1_dtype,
-                                     (input_size, 1, 1), (input_size, 1, 1),
-                                     (input_size, 1, 1)):
+                _infer_exact_helper(
+                    self,
+                    pf,
+                    (input_size,),
+                    8,
+                    input_dtype,
+                    output0_dtype,
+                    output1_dtype,
+                    output0_raw=output0_raw,
+                    output1_raw=output1_raw,
+                    swap=swap,
+                )
+
+        if tu.validate_for_trt_model(
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            (input_size, 1, 1),
+            (input_size, 1, 1),
+            (input_size, 1, 1),
+        ):
             if input_dtype == np.int8:
-                _infer_exact_helper(self,
-                                    'plan', (input_size, 1, 1),
-                                    8,
-                                    input_dtype,
-                                    output0_dtype,
-                                    output1_dtype,
-                                    output0_raw=output0_raw,
-                                    output1_raw=output1_raw,
-                                    swap=swap)
+                _infer_exact_helper(
+                    self,
+                    "plan",
+                    (input_size, 1, 1),
+                    8,
+                    input_dtype,
+                    output0_dtype,
+                    output1_dtype,
+                    output0_raw=output0_raw,
+                    output1_raw=output1_raw,
+                    swap=swap,
+                )
             else:
-                _infer_exact_helper(self,
-                                    'plan', (input_size,),
-                                    8,
-                                    input_dtype,
-                                    output0_dtype,
-                                    output1_dtype,
-                                    output0_raw=output0_raw,
-                                    output1_raw=output1_raw,
-                                    swap=swap)
-
-        if tu.validate_for_onnx_model(input_dtype, output0_dtype, output1_dtype,
-                                      (input_size,), (input_size,),
-                                      (input_size,)):
-            _infer_exact_helper(self,
-                                'onnx', (input_size,),
-                                8,
-                                input_dtype,
-                                output0_dtype,
-                                output1_dtype,
-                                output0_raw=output0_raw,
-                                output1_raw=output1_raw,
-                                swap=swap)
-
-        if tu.validate_for_libtorch_model(input_dtype, output0_dtype,
-                                          output1_dtype, (input_size,),
-                                          (input_size,), (input_size,)):
-            _infer_exact_helper(self,
-                                'libtorch', (input_size,),
-                                8,
-                                input_dtype,
-                                output0_dtype,
-                                output1_dtype,
-                                output0_raw=output0_raw,
-                                output1_raw=output1_raw,
-                                swap=swap)
+                _infer_exact_helper(
+                    self,
+                    "plan",
+                    (input_size,),
+                    8,
+                    input_dtype,
+                    output0_dtype,
+                    output1_dtype,
+                    output0_raw=output0_raw,
+                    output1_raw=output1_raw,
+                    swap=swap,
+                )
+
+        if tu.validate_for_onnx_model(
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            (input_size,),
+            (input_size,),
+            (input_size,),
+        ):
+            _infer_exact_helper(
+                self,
+                "onnx",
+                (input_size,),
+                8,
+                input_dtype,
+                output0_dtype,
+                output1_dtype,
+                output0_raw=output0_raw,
+                output1_raw=output1_raw,
+                swap=swap,
+            )
+
+        if tu.validate_for_libtorch_model(
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            (input_size,),
+            (input_size,),
+            (input_size,),
+        ):
+            _infer_exact_helper(
+                self,
+                "libtorch",
+                (input_size,),
+                8,
+                input_dtype,
+                output0_dtype,
+                output1_dtype,
+                output0_raw=output0_raw,
+                output1_raw=output1_raw,
+                swap=swap,
+            )
 
     def test_raw_fff(self):
-        self._full_exact(np.float32,
-                         np.float32,
-                         np.float32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=True)
+        self._full_exact(
+            np.float32,
+            np.float32,
+            np.float32,
+            output0_raw=True,
+            output1_raw=True,
+            swap=True,
+        )
 
     def test_class_fff(self):
-        self._full_exact(np.float32,
-                         np.float32,
-                         np.float32,
-                         output0_raw=False,
-                         output1_raw=False,
-                         swap=True)
+        self._full_exact(
+            np.float32,
+            np.float32,
+            np.float32,
+            output0_raw=False,
+            output1_raw=False,
+            swap=True,
+        )
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_string_io/string_client_test.py b/qa/L0_string_io/string_client_test.py
old mode 100644
new mode 100755
index b012ce87af..16112ac70c
--- a/qa/L0_string_io/string_client_test.py
+++ b/qa/L0_string_io/string_client_test.py
@@ -1,5 +1,5 @@
 #!/usr/bin/env python
-# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,27 +26,26 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
-sys.path.append('../common')
 
-import argparse
-import numpy as np
-import os
+sys.path.append("../common")
+
+import unittest
 from builtins import range
-import tritonclient.http as tritonhttpclient
+
+import numpy as np
+import test_util as tu
 import tritonclient.grpc as tritongrpcclient
+import tritonclient.http as tritonhttpclient
 import tritonclient.utils as tritonutils
-import unittest
-import test_util as tu
 
 
 class ClientStringTest(tu.TestResultCollector):
-
     def _test_infer_unicode(self, model_name, client, input_):
         # Send inference request to the inference server. Get results for
         # both output tensors.
         inputs = []
         outputs = []
-        inputs.append(client[1].InferInput('INPUT0', input_.shape, "BYTES"))
+        inputs.append(client[1].InferInput("INPUT0", input_.shape, "BYTES"))
 
         if client[1] == tritonhttpclient:
             inputs[0].set_data_from_numpy(input_, client[3])
@@ -54,31 +53,26 @@ def _test_infer_unicode(self, model_name, client, input_):
             inputs[0].set_data_from_numpy(input_)
 
         if client[1] == tritonhttpclient:
-            outputs.append(client[1].InferRequestedOutput(
-                'OUTPUT0', binary_data=client[2]))
+            outputs.append(
+                client[1].InferRequestedOutput("OUTPUT0", binary_data=client[2])
+            )
         else:
-            outputs.append(client[1].InferRequestedOutput('OUTPUT0'))
+            outputs.append(client[1].InferRequestedOutput("OUTPUT0"))
 
-        results = client[0].infer(model_name=model_name,
-                                  inputs=inputs,
-                                  outputs=outputs)
+        results = client[0].infer(model_name=model_name, inputs=inputs, outputs=outputs)
 
-        out0 = results.as_numpy('OUTPUT0')
+        out0 = results.as_numpy("OUTPUT0")
         # We expect there to be 1 results (with batch-size 1). Verify
         # that all 8 result elements are the same as the input.
         self.assertTrue(np.array_equal(input_, out0))
         return out0
 
-    def _test_infer_non_unicode(self,
-                                model_name,
-                                client,
-                                input_,
-                                binary_data=True):
+    def _test_infer_non_unicode(self, model_name, client, input_, binary_data=True):
         # Send inference request to the inference server. Get results for
         # both output tensors.
         inputs = []
         outputs = []
-        inputs.append(client[1].InferInput('INPUT0', input_.shape, "BYTES"))
+        inputs.append(client[1].InferInput("INPUT0", input_.shape, "BYTES"))
 
         if client[1] == tritonhttpclient:
             inputs[0].set_data_from_numpy(input_, client[3])
@@ -86,57 +80,58 @@ def _test_infer_non_unicode(self,
             inputs[0].set_data_from_numpy(input_)
 
         if client[1] == tritonhttpclient:
-            outputs.append(client[1].InferRequestedOutput(
-                'OUTPUT0', binary_data=client[2]))
+            outputs.append(
+                client[1].InferRequestedOutput("OUTPUT0", binary_data=client[2])
+            )
         else:
-            outputs.append(client[1].InferRequestedOutput('OUTPUT0'))
+            outputs.append(client[1].InferRequestedOutput("OUTPUT0"))
 
-        results = client[0].infer(model_name=model_name,
-                                  inputs=inputs,
-                                  outputs=outputs)
+        results = client[0].infer(model_name=model_name, inputs=inputs, outputs=outputs)
 
-        out0 = results.as_numpy('OUTPUT0')
+        out0 = results.as_numpy("OUTPUT0")
         # We expect there to be 1 results (with batch-size 1). Verify
         # that all 8 result elements are the same as the input.
         if client[2]:
             self.assertTrue(np.array_equal(input_.astype(np.bytes_), out0))
         else:
             self.assertTrue(
-                np.array_equal(input_.astype(np.bytes_),
-                               out0.astype(np.bytes_)))
+                np.array_equal(input_.astype(np.bytes_), out0.astype(np.bytes_))
+            )
         return out0
 
-    def _test_unicode_bytes_dtype(self, client, model_name, dtype='|S78'):
+    def _test_unicode_bytes_dtype(self, client, model_name, dtype="|S78"):
         # Create the data for the input tensor. Initialize the tensor to 8
         # byte strings. (dtype of np.bytes_)
         # Sample string that should no longer cause failure
-        in0 = np.array([
-            [
-                b'\nF\n\'\n\x01a\x12"\x1a \n\x1e\xfa\x03\x94\x01\x0f\xd7\x02\xf1\x05\xdf\x01\x82\x03\xb5\x05\xc1\x07\xba\x06\xff\x06\xc7\x07L\xf5\x03\xe2\x07\xa9\x03\n\x0c\n\x01b\x12\x07\x1a\x05\n\x03\x89\xcc=\n\r\n\x01c\x12\x08\x12\x06\n\x04\xdf\\\xcb\xbf'
-            ],
-            [
-                b'\n:\n\x1a\n\x01a\x12\x15\x1a\x13\n\x11*\xe3\x05\xc5\x06\xda\x07\xcb\x06~\xb1\x05\xb3\x01\xa9\x02\x15\n\r\n\x01b\x12\x08\x1a\x06\n\x04\xf6\xa2\xc5\x01\n\r\n\x01c\x12\x08\x12\x06\n\x04\xbb[\n\xbf'
-            ],
-            [
-                b'\nL\n-\n\x01a\x12(\x1a&\n$\x87\x07\xce\x01\xe7\x06\xee\x04\xe1\x03\xf1\x03\xd7\x07\xbe\x02\xb8\x05\xe0\x05\xe4\x01\x88\x06\xb6\x03\xb9\x05\x83\x06\xf8\x04\xe2\x04\xf4\x06\n\x0c\n\x01b\x12\x07\x1a\x05\n\x03\x89\xcc=\n\r\n\x01c\x12\x08\x12\x06\n\x04\xbc\x99+@'
-            ],
-            [
-                b'\n2\n\x12\n\x01a\x12\r\x1a\x0b\n\t\x99\x02\xde\x04\x9f\x04\xc5\x053\n\r\n\x01b\x12\x08\x1a\x06\n\x04\xf6\xa2\xc5\x01\n\r\n\x01c\x12\x08\x12\x06\n\x04\x12\x07\x83\xbe'
-            ],
-            [
-                b'\nJ\n\r\n\x01b\x12\x08\x1a\x06\n\x04\x9b\x94\xad\x04\n\r\n\x01c\x12\x08\x12\x06\n\x04\xc3\x8a\x08\xbf\n*\n\x01a\x12%\x1a#\n!\x9c\x02\xb2\x02\xcd\x02\x9d\x07\x8d\x01\xb6\x05a\xf1\x01\xf0\x05\xdb\x02\xac\x04\xbd\x05\xe0\x04\xd2\x06\xaf\x02\xa8\x01\x8b\x04'
-            ],
+        in0 = np.array(
             [
-                b'\n3\n\x13\n\x01a\x12\x0e\x1a\x0c\n\n<\xe2\x05\x8a\x01\xb3\x07?\xfd\x01\n\r\n\x01b\x12\x08\x1a\x06\n\x04\xf6\xa2\xc5\x01\n\r\n\x01c\x12\x08\x12\x06\n\x04\x1b\x931\xbf\x00\x00'
+                [
+                    b"\nF\n'\n\x01a\x12\"\x1a \n\x1e\xfa\x03\x94\x01\x0f\xd7\x02\xf1\x05\xdf\x01\x82\x03\xb5\x05\xc1\x07\xba\x06\xff\x06\xc7\x07L\xf5\x03\xe2\x07\xa9\x03\n\x0c\n\x01b\x12\x07\x1a\x05\n\x03\x89\xcc=\n\r\n\x01c\x12\x08\x12\x06\n\x04\xdf\\\xcb\xbf"
+                ],
+                [
+                    b"\n:\n\x1a\n\x01a\x12\x15\x1a\x13\n\x11*\xe3\x05\xc5\x06\xda\x07\xcb\x06~\xb1\x05\xb3\x01\xa9\x02\x15\n\r\n\x01b\x12\x08\x1a\x06\n\x04\xf6\xa2\xc5\x01\n\r\n\x01c\x12\x08\x12\x06\n\x04\xbb[\n\xbf"
+                ],
+                [
+                    b"\nL\n-\n\x01a\x12(\x1a&\n$\x87\x07\xce\x01\xe7\x06\xee\x04\xe1\x03\xf1\x03\xd7\x07\xbe\x02\xb8\x05\xe0\x05\xe4\x01\x88\x06\xb6\x03\xb9\x05\x83\x06\xf8\x04\xe2\x04\xf4\x06\n\x0c\n\x01b\x12\x07\x1a\x05\n\x03\x89\xcc=\n\r\n\x01c\x12\x08\x12\x06\n\x04\xbc\x99+@"
+                ],
+                [
+                    b"\n2\n\x12\n\x01a\x12\r\x1a\x0b\n\t\x99\x02\xde\x04\x9f\x04\xc5\x053\n\r\n\x01b\x12\x08\x1a\x06\n\x04\xf6\xa2\xc5\x01\n\r\n\x01c\x12\x08\x12\x06\n\x04\x12\x07\x83\xbe"
+                ],
+                [
+                    b"\nJ\n\r\n\x01b\x12\x08\x1a\x06\n\x04\x9b\x94\xad\x04\n\r\n\x01c\x12\x08\x12\x06\n\x04\xc3\x8a\x08\xbf\n*\n\x01a\x12%\x1a#\n!\x9c\x02\xb2\x02\xcd\x02\x9d\x07\x8d\x01\xb6\x05a\xf1\x01\xf0\x05\xdb\x02\xac\x04\xbd\x05\xe0\x04\xd2\x06\xaf\x02\xa8\x01\x8b\x04"
+                ],
+                [
+                    b"\n3\n\x13\n\x01a\x12\x0e\x1a\x0c\n\n<\xe2\x05\x8a\x01\xb3\x07?\xfd\x01\n\r\n\x01b\x12\x08\x1a\x06\n\x04\xf6\xa2\xc5\x01\n\r\n\x01c\x12\x08\x12\x06\n\x04\x1b\x931\xbf\x00\x00"
+                ],
+                [
+                    b"\n&\n\x07\n\x01a\x12\x02\x1a\x00\n\x0c\n\x01b\x12\x07\x1a\x05\n\x03\x89\xcc=\n\r\n\x01c\x12\x08\x12\x06\n\x04{\xbc\x0e>\x00\x00\x00"
+                ],
+                [
+                    b"\nF\n'\n\x01a\x12\"\x1a \n\x1e\x97\x01\x93\x02\x9e\x01\xac\x06\xff\x01\xd8\x05\xe1\x07\xd8\x04g]\x9a\x05\xff\x06\xde\x07\x8f\x04\x97\x04\xda\x03\n\x0c\n\x01b\x12\x07\x1a\x05\n\x03\x9a\xb7I\n\r\n\x01c\x12\x08\x12\x06\n\x04\xfb\x87\x83\xbf"
+                ],
             ],
-            [
-                b'\n&\n\x07\n\x01a\x12\x02\x1a\x00\n\x0c\n\x01b\x12\x07\x1a\x05\n\x03\x89\xcc=\n\r\n\x01c\x12\x08\x12\x06\n\x04{\xbc\x0e>\x00\x00\x00'
-            ],
-            [
-                b'\nF\n\'\n\x01a\x12"\x1a \n\x1e\x97\x01\x93\x02\x9e\x01\xac\x06\xff\x01\xd8\x05\xe1\x07\xd8\x04g]\x9a\x05\xff\x06\xde\x07\x8f\x04\x97\x04\xda\x03\n\x0c\n\x01b\x12\x07\x1a\x05\n\x03\x9a\xb7I\n\r\n\x01c\x12\x08\x12\x06\n\x04\xfb\x87\x83\xbf'
-            ]
-        ],
-                       dtype=dtype).flatten()
+            dtype=dtype,
+        ).flatten()
         self._test_infer_unicode(model_name, client, in0)
 
     def _test_str_dtype(self, client, model_name, dtype=np.object_):
@@ -147,30 +142,44 @@ def _test_str_dtype(self, client, model_name, dtype=np.object_):
         self._test_infer_non_unicode(model_name, client, in0_bytes)
 
     def _test_bytes(self, model_name):
-        dtypes = [np.object_, np.object, np.bytes_]
+        dtypes = [np.object_, np.bytes_]
 
         # This clients will fail for binary_data=False when the binary input
         # is not UTF-8 encodable. They should work for other cases however.
         binary_false_clients = [
-            (tritonhttpclient.InferenceServerClient("localhost:8000",
-                                                    verbose=True),
-             tritonhttpclient, True, False),
-            (tritonhttpclient.InferenceServerClient("localhost:8000",
-                                                    verbose=True),
-             tritonhttpclient, False, False),
-            (tritonhttpclient.InferenceServerClient("localhost:8000",
-                                                    verbose=True),
-             tritonhttpclient, False, True),
+            (
+                tritonhttpclient.InferenceServerClient("localhost:8000", verbose=True),
+                tritonhttpclient,
+                True,
+                False,
+            ),
+            (
+                tritonhttpclient.InferenceServerClient("localhost:8000", verbose=True),
+                tritonhttpclient,
+                False,
+                False,
+            ),
+            (
+                tritonhttpclient.InferenceServerClient("localhost:8000", verbose=True),
+                tritonhttpclient,
+                False,
+                True,
+            ),
         ]
 
         # These clients work for every data type
         other_clients = [
-            (tritongrpcclient.InferenceServerClient("localhost:8001",
-                                                    verbose=True),
-             tritongrpcclient, False),
-            (tritonhttpclient.InferenceServerClient("localhost:8000",
-                                                    verbose=True),
-             tritonhttpclient, True, True),
+            (
+                tritongrpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                tritongrpcclient,
+                False,
+            ),
+            (
+                tritonhttpclient.InferenceServerClient("localhost:8000", verbose=True),
+                tritonhttpclient,
+                True,
+                True,
+            ),
         ]
 
         for client in other_clients + binary_false_clients:
@@ -195,5 +204,5 @@ def test_tf_unicode_bytes(self):
         self._test_bytes("string_identity")
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_tf_gpu_io/test.sh b/qa/L0_tf_gpu_io/test.sh
index 2b520d219c..98a5dff1ef 100755
--- a/qa/L0_tf_gpu_io/test.sh
+++ b/qa/L0_tf_gpu_io/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2019-2023, NVIDIA CORPORATION. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -40,9 +40,8 @@ fi
 
 export CUDA_VISIBLE_DEVICES=0
 
-CLIENT=../clients/perf_client
+TF_TEST=tf_gpu_io_test.py
 BACKENDS=${BACKENDS:="graphdef savedmodel"}
-TENSOR_SIZE=16384
 
 DATADIR=/data/inferenceserver/${REPO_VERSION}
 
@@ -50,11 +49,9 @@ SERVER=/opt/tritonserver/bin/tritonserver
 source ../common/util.sh
 
 RET=0
-
-#
-# Use "identity" model for all model types.
-#
 rm -f ./*.log
+
+# Test with qa identity TF models
 for BACKEND in $BACKENDS; do
     MODEL_NAME=${BACKEND}_zero_1_float32
     rm -fr models && mkdir -p models
@@ -70,7 +67,7 @@ for BACKEND in $BACKENDS; do
             echo "optimization { execution_accelerators { gpu_execution_accelerator : [ { name : \"gpu_io\"} ] } }" >> config.pbtxt)
 
     SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=1"
-    SERVER_LOG="${MODEL_NAME}.serverlog"
+    SERVER_LOG="${MODEL_NAME}.server.log"
     run_server
     if [ "$SERVER_PID" == "0" ]; then
         echo -e "\n***\n*** Failed to start $SERVER\n***"
@@ -80,60 +77,71 @@ for BACKEND in $BACKENDS; do
 
     set +e
 
-    $CLIENT -m${MODEL_NAME}_def --shape INPUT0:${TENSOR_SIZE} \
-                >> ${BACKEND}.sanity.log 2>&1
+    python $TF_TEST TfGpuIoTest.test_${MODEL_NAME}_def >> ${BACKEND}.sanity.log 2>&1
     if (( $? != 0 )); then
+        cat ${BACKEND}.sanity.log
         RET=1
     fi
 
-    grep "is GPU tensor: true" $SERVER_LOG
+    grep "is GPU tensor: true" $SERVER_LOG >> grep.out.log
     if [ $? -eq 0 ]; then
         echo -e "\n***\n*** Failed. Expected neither input or output is GPU tensor\n***"
         RET=1
     fi
 
-    $CLIENT -m${MODEL_NAME}_gpu  --shape INPUT0:${TENSOR_SIZE} \
-             >> ${BACKEND}.gpu.sanity.log 2>&1
+    python $TF_TEST TfGpuIoTest.test_${MODEL_NAME}_gpu >> ${BACKEND}.gpu.sanity.log 2>&1
     if (( $? != 0 )); then
+        cat ${BACKEND}.gpu.sanity.log
         RET=1
     fi
 
-    grep "is GPU tensor: true" $SERVER_LOG
+    grep "is GPU tensor: true" $SERVER_LOG >> grep.out.log
     if [ $? -ne 0 ]; then
         echo -e "\n***\n*** Failed. Expected input and output are GPU tensors\n***"
         RET=1
     fi
 
-    # Sample latency results
-    $CLIENT -m${MODEL_NAME}_def --shape INPUT0:${TENSOR_SIZE} \
-             >> ${BACKEND}.log 2>&1
-    if (( $? != 0 )); then
-        RET=1
-    fi
-
-    $CLIENT -m${MODEL_NAME}_gpu --shape INPUT0:${TENSOR_SIZE} \
-            >> ${BACKEND}.gpu.log 2>&1
-    if (( $? != 0 )); then
-        RET=1
-    fi
-
     set -e
 
     kill $SERVER_PID
     wait $SERVER_PID
 done
 
-for BACKEND in $BACKENDS; do
-    echo -e "\n${BACKEND}\n************"
-    cat ${BACKEND}.log
-    echo -e "\n${BACKEND} with GPU I/O\n************"
-    cat ${BACKEND}.gpu.log
-done
+# Test savedmodel with mismatched key and name
+rm -rf models && mkdir -p models
+cp -r $DATADIR/qa_tf_tag_sigdef_repository/sig_tag0 models
+(cd models/sig_tag0 && \
+    echo "optimization { execution_accelerators { gpu_execution_accelerator : [ { name : \"gpu_io\"} ] } }" >> config.pbtxt)
+
+SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=1"
+SERVER_LOG="sig_tag0.server.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+CLIENT_LOG="sig_tag0.gpu.log"
+python $TF_TEST TfGpuIoTest.test_sig_tag0 >> $CLIENT_LOG 2>&1
+if (( $? != 0 )); then
+    cat $CLIENT_LOG
+    RET=1
+fi
+grep "is GPU tensor: true" $SERVER_LOG >> grep.out.log
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Failed. Expected input and output are GPU tensors\n***"
+    RET=1
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
 
 if [ $RET -eq 0 ]; then
     echo -e "\n***\n*** Test Passed\n***"
 else
     echo -e "\n***\n*** Test FAILED\n***"
 fi
-
 exit $RET
diff --git a/qa/L0_tf_gpu_io/tf_gpu_io_test.py b/qa/L0_tf_gpu_io/tf_gpu_io_test.py
new file mode 100755
index 0000000000..fd3550e434
--- /dev/null
+++ b/qa/L0_tf_gpu_io/tf_gpu_io_test.py
@@ -0,0 +1,105 @@
+#!/usr/bin/env python3
+
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+
+import unittest
+
+import infer_util as iu
+import numpy as np
+import test_util as tu
+
+TENSOR_SIZE = 16384
+
+
+class TfGpuIoTest(tu.TestResultCollector):
+    def _test_helper(
+        self,
+        model_name,
+        shape,
+        override_input_names=[],
+        override_output_names=[],
+        batching_enabled=False,
+    ):
+        try:
+            bs = 1
+            if batching_enabled:
+                shape = [
+                    [
+                        bs,
+                    ]
+                    + shape
+                ]
+            iu.infer_zero(
+                self,
+                "graphdef",
+                bs,
+                np.float32,
+                shape,
+                shape,
+                override_model_name=model_name,
+                override_input_names=override_input_names,
+                override_output_names=override_output_names,
+            )
+
+        except Exception as ex:
+            self.assertTrue(False, "unexpected error {}".format(ex))
+
+    def test_sig_tag0(self):
+        self._test_helper(
+            "sig_tag0",
+            [16],
+            override_input_names=["INPUT"],
+            override_output_names=["OUTPUT"],
+        )
+
+    def test_graphdef_zero_1_float32_def(self):
+        self._test_helper(
+            "graphdef_zero_1_float32_def", [TENSOR_SIZE], batching_enabled=True
+        )
+
+    def test_graphdef_zero_1_float32_gpu(self):
+        self._test_helper(
+            "graphdef_zero_1_float32_gpu", [TENSOR_SIZE], batching_enabled=True
+        )
+
+    def test_savedmodel_zero_1_float32_def(self):
+        self._test_helper(
+            "savedmodel_zero_1_float32_def", [TENSOR_SIZE], batching_enabled=True
+        )
+
+    def test_savedmodel_zero_1_float32_gpu(self):
+        self._test_helper(
+            "savedmodel_zero_1_float32_gpu", [TENSOR_SIZE], batching_enabled=True
+        )
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_tf_parameters/test.sh b/qa/L0_tf_parameters/test.sh
new file mode 100755
index 0000000000..133b6ef68d
--- /dev/null
+++ b/qa/L0_tf_parameters/test.sh
@@ -0,0 +1,150 @@
+#!/bin/bash
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
+if [ "$#" -ge 1 ]; then
+    REPO_VERSION=$1
+fi
+if [ -z "$REPO_VERSION" ]; then
+    echo -e "Repository version must be specified"
+    echo -e "\n***\n*** Test Failed\n***"
+    exit 1
+fi
+if [ ! -z "$TEST_REPO_ARCH" ]; then
+    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
+fi
+source ../common/util.sh
+
+export CUDA_VISIBLE_DEVICES=0
+
+DATADIR=/data/inferenceserver/${REPO_VERSION}/qa_tf_parameters_repository
+TEST_RESULT_FILE='test_results.txt'
+CLIENT_LOG="./client.log"
+TEST=tf_parameter_test.py
+EXPECTED_NUM_TESTS="1"
+MODEL_REPOSITORY=`pwd`/models
+SERVER=/opt/tritonserver/bin/tritonserver
+SERVER_LOG="./inference_server.log"
+
+RET=0
+
+rm -rf $SERVER_LOG $CLIENT_LOG models/
+cp -r $DATADIR models
+SERVER_ARGS="--model-repository=$MODEL_REPOSITORY"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python $TEST TFParameterTest.test_tf_variable_error>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Add the initialization operation
+echo "{\"init_ops\": [\"init\"]}" > models/graphdef_variable/init_ops.json
+echo "parameters: { key: \"TF_INIT_OPS_FILE\" value: { string_value:\"init_ops.json\" }}" >> models/graphdef_variable/config.pbtxt
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python $TEST TFParameterTest.test_tf_variable>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Move the initialization op to the model version folder.
+mv models/graphdef_variable/init_ops.json models/graphdef_variable/1/
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python $TEST TFParameterTest.test_tf_variable>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+if [ $RET -eq 0 ]; then
+    echo -e "\n***\n*** Test Passed\n***"
+else
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test FAILED\n***"
+fi
+
+exit $RET
diff --git a/qa/L0_tf_parameters/tf_parameter_test.py b/qa/L0_tf_parameters/tf_parameter_test.py
new file mode 100755
index 0000000000..f1a4621d93
--- /dev/null
+++ b/qa/L0_tf_parameters/tf_parameter_test.py
@@ -0,0 +1,81 @@
+#!/usr/bin/env python3
+
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+
+import unittest
+
+import numpy as np
+import test_util as tu
+import tritonclient.http as tritonhttpclient
+import tritonclient.utils
+
+
+class TFParameterTest(tu.TestResultCollector):
+    def setUp(self):
+        self._client = tritonhttpclient.InferenceServerClient(
+            "localhost:8000", verbose=True
+        )
+
+    def _infer_helper(self):
+        # The model has a single variable which is added to the input.  Since the
+        # variable is initialized to zero the input and output must match.
+        model_name = "graphdef_variable"
+        input = np.array([10], dtype=np.int32)
+
+        inputs = []
+        inputs.append(tritonhttpclient.InferInput("INPUT", input.shape, "INT32"))
+        inputs[-1].set_data_from_numpy(input)
+
+        outputs = []
+        outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT"))
+
+        results = self._client.infer(
+            model_name=model_name, inputs=inputs, outputs=outputs
+        )
+        output = results.as_numpy("OUTPUT")
+        np.testing.assert_array_equal(output, input)
+
+    def test_tf_variable(self):
+        self._infer_helper()
+
+    def test_tf_variable_error(self):
+        with self.assertRaises(tritonclient.utils.InferenceServerException) as e:
+            self._infer_helper()
+        self.assertIn(
+            "FAILED_PRECONDITION: Could not find variable VARIABLE. This "
+            + "could mean that the variable has been deleted. In TF1, it can "
+            + "also mean the variable is uninitialized.",
+            e.exception.message(),
+        )
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_tf_tag_sigdef/test.sh b/qa/L0_tf_tag_sigdef/test.sh
index 8a0295d810..32248c74ad 100755
--- a/qa/L0_tf_tag_sigdef/test.sh
+++ b/qa/L0_tf_tag_sigdef/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -43,22 +43,29 @@ export CUDA_VISIBLE_DEVICES=0
 TEST_RESULT_FILE='test_results.txt'
 CLIENT_LOG="./client.log"
 TEST=tf_tag_sigdef_test.py
-MAKE_MODEL=gen_tag_sigdef.py
 
 DATADIR=/data/inferenceserver/${REPO_VERSION}/qa_tf_tag_sigdef_repository
+MODELDIR=`pwd`/models
+
+rm -rf $SERVER_LOG $CLIENT_LOG $MODELDIR
+mkdir $MODELDIR
+cp -r $DATADIR/* $MODELDIR
+
 EXPECTED_NUM_TESTS="4"
 SERVER=/opt/tritonserver/bin/tritonserver
-SERVER_ARGS="--model-repository=$DATADIR --exit-timeout-secs=120"
+SERVER_ARGS="--model-repository=$MODELDIR --exit-timeout-secs=120"
 SERVER_LOG="./inference_server.log"
 source ../common/util.sh
 
-rm -f $SERVER_LOG $CLIENT_LOG
-
 RET=0
 
 run_server
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
+    if [ `grep -c "configuration expects 2 inputs, model provides 1" $SERVER_LOG` != "0" ]; then
+        echo -e "*** FAILED: sig_tag_different_io config autocompleted with wrong model tag variant, failed to load.\n"
+        RET=1
+    fi
     cat $SERVER_LOG
     exit 1
 fi
diff --git a/qa/L0_tf_tag_sigdef/tf_tag_sigdef_test.py b/qa/L0_tf_tag_sigdef/tf_tag_sigdef_test.py
old mode 100644
new mode 100755
index d3739afea4..b4a11ac04e
--- a/qa/L0_tf_tag_sigdef/tf_tag_sigdef_test.py
+++ b/qa/L0_tf_tag_sigdef/tf_tag_sigdef_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,14 +30,11 @@
 
 sys.path.append("../common")
 
-from builtins import range
-from future.utils import iteritems
 import unittest
+
 import numpy as np
-import os
 import test_util as tu
 import tritonhttpclient as httpclient
-from tritonclientutils import InferenceServerException
 
 
 class TagSigdefTest(tu.TestResultCollector):
@@ -53,16 +52,14 @@ def _test_helper(self, modelVersion, tag, sig_def):
         # for details
         multiplier = modelVersion + 1
         output_name = "OUTPUT"
-        triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                         verbose=True)
+        triton_client = httpclient.InferenceServerClient("localhost:8000", verbose=True)
         inputs = []
         outputs = []
-        inputs.append(httpclient.InferInput('INPUT', shape, "FP32"))
+        inputs.append(httpclient.InferInput("INPUT", shape, "FP32"))
         input_data = np.ones(shape=shape).astype(np.float32)
         inputs[0].set_data_from_numpy(input_data, binary_data=True)
 
-        outputs.append(
-            httpclient.InferRequestedOutput(output_name, binary_data=True))
+        outputs.append(httpclient.InferRequestedOutput(output_name, binary_data=True))
         results = triton_client.infer(model_name, inputs, outputs=outputs)
         output_data = results.as_numpy(output_name)
         test_output = input_data * multiplier
@@ -81,5 +78,5 @@ def test_tag_sig_def(self):
         self._test_helper(3, self.test_tag, self.test_sig_def)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_tf_unknown_rank/test.sh b/qa/L0_tf_unknown_rank/test.sh
old mode 100644
new mode 100755
index ab9db57f24..e279a46267
--- a/qa/L0_tf_unknown_rank/test.sh
+++ b/qa/L0_tf_unknown_rank/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -79,7 +79,7 @@ else
     fi
 fi
 
-python $UNKNOWN_RANK_TEST UnknownRankTest.test_wrong_output >> $CLIENT_LOG 2>&1
+python $UNKNOWN_RANK_TEST UnknownRankTest.test_wrong_input >> $CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
     echo -e "\n***\n*** Test Failed\n***"
     cat $CLIENT_LOG
@@ -109,9 +109,10 @@ if [ "$SERVER_PID" != "0" ]; then
     kill $SERVER_PID
     wait $SERVER_PID
 else
-    ERROR_MESSAGE="unable to autofill for 'scalar_model': the rank of model tensor 'x' is 0 which is not supported"
+    ERROR_MESSAGE="Unable to autofill for 'scalar_model': the rank of model tensor 'x' is 0 and dimensions are not defined"
     if [[ $(cat $SERVER_LOG | grep "${ERROR_MESSAGE}" | wc -l) -ne 2 ]]; then
         echo -e "\n***\n*** Test Failed: "${ERROR_MESSAGE}" not found\n***"
+        cat $SERVER_LOG
         RET=1
     fi
 fi
diff --git a/qa/L0_tf_unknown_rank/tf_unknown_rank_test.py b/qa/L0_tf_unknown_rank/tf_unknown_rank_test.py
old mode 100644
new mode 100755
index 427220a782..add6b32c13
--- a/qa/L0_tf_unknown_rank/tf_unknown_rank_test.py
+++ b/qa/L0_tf_unknown_rank/tf_unknown_rank_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,9 +27,11 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
 import unittest
+
 import numpy as np
 import test_util as tu
 import tritonhttpclient
@@ -39,33 +43,40 @@ class UnknownRankTest(tu.TestResultCollector):
     def infer_unknown(self, model_name, tensor_shape):
         print("About to run the test")
         input_data = np.random.random_sample(tensor_shape).astype(np.float32)
-        client = tritonhttpclient.InferenceServerClient('localhost:8000')
+        client = tritonhttpclient.InferenceServerClient("localhost:8000")
         inputs = [
-            tritonhttpclient.InferInput("INPUT", input_data.shape,
-                                        np_to_triton_dtype(input_data.dtype))
+            tritonhttpclient.InferInput(
+                "INPUT", input_data.shape, np_to_triton_dtype(input_data.dtype)
+            )
         ]
         inputs[0].set_data_from_numpy(input_data)
         results = client.infer(model_name, inputs)
-        self.assertTrue(np.array_equal(results.as_numpy('OUTPUT'), input_data))
+        self.assertTrue(np.array_equal(results.as_numpy("OUTPUT"), input_data))
 
     def test_success(self):
         model_name = "unknown_rank_success"
-        tensor_shape = (1,)
+        tensor_shape = 1
         try:
             self.infer_unknown(model_name, tensor_shape)
         except InferenceServerException as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
-    def test_wrong_output(self):
-        tensor_shape = (1,)
+    def test_wrong_input(self):
         model_name = "unknown_rank_wrong_output"
+        tensor_shape = (1, 2)
         try:
             self.infer_unknown(model_name, tensor_shape)
+            self.fail(
+                "Found success when expected failure with model given "
+                "wrong input tensor [1,2] for input [-1,1]."
+            )
         except InferenceServerException as ex:
-            self.assertIn("tensor \'OUTPUT\': the model expects 1 dimensions " \
-                "(shape [1]) but the model configuration specifies 2 dimensions " \
-                "(shape [1,1])", ex.message())
+            self.assertIn(
+                "unexpected shape for input 'INPUT' for model "
+                "'unknown_rank_wrong_output'. Expected [1], got [1,2]",
+                ex.message(),
+            )
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_tftrt_optimization/tftrt_optimization_test.py b/qa/L0_tftrt_optimization/tftrt_optimization_test.py
old mode 100644
new mode 100755
index b25734f606..9e59677317
--- a/qa/L0_tftrt_optimization/tftrt_optimization_test.py
+++ b/qa/L0_tftrt_optimization/tftrt_optimization_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,51 +27,49 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
 import unittest
+
 import numpy as np
 import test_util as tu
 import tritonhttpclient as httpclient
-from tritonclientutils import InferenceServerException
 
 
 class TFTRTOptimizationTest(tu.TestResultCollector):
-
     def setUp(self):
-        self.input0_ = np.arange(start=0, stop=16,
-                                 dtype=np.float32).reshape(1, 16)
+        self.input0_ = np.arange(start=0, stop=16, dtype=np.float32).reshape(1, 16)
         self.input1_ = np.ones(shape=16, dtype=np.float32).reshape(1, 16)
         self.expected_output0_ = self.input0_ + self.input1_
         self.expected_output1_ = self.input0_ - self.input1_
 
     def _addsub_infer(self, model_name):
-        triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                         verbose=True)
+        triton_client = httpclient.InferenceServerClient("localhost:8000", verbose=True)
 
         inputs = []
         outputs = []
-        inputs.append(httpclient.InferInput('INPUT0', [1, 16], "FP32"))
-        inputs.append(httpclient.InferInput('INPUT1', [1, 16], "FP32"))
+        inputs.append(httpclient.InferInput("INPUT0", [1, 16], "FP32"))
+        inputs.append(httpclient.InferInput("INPUT1", [1, 16], "FP32"))
 
         # Initialize the data
         inputs[0].set_data_from_numpy(self.input0_, binary_data=True)
         inputs[1].set_data_from_numpy(self.input1_, binary_data=False)
 
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT0', binary_data=True))
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT1', binary_data=True))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=True))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=True))
 
         results = triton_client.infer(model_name, inputs, outputs=outputs)
 
-        output0_data = results.as_numpy('OUTPUT0')
-        output1_data = results.as_numpy('OUTPUT1')
+        output0_data = results.as_numpy("OUTPUT0")
+        output1_data = results.as_numpy("OUTPUT1")
 
-        self.assertTrue(np.array_equal(self.expected_output0_, output0_data),
-                        "incorrect sum")
-        self.assertTrue(np.array_equal(self.expected_output1_, output1_data),
-                        "incorrect difference")
+        self.assertTrue(
+            np.array_equal(self.expected_output0_, output0_data), "incorrect sum"
+        )
+        self.assertTrue(
+            np.array_equal(self.expected_output1_, output1_data), "incorrect difference"
+        )
 
     def test_graphdef(self):
         self._addsub_infer("graphdef_float32_float32_float32_trt")
@@ -80,5 +80,5 @@ def test_savedmodel(self):
         self._addsub_infer("savedmodel_float32_float32_float32_param")
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_trace/opentelemetry_unittest.py b/qa/L0_trace/opentelemetry_unittest.py
new file mode 100644
index 0000000000..5055f4e88a
--- /dev/null
+++ b/qa/L0_trace/opentelemetry_unittest.py
@@ -0,0 +1,274 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+import json
+import re
+import unittest
+
+import numpy as np
+import test_util as tu
+import tritonclient.grpc as grpcclient
+import tritonclient.http as httpclient
+
+EXPECTED_NUM_SPANS = 16
+# OpenTelemetry OStream exporter sets `parent_span_id` to "0000000000000000",
+# if current span is a root span, i.e. there is no parent span.
+# https://github.com/open-telemetry/opentelemetry-cpp/blob/b7fd057185c4ed2dff507b859cbe058b7609fb4a/exporters/ostream/src/span_exporter.cc#L78C54-L78C68
+NO_PARENT_SPAN = "0000000000000000"
+
+
+class OpenTelemetryTest(tu.TestResultCollector):
+    def setUp(self):
+        # Extracted spans are in json-like format, thus data needs to be
+        # post-processed, so that `json` could accept it for further
+        # processing
+        with open("trace_collector.log", "rt") as f:
+            data = f.read()
+            # Removing new lines and tabs around `{`
+            json_string = re.sub("\n\t{\n\t", "{", data)
+            # `resources` field is a dictionary, so adding `{` and`}`
+            # in the next 2 transformations, `instr-lib` is a next field,
+            # so whatever goes before it, belongs to `resources`.
+            json_string = re.sub(
+                "resources     : \n\t", "resources     : {\n\t", json_string
+            )
+            json_string = re.sub(
+                "\n  instr-lib     :", "}\n  instr-lib     :", json_string
+            )
+            # `json`` expects "key":"value" format, some fields in the
+            # data have empty string as value, so need to add `"",`
+            json_string = re.sub(": \n\t", ':"",', json_string)
+            json_string = re.sub(": \n", ':"",', json_string)
+            # Extracted data missing `,' after each key-value pair,
+            # which `json` exppects
+            json_string = re.sub("\n|\n\t", ",", json_string)
+            # Removing tabs
+            json_string = re.sub("\t", "", json_string)
+            # `json` expects each key and value have `"`'s, so adding them to
+            # every word/number/alpha-numeric entry
+            json_string = re.sub(r"\b([\w.-]+)\b", r'"\1"', json_string)
+            # `span kind`` represents one key
+            json_string = re.sub('"span" "kind"', '"span kind"', json_string)
+            # Removing extra `,`
+            json_string = re.sub("{,", "{", json_string)
+            json_string = re.sub(",}", "}", json_string)
+            # Adding `,` between dictionary entries
+            json_string = re.sub("}{", "},{", json_string)
+            # `events` is a list of dictionaries, `json` will accept it in the
+            # form of "events" : [{....}, {.....}, ...]
+            json_string = re.sub(
+                '"events"        : {', '"events"        : [{', json_string
+            )
+            # Closing `events`' list of dictionaries
+            json_string = re.sub('},  "links"', '}],  "links"', json_string)
+            # Last 2 symbols are not needed
+            json_string = json_string[:-2]
+            # Since now `json_string` is a string, which represents dictionaries,
+            # we  put it into one dictionary, so that `json` could read it as one.
+            json_string = '{ "spans" :[' + json_string + "] }"
+            self.spans = json.loads(json_string)["spans"]
+
+        self.simple_model_name = "simple"
+        self.ensemble_model_name = "ensemble_add_sub_int32_int32_int32"
+        self.bls_model_name = "bls_simple"
+        self.root_span = "InferRequest"
+
+    def _check_events(self, span_name, events):
+        root_events_http = [
+            "HTTP_RECV_START",
+            "HTTP_RECV_END",
+            "INFER_RESPONSE_COMPLETE",
+            "HTTP_SEND_START",
+            "HTTP_SEND_END",
+        ]
+        root_events_grpc = [
+            "GRPC_WAITREAD_START",
+            "GRPC_WAITREAD_END",
+            "INFER_RESPONSE_COMPLETE",
+            "GRPC_SEND_START",
+            "GRPC_SEND_END",
+        ]
+        request_events = ["REQUEST_START", "QUEUE_START", "REQUEST_END"]
+        compute_events = [
+            "COMPUTE_START",
+            "COMPUTE_INPUT_END",
+            "COMPUTE_OUTPUT_START",
+            "COMPUTE_END",
+        ]
+
+        if span_name == "compute":
+            # Check that all compute related events (and only them)
+            # are recorded in compute span
+            self.assertTrue(all(entry in events for entry in compute_events))
+            self.assertFalse(all(entry in events for entry in request_events))
+            self.assertFalse(
+                all(entry in events for entry in root_events_http + root_events_grpc)
+            )
+
+        elif span_name == self.root_span:
+            # Check that root span has INFER_RESPONSE_COMPLETE, _RECV/_WAITREAD
+            # and _SEND events (and only them)
+            if "HTTP" in events:
+                self.assertTrue(all(entry in events for entry in root_events_http))
+                self.assertFalse(all(entry in events for entry in root_events_grpc))
+
+            elif "GRPC" in events:
+                self.assertTrue(all(entry in events for entry in root_events_grpc))
+                self.assertFalse(all(entry in events for entry in root_events_http))
+            self.assertFalse(all(entry in events for entry in request_events))
+            self.assertFalse(all(entry in events for entry in compute_events))
+
+        elif span_name == self.simple_model_name:
+            # Check that all request related events (and only them)
+            # are recorded in request span
+            self.assertTrue(all(entry in events for entry in request_events))
+            self.assertFalse(
+                all(entry in events for entry in root_events_http + root_events_grpc)
+            )
+            self.assertFalse(all(entry in events for entry in compute_events))
+
+    def _check_parent(self, child_span, parent_span):
+        # Check that child and parent span have the same trace_id
+        # and child's `parent_span_id` is the same as parent's `span_id`
+        self.assertEqual(child_span["trace_id"], parent_span["trace_id"])
+        self.assertNotEqual(
+            child_span["parent_span_id"],
+            NO_PARENT_SPAN,
+            "child span does not have parent span id specified",
+        )
+        self.assertEqual(
+            child_span["parent_span_id"],
+            parent_span["span_id"],
+            "child {} , parent {}".format(child_span, parent_span),
+        )
+
+    def test_spans(self):
+        parsed_spans = []
+
+        # Check that collected spans have proper events recorded
+        for span in self.spans:
+            span_name = span["name"]
+            self._check_events(span_name, str(span["events"]))
+            parsed_spans.append(span_name)
+
+        # There should be 16 spans in total:
+        # 3 for http request, 3 for grpc request, 4 for ensemble, 6 for bls
+        self.assertEqual(len(self.spans), EXPECTED_NUM_SPANS)
+        # We should have 5 compute spans
+        self.assertEqual(parsed_spans.count("compute"), 5)
+        # 7 request spans
+        # (4 named simple - same as our model name, 2 ensemble, 1 bls)
+        self.assertEqual(parsed_spans.count(self.simple_model_name), 4)
+        self.assertEqual(parsed_spans.count(self.ensemble_model_name), 2)
+        self.assertEqual(parsed_spans.count(self.bls_model_name), 1)
+        # 4 root spans
+        self.assertEqual(parsed_spans.count(self.root_span), 4)
+
+    def test_nested_spans(self):
+        # First 3 spans in `self.spans` belong to HTTP request
+        # They are recorded in the following order:
+        # compute_span [idx 0] , request_span [idx 1], root_span [idx 2].
+        # compute_span should be a child of request_span
+        # request_span should be a child of root_span
+        for child, parent in zip(self.spans[:3], self.spans[1:3]):
+            self._check_parent(child, parent)
+
+        # Next 3 spans in `self.spans` belong to GRPC request
+        # Order of spans and their relationship described earlier
+        for child, parent in zip(self.spans[3:6], self.spans[4:6]):
+            self._check_parent(child, parent)
+
+        # Next 4 spans in `self.spans` belong to ensemble request
+        # Order of spans: compute span - request span - request span - root span
+        for child, parent in zip(self.spans[6:10], self.spans[7:10]):
+            self._check_parent(child, parent)
+
+        # Final 6 spans in `self.spans` belong to bls with ensemble request
+        # Order of spans:
+        # compute span - request span (simple) - request span (ensemble)-
+        # - compute (for bls) - request (bls) - root span
+        # request span (ensemble) and compute (for bls) are children of
+        # request (bls)
+        children = self.spans[10:]
+        parents = (self.spans[11:13], self.spans[14], self.spans[14:])
+        for child, parent in zip(children, parents[0]):
+            self._check_parent(child, parent)
+
+    def test_resource_attributes(self):
+        for span in self.spans:
+            self.assertIn("test.key", span["resources"])
+            self.assertEqual("test.value", span["resources"]["test.key"])
+            self.assertIn("service.name", span["resources"])
+            self.assertEqual("test_triton", span["resources"]["service.name"])
+
+
+def prepare_data(client):
+    inputs = []
+    input0_data = np.full(shape=(1, 16), fill_value=-1, dtype=np.int32)
+    input1_data = np.full(shape=(1, 16), fill_value=-1, dtype=np.int32)
+
+    inputs.append(client.InferInput("INPUT0", [1, 16], "INT32"))
+    inputs.append(client.InferInput("INPUT1", [1, 16], "INT32"))
+
+    # Initialize the data
+    inputs[0].set_data_from_numpy(input0_data)
+    inputs[1].set_data_from_numpy(input1_data)
+
+    return inputs
+
+
+def prepare_traces():
+    triton_client_http = httpclient.InferenceServerClient(
+        "localhost:8000", verbose=True
+    )
+    triton_client_grpc = grpcclient.InferenceServerClient(
+        "localhost:8001", verbose=True
+    )
+    inputs = prepare_data(httpclient)
+    triton_client_http.infer("simple", inputs)
+
+    inputs = prepare_data(grpcclient)
+    triton_client_grpc.infer("simple", inputs)
+
+    inputs = prepare_data(httpclient)
+    triton_client_http.infer("ensemble_add_sub_int32_int32_int32", inputs)
+
+    send_bls_request(model_name="ensemble_add_sub_int32_int32_int32")
+
+
+def send_bls_request(model_name="simple"):
+    with httpclient.InferenceServerClient("localhost:8000") as client:
+        inputs = prepare_data(httpclient)
+        inputs.append(httpclient.InferInput("MODEL_NAME", [1], "BYTES"))
+        inputs[-1].set_data_from_numpy(np.array([model_name], dtype=np.object_))
+        client.infer("bls_simple", inputs)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_trace/test.sh b/qa/L0_trace/test.sh
index c7130a5645..56f3250b81 100755
--- a/qa/L0_trace/test.sh
+++ b/qa/L0_trace/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -51,6 +51,7 @@ export CUDA_VISIBLE_DEVICES=0
 
 DATADIR=/data/inferenceserver/${REPO_VERSION}/qa_model_repository
 ENSEMBLEDIR=$DATADIR/../qa_ensemble_model_repository/qa_model_repository/
+BLSDIR=../python_models/bls_simple
 MODELBASE=onnx_int32_int32_int32
 
 MODELSDIR=`pwd`/trace_models
@@ -62,19 +63,94 @@ rm -f *.log
 rm -fr $MODELSDIR && mkdir -p $MODELSDIR
 
 # set up simple and global_simple model using MODELBASE
-rm -fr $MODELSDIR && mkdir -p $MODELSDIR && \
-    cp -r $DATADIR/$MODELBASE $MODELSDIR/simple && \
+cp -r $DATADIR/$MODELBASE $MODELSDIR/simple && \
     rm -r $MODELSDIR/simple/2 && rm -r $MODELSDIR/simple/3 && \
     (cd $MODELSDIR/simple && \
             sed -i "s/^name:.*/name: \"simple\"/" config.pbtxt) && \
     cp -r $MODELSDIR/simple $MODELSDIR/global_simple && \
     (cd $MODELSDIR/global_simple && \
             sed -i "s/^name:.*/name: \"global_simple\"/" config.pbtxt) && \
+    cp -r $ENSEMBLEDIR/simple_onnx_int32_int32_int32 $MODELSDIR/ensemble_add_sub_int32_int32_int32 && \
+    rm -r $MODELSDIR/ensemble_add_sub_int32_int32_int32/2 && \
+    rm -r $MODELSDIR/ensemble_add_sub_int32_int32_int32/3 && \
+    (cd $MODELSDIR/ensemble_add_sub_int32_int32_int32 && \
+            sed -i "s/^name:.*/name: \"ensemble_add_sub_int32_int32_int32\"/" config.pbtxt && \
+            sed -i "s/model_name:.*/model_name: \"simple\"/" config.pbtxt) && \
+    mkdir -p $MODELSDIR/bls_simple/1 && cp $BLSDIR/bls_simple.py $MODELSDIR/bls_simple/1/model.py
 
 RET=0
 
+# Helpers =======================================
+function assert_curl_success {
+  message="${1}"
+  if [ "$code" != "200" ]; then
+    cat ./curl.out
+    echo -e "\n***\n*** ${message} : line ${BASH_LINENO}\n***"
+    RET=1
+  fi
+}
+
+function assert_curl_failure {
+  message="${1}"
+  if [ "$code" != "400" ]; then
+    cat ./curl.out
+    echo -e "\n***\n*** ${message} : line ${BASH_LINENO}\n***"
+    RET=1
+  fi
+}
+
+function get_global_trace_setting {
+  rm -f ./curl.out
+  set +e
+  code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/trace/setting`
+  set -e
+}
+
+function get_trace_setting {
+  model_name="${1}"
+  rm -f ./curl.out
+  set +e
+  code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/models/${model_name}/trace/setting`
+  set -e
+}
+
+function update_global_trace_setting {
+  settings="${1}"
+  rm -f ./curl.out
+  set +e
+  code=`curl -s -w %{http_code} -o ./curl.out -X POST localhost:8000/v2/trace/setting -d ${settings}`
+  set -e
+}
+
+function update_trace_setting {
+  model_name="${1}"
+  settings="${2}"
+  rm -f ./curl.out
+  set +e
+  code=`curl -s -w %{http_code} -o ./curl.out -X POST localhost:8000/v2/models/${model_name}/trace/setting -d ${settings}`
+  set -e
+}
+
+function send_inference_requests {
+    log_file="${1}"
+    upper_bound="${2}"
+    for (( p = 1; p <= $upper_bound; p++ )) do
+        $SIMPLE_HTTP_CLIENT >> ${log_file} 2>&1
+        if [ $? -ne 0 ]; then
+            RET=1
+        fi
+
+        $SIMPLE_GRPC_CLIENT >> ${log_file} 2>&1
+        if [ $? -ne 0 ]; then
+            RET=1
+        fi
+    done
+}
+
+#=======================================
+
 # start with trace-level=OFF
-SERVER_ARGS="--trace-file=trace_off_to_min.log --trace-level=OFF --trace-rate=1 --model-repository=$MODELSDIR"
+SERVER_ARGS="--trace-config triton,file=trace_off_to_min.log --trace-config level=OFF --trace-config rate=1 --model-repository=$MODELSDIR"
 SERVER_LOG="./inference_server_off.log"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
@@ -85,28 +161,10 @@ fi
 
 set +e
 
-for p in {1..10}; do
-    $SIMPLE_HTTP_CLIENT >> client_off.log 2>&1
-    if [ $? -ne 0 ]; then
-        RET=1
-    fi
-
-    $SIMPLE_GRPC_CLIENT >> client_off.log 2>&1
-    if [ $? -ne 0 ]; then
-        RET=1
-    fi
-done
-
 # Enable via trace API and send again
-rm -f ./curl.out
-set +e
-code=`curl -s -w %{http_code} -o ./curl.out -d'{"trace_level":["TIMESTAMPS"]}' localhost:8000/v2/trace/setting`
-set -e
-if [ "$code" != "200" ]; then
-    cat ./curl.out
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-fi
+update_global_trace_setting '{"trace_level":["TIMESTAMPS"]}'
+assert_curl_success "Failed to modify global trace settings"
+
 # Check if the current setting is returned
 if [ `grep -c "\"trace_level\":\[\"TIMESTAMPS\"\]" ./curl.out` != "1" ]; then
     RET=1
@@ -121,17 +179,7 @@ if [ `grep -c "\"trace_file\":\"trace_off_to_min.log\"" ./curl.out` != "1" ]; th
     RET=1
 fi
 
-for p in {1..10}; do
-    $SIMPLE_HTTP_CLIENT >> client_min.log 2>&1
-    if [ $? -ne 0 ]; then
-        RET=1
-    fi
-
-    $SIMPLE_GRPC_CLIENT >> client_min.log 2>&1
-    if [ $? -ne 0 ]; then
-        RET=1
-    fi
-done
+send_inference_requests "client_min.log" 10
 
 set -e
 
@@ -140,7 +188,7 @@ wait $SERVER_PID
 
 set +e
 
-# Expect only the requests after calling trace API are traced 
+# Expect only the requests after calling trace API are traced
 $TRACE_SUMMARY -t trace_off_to_min.log > summary_off_to_min.log
 
 if [ `grep -c "COMPUTE_INPUT_END" summary_off_to_min.log` != "20" ]; then
@@ -158,7 +206,7 @@ fi
 set -e
 
 # Add model specific setting
-SERVER_ARGS="--trace-file=global_trace.log --trace-level=TIMESTAMPS --trace-rate=6 --model-repository=$MODELSDIR"
+SERVER_ARGS="--trace-config triton,file=global_trace.log --trace-config level=TIMESTAMPS --trace-config rate=6 --model-repository=$MODELSDIR"
 SERVER_LOG="./inference_server_off.log"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
@@ -170,16 +218,10 @@ fi
 set +e
 
 # Add trace setting for 'simple' via trace API, first use the same trace file
-rm -f ./curl.out
-set +e
-code=`curl -s -w %{http_code} -o ./curl.out -d'{"trace_file":"global_trace.log"}' localhost:8000/v2/models/simple/trace/setting`
-set -e
-if [ "$code" != "200" ]; then
-    cat ./curl.out
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-fi
-# Check if the current setting is returned (not specified setting from global) 
+update_trace_setting "simple" '{"trace_file":"global_trace.log"}'
+assert_curl_success "Failed to modify trace settings for 'simple' model"
+
+# Check if the current setting is returned (not specified setting from global)
 if [ `grep -c "\"trace_level\":\[\"TIMESTAMPS\"\]" ./curl.out` != "1" ]; then
     RET=1
 fi
@@ -194,17 +236,10 @@ if [ `grep -c "\"trace_file\":\"global_trace.log\"" ./curl.out` != "1" ]; then
 fi
 
 # Use a different name
-rm -f ./curl.out
-set +e
-code=`curl -s -w %{http_code} -o ./curl.out -d'{"trace_file":"simple_trace.log","log_frequency":"2"}' localhost:8000/v2/models/simple/trace/setting`
-set -e
-if [ "$code" != "200" ]; then
-    cat ./curl.out
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-fi
+update_trace_setting "simple" '{"trace_file":"simple_trace.log","log_frequency":"2"}'
+assert_curl_success "Failed to modify trace settings for 'simple' model"
 
-# Check if the current setting is returned (not specified setting from global) 
+# Check if the current setting is returned (not specified setting from global)
 if [ `grep -c "\"trace_level\":\[\"TIMESTAMPS\"\]" ./curl.out` != "1" ]; then
     RET=1
 fi
@@ -221,17 +256,7 @@ if [ `grep -c "\"trace_file\":\"simple_trace.log\"" ./curl.out` != "1" ]; then
     RET=1
 fi
 
-for p in {1..10}; do
-    $SIMPLE_HTTP_CLIENT >> client_simple.log 2>&1
-    if [ $? -ne 0 ]; then
-        RET=1
-    fi
-
-    $SIMPLE_GRPC_CLIENT >> client_simple.log 2>&1
-    if [ $? -ne 0 ]; then
-        RET=1
-    fi
-done
+send_inference_requests "client_simple.log" 10
 
 set -e
 
@@ -276,7 +301,7 @@ fi
 set -e
 
 # Update and clear model specific setting
-SERVER_ARGS="--trace-file=global_trace.log --trace-level=TIMESTAMPS --trace-rate=6 --model-repository=$MODELSDIR"
+SERVER_ARGS="--trace-config triton,file=global_trace.log --trace-config level=TIMESTAMPS --trace-config rate=6 --model-repository=$MODELSDIR"
 SERVER_LOG="./inference_server_off.log"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
@@ -288,25 +313,11 @@ fi
 set +e
 
 # Add model setting and update it
-rm -f ./curl.out
-set +e
-code=`curl -s -w %{http_code} -o ./curl.out -d'{"trace_file":"update_trace.log", "trace_rate":"1"}' localhost:8000/v2/models/simple/trace/setting`
-set -e
-if [ "$code" != "200" ]; then
-    cat ./curl.out
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-fi
+update_trace_setting "simple" '{"trace_file":"update_trace.log","trace_rate":"1"}'
+assert_curl_success "Failed to modify trace settings for 'simple' model"
 
-rm -f ./curl.out
-set +e
-code=`curl -s -w %{http_code} -o ./curl.out -d'{"trace_file":"update_trace.log", "trace_level":["OFF"]}' localhost:8000/v2/models/simple/trace/setting`
-set -e
-if [ "$code" != "200" ]; then
-    cat ./curl.out
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-fi
+update_trace_setting "simple" '{"trace_file":"update_trace.log","trace_level":["OFF"]}'
+assert_curl_success "Failed to modify trace settings for 'simple' model"
 
 # Check if the current setting is returned
 if [ `grep -c "\"trace_level\":\[\"OFF\"\]" ./curl.out` != "1" ]; then
@@ -326,31 +337,14 @@ if [ `grep -c "\"trace_file\":\"update_trace.log\"" ./curl.out` != "1" ]; then
 fi
 
 # Send requests to simple where trace is explicitly disabled
-for p in {1..10}; do
-    $SIMPLE_HTTP_CLIENT >> client_update.log 2>&1
-    if [ $? -ne 0 ]; then
-        RET=1
-    fi
-
-    $SIMPLE_GRPC_CLIENT >> client_update.log 2>&1
-    if [ $? -ne 0 ]; then
-        RET=1
-    fi
-done
+send_inference_requests "client_update.log" 10
 
 rm -f ./curl.out
 set +e
 
-# Clear trace setting by explicitly asking removal for every feild except 'trace_rate'
-rm -f ./curl.out
-set +e
-code=`curl -s -w %{http_code} -o ./curl.out -d'{"trace_file":null, "trace_level":null}' localhost:8000/v2/models/simple/trace/setting`
-set -e
-if [ "$code" != "200" ]; then
-    cat ./curl.out
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-fi
+# Clear trace setting by explicitly asking removal for every field except 'trace_rate'
+update_trace_setting "simple" '{"trace_file":null,"trace_level":null}'
+assert_curl_success "Failed to modify trace settings for 'simple' model"
 
 # Check if the current setting (global) is returned
 if [ `grep -c "\"trace_level\":\[\"TIMESTAMPS\"\]" ./curl.out` != "1" ]; then
@@ -370,17 +364,7 @@ if [ `grep -c "\"trace_file\":\"global_trace.log\"" ./curl.out` != "1" ]; then
 fi
 
 # Send requests to simple where now uses global setting
-for p in {1..5}; do
-    $SIMPLE_HTTP_CLIENT >> client_clear.log 2>&1
-    if [ $? -ne 0 ]; then
-        RET=1
-    fi
-
-    $SIMPLE_GRPC_CLIENT >> client_clear.log 2>&1
-    if [ $? -ne 0 ]; then
-        RET=1
-    fi
-done
+send_inference_requests "client_clear.log" 5
 
 set -e
 
@@ -411,7 +395,7 @@ fi
 set -e
 
 # Update trace count
-SERVER_ARGS="--trace-file=global_count.log --trace-level=TIMESTAMPS --trace-rate=1 --model-repository=$MODELSDIR"
+SERVER_ARGS="--trace-config triton,file=global_count.log --trace-config level=TIMESTAMPS --trace-config rate=1 --model-repository=$MODELSDIR"
 SERVER_LOG="./inference_server_off.log"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
@@ -423,30 +407,14 @@ fi
 set +e
 
 # Send requests without trace count
-for p in {1..10}; do
-    $SIMPLE_HTTP_CLIENT >> client_update.log 2>&1
-    if [ $? -ne 0 ]; then
-        RET=1
-    fi
-
-    $SIMPLE_GRPC_CLIENT >> client_update.log 2>&1
-    if [ $? -ne 0 ]; then
-        RET=1
-    fi
-done
+send_inference_requests "client_update.log" 10
 
 set -e
 
 # Check the current setting
-rm -f ./curl.out
-set +e
-code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/models/simple/trace/setting`
-set -e
-if [ "$code" != "200" ]; then
-    cat ./curl.out
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-fi
+get_trace_setting "simple"
+assert_curl_success "Failed to obtain trace settings for 'simple' model"
+
 if [ `grep -c "\"trace_level\":\[\"TIMESTAMPS\"\]" ./curl.out` != "1" ]; then
     RET=1
 fi
@@ -464,15 +432,8 @@ if [ `grep -c "\"trace_file\":\"global_count.log\"" ./curl.out` != "1" ]; then
 fi
 
 # Set trace count
-rm -f ./curl.out
-set +e
-code=`curl -s -w %{http_code} -o ./curl.out -d'{"trace_count":"5"}' localhost:8000/v2/trace/setting`
-set -e
-if [ "$code" != "200" ]; then
-    cat ./curl.out
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-fi
+update_global_trace_setting '{"trace_count":"5"}'
+assert_curl_success "Failed to modify global trace settings"
 
 # Check if the current setting is returned
 if [ `grep -c "\"trace_level\":\[\"TIMESTAMPS\"\]" ./curl.out` != "1" ]; then
@@ -492,28 +453,12 @@ if [ `grep -c "\"trace_file\":\"global_count.log\"" ./curl.out` != "1" ]; then
 fi
 
 # Send requests to simple where trace is explicitly disabled
-for p in {1..10}; do
-    $SIMPLE_HTTP_CLIENT >> client_update.log 2>&1
-    if [ $? -ne 0 ]; then
-        RET=1
-    fi
+send_inference_requests "client_update.log" 10
 
-    $SIMPLE_GRPC_CLIENT >> client_update.log 2>&1
-    if [ $? -ne 0 ]; then
-        RET=1
-    fi
-done
+# Check the current setting again and expect 'trace_count' becomes 0
+get_trace_setting "simple"
+assert_curl_success "Failed to obtain trace settings for 'simple' model"
 
-# Check the current setting agian and expect 'trace_count' becomes 0
-rm -f ./curl.out
-set +e
-code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/models/simple/trace/setting`
-set -e
-if [ "$code" != "200" ]; then
-    cat ./curl.out
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-fi
 if [ `grep -c "\"trace_level\":\[\"TIMESTAMPS\"\]" ./curl.out` != "1" ]; then
     RET=1
 fi
@@ -536,6 +481,14 @@ if [ -f ./global_trace.log.0 ]; then
     RET=1
 fi
 
+SETTINGS="trace_count trace_rate log_frequency"
+
+for SETTING in $SETTINGS; do
+    # Check `out of range` errors
+    update_trace_setting "simple" '{"'${SETTING}'":"10000000000"}'
+    assert_curl_failure "Server modified '${SETTING}' with an out of range value."
+done
+
 set -e
 
 kill $SERVER_PID
@@ -576,7 +529,7 @@ fi
 set -e
 
 # Test Python client library
-SERVER_ARGS="--trace-file=global_unittest.log --trace-level=TIMESTAMPS --trace-rate=1 --model-repository=$MODELSDIR"
+SERVER_ARGS="--trace-config triton,file=global_unittest.log --trace-config level=TIMESTAMPS --trace-config rate=1 --model-repository=$MODELSDIR"
 SERVER_LOG="./inference_server_unittest.log"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
@@ -607,11 +560,249 @@ set -e
 kill $SERVER_PID
 wait $SERVER_PID
 
-if [ $RET -eq 0 ]; then
-    echo -e "\n***\n*** Test Passed\n***"
-else
-    echo -e "\n***\n*** Test FAILED\n***"
+
+# Check `--trace-config` sets arguments properly
+SERVER_ARGS="--trace-config=triton,file=bls_trace.log --trace-config=level=TIMESTAMPS \
+            --trace-config=rate=4 --trace-config=count=6 --trace-config=mode=triton --model-repository=$MODELSDIR"
+SERVER_LOG="./inference_server_trace_config.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+get_trace_setting "simple"
+assert_curl_success "Failed to obtain trace settings for 'simple' model"
+
+if [ `grep -c "\"trace_level\":\[\"TIMESTAMPS\"\]" ./curl.out` != "1" ]; then
+    RET=1
+fi
+if [ `grep -c "\"trace_rate\":\"4\"" ./curl.out` != "1" ]; then
+    RET=1
+fi
+if [ `grep -c "\"trace_count\":\"6\"" ./curl.out` != "1" ]; then
+    RET=1
+fi
+if [ `grep -c "\"log_frequency\":\"0\"" ./curl.out` != "1" ]; then
+    RET=1
+fi
+if [ `grep -c "\"trace_file\":\"bls_trace.log\"" ./curl.out` != "1" ]; then
+    RET=1
+fi
+
+set +e
+# Send bls requests to make sure simple model is traced
+for p in {1..4}; do
+    python -c 'import opentelemetry_unittest; \
+        opentelemetry_unittest.send_bls_request(model_name="ensemble_add_sub_int32_int32_int32")'  >> client_update.log 2>&1
+done
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+set +e
+
+$TRACE_SUMMARY -t bls_trace.log > summary_bls.log
+
+if [ `grep -c "COMPUTE_INPUT_END" summary_bls.log` != "2" ]; then
+    cat summary_bls.log
+    echo -e "\n***\n*** Test Failed: Unexpected number of traced "COMPUTE_INPUT_END" events.\n***"
+    RET=1
+fi
+
+if [ `grep -c ^ensemble_add_sub_int32_int32_int32 summary_bls.log` != "1" ]; then
+    cat summary_bls.log
+    echo -e "\n***\n*** Test Failed: BLS child ensemble model wasn't traced. \n***"
+    RET=1
+fi
+
+if [ `grep -c ^simple summary_bls.log` != "1" ]; then
+    cat summary_bls.log
+    echo -e "\n***\n*** Test Failed: ensemble's model 'simple' wasn't traced. \n***"
+    RET=1
+fi
+
+if [ `grep -o 'parent_id' bls_trace.log | wc -l` != "2" ]; then
+    cat bls_trace.log
+    echo -e "\n***\n*** Test Failed: Unexpected number of 'parent id' fields. \n***"
+    RET=1
+fi
+
+# Attempt to trace non-existent model
+SERVER_ARGS="--model-control-mode=explicit --model-repository=$MODELSDIR"
+SERVER_LOG="./inference_server_nonexistent_model.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+# Explicitly load model
+rm -f ./curl.out
+set +e
+code=`curl -s -w %{http_code} -o ./curl.out -X POST localhost:8000/v2/repository/models/simple/load`
+set -e
+assert_curl_success "Failed to load 'simple' model"
+
+# Non-existent model (get)
+get_trace_setting "does-not-exist"
+assert_curl_failure "Server returned trace settings for a non-existent model"
+
+# Non-existent model (post)
+update_trace_setting "does-not-exist" '{"log_frequency":"1"}'
+assert_curl_failure "Server modified trace settings for a non-existent model"
+
+# Local model (get)
+get_trace_setting "simple"
+assert_curl_success "Failed to obtain trace settings for 'simple' model"
+
+# Local model (post)
+update_trace_setting "simple" '{"log_frequency":"1"}'
+assert_curl_success "Failed to modify trace settings for 'simple' model"
+
+# Local model (unload)
+rm -f ./curl.out
+set +e
+code=`curl -s -w %{http_code} -o ./curl.out -X POST localhost:8000/v2/repository/models/simple/unload`
+set -e
+assert_curl_success "Failed to unload 'simple' model"
+
+get_trace_setting "simple"
+assert_curl_failure "Server returned trace settings for an unloaded model"
+
+update_trace_setting "simple" '{"log_frequency":"1"}'
+assert_curl_failure "Server modified trace settings for an unloaded model"
+
+# Local model (reload)
+rm -f ./curl.out
+set +e
+code=`curl -s -w %{http_code} -o ./curl.out -X POST localhost:8000/v2/repository/models/simple/load`
+set -e
+assert_curl_success "Failed to load 'simple' model"
+
+get_trace_setting "simple"
+assert_curl_success "Failed to obtain trace settings for 'simple' model"
+
+update_trace_setting "simple" '{"log_frequency":"1"}'
+assert_curl_success "Failed to modify trace settings for 'simple' model"
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+set +e
+
+# Check opentelemetry trace exporter sends proper info.
+# A helper python script starts listening on $OTLP_PORT, where
+# OTLP exporter sends traces.
+export TRITON_OPENTELEMETRY_TEST='false'
+OTLP_PORT=10000
+OTEL_COLLECTOR_DIR=./opentelemetry-collector
+OTEL_COLLECTOR=./opentelemetry-collector/bin/otelcorecol_*
+OTEL_COLLECTOR_LOG="./trace_collector_http_exporter.log"
+
+# Building the latest version of the OpenTelemetry collector.
+# Ref: https://opentelemetry.io/docs/collector/getting-started/#local
+if [ -d "$OTEL_COLLECTOR_DIR" ]; then rm -Rf $OTEL_COLLECTOR_DIR; fi
+git clone --depth 1 --branch v0.82.0 https://github.com/open-telemetry/opentelemetry-collector.git
+cd $OTEL_COLLECTOR_DIR
+make install-tools
+make otelcorecol
+cd ..
+$OTEL_COLLECTOR --config ./trace-config.yaml >> $OTEL_COLLECTOR_LOG 2>&1 & COLLECTOR_PID=$!
+
+
+SERVER_ARGS="--trace-config=level=TIMESTAMPS --trace-config=rate=1 \
+                --trace-config=count=100 --trace-config=mode=opentelemetry \
+                --trace-config=opentelemetry,url=localhost:$OTLP_PORT/v1/traces \
+                --model-repository=$MODELSDIR"
+SERVER_LOG="./inference_server_otel_http_exporter.log"
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+$SIMPLE_HTTP_CLIENT >>$CLIENT_LOG 2>&1
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+kill $COLLECTOR_PID
+wait $COLLECTOR_PID
+
+set +e
+
+if ! [[ -s $OTEL_COLLECTOR_LOG && `grep -c 'InstrumentationScope triton-server' $OTEL_COLLECTOR_LOG` == 3 ]] ; then
+    echo -e "\n***\n*** HTTP exporter test failed.\n***"
+    cat $OTEL_COLLECTOR_LOG
+    exit 1
 fi
 
 
+# Unittests then check that produced spans have expected format and events
+OPENTELEMETRY_TEST=opentelemetry_unittest.py
+OPENTELEMETRY_LOG="opentelemetry_unittest.log"
+EXPECTED_NUM_TESTS="3"
+
+export TRITON_OPENTELEMETRY_TEST='true'
+
+SERVER_ARGS="--trace-config=level=TIMESTAMPS --trace-config=rate=1 \
+                --trace-config=count=100 --trace-config=mode=opentelemetry \
+                --trace-config=opentelemetry,resource=test.key=test.value \
+                --trace-config=opentelemetry,resource=service.name=test_triton \
+                --model-repository=$MODELSDIR"
+SERVER_LOG="./inference_server_otel_ostream_exporter.log"
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+# Preparing traces for unittest.
+# Note: running this separately, so that I could extract spans with `grep`
+# from server log later.
+python -c 'import opentelemetry_unittest; \
+        opentelemetry_unittest.prepare_traces()' >>$CLIENT_LOG 2>&1
+
+sleep 5
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+set +e
+
+grep -z -o -P '({\n(?s).*}\n)' $SERVER_LOG >> trace_collector.log
+
+if ! [ -s trace_collector.log ] ; then
+    echo -e "\n***\n*** $SERVER_LOG did not contain any OpenTelemetry spans.\n***"
+    exit 1
+fi
+
+# Unittest will not start until expected number of spans is collected.
+python $OPENTELEMETRY_TEST >>$OPENTELEMETRY_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $OPENTELEMETRY_LOG
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+    if [ $? -ne 0 ]; then
+        cat $OPENTELEMETRY_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+
 exit $RET
diff --git a/qa/L0_trace/trace-config.yaml b/qa/L0_trace/trace-config.yaml
new file mode 100644
index 0000000000..f8fe2424c0
--- /dev/null
+++ b/qa/L0_trace/trace-config.yaml
@@ -0,0 +1,45 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# Simple config file for OpenTelemetry collector.
+# It receives all traces, received on localhost:10000 and prints
+# it into the output stream.
+# Ref: https://opentelemetry.io/docs/collector/configuration/
+receivers:
+  otlp:
+    protocols:
+      http:
+        endpoint: 0.0.0.0:10000
+
+exporters:
+  logging:
+    verbosity: detailed
+
+service:
+  pipelines:
+    traces:
+      receivers: [otlp]
+      exporters: [logging]
diff --git a/qa/L0_trace/trace_endpoint_test.py b/qa/L0_trace/trace_endpoint_test.py
old mode 100644
new mode 100755
index c836e03e8f..70066dd3b2
--- a/qa/L0_trace/trace_endpoint_test.py
+++ b/qa/L0_trace/trace_endpoint_test.py
@@ -1,6 +1,6 @@
 #!/usr/bin/python
 
-# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -27,21 +27,21 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-import numpy as np
+import json
 import sys
 import unittest
-import tritonclient.http as httpclient
+
+import test_util as tu
 import tritonclient.grpc as grpcclient
-import json
+import tritonclient.http as httpclient
 from google.protobuf import json_format
-import test_util as tu
 
 
 # Similar set up as dynamic batcher tests
 class TraceEndpointTest(tu.TestResultCollector):
-
     def tearDown(self):
         # Clear all trace settings to initial state.
         # Note that the tearDown function uses HTTP client so the pass/fail
@@ -53,13 +53,13 @@ def tearDown(self):
             "trace_level": None,
             "trace_rate": None,
             "trace_count": None,
-            "log_frequency": None
+            "log_frequency": None,
         }
         triton_client = httpclient.InferenceServerClient("localhost:8000")
-        triton_client.update_trace_settings(model_name="simple",
-                                            settings=clear_settings)
-        triton_client.update_trace_settings(model_name=None,
-                                            settings=clear_settings)
+        triton_client.update_trace_settings(
+            model_name="simple", settings=clear_settings
+        )
+        triton_client.update_trace_settings(model_name=None, settings=clear_settings)
 
     def check_server_initial_state(self):
         # Helper function to make sure the trace setting is properly
@@ -72,11 +72,12 @@ def check_server_initial_state(self):
             "trace_level": ["TIMESTAMPS"],
             "trace_rate": "1",
             "trace_count": "-1",
-            "log_frequency": "0"
+            "log_frequency": "0",
         }
         triton_client = httpclient.InferenceServerClient("localhost:8000")
-        self.assertEqual(initial_settings,
-                         triton_client.get_trace_settings(model_name="simple"))
+        self.assertEqual(
+            initial_settings, triton_client.get_trace_settings(model_name="simple")
+        )
         self.assertEqual(initial_settings, triton_client.get_trace_settings())
 
     def test_http_get_settings(self):
@@ -87,46 +88,64 @@ def test_http_get_settings(self):
             "trace_level": ["TIMESTAMPS"],
             "trace_rate": "1",
             "trace_count": "-1",
-            "log_frequency": "0"
+            "log_frequency": "0",
         }
         triton_client = httpclient.InferenceServerClient("localhost:8000")
-        self.assertEqual(initial_settings,
-                         triton_client.get_trace_settings(model_name="simple"),
-                         "Unexpected initial model trace settings")
-        self.assertEqual(initial_settings, triton_client.get_trace_settings(),
-                         "Unexpected initial global settings")
+        self.assertEqual(
+            initial_settings,
+            triton_client.get_trace_settings(model_name="simple"),
+            "Unexpected initial model trace settings",
+        )
+        self.assertEqual(
+            initial_settings,
+            triton_client.get_trace_settings(),
+            "Unexpected initial global settings",
+        )
+        try:
+            triton_client.get_trace_settings(model_name="does-not-exist")
+        except Exception as ex:
+            self.assertIn(
+                "Request for unknown model : does-not-exist",
+                ex.message(),
+            )
 
     def test_grpc_get_settings(self):
         # Model trace settings will be the same as global trace settings since
         # no update has been made.
         initial_settings = grpcclient.service_pb2.TraceSettingResponse()
         json_format.Parse(
-            json.dumps({
-                "settings": {
-                    "trace_file": {
-                        "value": ["global_unittest.log"]
-                    },
-                    "trace_level": {
-                        "value": ["TIMESTAMPS"]
-                    },
-                    "trace_rate": {
-                        "value": ["1"]
-                    },
-                    "trace_count": {
-                        "value": ["-1"]
-                    },
-                    "log_frequency": {
-                        "value": ["0"]
-                    },
+            json.dumps(
+                {
+                    "settings": {
+                        "trace_file": {"value": ["global_unittest.log"]},
+                        "trace_level": {"value": ["TIMESTAMPS"]},
+                        "trace_rate": {"value": ["1"]},
+                        "trace_count": {"value": ["-1"]},
+                        "log_frequency": {"value": ["0"]},
+                    }
                 }
-            }), initial_settings)
+            ),
+            initial_settings,
+        )
 
         triton_client = grpcclient.InferenceServerClient("localhost:8001")
-        self.assertEqual(initial_settings,
-                         triton_client.get_trace_settings(model_name="simple"),
-                         "Unexpected initial model trace settings")
-        self.assertEqual(initial_settings, triton_client.get_trace_settings(),
-                         "Unexpected initial global settings")
+        self.assertEqual(
+            initial_settings,
+            triton_client.get_trace_settings(model_name="simple"),
+            "Unexpected initial model trace settings",
+        )
+        self.assertEqual(
+            initial_settings,
+            triton_client.get_trace_settings(),
+            "Unexpected initial global settings",
+        )
+        try:
+            triton_client.get_trace_settings(model_name="does-not-exist")
+        except Exception as ex:
+            self.assertIn(
+                "Request for unknown model : does-not-exist",
+                ex.message(),
+            )
 
     def test_http_update_settings(self):
         # Update model and global trace settings in order,
@@ -139,47 +158,60 @@ def test_http_update_settings(self):
             "trace_level": ["TIMESTAMPS"],
             "trace_rate": "1",
             "trace_count": "-1",
-            "log_frequency": "0"
+            "log_frequency": "0",
         }
         expected_second_model_settings = {
             "trace_file": "model.log",
             "trace_level": ["TIMESTAMPS", "TENSORS"],
             "trace_rate": "1",
             "trace_count": "-1",
-            "log_frequency": "0"
+            "log_frequency": "0",
         }
         expected_global_settings = {
             "trace_file": "another.log",
             "trace_level": ["TIMESTAMPS", "TENSORS"],
             "trace_rate": "1",
             "trace_count": "-1",
-            "log_frequency": "0"
+            "log_frequency": "0",
         }
 
         model_update_settings = {"trace_file": "model.log"}
         global_update_settings = {
             "trace_file": "another.log",
-            "trace_level": ["TIMESTAMPS", "TENSORS"]
+            "trace_level": ["TIMESTAMPS", "TENSORS"],
         }
 
         triton_client = httpclient.InferenceServerClient("localhost:8000")
         self.assertEqual(
             expected_first_model_settings,
-            triton_client.update_trace_settings(model_name="simple",
-                                                settings=model_update_settings),
-            "Unexpected updated model trace settings")
+            triton_client.update_trace_settings(
+                model_name="simple", settings=model_update_settings
+            ),
+            "Unexpected updated model trace settings",
+        )
         # Note that 'trace_level' may be mismatch due to the order of
         # the levels listed, currently we assume the order is the same
         # for simplicity. But the order shouldn't be enforced and this checking
         # needs to be improved when this kind of failure is reported
         self.assertEqual(
             expected_global_settings,
+            triton_client.update_trace_settings(settings=global_update_settings),
+            "Unexpected updated global settings",
+        )
+        self.assertEqual(
+            expected_second_model_settings,
+            triton_client.get_trace_settings(model_name="simple"),
+            "Unexpected model trace settings after global update",
+        )
+        try:
             triton_client.update_trace_settings(
-                settings=global_update_settings),
-            "Unexpected updated global settings")
-        self.assertEqual(expected_second_model_settings,
-                         triton_client.get_trace_settings(model_name="simple"),
-                         "Unexpected model trace settings after global update")
+                model_name="does-not-exist", settings=model_update_settings
+            )
+        except Exception as ex:
+            self.assertIn(
+                "Request for unknown model : does-not-exist",
+                ex.message(),
+            )
 
     def test_grpc_update_settings(self):
         # Update model and global trace settings in order,
@@ -187,98 +219,91 @@ def test_grpc_update_settings(self):
         # the model setting fields that haven't been specified.
         self.check_server_initial_state()
 
-        expected_first_model_settings = grpcclient.service_pb2.TraceSettingResponse(
-        )
+        expected_first_model_settings = grpcclient.service_pb2.TraceSettingResponse()
         json_format.Parse(
-            json.dumps({
-                "settings": {
-                    "trace_file": {
-                        "value": ["model.log"]
-                    },
-                    "trace_level": {
-                        "value": ["TIMESTAMPS"]
-                    },
-                    "trace_rate": {
-                        "value": ["1"]
-                    },
-                    "trace_count": {
-                        "value": ["-1"]
-                    },
-                    "log_frequency": {
-                        "value": ["0"]
-                    },
+            json.dumps(
+                {
+                    "settings": {
+                        "trace_file": {"value": ["model.log"]},
+                        "trace_level": {"value": ["TIMESTAMPS"]},
+                        "trace_rate": {"value": ["1"]},
+                        "trace_count": {"value": ["-1"]},
+                        "log_frequency": {"value": ["0"]},
+                    }
                 }
-            }), expected_first_model_settings)
-
-        expected_second_model_settings = grpcclient.service_pb2.TraceSettingResponse(
+            ),
+            expected_first_model_settings,
         )
+
+        expected_second_model_settings = grpcclient.service_pb2.TraceSettingResponse()
         json_format.Parse(
-            json.dumps({
-                "settings": {
-                    "trace_file": {
-                        "value": ["model.log"]
-                    },
-                    "trace_level": {
-                        "value": ["TIMESTAMPS", "TENSORS"]
-                    },
-                    "trace_rate": {
-                        "value": ["1"]
-                    },
-                    "trace_count": {
-                        "value": ["-1"]
-                    },
-                    "log_frequency": {
-                        "value": ["0"]
-                    },
+            json.dumps(
+                {
+                    "settings": {
+                        "trace_file": {"value": ["model.log"]},
+                        "trace_level": {"value": ["TIMESTAMPS", "TENSORS"]},
+                        "trace_rate": {"value": ["1"]},
+                        "trace_count": {"value": ["-1"]},
+                        "log_frequency": {"value": ["0"]},
+                    }
                 }
-            }), expected_second_model_settings)
+            ),
+            expected_second_model_settings,
+        )
 
         expected_global_settings = grpcclient.service_pb2.TraceSettingResponse()
         json_format.Parse(
-            json.dumps({
-                "settings": {
-                    "trace_file": {
-                        "value": ["another.log"]
-                    },
-                    "trace_level": {
-                        "value": ["TIMESTAMPS", "TENSORS"]
-                    },
-                    "trace_rate": {
-                        "value": ["1"]
-                    },
-                    "trace_count": {
-                        "value": ["-1"]
-                    },
-                    "log_frequency": {
-                        "value": ["0"]
-                    },
+            json.dumps(
+                {
+                    "settings": {
+                        "trace_file": {"value": ["another.log"]},
+                        "trace_level": {"value": ["TIMESTAMPS", "TENSORS"]},
+                        "trace_rate": {"value": ["1"]},
+                        "trace_count": {"value": ["-1"]},
+                        "log_frequency": {"value": ["0"]},
+                    }
                 }
-            }), expected_global_settings)
+            ),
+            expected_global_settings,
+        )
 
         model_update_settings = {"trace_file": "model.log"}
         global_update_settings = {
             "trace_file": "another.log",
-            "trace_level": ["TIMESTAMPS", "TENSORS"]
+            "trace_level": ["TIMESTAMPS", "TENSORS"],
         }
 
         triton_client = grpcclient.InferenceServerClient("localhost:8001")
         self.assertEqual(
             expected_first_model_settings,
-            triton_client.update_trace_settings(model_name="simple",
-                                                settings=model_update_settings),
-            "Unexpected updated model trace settings")
+            triton_client.update_trace_settings(
+                model_name="simple", settings=model_update_settings
+            ),
+            "Unexpected updated model trace settings",
+        )
         # Note that 'trace_level' may be mismatch due to the order of
         # the levels listed, currently we assume the order is the same
         # for simplicity. But the order shouldn't be enforced and this checking
         # needs to be improved when this kind of failure is reported
         self.assertEqual(
             expected_global_settings,
+            triton_client.update_trace_settings(settings=global_update_settings),
+            "Unexpected updated global settings",
+        )
+        self.assertEqual(
+            expected_second_model_settings,
+            triton_client.get_trace_settings(model_name="simple"),
+            "Unexpected model trace settings after global update",
+        )
+        try:
             triton_client.update_trace_settings(
-                settings=global_update_settings),
-            "Unexpected updated global settings")
-        self.assertEqual(expected_second_model_settings,
-                         triton_client.get_trace_settings(model_name="simple"),
-                         "Unexpected model trace settings after global update")
+                model_name="does-not-exist", settings=model_update_settings
+            )
+        except Exception as ex:
+            self.assertIn(
+                "Request for unknown model : does-not-exist",
+                ex.message(),
+            )
 
     def test_http_clear_settings(self):
         # Clear global and model trace settings in order,
@@ -290,37 +315,33 @@ def test_http_clear_settings(self):
         # model 'simple' has 'trace_rate' and 'log_frequency' specified
         # global has 'trace_level', 'trace_count' and 'trace_rate' specified
         triton_client = httpclient.InferenceServerClient("localhost:8000")
-        triton_client.update_trace_settings(model_name="simple",
-                                            settings={
-                                                "trace_rate": "12",
-                                                "log_frequency": "34"
-                                            })
-        triton_client.update_trace_settings(settings={
-            "trace_rate": "56",
-            "trace_count": "78",
-            "trace_level": ["OFF"]
-        })
+        triton_client.update_trace_settings(
+            model_name="simple", settings={"trace_rate": "12", "log_frequency": "34"}
+        )
+        triton_client.update_trace_settings(
+            settings={"trace_rate": "56", "trace_count": "78", "trace_level": ["OFF"]}
+        )
 
         expected_global_settings = {
             "trace_file": "global_unittest.log",
             "trace_level": ["OFF"],
             "trace_rate": "1",
             "trace_count": "-1",
-            "log_frequency": "0"
+            "log_frequency": "0",
         }
         expected_first_model_settings = {
             "trace_file": "global_unittest.log",
             "trace_level": ["OFF"],
             "trace_rate": "12",
             "trace_count": "-1",
-            "log_frequency": "34"
+            "log_frequency": "34",
         }
         expected_second_model_settings = {
             "trace_file": "global_unittest.log",
             "trace_level": ["OFF"],
             "trace_rate": "1",
             "trace_count": "-1",
-            "log_frequency": "34"
+            "log_frequency": "34",
         }
         global_clear_settings = {"trace_rate": None, "trace_count": None}
         model_clear_settings = {"trace_rate": None, "trace_level": None}
@@ -329,18 +350,25 @@ def test_http_clear_settings(self):
         self.assertEqual(
             expected_global_settings,
             triton_client.update_trace_settings(settings=global_clear_settings),
-            "Unexpected cleared global trace settings")
-        self.assertEqual(expected_first_model_settings,
-                         triton_client.get_trace_settings(model_name="simple"),
-                         "Unexpected model trace settings after global clear")
+            "Unexpected cleared global trace settings",
+        )
+        self.assertEqual(
+            expected_first_model_settings,
+            triton_client.get_trace_settings(model_name="simple"),
+            "Unexpected model trace settings after global clear",
+        )
         self.assertEqual(
             expected_second_model_settings,
-            triton_client.update_trace_settings(model_name="simple",
-                                                settings=model_clear_settings),
-            "Unexpected model trace settings after model clear")
-        self.assertEqual(expected_global_settings,
-                         triton_client.get_trace_settings(),
-                         "Unexpected global trace settings after model clear")
+            triton_client.update_trace_settings(
+                model_name="simple", settings=model_clear_settings
+            ),
+            "Unexpected model trace settings after model clear",
+        )
+        self.assertEqual(
+            expected_global_settings,
+            triton_client.get_trace_settings(),
+            "Unexpected global trace settings after model clear",
+        )
 
     def test_grpc_clear_settings(self):
         # Clear global and model trace settings in order,
@@ -352,82 +380,58 @@ def test_grpc_clear_settings(self):
         # model 'simple' has 'trace_rate' and 'log_frequency' specified
         # global has 'trace_level', 'trace_count' and 'trace_rate' specified
         triton_client = grpcclient.InferenceServerClient("localhost:8001")
-        triton_client.update_trace_settings(model_name="simple",
-                                            settings={
-                                                "trace_rate": "12",
-                                                "log_frequency": "34"
-                                            })
-        triton_client.update_trace_settings(settings={
-            "trace_rate": "56",
-            "trace_count": "78",
-            "trace_level": ["OFF"]
-        })
+        triton_client.update_trace_settings(
+            model_name="simple", settings={"trace_rate": "12", "log_frequency": "34"}
+        )
+        triton_client.update_trace_settings(
+            settings={"trace_rate": "56", "trace_count": "78", "trace_level": ["OFF"]}
+        )
 
         expected_global_settings = grpcclient.service_pb2.TraceSettingResponse()
         json_format.Parse(
-            json.dumps({
-                "settings": {
-                    "trace_file": {
-                        "value": ["global_unittest.log"]
-                    },
-                    "trace_level": {
-                        "value": ["OFF"]
-                    },
-                    "trace_rate": {
-                        "value": ["1"]
-                    },
-                    "trace_count": {
-                        "value": ["-1"]
-                    },
-                    "log_frequency": {
-                        "value": ["0"]
-                    },
+            json.dumps(
+                {
+                    "settings": {
+                        "trace_file": {"value": ["global_unittest.log"]},
+                        "trace_level": {"value": ["OFF"]},
+                        "trace_rate": {"value": ["1"]},
+                        "trace_count": {"value": ["-1"]},
+                        "log_frequency": {"value": ["0"]},
+                    }
                 }
-            }), expected_global_settings)
-        expected_first_model_settings = grpcclient.service_pb2.TraceSettingResponse(
+            ),
+            expected_global_settings,
         )
+        expected_first_model_settings = grpcclient.service_pb2.TraceSettingResponse()
         json_format.Parse(
-            json.dumps({
-                "settings": {
-                    "trace_file": {
-                        "value": ["global_unittest.log"]
-                    },
-                    "trace_level": {
-                        "value": ["OFF"]
-                    },
-                    "trace_rate": {
-                        "value": ["12"]
-                    },
-                    "trace_count": {
-                        "value": ["-1"]
-                    },
-                    "log_frequency": {
-                        "value": ["34"]
-                    },
+            json.dumps(
+                {
+                    "settings": {
+                        "trace_file": {"value": ["global_unittest.log"]},
+                        "trace_level": {"value": ["OFF"]},
+                        "trace_rate": {"value": ["12"]},
+                        "trace_count": {"value": ["-1"]},
+                        "log_frequency": {"value": ["34"]},
+                    }
                 }
-            }), expected_first_model_settings)
-        expected_second_model_settings = grpcclient.service_pb2.TraceSettingResponse(
+            ),
+            expected_first_model_settings,
         )
+        expected_second_model_settings = grpcclient.service_pb2.TraceSettingResponse()
         json_format.Parse(
-            json.dumps({
-                "settings": {
-                    "trace_file": {
-                        "value": ["global_unittest.log"]
-                    },
-                    "trace_level": {
-                        "value": ["OFF"]
-                    },
-                    "trace_rate": {
-                        "value": ["1"]
-                    },
-                    "trace_count": {
-                        "value": ["-1"]
-                    },
-                    "log_frequency": {
-                        "value": ["34"]
-                    },
+            json.dumps(
+                {
+                    "settings": {
+                        "trace_file": {"value": ["global_unittest.log"]},
+                        "trace_level": {"value": ["OFF"]},
+                        "trace_rate": {"value": ["1"]},
+                        "trace_count": {"value": ["-1"]},
+                        "log_frequency": {"value": ["34"]},
+                    }
                 }
-            }), expected_second_model_settings)
+            ),
+            expected_second_model_settings,
+        )
 
         global_clear_settings = {"trace_rate": None, "trace_count": None}
         model_clear_settings = {"trace_rate": None, "trace_level": None}
@@ -436,19 +440,26 @@ def test_grpc_clear_settings(self):
         self.assertEqual(
             expected_global_settings,
             triton_client.update_trace_settings(settings=global_clear_settings),
-            "Unexpected cleared global trace settings")
-        self.assertEqual(expected_first_model_settings,
-                         triton_client.get_trace_settings(model_name="simple"),
-                         "Unexpected model trace settings after global clear")
+            "Unexpected cleared global trace settings",
+        )
+        self.assertEqual(
+            expected_first_model_settings,
+            triton_client.get_trace_settings(model_name="simple"),
+            "Unexpected model trace settings after global clear",
+        )
         self.assertEqual(
             expected_second_model_settings,
-            triton_client.update_trace_settings(model_name="simple",
-                                                settings=model_clear_settings),
-            "Unexpected model trace settings after model clear")
-        self.assertEqual(expected_global_settings,
-                         triton_client.get_trace_settings(),
-                         "Unexpected global trace settings after model clear")
+            triton_client.update_trace_settings(
+                model_name="simple", settings=model_clear_settings
+            ),
+            "Unexpected model trace settings after model clear",
+        )
+        self.assertEqual(
+            expected_global_settings,
+            triton_client.get_trace_settings(),
+            "Unexpected global trace settings after model clear",
+        )
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_triton_repo_agent/test.sh b/qa/L0_triton_repo_agent/test.sh
old mode 100644
new mode 100755
diff --git a/qa/L0_trt_compat/test.sh b/qa/L0_trt_compat/test.sh
new file mode 100755
index 0000000000..6b4f83cbc8
--- /dev/null
+++ b/qa/L0_trt_compat/test.sh
@@ -0,0 +1,110 @@
+#!/bin/bash
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
+if [ "$#" -ge 1 ]; then
+    REPO_VERSION=$1
+fi
+if [ -z "$REPO_VERSION" ]; then
+    echo -e "Repository version must be specified"
+    echo -e "\n***\n*** Test Failed\n***"
+    exit 1
+fi
+
+TEST_RESULT_FILE='test_results.txt'
+COMPATIBILITY_TEST_PY=trt_compatibility_test.py
+CLIENT_LOG="client.log"
+DATADIR=${DATADIR:="/data/inferenceserver/${REPO_VERSION}"}
+SERVER=/opt/tritonserver/bin/tritonserver
+SERVER_ARGS="--model-repository=`pwd`/models --exit-timeout-secs=120"
+SERVER_LOG="./inference_server.log"
+source ../common/util.sh
+
+rm -fr models && mkdir models
+cp -r $DATADIR/qa_identity_model_repository/plan_compatible_zero_1_float32 models/.
+
+RET=0
+
+if [ `ps | grep -c "tritonserver"` != "0" ]; then
+    echo -e "Tritonserver already running"
+    echo -e `ps | grep tritonserver`
+    exit 1
+fi
+
+run_server
+if [ "$SERVER_PID" != "0" ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** FAILED: unexpected server start (version compatibility disabled): $SERVER\n***" >> $CLIENT_LOG
+    kill $SERVER_PID
+    wait $SERVER_PID
+    exit 1
+fi
+
+EXPECTED_ERR="Internal Error (Cannot deserialize engine with lean runtime"
+if ! grep "$EXPECTED_ERR" $SERVER_LOG; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Failed to find expected error: ${EXPECTED_ERR} \n***"
+    RET=1
+fi
+
+SERVER_ARGS="--model-repository=`pwd`/models --exit-timeout-secs=120 --backend-config=tensorrt,version-compatible=true"
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** FAILED: unsuccessful server start (version compatibility enabled): $SERVER\n***"
+    exit 1
+fi
+
+set +e
+
+python $COMPATIBILITY_TEST_PY >$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE 1
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+if [ $RET -eq 0 ]; then
+  echo -e "\n***\n*** Test Passed\n***"
+else
+  echo -e "\n***\n*** Test FAILED\n***"
+fi
+
+exit $RET
diff --git a/qa/L0_trt_compat/trt_compatibility_test.py b/qa/L0_trt_compat/trt_compatibility_test.py
new file mode 100755
index 0000000000..6991299a4c
--- /dev/null
+++ b/qa/L0_trt_compat/trt_compatibility_test.py
@@ -0,0 +1,50 @@
+#!/usr/bin/env python3
+
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+
+import unittest
+
+import infer_util as iu
+import numpy as np
+import test_util as tu
+
+
+class TrtCompatibilityTest(tu.TestResultCollector):
+    def setUp(self):
+        self._data_type = np.float32
+
+    def test_plan(self):
+        # plan_compatible_zero_1_float32 is an identity model with input shape [-1]
+        iu.infer_zero(self, "plan_compatible", 1, self._data_type, [[2, 4]], [[2, 4]])
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_trt_data_dependent_shape/test.sh b/qa/L0_trt_data_dependent_shape/test.sh
new file mode 100755
index 0000000000..61efb053f8
--- /dev/null
+++ b/qa/L0_trt_data_dependent_shape/test.sh
@@ -0,0 +1,94 @@
+#!/bin/bash
+# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
+if [ "$#" -ge 1 ]; then
+    REPO_VERSION=$1
+fi
+if [ -z "$REPO_VERSION" ]; then
+    echo -e "Repository version must be specified"
+    echo -e "\n***\n*** Test Failed\n***"
+    exit 1
+fi
+if [ ! -z "$TEST_REPO_ARCH" ]; then
+    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
+fi
+
+TEST_RESULT_FILE='test_results.txt'
+export CUDA_VISIBLE_DEVICES=0
+
+TRT_TEST=trt_data_dependent_shape_test.py
+
+DATADIR="./models"
+
+rm -rf ${DATADIR}
+cp -r /data/inferenceserver/${REPO_VERSION}/qa_trt_data_dependent_model_repository/ ${DATADIR}
+
+source ../common/util.sh
+
+rm -f *.log*
+
+RET=0
+
+CLIENT_LOG="./client.log"
+SERVER_LOG="./inference_server.log"
+SERVER=/opt/tritonserver/bin/tritonserver
+SERVER_ARGS="--model-repository=$DATADIR"
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python $TRT_TEST >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Test Failed\n***"
+    cat $CLIENT_LOG
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE 2
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+if [ $RET -eq 0 ]; then
+  echo -e "\n***\n*** Test Passed\n***"
+else
+  echo -e "\n***\n*** Test Failed\n***"
+fi
+
+exit $RET
diff --git a/qa/L0_trt_data_dependent_shape/trt_data_dependent_shape_test.py b/qa/L0_trt_data_dependent_shape/trt_data_dependent_shape_test.py
new file mode 100755
index 0000000000..ee0b675d84
--- /dev/null
+++ b/qa/L0_trt_data_dependent_shape/trt_data_dependent_shape_test.py
@@ -0,0 +1,85 @@
+#!/usr/bin/env python3
+
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+
+import unittest
+
+import numpy as np
+import test_util as tu
+import tritonclient.http as client
+
+
+class TrtDataDependentShapeTest(tu.TestResultCollector):
+    def setUp(self):
+        self.triton_client = client.InferenceServerClient(
+            "localhost:8000", verbose=True
+        )
+
+    def test_fixed(self):
+        model_name = "plan_nobatch_nonzero_fixed"
+        input_np = np.arange(16, dtype=np.int32).reshape((4, 4))
+        expected_output_np = np.nonzero(input_np)
+
+        inputs = []
+        inputs.append(client.InferInput("INPUT", [4, 4], "INT32"))
+        inputs[-1].set_data_from_numpy(input_np)
+
+        results = self.triton_client.infer(model_name=model_name, inputs=inputs)
+        # Validate the results by comparing with precomputed values.
+        output_np = results.as_numpy("OUTPUT")
+        self.assertTrue(
+            np.array_equal(output_np, expected_output_np),
+            "OUTPUT expected: {}, got {}".format(expected_output_np, output_np),
+        )
+
+    def test_dynamic(self):
+        model_name = "plan_nobatch_nonzero_dynamic"
+        input_data = []
+        for i in range(20 * 16):
+            input_data.append(i if (i % 2) == 0 else 0)
+        input_np = np.array(input_data, dtype=np.int32).reshape((20, 16))
+        expected_output_np = np.nonzero(input_np)
+
+        inputs = []
+        inputs.append(client.InferInput("INPUT", [20, 16], "INT32"))
+        inputs[-1].set_data_from_numpy(input_np)
+
+        results = self.triton_client.infer(model_name=model_name, inputs=inputs)
+        # Validate the results by comparing with precomputed values.
+        output_np = results.as_numpy("OUTPUT")
+        self.assertTrue(
+            np.array_equal(output_np, expected_output_np),
+            "OUTPUT expected: {}, got {}".format(expected_output_np, output_np),
+        )
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_trt_dla/dla_test.py b/qa/L0_trt_dla/dla_test.py
old mode 100644
new mode 100755
index ec4f687c47..d71d277ac4
--- a/qa/L0_trt_dla/dla_test.py
+++ b/qa/L0_trt_dla/dla_test.py
@@ -1,5 +1,5 @@
 #!/usr/bin/env python
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,26 +26,25 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
 import unittest
+
 import numpy as np
-from PIL import Image
 import test_util as tu
-
 import tritonclient.http as httpclient
-from tritonclient.utils import InferenceServerException
+from PIL import Image
 
 
 class InferTest(tu.TestResultCollector):
-
     def _preprocess(self, img, dtype):
         """
         Pre-process an image to meet the size and type
         requirements specified by the parameters.
         """
 
-        sample_img = img.convert('RGB')
+        sample_img = img.convert("RGB")
         resized_img = sample_img.resize((224, 224), Image.BILINEAR)
         resized = np.array(resized_img)
 
@@ -57,8 +56,7 @@ def _preprocess(self, img, dtype):
 
     def test_resnet50(self):
         try:
-            triton_client = httpclient.InferenceServerClient(
-                url="localhost:8000")
+            triton_client = httpclient.InferenceServerClient(url="localhost:8000")
         except Exception as e:
             print("channel creation failed: " + str(e))
             sys.exit(1)
@@ -74,22 +72,21 @@ def test_resnet50(self):
         batched_image_data = image_data
         for i in range(1, batch_size):
             batched_image_data = np.concatenate(
-                (batched_image_data, image_data), axis=0)
+                (batched_image_data, image_data), axis=0
+            )
 
         inputs = [
-            httpclient.InferInput('input_tensor_0', [batch_size, 3, 224, 224],
-                                  'INT8')
+            httpclient.InferInput("input_tensor_0", [batch_size, 3, 224, 224], "INT8")
         ]
         inputs[0].set_data_from_numpy(batched_image_data, binary_data=True)
 
         outputs = [
-            httpclient.InferRequestedOutput('topk_layer_output_index',
-                                            binary_data=True)
+            httpclient.InferRequestedOutput("topk_layer_output_index", binary_data=True)
         ]
 
         results = triton_client.infer(model_name, inputs, outputs=outputs)
 
-        output_data = results.as_numpy('topk_layer_output_index')
+        output_data = results.as_numpy("topk_layer_output_index")
         print(output_data)
 
         # Validate the results by comparing with precomputed values.
@@ -99,5 +96,5 @@ def test_resnet50(self):
             self.assertEqual(output_data[i][0][0], EXPECTED_CLASS_INDEX)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_trt_dla/test.sh b/qa/L0_trt_dla/test.sh
old mode 100644
new mode 100755
diff --git a/qa/L0_trt_dynamic_shape/test.sh b/qa/L0_trt_dynamic_shape/test.sh
index 99ecc7f2b8..43a39dd199 100755
--- a/qa/L0_trt_dynamic_shape/test.sh
+++ b/qa/L0_trt_dynamic_shape/test.sh
@@ -305,7 +305,7 @@ kill $SERVER_PID
 wait $SERVER_PID
 
 
-# Adding test cases for mulitple optimization profiles with static shapes.
+# Adding test cases for multiple optimization profiles with static shapes.
 # Will load only the following profiles with the static shapes:
 # Profile 7: [1, 33]
 # Profile 8: [3, 33]
diff --git a/qa/L0_trt_dynamic_shape/trt_dynamic_shape_test.py b/qa/L0_trt_dynamic_shape/trt_dynamic_shape_test.py
old mode 100644
new mode 100755
index 8b01cbc206..d9f890d9b6
--- a/qa/L0_trt_dynamic_shape/trt_dynamic_shape_test.py
+++ b/qa/L0_trt_dynamic_shape/trt_dynamic_shape_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,41 +27,52 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-from builtins import range
-from future.utils import iteritems
-import os
-import shutil
-import time
 import unittest
-import numpy as np
+
 import infer_util as iu
+import numpy as np
 import test_util as tu
 import tritonhttpclient
 from tritonclientutils import InferenceServerException
 
 
 class TrtDynamicShapeTest(tu.TestResultCollector):
-
     def setUp(self):
         self.dtype_ = np.float32
-        self.model_name_ = 'plan'
+        self.model_name_ = "plan"
 
     def test_load_specific_optimization_profile(self):
         # Only OP 5 should be available, which only allow batch size 8
         tensor_shape = (1,)
         try:
-            iu.infer_exact(self, self.model_name_, (1,) + tensor_shape, 1,
-                           self.dtype_, self.dtype_, self.dtype_)
+            iu.infer_exact(
+                self,
+                self.model_name_,
+                (1,) + tensor_shape,
+                1,
+                self.dtype_,
+                self.dtype_,
+                self.dtype_,
+            )
         except InferenceServerException as ex:
             self.assertTrue(
                 "model expected the shape of dimension 0 to be between 6 and 8 but received 1"
-                in ex.message())
+                in ex.message()
+            )
 
         try:
-            iu.infer_exact(self, self.model_name_, (8,) + tensor_shape, 8,
-                           self.dtype_, self.dtype_, self.dtype_)
+            iu.infer_exact(
+                self,
+                self.model_name_,
+                (8,) + tensor_shape,
+                8,
+                self.dtype_,
+                self.dtype_,
+                self.dtype_,
+            )
         except InferenceServerException as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -68,37 +81,60 @@ def test_load_default_optimization_profile(self):
         tensor_shape = (33,)
 
         try:
-            iu.infer_exact(self, self.model_name_, (8,) + tensor_shape, 8,
-                           self.dtype_, self.dtype_, self.dtype_)
+            iu.infer_exact(
+                self,
+                self.model_name_,
+                (8,) + tensor_shape,
+                8,
+                self.dtype_,
+                self.dtype_,
+                self.dtype_,
+            )
         except InferenceServerException as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
         over_tensor_shape = (34,)
         try:
-            iu.infer_exact(self, self.model_name_, (8,) + over_tensor_shape, 8,
-                           self.dtype_, self.dtype_, self.dtype_)
+            iu.infer_exact(
+                self,
+                self.model_name_,
+                (8,) + over_tensor_shape,
+                8,
+                self.dtype_,
+                self.dtype_,
+                self.dtype_,
+            )
         except InferenceServerException as ex:
             self.assertTrue(
                 "model expected the shape of dimension 1 to be between 1 and 33 but received 34"
-                in ex.message())
+                in ex.message()
+            )
 
     def test_select_optimization_profile(self):
         # Different profile has different optimized input shape
         batch_size = 4
         tensor_shape = (16,)
         try:
-            iu.infer_exact(self, self.model_name_, (batch_size,) + tensor_shape,
-                           batch_size, self.dtype_, self.dtype_, self.dtype_)
+            iu.infer_exact(
+                self,
+                self.model_name_,
+                (batch_size,) + tensor_shape,
+                batch_size,
+                self.dtype_,
+                self.dtype_,
+                self.dtype_,
+            )
         except InferenceServerException as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
     def test_load_wrong_optimization_profile(self):
         client = tritonhttpclient.InferenceServerClient("localhost:8000")
-        model_name = tu.get_model_name(self.model_name_, self.dtype_,
-                                       self.dtype_, self.dtype_)
+        model_name = tu.get_model_name(
+            self.model_name_, self.dtype_, self.dtype_, self.dtype_
+        )
         model_status = client.is_model_ready(model_name, "1")
         self.assertFalse(model_status, "expected model to be not ready")
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_trt_error_propagation/test.sh b/qa/L0_trt_error_propagation/test.sh
new file mode 100755
index 0000000000..dac3f6349e
--- /dev/null
+++ b/qa/L0_trt_error_propagation/test.sh
@@ -0,0 +1,82 @@
+#!/bin/bash
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+export CUDA_VISIBLE_DEVICES=0
+SERVER=/opt/tritonserver/bin/tritonserver
+source ../common/util.sh
+
+# Create TensorRT model with invalid plan file
+rm -rf models && mkdir models
+mkdir models/invalid_plan_file && (cd models/invalid_plan_file && \
+    echo -e "name: \"invalid_plan_file\"" >> config.pbtxt && \
+    echo -e "platform: \"tensorrt_plan\"" >> config.pbtxt && \
+    echo -e "input [\n {\n name: \"INPUT\"\n data_type: TYPE_FP32\n dims: [-1]\n }\n ]" >> config.pbtxt && \
+    echo -e "output [\n {\n name: \"OUTPUT\"\n data_type: TYPE_FP32\n dims: [-1]\n }\n ]" >> config.pbtxt && \
+    mkdir 1 && echo "----- invalid model.plan -----" >> 1/model.plan)
+
+# Test with and without auto complete enabled
+for ENABLE_AUTOCOMPLETE in "YES" "NO"; do
+
+    if [[ "$ENABLE_AUTOCOMPLETE" == "YES" ]]; then
+        TEST_NAME="test_invalid_trt_model_autocomplete"
+        SERVER_ARGS="--model-repository=models --model-control-mode=explicit"
+    else
+        TEST_NAME="test_invalid_trt_model"
+        SERVER_ARGS="--model-repository=models --model-control-mode=explicit --disable-auto-complete-config"
+    fi
+
+    SERVER_LOG="./$TEST_NAME.server.log"
+    run_server
+    if [ "$SERVER_PID" == "0" ]; then
+        echo -e "\n***\n*** Failed to start $SERVER\n***"
+        cat $SERVER_LOG
+        exit 1
+    fi
+
+    RET=0
+
+    set +e
+    python trt_error_propagation_test.py TestTrtErrorPropagation.$TEST_NAME > $TEST_NAME.log 2>&1
+    if [ $? -ne 0 ]; then
+        cat $TEST_NAME.log
+        echo -e "\n***\n*** Test FAILED\n***"
+        RET=1
+    fi
+    set -e
+
+    kill $SERVER_PID
+    wait $SERVER_PID
+
+    if [ $RET -ne 0 ]; then
+        exit $RET
+    fi
+
+done
+
+# Exit with success
+echo -e "\n***\n*** Test Passed\n***"
+exit 0
diff --git a/qa/L0_trt_error_propagation/trt_error_propagation_test.py b/qa/L0_trt_error_propagation/trt_error_propagation_test.py
new file mode 100755
index 0000000000..83527a7533
--- /dev/null
+++ b/qa/L0_trt_error_propagation/trt_error_propagation_test.py
@@ -0,0 +1,72 @@
+#!/usr/bin/env python3
+
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import unittest
+
+import tritonclient.grpc as grpcclient
+from tritonclient.utils import InferenceServerException
+
+
+class TestTrtErrorPropagation(unittest.TestCase):
+    def setUp(self):
+        # Initialize client
+        self.__triton = grpcclient.InferenceServerClient("localhost:8001", verbose=True)
+
+    def test_invalid_trt_model(self):
+        with self.assertRaises(InferenceServerException) as cm:
+            self.__triton.load_model("invalid_plan_file")
+        err_msg = str(cm.exception)
+        # All 'expected_msg_parts' should be present in the 'err_msg' in order
+        expected_msg_parts = [
+            "load failed for model",
+            "version 1 is at UNAVAILABLE state: ",
+            "Internal: unable to create TensorRT engine: ",
+            "Error Code ",
+            "Internal Error ",
+        ]
+        for expected_msg_part in expected_msg_parts:
+            self.assertIn(
+                expected_msg_part,
+                err_msg,
+                "Cannot find an expected part of error message",
+            )
+            _, err_msg = err_msg.split(expected_msg_part)
+
+    def test_invalid_trt_model_autocomplete(self):
+        with self.assertRaises(InferenceServerException) as cm:
+            self.__triton.load_model("invalid_plan_file")
+        err_msg = str(cm.exception)
+        self.assertIn(
+            "Internal: unable to load plan file to auto complete config",
+            err_msg,
+            "Caught an unexpected exception",
+        )
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_trt_plugin/test.sh b/qa/L0_trt_plugin/test.sh
old mode 100644
new mode 100755
index 13df59c56f..7ffc7e215d
--- a/qa/L0_trt_plugin/test.sh
+++ b/qa/L0_trt_plugin/test.sh
@@ -43,18 +43,112 @@ export CUDA_VISIBLE_DEVICES=0
 
 CLIENT_LOG="./client.log"
 PLUGIN_TEST=trt_plugin_test.py
-EXPECTED_NUM_TESTS="2"
 
-DATADIR=/data/inferenceserver/${REPO_VERSION}/qa_trt_plugin_model_repository
+# On windows the paths invoked by the script (running in WSL) must use
+# /mnt/c when needed but the paths on the tritonserver command-line
+# must be C:/ style.
+if [[ "$(< /proc/sys/kernel/osrelease)" == *microsoft* ]]; then
+    DATADIR=${DATADIR:="/mnt/c/data/inferenceserver/${REPO_VERSION}"}
+    MODELDIR=${MODELDIR:=C:/models}
+    CUSTOMPLUGIN=${CUSTOMPLUGIN:=$MODELDIR/clipplugin.dll}
+    BACKEND_DIR=${BACKEND_DIR:=C:/tritonserver/backends}
+    SERVER=${SERVER:=/mnt/c/tritonserver/bin/tritonserver.exe}
+else
+    DATADIR=${DATADIR:="/data/inferenceserver/${REPO_VERSION}"}
+    MODELDIR=${MODELDIR:=`pwd`/models}
+    CUSTOMPLUGIN=${CUSTOMPLUGIN:=$MODELDIR/libclipplugin.so}
+    TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"}
+    BACKEND_DIR=${TRITON_DIR}/backends
+    SERVER=${TRITON_DIR}/bin/tritonserver
+fi
 
-SERVER=/opt/tritonserver/bin/tritonserver
-SERVER_ARGS="--model-repository=$DATADIR --exit-timeout-secs=120"
-SERVER_LOG="./inference_server.log"
 source ../common/util.sh
 
-rm -f $SERVER_LOG $CLIENT_LOG
-
 RET=0
+rm -f ./*.log
+
+SERVER_ARGS_BASE="--model-repository=${MODELDIR} --backend-directory=${BACKEND_DIR} --log-verbose=1"
+SERVER_TIMEOUT=20
+
+LOG_IDX=0
+
+## Default Plugin Tests
+
+## Create model folder with default plugin models
+rm -fr models && mkdir -p models
+set -e
+find $DATADIR/qa_trt_plugin_model_repository/ -mindepth 1 -maxdepth 1 ! -iname '*clipplugin*' -exec cp -rv {} models \;
+
+SERVER_ARGS=$SERVER_ARGS_BASE
+SERVER_LOG="./inference_server_$LOG_IDX.log"
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+rm -f $CLIENT_LOG
+set +e
+python3 $PLUGIN_TEST PluginModelTest.test_raw_fff_gelu >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE 1
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+rm -f $CLIENT_LOG
+python3 $PLUGIN_TEST PluginModelTest.test_raw_fff_norm >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE 1
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+set -e
+
+kill_server
+
+## Custom Plugin Tests
+
+## Create model folder with custom plugin models for remaining tests
+rm -fr models && mkdir -p models
+find $DATADIR/qa_trt_plugin_model_repository/ -maxdepth 1 -iname '*clipplugin*' -exec cp -r {} models \;
+
+LOG_IDX=$((LOG_IDX+1))
+
+## Baseline Failure Test
+## Plugin library not loaded
+SERVER_ARGS=$SERVER_ARGS_BASE
+SERVER_LOG="./inference_server_$LOG_IDX.log"
+
+run_server
+if [ "$SERVER_PID" != "0" ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Test Failed\n"
+    echo -e "Unexpected successful server start $SERVER\n***"
+    kill_server
+    exit 1
+fi
+
+LOG_IDX=$((LOG_IDX+1))
+
+## Backend Config, Single Plugin Test
+SERVER_ARGS="${SERVER_ARGS_BASE} --backend-config=tensorrt,plugins=${CUSTOMPLUGIN}"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
 
 run_server
 if [ "$SERVER_PID" == "0" ]; then
@@ -63,14 +157,15 @@ if [ "$SERVER_PID" == "0" ]; then
     exit 1
 fi
 
+rm -f $CLIENT_LOG
 set +e
-python $PLUGIN_TEST >$CLIENT_LOG 2>&1
+python3 $PLUGIN_TEST PluginModelTest.test_raw_fff_clip >>$CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
     cat $CLIENT_LOG
     echo -e "\n***\n*** Test Failed\n***"
     RET=1
 else
-    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+    check_test_results $TEST_RESULT_FILE 1
     if [ $? -ne 0 ]; then
         cat $CLIENT_LOG
         echo -e "\n***\n*** Test Result Verification Failed\n***"
@@ -79,13 +174,80 @@ else
 fi
 set -e
 
-kill $SERVER_PID
-wait $SERVER_PID
+kill_server
+
+LOG_IDX=$((LOG_IDX+1))
+
+## Backend Config, Multiple Plugins Test
+SERVER_ARGS="${SERVER_ARGS_BASE} --backend-config=tensorrt,plugins=${CUSTOMPLUGIN}"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+rm -f $CLIENT_LOG
+set +e
+python3 $PLUGIN_TEST PluginModelTest.test_raw_fff_clip >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE 1
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+set -e
+
+kill_server
+
+LOG_IDX=$((LOG_IDX+1))
+
+## LD_PRELOAD, Single Plugin Test
+## LD_PRELOAD is only on Linux
+
+SERVER_LD_PRELOAD=$CUSTOMPLUGIN
+SERVER_ARGS=$SERVER_ARGS_BASE
+SERVER_LOG="./inference_server_$LOG_IDX.log"
+
+if [[ "$(< /proc/sys/kernel/osrelease)" != *microsoft* ]]; then
+    run_server
+    if [ "$SERVER_PID" == "0" ]; then
+        echo -e "\n***\n*** Failed to start $SERVER\n***"
+        cat $SERVER_LOG
+        exit 1
+    fi
+
+    rm -f $CLIENT_LOG
+    set +e
+    python3 $PLUGIN_TEST PluginModelTest.test_raw_fff_clip >>$CLIENT_LOG 2>&1
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Failed\n***"
+        RET=1
+    else
+        check_test_results $TEST_RESULT_FILE 1
+        if [ $? -ne 0 ]; then
+            cat $CLIENT_LOG
+            echo -e "\n***\n*** Test Result Verification Failed\n***"
+            RET=1
+        fi
+    fi
+    set -e
+
+    kill_server
+fi
 
 if [ $RET -eq 0 ]; then
     echo -e "\n***\n*** Test Passed\n***"
 else
-    cat $CLIENT_LOG
     echo -e "\n***\n*** Test FAILED\n***"
 fi
 
diff --git a/qa/L0_trt_plugin/trt_plugin_test.py b/qa/L0_trt_plugin/trt_plugin_test.py
old mode 100644
new mode 100755
index fde88244a9..5dcc6318f5
--- a/qa/L0_trt_plugin/trt_plugin_test.py
+++ b/qa/L0_trt_plugin/trt_plugin_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2018-2020, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,65 +27,98 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-from builtins import range
-from future.utils import iteritems
+import os
 import unittest
+
 import numpy as np
-import os
 import test_util as tu
+import tritonclient.http as httpclient
 
-import tritonhttpclient as httpclient
-from tritonclientutils import InferenceServerException
+# By default, find tritonserver on "localhost", but can be overridden
+# with TRITONSERVER_IPADDR envvar
+_tritonserver_ipaddr = os.environ.get("TRITONSERVER_IPADDR", "localhost")
 
 
 class PluginModelTest(tu.TestResultCollector):
-
     def _full_exact(self, model_name, plugin_name, shape):
-        triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                         verbose=True)
+        print(f"{_tritonserver_ipaddr}:8000")
+        triton_client = httpclient.InferenceServerClient(f"{_tritonserver_ipaddr}:8000")
 
         inputs = []
         outputs = []
-        inputs.append(httpclient.InferInput('INPUT0', list(shape), "FP32"))
+        inputs.append(httpclient.InferInput("INPUT0", list(shape), "FP32"))
 
         input0_data = np.ones(shape=shape).astype(np.float32)
         inputs[0].set_data_from_numpy(input0_data, binary_data=True)
 
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT0', binary_data=True))
-
-        results = triton_client.infer(model_name + '_' + plugin_name,
-                                      inputs,
-                                      outputs=outputs)
-
-        output0_data = results.as_numpy('OUTPUT0')
-
-        # Verify values of Normalize and GELU
-        if plugin_name == 'CustomGeluPluginDynamic':
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=True))
+
+        results = triton_client.infer(
+            model_name + "_" + plugin_name, inputs, outputs=outputs
+        )
+
+        output0_data = results.as_numpy("OUTPUT0")
+        tolerance_relative = 1e-6
+        tolerance_absolute = 1e-7
+
+        # Verify values of Clip, GELU, and Normalize
+        if plugin_name == "CustomClipPlugin":
+            # Clip data to minimum of .1, maximum of .5
+            test_output = np.clip(input0_data, 0.1, 0.5)
+            np.testing.assert_allclose(
+                output0_data,
+                test_output,
+                rtol=tolerance_relative,
+                atol=tolerance_absolute,
+            )
+        elif plugin_name == "CustomGeluPluginDynamic":
             # Add bias
             input0_data += 1
             # Calculate Gelu activation
-            test_output = (input0_data *
-                           0.5) * (1 + np.tanh((0.797885 * input0_data) +
-                                               (0.035677 * (input0_data**3))))
-            self.assertTrue(np.isclose(output0_data, test_output).all())
-        else:
+            test_output = (input0_data * 0.5) * (
+                1 + np.tanh((0.797885 * input0_data) + (0.035677 * (input0_data**3)))
+            )
+            np.testing.assert_allclose(
+                output0_data,
+                test_output,
+                rtol=tolerance_relative,
+                atol=tolerance_absolute,
+            )
+        elif plugin_name == "Normalize_TRT":
             # L2 norm is sqrt(sum([1]*16)))
             test_output = input0_data / np.sqrt(sum([1] * 16))
-            self.assertTrue(np.isclose(output0_data, test_output).all())
+            np.testing.assert_allclose(
+                output0_data,
+                test_output,
+                rtol=tolerance_relative,
+                atol=tolerance_absolute,
+            )
+        else:
+            self.fail("Unexpected plugin: " + plugin_name)
+
+    def test_raw_fff_clip(self):
+        for bs in (1, 8):
+            self._full_exact(
+                "plan_float32_float32_float32", "CustomClipPlugin", (bs, 16)
+            )
 
     def test_raw_fff_gelu(self):
-        self._full_exact('plan_nobatch_float32_float32_float32',
-                         'CustomGeluPluginDynamic', (16, 1, 1))
+        self._full_exact(
+            "plan_nobatch_float32_float32_float32",
+            "CustomGeluPluginDynamic",
+            (16, 1, 1),
+        )
 
     def test_raw_fff_norm(self):
         # model that supports batching
         for bs in (1, 8):
-            self._full_exact('plan_float32_float32_float32', 'Normalize_TRT',
-                             (bs, 16, 16, 16))
+            self._full_exact(
+                "plan_float32_float32_float32", "Normalize_TRT", (bs, 16, 16, 16)
+            )
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_trt_reformat_free/test.sh b/qa/L0_trt_reformat_free/test.sh
index c834a05992..ebdc83a5b8 100755
--- a/qa/L0_trt_reformat_free/test.sh
+++ b/qa/L0_trt_reformat_free/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -42,7 +42,6 @@ TEST_RESULT_FILE='test_results.txt'
 export CUDA_VISIBLE_DEVICES=0
 
 CLIENT_LOG="./client.log"
-PERF_CLIENT=../clients/perf_client
 TRT_TEST=trt_reformat_free_test.py
 
 DATADIR="./models"
diff --git a/qa/L0_trt_reformat_free/trt_reformat_free_test.py b/qa/L0_trt_reformat_free/trt_reformat_free_test.py
old mode 100644
new mode 100755
index fedcf62184..ea36f9c24a
--- a/qa/L0_trt_reformat_free/trt_reformat_free_test.py
+++ b/qa/L0_trt_reformat_free/trt_reformat_free_test.py
@@ -1,4 +1,6 @@
-# Copyright 2020-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,20 +27,16 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-from builtins import range
-from future.utils import iteritems
-import os
-import shutil
-import time
 import unittest
+from builtins import range
+
 import numpy as np
-import infer_util as iu
 import test_util as tu
-import tritonhttpclient
 import tritonclient.utils.shared_memory as shm
-from tritonclientutils import InferenceServerException
+import tritonhttpclient
 
 
 def div_up(a, b):
@@ -48,36 +46,40 @@ def div_up(a, b):
 def reformat(format, tensor_np):
     if format == "CHW2":
         factor = 2
-    if format == "CHW32":
+    elif format == "CHW32":
         factor = 32
+    else:
+        raise ValueError(
+            "Unexpected format {} for testing reformat-free input".format(format)
+        )
     shape = list(tensor_np.shape) + [factor]
     shape[-4] = div_up(shape[-4], factor)
     reformatted_tensor_np = np.empty(shape, tensor_np.dtype)
     if len(tensor_np.shape) == 3:
         batch = [(tensor_np, reformatted_tensor_np)]
     elif len(tensor_np.shape) == 4:
-        batch = [(tensor_np[idx], reformatted_tensor_np[idx])
-                 for idx in range(tensor_np.shape[0])]
+        batch = [
+            (tensor_np[idx], reformatted_tensor_np[idx])
+            for idx in range(tensor_np.shape[0])
+        ]
     else:
         raise ValueError(
             "Unexpected numpy shape {} for testing reformat-free input".format(
-                tensor_np.shape))
-    for (tensor, reformatted_tensor) in batch:
+                tensor_np.shape
+            )
+        )
+    for tensor, reformatted_tensor in batch:
         for c in range(tensor.shape[0]):
             for h in range(tensor.shape[1]):
                 for w in range(tensor.shape[2]):
-                    reformatted_tensor[c //
-                                       factor][h][w][c %
-                                                     factor] = tensor[c][h][w]
+                    reformatted_tensor[c // factor][h][w][c % factor] = tensor[c][h][w]
     return reformatted_tensor_np
 
 
 class TrtReformatFreeTest(tu.TestResultCollector):
-
     def add_reformat_free_data_as_shared_memory(self, name, tensor, tensor_np):
         byte_size = tensor_np.size * tensor_np.dtype.itemsize
-        self.shm_handles.append(
-            shm.create_shared_memory_region(name, name, byte_size))
+        self.shm_handles.append(shm.create_shared_memory_region(name, name, byte_size))
         # Put data values into shared memory
         shm.set_shared_memory_region(self.shm_handles[-1], [tensor_np])
         # Register shared memory with Triton Server
@@ -88,7 +90,8 @@ def add_reformat_free_data_as_shared_memory(self, name, tensor, tensor_np):
     def setUp(self):
         self.shm_handles = []
         self.triton_client = tritonhttpclient.InferenceServerClient(
-            "localhost:8000", verbose=True)
+            "localhost:8000", verbose=True
+        )
 
     def tearDown(self):
         self.triton_client.unregister_system_shared_memory()
@@ -106,39 +109,42 @@ def test_nobatch_chw2_input(self):
         # for non-linear format tensor, the data buffer is padded and thus the
         # data byte size may not match what is calculated from tensor shape
         inputs = []
-        inputs.append(tritonhttpclient.InferInput('INPUT0', [13, 2, 1], "FP16"))
-        self.add_reformat_free_data_as_shared_memory("input0", inputs[-1],
-                                                     reformatted_input_np)
-        inputs.append(tritonhttpclient.InferInput('INPUT1', [13, 2, 1], "FP16"))
-        self.add_reformat_free_data_as_shared_memory("input1", inputs[-1],
-                                                     reformatted_input_np)
+        inputs.append(tritonhttpclient.InferInput("INPUT0", [13, 2, 1], "FP16"))
+        self.add_reformat_free_data_as_shared_memory(
+            "input0", inputs[-1], reformatted_input_np
+        )
+        inputs.append(tritonhttpclient.InferInput("INPUT1", [13, 2, 1], "FP16"))
+        self.add_reformat_free_data_as_shared_memory(
+            "input1", inputs[-1], reformatted_input_np
+        )
 
         outputs = []
         outputs.append(
-            tritonhttpclient.InferRequestedOutput('OUTPUT0', binary_data=True))
+            tritonhttpclient.InferRequestedOutput("OUTPUT0", binary_data=True)
+        )
         outputs.append(
-            tritonhttpclient.InferRequestedOutput('OUTPUT1', binary_data=True))
+            tritonhttpclient.InferRequestedOutput("OUTPUT1", binary_data=True)
+        )
 
-        results = self.triton_client.infer(model_name=model_name,
-                                           inputs=inputs,
-                                           outputs=outputs)
+        results = self.triton_client.infer(
+            model_name=model_name, inputs=inputs, outputs=outputs
+        )
         # Validate the results by comparing with precomputed values.
-        output0_np = results.as_numpy('OUTPUT0')
-        output1_np = results.as_numpy('OUTPUT1')
+        output0_np = results.as_numpy("OUTPUT0")
+        output1_np = results.as_numpy("OUTPUT1")
         self.assertTrue(
             np.array_equal(output0_np, expected_output0_np),
-            "OUTPUT0 expected: {}, got {}".format(expected_output0_np,
-                                                  output0_np))
+            "OUTPUT0 expected: {}, got {}".format(expected_output0_np, output0_np),
+        )
         self.assertTrue(
             np.array_equal(output1_np, expected_output1_np),
-            "OUTPUT0 expected: {}, got {}".format(expected_output1_np,
-                                                  output1_np))
+            "OUTPUT0 expected: {}, got {}".format(expected_output1_np, output1_np),
+        )
 
     def test_chw2_input(self):
         model_name = "plan_CHW2_LINEAR_float16_float16_float16"
         for bs in [1, 8]:
-            input_np = np.arange(26 * bs, dtype=np.float16).reshape(
-                (bs, 13, 2, 1))
+            input_np = np.arange(26 * bs, dtype=np.float16).reshape((bs, 13, 2, 1))
             expected_output0_np = input_np + input_np
             expected_output1_np = input_np - input_np
             reformatted_input_np = reformat("CHW2", input_np)
@@ -148,37 +154,37 @@ def test_chw2_input(self):
             # and thus the data byte size may not match what is calculated from
             # tensor shape
             inputs = []
-            inputs.append(
-                tritonhttpclient.InferInput('INPUT0', [bs, 13, 2, 1], "FP16"))
+            inputs.append(tritonhttpclient.InferInput("INPUT0", [bs, 13, 2, 1], "FP16"))
             self.add_reformat_free_data_as_shared_memory(
-                "input0" + str(bs), inputs[-1], reformatted_input_np)
-            inputs.append(
-                tritonhttpclient.InferInput('INPUT1', [bs, 13, 2, 1], "FP16"))
+                "input0" + str(bs), inputs[-1], reformatted_input_np
+            )
+            inputs.append(tritonhttpclient.InferInput("INPUT1", [bs, 13, 2, 1], "FP16"))
             self.add_reformat_free_data_as_shared_memory(
-                "input1" + str(bs), inputs[-1], reformatted_input_np)
+                "input1" + str(bs), inputs[-1], reformatted_input_np
+            )
 
             outputs = []
             outputs.append(
-                tritonhttpclient.InferRequestedOutput('OUTPUT0',
-                                                      binary_data=True))
+                tritonhttpclient.InferRequestedOutput("OUTPUT0", binary_data=True)
+            )
             outputs.append(
-                tritonhttpclient.InferRequestedOutput('OUTPUT1',
-                                                      binary_data=True))
+                tritonhttpclient.InferRequestedOutput("OUTPUT1", binary_data=True)
+            )
 
-            results = self.triton_client.infer(model_name=model_name,
-                                               inputs=inputs,
-                                               outputs=outputs)
+            results = self.triton_client.infer(
+                model_name=model_name, inputs=inputs, outputs=outputs
+            )
             # Validate the results by comparing with precomputed values.
-            output0_np = results.as_numpy('OUTPUT0')
-            output1_np = results.as_numpy('OUTPUT1')
+            output0_np = results.as_numpy("OUTPUT0")
+            output1_np = results.as_numpy("OUTPUT1")
             self.assertTrue(
                 np.array_equal(output0_np, expected_output0_np),
-                "OUTPUT0 expected: {}, got {}".format(expected_output0_np,
-                                                      output0_np))
+                "OUTPUT0 expected: {}, got {}".format(expected_output0_np, output0_np),
+            )
             self.assertTrue(
                 np.array_equal(output1_np, expected_output1_np),
-                "OUTPUT0 expected: {}, got {}".format(expected_output1_np,
-                                                      output1_np))
+                "OUTPUT0 expected: {}, got {}".format(expected_output1_np, output1_np),
+            )
 
     def test_nobatch_chw32_input(self):
         model_name = "plan_nobatch_CHW32_LINEAR_float32_float32_float32"
@@ -191,39 +197,42 @@ def test_nobatch_chw32_input(self):
         # for non-linear format tensor, the data buffer is padded and thus the
         # data byte size may not match what is calculated from tensor shape
         inputs = []
-        inputs.append(tritonhttpclient.InferInput('INPUT0', [13, 2, 1], "FP32"))
-        self.add_reformat_free_data_as_shared_memory("input0", inputs[-1],
-                                                     reformatted_input_np)
-        inputs.append(tritonhttpclient.InferInput('INPUT1', [13, 2, 1], "FP32"))
-        self.add_reformat_free_data_as_shared_memory("input1", inputs[-1],
-                                                     reformatted_input_np)
+        inputs.append(tritonhttpclient.InferInput("INPUT0", [13, 2, 1], "FP32"))
+        self.add_reformat_free_data_as_shared_memory(
+            "input0", inputs[-1], reformatted_input_np
+        )
+        inputs.append(tritonhttpclient.InferInput("INPUT1", [13, 2, 1], "FP32"))
+        self.add_reformat_free_data_as_shared_memory(
+            "input1", inputs[-1], reformatted_input_np
+        )
 
         outputs = []
         outputs.append(
-            tritonhttpclient.InferRequestedOutput('OUTPUT0', binary_data=True))
+            tritonhttpclient.InferRequestedOutput("OUTPUT0", binary_data=True)
+        )
         outputs.append(
-            tritonhttpclient.InferRequestedOutput('OUTPUT1', binary_data=True))
+            tritonhttpclient.InferRequestedOutput("OUTPUT1", binary_data=True)
+        )
 
-        results = self.triton_client.infer(model_name=model_name,
-                                           inputs=inputs,
-                                           outputs=outputs)
+        results = self.triton_client.infer(
+            model_name=model_name, inputs=inputs, outputs=outputs
+        )
         # Validate the results by comparing with precomputed values.
-        output0_np = results.as_numpy('OUTPUT0')
-        output1_np = results.as_numpy('OUTPUT1')
+        output0_np = results.as_numpy("OUTPUT0")
+        output1_np = results.as_numpy("OUTPUT1")
         self.assertTrue(
             np.array_equal(output0_np, expected_output0_np),
-            "OUTPUT0 expected: {}, got {}".format(expected_output0_np,
-                                                  output0_np))
+            "OUTPUT0 expected: {}, got {}".format(expected_output0_np, output0_np),
+        )
         self.assertTrue(
             np.array_equal(output1_np, expected_output1_np),
-            "OUTPUT0 expected: {}, got {}".format(expected_output1_np,
-                                                  output1_np))
+            "OUTPUT0 expected: {}, got {}".format(expected_output1_np, output1_np),
+        )
 
     def test_chw32_input(self):
         model_name = "plan_CHW32_LINEAR_float32_float32_float32"
         for bs in [1, 8]:
-            input_np = np.arange(26 * bs, dtype=np.float32).reshape(
-                (bs, 13, 2, 1))
+            input_np = np.arange(26 * bs, dtype=np.float32).reshape((bs, 13, 2, 1))
             expected_output0_np = input_np + input_np
             expected_output1_np = input_np - input_np
             reformatted_input_np = reformat("CHW32", input_np)
@@ -233,38 +242,38 @@ def test_chw32_input(self):
             # and thus the data byte size may not match what is calculated from
             # tensor shape
             inputs = []
-            inputs.append(
-                tritonhttpclient.InferInput('INPUT0', [bs, 13, 2, 1], "FP32"))
+            inputs.append(tritonhttpclient.InferInput("INPUT0", [bs, 13, 2, 1], "FP32"))
             self.add_reformat_free_data_as_shared_memory(
-                "input0" + str(bs), inputs[-1], reformatted_input_np)
-            inputs.append(
-                tritonhttpclient.InferInput('INPUT1', [bs, 13, 2, 1], "FP32"))
+                "input0" + str(bs), inputs[-1], reformatted_input_np
+            )
+            inputs.append(tritonhttpclient.InferInput("INPUT1", [bs, 13, 2, 1], "FP32"))
             self.add_reformat_free_data_as_shared_memory(
-                "input1" + str(bs), inputs[-1], reformatted_input_np)
+                "input1" + str(bs), inputs[-1], reformatted_input_np
+            )
 
             outputs = []
             outputs.append(
-                tritonhttpclient.InferRequestedOutput('OUTPUT0',
-                                                      binary_data=True))
+                tritonhttpclient.InferRequestedOutput("OUTPUT0", binary_data=True)
+            )
             outputs.append(
-                tritonhttpclient.InferRequestedOutput('OUTPUT1',
-                                                      binary_data=True))
+                tritonhttpclient.InferRequestedOutput("OUTPUT1", binary_data=True)
+            )
 
-            results = self.triton_client.infer(model_name=model_name,
-                                               inputs=inputs,
-                                               outputs=outputs)
+            results = self.triton_client.infer(
+                model_name=model_name, inputs=inputs, outputs=outputs
+            )
             # Validate the results by comparing with precomputed values.
-            output0_np = results.as_numpy('OUTPUT0')
-            output1_np = results.as_numpy('OUTPUT1')
+            output0_np = results.as_numpy("OUTPUT0")
+            output1_np = results.as_numpy("OUTPUT1")
             self.assertTrue(
                 np.array_equal(output0_np, expected_output0_np),
-                "OUTPUT0 expected: {}, got {}".format(expected_output0_np,
-                                                      output0_np))
+                "OUTPUT0 expected: {}, got {}".format(expected_output0_np, output0_np),
+            )
             self.assertTrue(
                 np.array_equal(output1_np, expected_output1_np),
-                "OUTPUT0 expected: {}, got {}".format(expected_output1_np,
-                                                      output1_np))
+                "OUTPUT0 expected: {}, got {}".format(expected_output1_np, output1_np),
+            )
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_trt_shape_tensors/test.sh b/qa/L0_trt_shape_tensors/test.sh
old mode 100644
new mode 100755
index 9ca0bc958f..eed67d9dcb
--- a/qa/L0_trt_shape_tensors/test.sh
+++ b/qa/L0_trt_shape_tensors/test.sh
@@ -49,7 +49,7 @@ SERVER_ARGS="--model-repository=`pwd`/models"
 SERVER_LOG="./inference_server.log"
 source ../common/util.sh
 
-rm -fr *.serverlog *.log *.serverlog
+rm -fr  *.log
 rm -fr models && mkdir models
 cp -r /data/inferenceserver/${REPO_VERSION}/qa_shapetensor_model_repository/* models/.
 
@@ -134,7 +134,7 @@ sed -i "s/^version_policy:.*/version_policy: { specific { versions: [1] }}/" $CO
 for i in \
             test_dynamic_different_shape_values \
             test_dynamic_identical_shape_values; do
-        SERVER_LOG="./$i.serverlog"
+        SERVER_LOG="./$i.server.log"
         run_server
         if [ "$SERVER_PID" == "0" ]; then
             echo -e "\n***\n*** Failed to start $SERVER\n***"
@@ -169,7 +169,7 @@ for i in \
             test_sequence_identical_shape_values ; do
         export TRITONSERVER_BACKLOG_DELAY_SCHEDULER=0
         export TRITONSERVER_DELAY_SCHEDULER=12
-        SERVER_LOG="./$i.serverlog"
+        SERVER_LOG="./$i.server.log"
         run_server
         if [ "$SERVER_PID" == "0" ]; then
             echo -e "\n***\n*** Failed to start $SERVER\n***"
@@ -215,7 +215,7 @@ for i in \
     test_dynaseq_different_shape_values_parallel \
     ;do
     SERVER_ARGS="--model-repository=`pwd`/models"
-    SERVER_LOG="./$i.serverlog"
+    SERVER_LOG="./$i.server.log"
     run_server
     if [ "$SERVER_PID" == "0" ]; then
         echo -e "\n***\n*** Failed to start $SERVER\n***"
diff --git a/qa/L0_trt_shape_tensors/trt_shape_tensor_test.py b/qa/L0_trt_shape_tensors/trt_shape_tensor_test.py
old mode 100644
new mode 100755
index 89a3f889dc..a83795f981
--- a/qa/L0_trt_shape_tensors/trt_shape_tensor_test.py
+++ b/qa/L0_trt_shape_tensors/trt_shape_tensor_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,24 +27,22 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-from builtins import range
-from future.utils import iteritems
 import os
-import unittest
-import time
 import threading
-import traceback
-import numpy as np
+import time
+import unittest
+from builtins import range
+
 import infer_util as iu
-import test_util as tu
+import numpy as np
 import sequence_util as su
-
+import test_util as tu
 import tritongrpcclient as grpcclient
 
-TEST_SYSTEM_SHARED_MEMORY = bool(
-    int(os.environ.get('TEST_SYSTEM_SHARED_MEMORY', 0)))
+TEST_SYSTEM_SHARED_MEMORY = bool(int(os.environ.get("TEST_SYSTEM_SHARED_MEMORY", 0)))
 
 _model_instances = 1
 _max_queue_delay_ms = 10000
@@ -53,7 +53,6 @@
 
 
 class InferShapeTensorTest(tu.TestResultCollector):
-
     def setUp(self):
         # The helper client for setup will be GRPC for simplicity.
         self.triton_client_ = grpcclient.InferenceServerClient("localhost:8001")
@@ -76,14 +75,16 @@ def check_deferred_exception(self):
             if len(_deferred_exceptions) > 0:
                 raise _deferred_exceptions[0]
 
-    def check_response(self,
-                       bs,
-                       thresholds,
-                       shape_values,
-                       dummy_input_shapes,
-                       shm_region_names=None,
-                       precreated_shm_regions=None,
-                       shm_suffix=""):
+    def check_response(
+        self,
+        bs,
+        thresholds,
+        shape_values,
+        dummy_input_shapes,
+        shm_region_names=None,
+        precreated_shm_regions=None,
+        shm_suffix="",
+    ):
         try:
             # Add batch size to shape as full shape is expected
             for i in range(len(dummy_input_shapes)):
@@ -94,7 +95,7 @@ def check_response(self,
 
             iu.infer_shape_tensor(
                 self,
-                'plan',
+                "plan",
                 np.float32,
                 shape_values,
                 dummy_input_shapes,
@@ -102,7 +103,8 @@ def check_response(self,
                 use_streaming=False,
                 shm_suffix=shm_suffix,
                 use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                batch_size=bs)
+                batch_size=bs,
+            )
 
             end_ms = int(round(time.time() * 1000))
 
@@ -111,13 +113,21 @@ def check_response(self,
             if lt_ms is not None:
                 self.assertTrue(
                     (end_ms - start_ms) < lt_ms,
-                    "expected less than " + str(lt_ms) +
-                    "ms response time, got " + str(end_ms - start_ms) + " ms")
+                    "expected less than "
+                    + str(lt_ms)
+                    + "ms response time, got "
+                    + str(end_ms - start_ms)
+                    + " ms",
+                )
             if gt_ms is not None:
                 self.assertTrue(
                     (end_ms - start_ms) > gt_ms,
-                    "expected greater than " + str(gt_ms) +
-                    "ms response time, got " + str(end_ms - start_ms) + " ms")
+                    "expected greater than "
+                    + str(gt_ms)
+                    + "ms response time, got "
+                    + str(end_ms - start_ms)
+                    + " ms",
+                )
         except Exception as ex:
             self.add_deferred_exception(ex)
 
@@ -127,109 +137,164 @@ def check_setup(self, model_name):
         bconfig = config.dynamic_batching
         self.assertTrue(2 in bconfig.preferred_batch_size)
         self.assertTrue(6 in bconfig.preferred_batch_size)
-        self.assertEqual(bconfig.max_queue_delay_microseconds,
-                         _max_queue_delay_ms * 1000)  # 10 secs
+        self.assertEqual(
+            bconfig.max_queue_delay_microseconds, _max_queue_delay_ms * 1000
+        )  # 10 secs
 
     def check_status(self, model_name, batch_exec, exec_cnt, infer_cnt):
-        stats = self.triton_client_.get_inference_statistics(model_name, "1")
-        self.assertEqual(len(stats.model_stats), 1, "expect 1 model stats")
-        self.assertEqual(stats.model_stats[0].name, model_name,
-                         "expect model stats for model {}".format(model_name))
+        # There is a time window between when responses are returned and statistics are updated.
+        # To prevent intermittent test failure during that window, wait up to 10 seconds for the
+        # inference statistics to be ready.
+        num_tries = 10
+        for i in range(num_tries):
+            stats = self.triton_client_.get_inference_statistics(model_name, "1")
+            self.assertEqual(len(stats.model_stats), 1, "expect 1 model stats")
+            actual_exec_cnt = stats.model_stats[0].execution_count
+            if actual_exec_cnt == exec_cnt:
+                break
+            print(
+                "WARNING: expect {} executions, got {} (attempt {})".format(
+                    exec_cnt, actual_exec_cnt, i
+                )
+            )
+            time.sleep(1)
+
         self.assertEqual(
-            stats.model_stats[0].version, "1",
-            "expect model stats for model {} version 1".format(model_name))
+            stats.model_stats[0].name,
+            model_name,
+            "expect model stats for model {}".format(model_name),
+        )
+        self.assertEqual(
+            stats.model_stats[0].version,
+            "1",
+            "expect model stats for model {} version 1".format(model_name),
+        )
 
         if batch_exec is not None:
             batch_stats = stats.model_stats[0].batch_stats
             print(batch_stats)
             self.assertEqual(
-                len(batch_stats), len(batch_exec),
+                len(batch_stats),
+                len(batch_exec),
                 "expected {} different batch-sizes, got {}".format(
-                    len(batch_exec), len(batch_stats)))
+                    len(batch_exec), len(batch_stats)
+                ),
+            )
 
             for batch_stat in batch_stats:
                 bs = batch_stat.batch_size
                 bc = batch_stat.compute_infer.count
                 self.assertTrue(
-                    bs in batch_exec,
-                    "did not find expected batch-size {}".format(bs))
+                    bs in batch_exec, "did not find expected batch-size {}".format(bs)
+                )
                 # Get count from one of the stats
                 self.assertEqual(
-                    bc, batch_exec[bs],
-                    "expected model-execution-count {} for batch size {}, got {}"
-                    .format(batch_exec[bs], bs, bc))
+                    bc,
+                    batch_exec[bs],
+                    "expected model-execution-count {} for batch size {}, got {}".format(
+                        batch_exec[bs], bs, bc
+                    ),
+                )
 
         actual_exec_cnt = stats.model_stats[0].execution_count
         self.assertEqual(
-            actual_exec_cnt, exec_cnt,
-            "expected model-exec-count {}, got {}".format(
-                exec_cnt, actual_exec_cnt))
+            actual_exec_cnt,
+            exec_cnt,
+            "expected model-exec-count {}, got {}".format(exec_cnt, actual_exec_cnt),
+        )
 
         actual_infer_cnt = stats.model_stats[0].inference_count
         self.assertEqual(
-            actual_infer_cnt, infer_cnt,
+            actual_infer_cnt,
+            infer_cnt,
             "expected model-inference-count {}, got {}".format(
-                infer_cnt, actual_infer_cnt))
+                infer_cnt, actual_infer_cnt
+            ),
+        )
 
         actual_infer_cnt = stats.model_stats[0].inference_count
         self.assertEqual(
-            actual_infer_cnt, infer_cnt,
+            actual_infer_cnt,
+            infer_cnt,
             "expected model-inference-count {}, got {}".format(
-                infer_cnt, actual_infer_cnt))
+                infer_cnt, actual_infer_cnt
+            ),
+        )
 
     def test_static_batch(self):
         iu.infer_shape_tensor(
             self,
-            'plan',
-            np.float32, [[32, 32]], [[8, 4, 4]],
+            "plan",
+            np.float32,
+            [[32, 32]],
+            [[8, 4, 4]],
             use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-            batch_size=8)
+            batch_size=8,
+        )
         iu.infer_shape_tensor(
             self,
-            'plan',
-            np.float32, [[4, 4]], [[8, 32, 32]],
+            "plan",
+            np.float32,
+            [[4, 4]],
+            [[8, 32, 32]],
             use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-            batch_size=8)
+            batch_size=8,
+        )
         iu.infer_shape_tensor(
             self,
-            'plan',
-            np.float32, [[4, 4]], [[8, 4, 4]],
+            "plan",
+            np.float32,
+            [[4, 4]],
+            [[8, 4, 4]],
             use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-            batch_size=8)
+            batch_size=8,
+        )
 
     def test_nobatch(self):
         iu.infer_shape_tensor(
             self,
-            'plan_nobatch',
-            np.float32, [[32, 32]], [[4, 4]],
-            use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY)
+            "plan_nobatch",
+            np.float32,
+            [[32, 32]],
+            [[4, 4]],
+            use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+        )
         iu.infer_shape_tensor(
             self,
-            'plan_nobatch',
-            np.float32, [[4, 4]], [[32, 32]],
-            use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY)
+            "plan_nobatch",
+            np.float32,
+            [[4, 4]],
+            [[32, 32]],
+            use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+        )
         iu.infer_shape_tensor(
             self,
-            'plan_nobatch',
-            np.float32, [[4, 4]], [[4, 4]],
-            use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY)
+            "plan_nobatch",
+            np.float32,
+            [[4, 4]],
+            [[4, 4]],
+            use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+        )
 
     def test_wrong_shape_values(self):
         over_shape_values = [[32, 33]]
         try:
             iu.infer_shape_tensor(
                 self,
-                'plan',
+                "plan",
                 np.float32,
-                over_shape_values, [[8, 4, 4]],
+                over_shape_values,
+                [[8, 4, 4]],
                 use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                batch_size=8)
+                batch_size=8,
+            )
         # InferenceServerException will be raised from different namespace,
         # use dynamic type characteristic to catch both ex
         except Exception as ex:
             self.assertTrue(
                 "The shape value at index 2 is expected to be in range from 1 to 32, Got: 33"
-                in ex.message())
+                in ex.message()
+            )
 
     # Dynamic Batcher tests
     def test_dynamic_different_shape_values(self):
@@ -245,22 +310,27 @@ def test_dynamic_different_shape_values(self):
 
             threads = []
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=(3, (6000, None)),
-                                 kwargs={
-                                     'shape_values': [[2, 2]],
-                                     'dummy_input_shapes': [[16, 16]],
-                                     'shm_suffix': '{}'.format(len(threads))
-                                 }))
+                threading.Thread(
+                    target=self.check_response,
+                    args=(3, (6000, None)),
+                    kwargs={
+                        "shape_values": [[2, 2]],
+                        "dummy_input_shapes": [[16, 16]],
+                        "shm_suffix": "{}".format(len(threads)),
+                    },
+                )
+            )
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=(3, (_max_queue_delay_ms * 1.5,
-                                           _max_queue_delay_ms)),
-                                 kwargs={
-                                     'shape_values': [[4, 4]],
-                                     'dummy_input_shapes': [[16, 16]],
-                                     'shm_suffix': '{}'.format(len(threads))
-                                 }))
+                threading.Thread(
+                    target=self.check_response,
+                    args=(3, (_max_queue_delay_ms * 1.5, _max_queue_delay_ms)),
+                    kwargs={
+                        "shape_values": [[4, 4]],
+                        "dummy_input_shapes": [[16, 16]],
+                        "shm_suffix": "{}".format(len(threads)),
+                    },
+                )
+            )
             threads[0].start()
             time.sleep(1)
             threads[1].start()
@@ -283,21 +353,27 @@ def test_dynamic_identical_shape_values(self):
 
             threads = []
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=(4, (6000, None)),
-                                 kwargs={
-                                     'shape_values': [[4, 4]],
-                                     'dummy_input_shapes': [[16, 16]],
-                                     'shm_suffix': '{}'.format(len(threads))
-                                 }))
+                threading.Thread(
+                    target=self.check_response,
+                    args=(4, (6000, None)),
+                    kwargs={
+                        "shape_values": [[4, 4]],
+                        "dummy_input_shapes": [[16, 16]],
+                        "shm_suffix": "{}".format(len(threads)),
+                    },
+                )
+            )
             threads.append(
-                threading.Thread(target=self.check_response,
-                                 args=(2, (6000, None)),
-                                 kwargs={
-                                     'shape_values': [[4, 4]],
-                                     'dummy_input_shapes': [[16, 16]],
-                                     'shm_suffix': '{}'.format(len(threads))
-                                 }))
+                threading.Thread(
+                    target=self.check_response,
+                    args=(2, (6000, None)),
+                    kwargs={
+                        "shape_values": [[4, 4]],
+                        "dummy_input_shapes": [[16, 16]],
+                        "shm_suffix": "{}".format(len(threads)),
+                    },
+                )
+            )
             threads[0].start()
             time.sleep(1)
             threads[1].start()
@@ -310,7 +386,6 @@ def test_dynamic_identical_shape_values(self):
 
 
 class SequenceBatcherShapeTensorTest(su.SequenceBatcherTestUtil):
-
     def get_expected_result(self, expected_result, value, flag_str=None):
         # Adjust the expected_result for models
         expected_result = value
@@ -333,20 +408,21 @@ def test_sequence_identical_shape_values(self):
             # Need scheduler to wait for queue to contain all
             # inferences for both sequences.
             self.assertTrue("TRITONSERVER_DELAY_SCHEDULER" in os.environ)
-            self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]),
-                             12)
-            self.assertTrue(
-                "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ)
-            self.assertEqual(
-                int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0)
+            self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12)
+            self.assertTrue("TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ)
+            self.assertEqual(int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0)
             precreated_shm0_handles = self.precreate_register_shape_tensor_regions(
-                ((2, 1), (4, 2), (8, 3)), dtype, 0)
+                ((2, 1), (4, 2), (8, 3)), dtype, 0
+            )
             precreated_shm1_handles = self.precreate_register_shape_tensor_regions(
-                ((2, 11), (4, 12), (8, 13)), dtype, 1)
+                ((2, 11), (4, 12), (8, 13)), dtype, 1
+            )
             precreated_shm2_handles = self.precreate_register_shape_tensor_regions(
-                ((2, 111), (4, 112), (8, 113)), dtype, 2)
+                ((2, 111), (4, 112), (8, 113)), dtype, 2
+            )
             precreated_shm3_handles = self.precreate_register_shape_tensor_regions(
-                ((2, 1111), (4, 1112), (8, 1113)), dtype, 3)
+                ((2, 1111), (4, 1112), (8, 1113)), dtype, 3
+            )
             threads = []
             threads.append(
                 threading.Thread(
@@ -357,12 +433,17 @@ def test_sequence_identical_shape_values(self):
                         1001,
                         (None, None),
                         # (flag_str, shape_value, value, pre_delay_ms)
-                        (("start", 2, 1, None), (None, 4, 2, None), ("end", 8,
-                                                                     3, None)),
+                        (
+                            ("start", 2, 1, None),
+                            (None, 4, 2, None),
+                            ("end", 8, 3, None),
+                        ),
                         self.get_expected_result(6, 3, "end"),
-                        precreated_shm0_handles),
-                    kwargs={'sequence_name': "{}".format(self._testMethodName)
-                           }))
+                        precreated_shm0_handles,
+                    ),
+                    kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                )
+            )
             threads.append(
                 threading.Thread(
                     target=self.check_sequence_shape_tensor_io,
@@ -372,12 +453,17 @@ def test_sequence_identical_shape_values(self):
                         1002,
                         (None, None),
                         # (flag_str, shape_value, value, pre_delay_ms)
-                        (("start", 2, 11, None), (None, 4, 12, None),
-                         ("end", 8, 13, None)),
+                        (
+                            ("start", 2, 11, None),
+                            (None, 4, 12, None),
+                            ("end", 8, 13, None),
+                        ),
                         self.get_expected_result(36, 13, "end"),
-                        precreated_shm1_handles),
-                    kwargs={'sequence_name': "{}".format(self._testMethodName)
-                           }))
+                        precreated_shm1_handles,
+                    ),
+                    kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                )
+            )
             threads.append(
                 threading.Thread(
                     target=self.check_sequence_shape_tensor_io,
@@ -387,12 +473,17 @@ def test_sequence_identical_shape_values(self):
                         1003,
                         (None, None),
                         # (flag_str, shape_value, value, pre_delay_ms)
-                        (("start", 2, 111, None), (None, 4, 112, None),
-                         ("end", 8, 113, None)),
+                        (
+                            ("start", 2, 111, None),
+                            (None, 4, 112, None),
+                            ("end", 8, 113, None),
+                        ),
                         self.get_expected_result(336, 113, "end"),
-                        precreated_shm2_handles),
-                    kwargs={'sequence_name': "{}".format(self._testMethodName)
-                           }))
+                        precreated_shm2_handles,
+                    ),
+                    kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                )
+            )
             threads.append(
                 threading.Thread(
                     target=self.check_sequence_shape_tensor_io,
@@ -402,12 +493,17 @@ def test_sequence_identical_shape_values(self):
                         1004,
                         (None, None),
                         # (flag_str, shape_value, value, pre_delay_ms)
-                        (("start", 2, 1111, None), (None, 4, 1112, None),
-                         ("end", 8, 1113, None)),
+                        (
+                            ("start", 2, 1111, None),
+                            (None, 4, 1112, None),
+                            ("end", 8, 1113, None),
+                        ),
                         self.get_expected_result(3336, 1113, "end"),
-                        precreated_shm3_handles),
-                    kwargs={'sequence_name': "{}".format(self._testMethodName)
-                           }))
+                        precreated_shm3_handles,
+                    ),
+                    kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                )
+            )
 
             for t in threads:
                 t.start()
@@ -435,13 +531,17 @@ def test_sequence_different_shape_values(self):
         dtype = np.float32
 
         precreated_shm0_handles = self.precreate_register_shape_tensor_regions(
-            ((1, 1), (1, 2), (1, 3)), dtype, 0)
+            ((1, 1), (1, 2), (1, 3)), dtype, 0
+        )
         precreated_shm1_handles = self.precreate_register_shape_tensor_regions(
-            ((32, 11), (32, 12), (32, 13)), dtype, 1)
+            ((32, 11), (32, 12), (32, 13)), dtype, 1
+        )
         precreated_shm2_handles = self.precreate_register_shape_tensor_regions(
-            ((16, 111), (16, 112), (16, 113)), dtype, 2)
+            ((16, 111), (16, 112), (16, 113)), dtype, 2
+        )
         precreated_shm3_handles = self.precreate_register_shape_tensor_regions(
-            ((1, 1111), (1, 1112), (1, 1113)), dtype, 3)
+            ((1, 1111), (1, 1112), (1, 1113)), dtype, 3
+        )
         try:
             model_name = tu.get_sequence_model_name("plan", dtype)
             self.check_setup(model_name)
@@ -449,12 +549,9 @@ def test_sequence_different_shape_values(self):
             # Need scheduler to wait for queue to contain all
             # inferences for both sequences.
             self.assertTrue("TRITONSERVER_DELAY_SCHEDULER" in os.environ)
-            self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]),
-                             12)
-            self.assertTrue(
-                "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ)
-            self.assertEqual(
-                int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0)
+            self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12)
+            self.assertTrue("TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ)
+            self.assertEqual(int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0)
 
             threads = []
             threads.append(
@@ -466,12 +563,17 @@ def test_sequence_different_shape_values(self):
                         1001,
                         (None, None),
                         # (flag_str, shape_value, value, pre_delay_ms)
-                        (("start", 1, 1, None), (None, 1, 2, None), ("end", 1,
-                                                                     3, None)),
+                        (
+                            ("start", 1, 1, None),
+                            (None, 1, 2, None),
+                            ("end", 1, 3, None),
+                        ),
                         self.get_expected_result(6, 3, "end"),
-                        precreated_shm0_handles),
-                    kwargs={'sequence_name': "{}".format(self._testMethodName)
-                           }))
+                        precreated_shm0_handles,
+                    ),
+                    kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                )
+            )
             threads.append(
                 threading.Thread(
                     target=self.check_sequence_shape_tensor_io,
@@ -481,12 +583,17 @@ def test_sequence_different_shape_values(self):
                         1002,
                         (None, None),
                         # (flag_str, shape_value, value, pre_delay_ms)
-                        (("start", 32, 11, None), (None, 32, 12, None),
-                         ("end", 32, 13, None)),
+                        (
+                            ("start", 32, 11, None),
+                            (None, 32, 12, None),
+                            ("end", 32, 13, None),
+                        ),
                         self.get_expected_result(36, 13, "end"),
-                        precreated_shm1_handles),
-                    kwargs={'sequence_name': "{}".format(self._testMethodName)
-                           }))
+                        precreated_shm1_handles,
+                    ),
+                    kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                )
+            )
             threads.append(
                 threading.Thread(
                     target=self.check_sequence_shape_tensor_io,
@@ -496,12 +603,17 @@ def test_sequence_different_shape_values(self):
                         1003,
                         (None, None),
                         # (flag_str, shape_value, value, pre_delay_ms)
-                        (("start", 16, 111, None), (None, 16, 112, None),
-                         ("end", 16, 113, None)),
+                        (
+                            ("start", 16, 111, None),
+                            (None, 16, 112, None),
+                            ("end", 16, 113, None),
+                        ),
                         self.get_expected_result(336, 113, "end"),
-                        precreated_shm2_handles),
-                    kwargs={'sequence_name': "{}".format(self._testMethodName)
-                           }))
+                        precreated_shm2_handles,
+                    ),
+                    kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                )
+            )
             threads.append(
                 threading.Thread(
                     target=self.check_sequence_shape_tensor_io,
@@ -511,12 +623,17 @@ def test_sequence_different_shape_values(self):
                         1004,
                         (None, None),
                         # (flag_str, shape_value, value, pre_delay_ms)
-                        (("start", 1, 1111, None), (None, 1, 1112, None),
-                         ("end", 1, 1113, None)),
+                        (
+                            ("start", 1, 1111, None),
+                            (None, 1, 1112, None),
+                            ("end", 1, 1113, None),
+                        ),
                         self.get_expected_result(3336, 1113, "end"),
-                        precreated_shm3_handles),
-                    kwargs={'sequence_name': "{}".format(self._testMethodName)
-                           }))
+                        precreated_shm3_handles,
+                    ),
+                    kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                )
+            )
 
             for t in threads:
                 t.start()
@@ -537,12 +654,7 @@ def test_sequence_different_shape_values(self):
 
 
 class DynaSequenceBatcherTest(su.SequenceBatcherTestUtil):
-
-    def get_expected_result(self,
-                            expected_result,
-                            corrid,
-                            value,
-                            flag_str=None):
+    def get_expected_result(self, expected_result, corrid, value, flag_str=None):
         expected_result = value
         if flag_str is not None:
             if "start" in flag_str:
@@ -556,20 +668,23 @@ def _multi_sequence_different_shape_impl(self, sleep_secs):
         dtype = np.float32
 
         precreated_shm0_handles = self.precreate_register_dynaseq_shape_tensor_regions(
-            ((1, 1), (12, 2), (2, 3)), dtype, 0)
+            ((1, 1), (12, 2), (2, 3)), dtype, 0
+        )
         precreated_shm1_handles = self.precreate_register_dynaseq_shape_tensor_regions(
-            ((3, 11), (4, 12), (5, 13)), dtype, 1)
+            ((3, 11), (4, 12), (5, 13)), dtype, 1
+        )
         precreated_shm2_handles = self.precreate_register_dynaseq_shape_tensor_regions(
-            ((6, 111), (7, 112), (8, 113)), dtype, 2)
+            ((6, 111), (7, 112), (8, 113)), dtype, 2
+        )
         precreated_shm3_handles = self.precreate_register_dynaseq_shape_tensor_regions(
-            ((9, 1111), (10, 1112), (11, 1113)), dtype, 3)
+            ((9, 1111), (10, 1112), (11, 1113)), dtype, 3
+        )
 
         try:
             model_name = tu.get_dyna_sequence_model_name("plan", dtype)
             self.check_setup(model_name)
             self.assertFalse("TRITONSERVER_DELAY_SCHEDULER" in os.environ)
-            self.assertFalse(
-                "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ)
+            self.assertFalse("TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ)
 
             corrids = [1001, 1002, 1003, 1004]
             threads = []
@@ -582,17 +697,22 @@ def _multi_sequence_different_shape_impl(self, sleep_secs):
                         corrids[0],
                         (None, None),
                         # (flag_str, shape_value, value, pre_delay_ms)
-                        (("start", 1, 1, None), (None, 12, 2, None), ("end", 2,
-                                                                      3, None)),
-                        self.get_expected_result(4 + corrids[0], corrids[0], 3,
-                                                 "end"),
-                        precreated_shm0_handles),
+                        (
+                            ("start", 1, 1, None),
+                            (None, 12, 2, None),
+                            ("end", 2, 3, None),
+                        ),
+                        self.get_expected_result(4 + corrids[0], corrids[0], 3, "end"),
+                        precreated_shm0_handles,
+                    ),
                     kwargs={
-                        'sequence_name':
-                            "{}_{}".format(self._testMethodName, corrids[0]),
-                        'using_dynamic_batcher':
-                            True
-                    }))
+                        "sequence_name": "{}_{}".format(
+                            self._testMethodName, corrids[0]
+                        ),
+                        "using_dynamic_batcher": True,
+                    },
+                )
+            )
             threads.append(
                 threading.Thread(
                     target=self.check_sequence_shape_tensor_io,
@@ -602,17 +722,24 @@ def _multi_sequence_different_shape_impl(self, sleep_secs):
                         corrids[1],
                         (None, None),
                         # (flag_str, shape_value, value, pre_delay_ms)
-                        (("start", 3, 11, None), (None, 4, 12, None),
-                         ("end", 5, 13, None)),
-                        self.get_expected_result(36 + corrids[1], corrids[1],
-                                                 13, "end"),
-                        precreated_shm1_handles),
+                        (
+                            ("start", 3, 11, None),
+                            (None, 4, 12, None),
+                            ("end", 5, 13, None),
+                        ),
+                        self.get_expected_result(
+                            36 + corrids[1], corrids[1], 13, "end"
+                        ),
+                        precreated_shm1_handles,
+                    ),
                     kwargs={
-                        'sequence_name':
-                            "{}_{}".format(self._testMethodName, corrids[1]),
-                        'using_dynamic_batcher':
-                            True
-                    }))
+                        "sequence_name": "{}_{}".format(
+                            self._testMethodName, corrids[1]
+                        ),
+                        "using_dynamic_batcher": True,
+                    },
+                )
+            )
             threads.append(
                 threading.Thread(
                     target=self.check_sequence_shape_tensor_io,
@@ -622,17 +749,24 @@ def _multi_sequence_different_shape_impl(self, sleep_secs):
                         corrids[2],
                         (None, None),
                         # (flag_str, shape_value, value, pre_delay_ms)
-                        (("start", 6, 111, None), (None, 7, 112, None),
-                         ("end", 8, 113, None)),
-                        self.get_expected_result(336 + corrids[2], corrids[2],
-                                                 113, "end"),
-                        precreated_shm2_handles),
+                        (
+                            ("start", 6, 111, None),
+                            (None, 7, 112, None),
+                            ("end", 8, 113, None),
+                        ),
+                        self.get_expected_result(
+                            336 + corrids[2], corrids[2], 113, "end"
+                        ),
+                        precreated_shm2_handles,
+                    ),
                     kwargs={
-                        'sequence_name':
-                            "{}_{}".format(self._testMethodName, corrids[2]),
-                        'using_dynamic_batcher':
-                            True
-                    }))
+                        "sequence_name": "{}_{}".format(
+                            self._testMethodName, corrids[2]
+                        ),
+                        "using_dynamic_batcher": True,
+                    },
+                )
+            )
             threads.append(
                 threading.Thread(
                     target=self.check_sequence_shape_tensor_io,
@@ -642,17 +776,24 @@ def _multi_sequence_different_shape_impl(self, sleep_secs):
                         corrids[3],
                         (None, None),
                         # (flag_str, shape_value, value, pre_delay_ms)
-                        (("start", 9, 1111, None), (None, 10, 1112, None),
-                         ("end", 11, 1113, None)),
-                        self.get_expected_result(3336 + corrids[3], corrids[3],
-                                                 1113, "end"),
-                        precreated_shm3_handles),
+                        (
+                            ("start", 9, 1111, None),
+                            (None, 10, 1112, None),
+                            ("end", 11, 1113, None),
+                        ),
+                        self.get_expected_result(
+                            3336 + corrids[3], corrids[3], 1113, "end"
+                        ),
+                        precreated_shm3_handles,
+                    ),
                     kwargs={
-                        'sequence_name':
-                            "{}_{}".format(self._testMethodName, corrids[3]),
-                        'using_dynamic_batcher':
-                            True
-                    }))
+                        "sequence_name": "{}_{}".format(
+                            self._testMethodName, corrids[3]
+                        ),
+                        "using_dynamic_batcher": True,
+                    },
+                )
+            )
 
             for t in threads:
                 t.start()
@@ -676,21 +817,24 @@ def _multi_sequence_identical_shape_impl(self, sleep_secs):
         dtype = np.float32
 
         precreated_shm0_handles = self.precreate_register_dynaseq_shape_tensor_regions(
-            ((2, 1), (4, 2), (8, 3)), dtype, 0)
+            ((2, 1), (4, 2), (8, 3)), dtype, 0
+        )
         precreated_shm1_handles = self.precreate_register_dynaseq_shape_tensor_regions(
-            ((2, 11), (4, 12), (8, 13)), dtype, 1)
+            ((2, 11), (4, 12), (8, 13)), dtype, 1
+        )
         precreated_shm2_handles = self.precreate_register_dynaseq_shape_tensor_regions(
-            ((2, 111), (4, 112), (8, 113)), dtype, 2)
+            ((2, 111), (4, 112), (8, 113)), dtype, 2
+        )
         precreated_shm3_handles = self.precreate_register_dynaseq_shape_tensor_regions(
-            ((2, 1111), (4, 1112), (8, 1113)), dtype, 3)
+            ((2, 1111), (4, 1112), (8, 1113)), dtype, 3
+        )
 
         try:
             model_name = tu.get_dyna_sequence_model_name("plan", dtype)
 
             self.check_setup(model_name)
             self.assertFalse("TRITONSERVER_DELAY_SCHEDULER" in os.environ)
-            self.assertFalse(
-                "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ)
+            self.assertFalse("TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ)
 
             corrids = [1001, 1002, 1003, 1004]
             threads = []
@@ -703,17 +847,22 @@ def _multi_sequence_identical_shape_impl(self, sleep_secs):
                         corrids[0],
                         (None, None),
                         # (flag_str, shape_value, value, pre_delay_ms)
-                        (("start", 2, 1, None), (None, 4, 2, None), ("end", 8,
-                                                                     3, None)),
-                        self.get_expected_result(4 + corrids[0], corrids[0], 3,
-                                                 "end"),
-                        precreated_shm0_handles),
+                        (
+                            ("start", 2, 1, None),
+                            (None, 4, 2, None),
+                            ("end", 8, 3, None),
+                        ),
+                        self.get_expected_result(4 + corrids[0], corrids[0], 3, "end"),
+                        precreated_shm0_handles,
+                    ),
                     kwargs={
-                        'sequence_name':
-                            "{}_{}".format(self._testMethodName, corrids[0]),
-                        'using_dynamic_batcher':
-                            True
-                    }))
+                        "sequence_name": "{}_{}".format(
+                            self._testMethodName, corrids[0]
+                        ),
+                        "using_dynamic_batcher": True,
+                    },
+                )
+            )
             threads.append(
                 threading.Thread(
                     target=self.check_sequence_shape_tensor_io,
@@ -723,17 +872,24 @@ def _multi_sequence_identical_shape_impl(self, sleep_secs):
                         corrids[1],
                         (None, None),
                         # (flag_str, shape_value, value, pre_delay_ms)
-                        (("start", 2, 11, None), (None, 4, 12, None),
-                         ("end", 8, 13, None)),
-                        self.get_expected_result(36 + corrids[1], corrids[1],
-                                                 13, "end"),
-                        precreated_shm1_handles),
+                        (
+                            ("start", 2, 11, None),
+                            (None, 4, 12, None),
+                            ("end", 8, 13, None),
+                        ),
+                        self.get_expected_result(
+                            36 + corrids[1], corrids[1], 13, "end"
+                        ),
+                        precreated_shm1_handles,
+                    ),
                     kwargs={
-                        'sequence_name':
-                            "{}_{}".format(self._testMethodName, corrids[1]),
-                        'using_dynamic_batcher':
-                            True
-                    }))
+                        "sequence_name": "{}_{}".format(
+                            self._testMethodName, corrids[1]
+                        ),
+                        "using_dynamic_batcher": True,
+                    },
+                )
+            )
             threads.append(
                 threading.Thread(
                     target=self.check_sequence_shape_tensor_io,
@@ -743,17 +899,24 @@ def _multi_sequence_identical_shape_impl(self, sleep_secs):
                         corrids[2],
                         (None, None),
                         # (flag_str, shape_value, value, pre_delay_ms)
-                        (("start", 2, 111, None), (None, 4, 112, None),
-                         ("end", 8, 113, None)),
-                        self.get_expected_result(336 + corrids[2], corrids[2],
-                                                 113, "end"),
-                        precreated_shm2_handles),
+                        (
+                            ("start", 2, 111, None),
+                            (None, 4, 112, None),
+                            ("end", 8, 113, None),
+                        ),
+                        self.get_expected_result(
+                            336 + corrids[2], corrids[2], 113, "end"
+                        ),
+                        precreated_shm2_handles,
+                    ),
                     kwargs={
-                        'sequence_name':
-                            "{}_{}".format(self._testMethodName, corrids[2]),
-                        'using_dynamic_batcher':
-                            True
-                    }))
+                        "sequence_name": "{}_{}".format(
+                            self._testMethodName, corrids[2]
+                        ),
+                        "using_dynamic_batcher": True,
+                    },
+                )
+            )
             threads.append(
                 threading.Thread(
                     target=self.check_sequence_shape_tensor_io,
@@ -763,17 +926,24 @@ def _multi_sequence_identical_shape_impl(self, sleep_secs):
                         corrids[3],
                         (None, None),
                         # (flag_str, shape_value, value, pre_delay_ms)
-                        (("start", 2, 1111, None), (None, 4, 1112, None),
-                         ("end", 8, 1113, None)),
-                        self.get_expected_result(3336 + corrids[3], corrids[3],
-                                                 1113, "end"),
-                        precreated_shm3_handles),
+                        (
+                            ("start", 2, 1111, None),
+                            (None, 4, 1112, None),
+                            ("end", 8, 1113, None),
+                        ),
+                        self.get_expected_result(
+                            3336 + corrids[3], corrids[3], 1113, "end"
+                        ),
+                        precreated_shm3_handles,
+                    ),
                     kwargs={
-                        'sequence_name':
-                            "{}_{}".format(self._testMethodName, corrids[3]),
-                        'using_dynamic_batcher':
-                            True
-                    }))
+                        "sequence_name": "{}_{}".format(
+                            self._testMethodName, corrids[3]
+                        ),
+                        "using_dynamic_batcher": True,
+                    },
+                )
+            )
 
             for t in threads:
                 t.start()
@@ -815,5 +985,5 @@ def test_dynaseq_different_shape_values_parallel(self):
         self._multi_sequence_different_shape_impl(0)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_vertex_ai/test.sh b/qa/L0_vertex_ai/test.sh
old mode 100644
new mode 100755
index d334d6c886..3113a66d1f
--- a/qa/L0_vertex_ai/test.sh
+++ b/qa/L0_vertex_ai/test.sh
@@ -106,7 +106,7 @@ function vertex_ai_wait_for_server_ready() {
     WAIT_RET=1
 }
 
-# Helper function to unset all AIP vairables before test
+# Helper function to unset all AIP variables before test
 function unset_vertex_variables() {
     unset AIP_MODE
     unset AIP_HTTP_PORT
@@ -418,7 +418,7 @@ else
   fi
 fi
 
-# Test AIP_STORAGE_URI won't be used if model repository is specified 
+# Test AIP_STORAGE_URI won't be used if model repository is specified
 SERVER_ARGS="--model-repository=single_model"
 run_server_nowait
 vertex_ai_wait_for_server_ready $SERVER_PID 10
diff --git a/qa/L0_vertex_ai/vertex_ai_test.py b/qa/L0_vertex_ai/vertex_ai_test.py
old mode 100644
new mode 100755
index c421987538..b6f9fc42b4
--- a/qa/L0_vertex_ai/vertex_ai_test.py
+++ b/qa/L0_vertex_ai/vertex_ai_test.py
@@ -1,5 +1,5 @@
 #!/usr/bin/python
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,44 +26,34 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
 import os
-import shutil
-import time
+import sys
 import unittest
+
 import numpy as np
-import infer_util as iu
+import requests
 import test_util as tu
 import tritonclient.http as httpclient
 
-import argparse
-import csv
-import json
-import os
-import requests
-import socket
-import sys
-
 
 class VertexAiTest(tu.TestResultCollector):
-
     def setUp(self):
-        port = os.getenv('AIP_HTTP_PORT', '8080')
-        predict_endpoint = os.getenv('AIP_PREDICT_ROUTE', '/predict')
-        self.model_ = os.getenv('TEST_EXPLICIT_MODEL_NAME', 'addsub')
+        port = os.getenv("AIP_HTTP_PORT", "8080")
+        predict_endpoint = os.getenv("AIP_PREDICT_ROUTE", "/predict")
+        self.model_ = os.getenv("TEST_EXPLICIT_MODEL_NAME", "addsub")
         self.url_ = "http://localhost:{}{}".format(port, predict_endpoint)
-        self.input_data_ = [
-            0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
-        ]
+        self.input_data_ = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
         self.expected_output0_data_ = [x * 2 for x in self.input_data_]
         self.expected_output1_data_ = [0 for x in self.input_data_]
 
     def test_predict(self):
         inputs = []
         outputs = []
-        inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32"))
-        inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32"))
 
         # Initialize the data
         input_data = np.array(self.input_data_, dtype=np.int32)
@@ -71,22 +61,20 @@ def test_predict(self):
         inputs[0].set_data_from_numpy(input_data, binary_data=False)
         inputs[1].set_data_from_numpy(input_data, binary_data=False)
 
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT0', binary_data=False))
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT1', binary_data=False))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False))
         request_body, _ = httpclient.InferenceServerClient.generate_request_body(
-            inputs, outputs=outputs)
+            inputs, outputs=outputs
+        )
 
-        headers = {'Content-Type': 'application/json'}
+        headers = {"Content-Type": "application/json"}
         r = requests.post(self.url_, data=request_body, headers=headers)
         r.raise_for_status()
 
-        result = httpclient.InferenceServerClient.parse_response_body(
-            r._content)
+        result = httpclient.InferenceServerClient.parse_response_body(r._content)
 
-        output0_data = result.as_numpy('OUTPUT0')
-        output1_data = result.as_numpy('OUTPUT1')
+        output0_data = result.as_numpy("OUTPUT0")
+        output1_data = result.as_numpy("OUTPUT1")
         for i in range(16):
             self.assertEqual(output0_data[0][i], self.expected_output0_data_[i])
             self.assertEqual(output1_data[0][i], self.expected_output1_data_[i])
@@ -94,8 +82,8 @@ def test_predict(self):
     def test_predict_specified_model(self):
         inputs = []
         outputs = []
-        inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32"))
-        inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32"))
 
         # Initialize the data
         input_data = np.array(self.input_data_, dtype=np.int32)
@@ -103,27 +91,23 @@ def test_predict_specified_model(self):
         inputs[0].set_data_from_numpy(input_data, binary_data=False)
         inputs[1].set_data_from_numpy(input_data, binary_data=False)
 
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT0', binary_data=False))
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT1', binary_data=False))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False))
         request_body, _ = httpclient.InferenceServerClient.generate_request_body(
-            inputs, outputs=outputs)
+            inputs, outputs=outputs
+        )
 
         headers = {
-            'Content-Type':
-                'application/json',
-            "X-Vertex-Ai-Triton-Redirect":
-                "v2/models/{}/infer".format(self.model_)
+            "Content-Type": "application/json",
+            "X-Vertex-Ai-Triton-Redirect": "v2/models/{}/infer".format(self.model_),
         }
         r = requests.post(self.url_, data=request_body, headers=headers)
         r.raise_for_status()
 
-        result = httpclient.InferenceServerClient.parse_response_body(
-            r._content)
+        result = httpclient.InferenceServerClient.parse_response_body(r._content)
 
-        output0_data = result.as_numpy('OUTPUT0')
-        output1_data = result.as_numpy('OUTPUT1')
+        output0_data = result.as_numpy("OUTPUT0")
+        output1_data = result.as_numpy("OUTPUT1")
         if self.model_ == "addsub":
             expected_output0_data = [x * 2 for x in self.input_data_]
             expected_output1_data = [0 for x in self.input_data_]
@@ -137,8 +121,8 @@ def test_predict_specified_model(self):
     def test_predict_request_binary(self):
         inputs = []
         outputs = []
-        inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32"))
-        inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32"))
 
         # Initialize the data
         input_data = np.array(self.input_data_, dtype=np.int32)
@@ -146,25 +130,26 @@ def test_predict_request_binary(self):
         inputs[0].set_data_from_numpy(input_data, binary_data=True)
         inputs[1].set_data_from_numpy(input_data, binary_data=False)
 
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT0', binary_data=False))
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT1', binary_data=False))
-        request_body, header_length = httpclient.InferenceServerClient.generate_request_body(
-            inputs, outputs=outputs)
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False))
+        (
+            request_body,
+            header_length,
+        ) = httpclient.InferenceServerClient.generate_request_body(
+            inputs, outputs=outputs
+        )
 
         headers = {
-            'Content-Type':
-                'application/vnd.vertex-ai-triton.binary+json;json-header-size={}'
-                .format(header_length)
+            "Content-Type": "application/vnd.vertex-ai-triton.binary+json;json-header-size={}".format(
+                header_length
+            )
         }
         r = requests.post(self.url_, data=request_body, headers=headers)
         r.raise_for_status()
 
-        result = httpclient.InferenceServerClient.parse_response_body(
-            r._content)
-        output0_data = result.as_numpy('OUTPUT0')
-        output1_data = result.as_numpy('OUTPUT1')
+        result = httpclient.InferenceServerClient.parse_response_body(r._content)
+        output0_data = result.as_numpy("OUTPUT0")
+        output1_data = result.as_numpy("OUTPUT1")
         for i in range(16):
             self.assertEqual(output0_data[0][i], self.expected_output0_data_[i])
             self.assertEqual(output1_data[0][i], self.expected_output1_data_[i])
@@ -172,8 +157,8 @@ def test_predict_request_binary(self):
     def test_predict_response_binary(self):
         inputs = []
         outputs = []
-        inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32"))
-        inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32"))
 
         # Initialize the data
         input_data = np.array(self.input_data_, dtype=np.int32)
@@ -181,23 +166,23 @@ def test_predict_response_binary(self):
         inputs[0].set_data_from_numpy(input_data, binary_data=False)
         inputs[1].set_data_from_numpy(input_data, binary_data=False)
 
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT0', binary_data=True))
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT1', binary_data=False))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=True))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False))
         request_body, _ = httpclient.InferenceServerClient.generate_request_body(
-            inputs, outputs=outputs)
+            inputs, outputs=outputs
+        )
 
-        headers = {'Content-Type': 'application/json'}
+        headers = {"Content-Type": "application/json"}
         r = requests.post(self.url_, data=request_body, headers=headers)
         r.raise_for_status()
 
-        header_length_str = r.headers['Inference-Header-Content-Length']
+        header_length_str = r.headers["Inference-Header-Content-Length"]
         result = httpclient.InferenceServerClient.parse_response_body(
-            r._content, header_length=int(header_length_str))
+            r._content, header_length=int(header_length_str)
+        )
 
-        output0_data = result.as_numpy('OUTPUT0')
-        output1_data = result.as_numpy('OUTPUT1')
+        output0_data = result.as_numpy("OUTPUT0")
+        output1_data = result.as_numpy("OUTPUT1")
         for i in range(16):
             self.assertEqual(output0_data[0][i], self.expected_output0_data_[i])
             self.assertEqual(output1_data[0][i], self.expected_output1_data_[i])
@@ -205,8 +190,8 @@ def test_predict_response_binary(self):
     def test_malformed_binary_header(self):
         inputs = []
         outputs = []
-        inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32"))
-        inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32"))
 
         # Initialize the data
         input_data = np.array(self.input_data_, dtype=np.int32)
@@ -214,29 +199,34 @@ def test_malformed_binary_header(self):
         inputs[0].set_data_from_numpy(input_data, binary_data=True)
         inputs[1].set_data_from_numpy(input_data, binary_data=False)
 
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT0', binary_data=False))
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT1', binary_data=False))
-        request_body, header_length = httpclient.InferenceServerClient.generate_request_body(
-            inputs, outputs=outputs)
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False))
+        (
+            request_body,
+            header_length,
+        ) = httpclient.InferenceServerClient.generate_request_body(
+            inputs, outputs=outputs
+        )
 
         headers = {
-            'Content-Type':
-                'additional-string/application/vnd.vertex-ai-triton.binary+json;json-header-size={}'
-                .format(header_length)
+            "Content-Type": "additional-string/application/vnd.vertex-ai-triton.binary+json;json-header-size={}".format(
+                header_length
+            )
         }
         r = requests.post(self.url_, data=request_body, headers=headers)
         self.assertEqual(
-            400, r.status_code,
+            400,
+            r.status_code,
             "Expected error code {} returned for the request; got: {}".format(
-                400, r.status_code))
+                400, r.status_code
+            ),
+        )
 
     def test_malformed_binary_header_not_number(self):
         inputs = []
         outputs = []
-        inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32"))
-        inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32"))
 
         # Initialize the data
         input_data = np.array(self.input_data_, dtype=np.int32)
@@ -244,29 +234,34 @@ def test_malformed_binary_header_not_number(self):
         inputs[0].set_data_from_numpy(input_data, binary_data=True)
         inputs[1].set_data_from_numpy(input_data, binary_data=False)
 
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT0', binary_data=False))
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT1', binary_data=False))
-        request_body, header_length = httpclient.InferenceServerClient.generate_request_body(
-            inputs, outputs=outputs)
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False))
+        (
+            request_body,
+            header_length,
+        ) = httpclient.InferenceServerClient.generate_request_body(
+            inputs, outputs=outputs
+        )
 
         headers = {
-            'Content-Type':
-                'application/vnd.vertex-ai-triton.binary+json;json-header-size=additional-string{}'
-                .format(header_length)
+            "Content-Type": "application/vnd.vertex-ai-triton.binary+json;json-header-size=additional-string{}".format(
+                header_length
+            )
         }
         r = requests.post(self.url_, data=request_body, headers=headers)
         self.assertEqual(
-            400, r.status_code,
+            400,
+            r.status_code,
             "Expected error code {} returned for the request; got: {}".format(
-                400, r.status_code))
+                400, r.status_code
+            ),
+        )
 
     def test_malformed_binary_header_negative_number(self):
         inputs = []
         outputs = []
-        inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32"))
-        inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32"))
 
         # Initialize the data
         input_data = np.array(self.input_data_, dtype=np.int32)
@@ -274,28 +269,32 @@ def test_malformed_binary_header_negative_number(self):
         inputs[0].set_data_from_numpy(input_data, binary_data=True)
         inputs[1].set_data_from_numpy(input_data, binary_data=False)
 
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT0', binary_data=False))
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT1', binary_data=False))
-        request_body, header_length = httpclient.InferenceServerClient.generate_request_body(
-            inputs, outputs=outputs)
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False))
+        (
+            request_body,
+            header_length,
+        ) = httpclient.InferenceServerClient.generate_request_body(
+            inputs, outputs=outputs
+        )
 
         headers = {
-            'Content-Type':
-                'application/vnd.vertex-ai-triton.binary+json;json-header-size=-123'
+            "Content-Type": "application/vnd.vertex-ai-triton.binary+json;json-header-size=-123"
         }
         r = requests.post(self.url_, data=request_body, headers=headers)
         self.assertEqual(
-            400, r.status_code,
+            400,
+            r.status_code,
             "Expected error code {} returned for the request; got: {}".format(
-                400, r.status_code))
+                400, r.status_code
+            ),
+        )
 
     def test_malformed_binary_header_large_number(self):
         inputs = []
         outputs = []
-        inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32"))
-        inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32"))
 
         # Initialize the data
         input_data = np.array(self.input_data_, dtype=np.int32)
@@ -303,23 +302,27 @@ def test_malformed_binary_header_large_number(self):
         inputs[0].set_data_from_numpy(input_data, binary_data=True)
         inputs[1].set_data_from_numpy(input_data, binary_data=False)
 
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT0', binary_data=False))
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT1', binary_data=False))
-        request_body, header_length = httpclient.InferenceServerClient.generate_request_body(
-            inputs, outputs=outputs)
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False))
+        (
+            request_body,
+            header_length,
+        ) = httpclient.InferenceServerClient.generate_request_body(
+            inputs, outputs=outputs
+        )
 
         headers = {
-            'Content-Type':
-                'application/vnd.vertex-ai-triton.binary+json;json-header-size=12345'
+            "Content-Type": "application/vnd.vertex-ai-triton.binary+json;json-header-size=12345"
         }
         r = requests.post(self.url_, data=request_body, headers=headers)
         self.assertEqual(
-            400, r.status_code,
+            400,
+            r.status_code,
             "Expected error code {} returned for the request; got: {}".format(
-                400, r.status_code))
+                400, r.status_code
+            ),
+        )
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_warmup/decoupled/1/model.py b/qa/L0_warmup/decoupled/1/model.py
index db7c6903f5..9827a87f09 100644
--- a/qa/L0_warmup/decoupled/1/model.py
+++ b/qa/L0_warmup/decoupled/1/model.py
@@ -1,4 +1,4 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,11 +28,12 @@
 
 
 class TritonPythonModel:
-    """Test model that always returns 0 response for all requests. """
+    """Test model that always returns 0 response for all requests."""
 
     def execute(self, requests):
         for request in requests:
             request.get_response_sender().send(
-                flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)
+                flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL
+            )
 
         return None
diff --git a/qa/L0_warmup/failing_infer/1/model.py b/qa/L0_warmup/failing_infer/1/model.py
index 1935fe6cd9..632477c903 100644
--- a/qa/L0_warmup/failing_infer/1/model.py
+++ b/qa/L0_warmup/failing_infer/1/model.py
@@ -1,4 +1,4 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,7 +28,7 @@
 
 
 class TritonPythonModel:
-    """Test model that always returns error for all requests. """
+    """Test model that always returns error for all requests."""
 
     def execute(self, requests):
         responses = []
@@ -36,8 +36,9 @@ def execute(self, requests):
         for _ in requests:
             responses.append(
                 pb_utils.InferenceResponse(
-                    output_tensors=[],
-                    error=pb_utils.TritonError("An Error Occurred")))
+                    output_tensors=[], error=pb_utils.TritonError("An Error Occurred")
+                )
+            )
 
         # You must return a list of pb_utils.InferenceResponse. Length
         # of this list must match the length of `requests` list.
diff --git a/qa/L0_warmup/test.sh b/qa/L0_warmup/test.sh
old mode 100644
new mode 100755
index aad83e1789..193f4b130d
--- a/qa/L0_warmup/test.sh
+++ b/qa/L0_warmup/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -42,6 +42,9 @@ export CUDA_VISIBLE_DEVICES=0
 
 CLIENT=../clients/image_client
 CLIENT_LOG="./client.log"
+CLIENT_PY=./python_unittest.py
+EXPECTED_NUM_TESTS="1"
+TEST_RESULT_FILE='test_results.txt'
 
 IMAGE="../images/vulture.jpeg"
 
@@ -56,6 +59,7 @@ SERVER_LOG="./inference_server.log"
 source ../common/util.sh
 
 RET=0
+rm -fr *.txt
 
 for BACKEND in ${BACKENDS}; do
     rm -f $SERVER_LOG $CLIENT_LOG
@@ -408,8 +412,83 @@ set -e
 kill $SERVER_PID
 wait $SERVER_PID
 
-if [ $RET -eq 0 ]; then
-  echo -e "\n***\n*** Test Passed\n***"
+# Test the onnx model to verify that the memory type of the output tensor
+# remains unchanged with the warmup setting
+pip3 uninstall -y torch
+pip3 install torch==1.13.0+cu117 -f https://download.pytorch.org/whl/torch_stable.html
+
+rm -fr models && mkdir models
+cp -r /data/inferenceserver/${REPO_VERSION}/qa_model_repository/onnx_nobatch_float32_float32_float32 models/.
+(cd models/onnx_nobatch_float32_float32_float32 && \
+            echo "" >> config.pbtxt && \
+            echo 'instance_group [{' >> config.pbtxt && \
+            echo '    kind : KIND_GPU' >> config.pbtxt && \
+            echo '}]' >> config.pbtxt && \
+            echo 'model_warmup [{' >> config.pbtxt && \
+            echo '    name : "sample"' >> config.pbtxt && \
+            echo '    batch_size: 1' >> config.pbtxt && \
+            echo '    inputs {' >> config.pbtxt && \
+            echo '        key: "INPUT0"' >> config.pbtxt && \
+            echo '        value: {' >> config.pbtxt && \
+            echo '            data_type: TYPE_FP32' >> config.pbtxt && \
+            echo "            dims: 16" >> config.pbtxt && \
+            echo "            zero_data: false" >> config.pbtxt && \
+            echo '        }' >> config.pbtxt && \
+            echo '    }' >> config.pbtxt && \
+             echo '    inputs {' >> config.pbtxt && \
+            echo '        key: "INPUT1"' >> config.pbtxt && \
+            echo '        value: {' >> config.pbtxt && \
+            echo '            data_type: TYPE_FP32' >> config.pbtxt && \
+            echo "            dims: 16" >> config.pbtxt && \
+            echo "            zero_data: false" >> config.pbtxt && \
+            echo '        }' >> config.pbtxt && \
+            echo '    }' >> config.pbtxt && \
+            echo '}]' >> config.pbtxt )
+
+mkdir -p models/bls_onnx_warmup/1/
+cp ../python_models/bls_onnx_warmup/model.py models/bls_onnx_warmup/1/
+cp ../python_models/bls_onnx_warmup/config.pbtxt models/bls_onnx_warmup/.
+
+cp ../L0_backend_python/python_unittest.py .
+sed -i 's#sys.path.append("../../common")#sys.path.append("../common")#g' python_unittest.py
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+
+export MODEL_NAME='bls_onnx_warmup'
+python3 $CLIENT_PY >> $CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** 'bls_onnx_warmup' test FAILED. \n***"
+    cat $CLIENT_LOG
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+
+set -e
+
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+
+if [ $RET -eq 1 ]; then
+    cat $CLIENT_LOG
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Test Failed \n***"
+else
+    echo -e "\n***\n*** Test Passed \n***"
 fi
 
 exit $RET
diff --git a/qa/common/check_copyright.py b/qa/common/check_copyright.py
index 7d6e8e0729..ff18ca8e39 100755
--- a/qa/common/check_copyright.py
+++ b/qa/common/check_copyright.py
@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,37 +28,68 @@
 
 import argparse
 import os
-import re
 import pathlib
+import re
 
 FLAGS = None
-SKIP_EXTS = ('jpeg', 'jpg', 'pgm', 'png', 'log', 'serverlog', 'preprocessed',
-             'jmx', 'gz', 'json', 'pdf', 'so', 'onnx')
-REPO_PATH_FROM_THIS_FILE = '../..'
+SKIP_EXTS = (
+    "jpeg",
+    "jpg",
+    "pgm",
+    "png",
+    "log",
+    "preprocessed",
+    "jmx",
+    "gz",
+    "json",
+    "pdf",
+    "so",
+    "onnx",
+    "svg",
+)
+REPO_PATH_FROM_THIS_FILE = "../.."
 SKIP_PATHS = (
-    'build',
-    'deploy/gke-marketplace-app/.gitignore',
-    'deploy/gke-marketplace-app/server-deployer/chart/.helmignore',
-    'deploy/gcp/.helmignore', 'deploy/aws/.helmignore',
-    'deploy/fleetcommand/.helmignore', 'docs/examples/model_repository',
-    'docs/examples/jetson', 'docker',
-    'qa/common/cuda_op_kernel.cu.cc.patch',
-    'qa/ensemble_models/mix_platform_float32_float32_float32/output0_labels.txt',
-    'qa/ensemble_models/mix_type_int32_float32_float32/output0_labels.txt',
-    'qa/ensemble_models/mix_ensemble_int32_float32_float32/output0_labels.txt',
-    'qa/ensemble_models/wrong_label_int32_float32_float32/output0_labels.txt',
-    'qa/ensemble_models/label_override_int32_float32_float32/output0_labels.txt',
-    'qa/L0_model_config/noautofill_platform',
-    'qa/L0_model_config/autofill_noplatform',
-    'qa/L0_model_config/autofill_noplatform_success',
-    'qa/L0_model_config/special_cases', 'qa/L0_perf_nomodel/baseline',
-    'qa/L0_perf_nomodel/legacy_baseline', 'qa/L0_warmup/raw_mug_data',
-    'qa/L0_java_resnet/expected_output_data',
-    'TRITON_VERSION')
+    "build",
+    "deploy/gke-marketplace-app/.gitignore",
+    "deploy/gke-marketplace-app/server-deployer/chart/.helmignore",
+    "deploy/gcp/.helmignore",
+    "deploy/aws/.helmignore",
+    "deploy/fleetcommand/.helmignore",
+    "docs/.gitignore",
+    "docs/_static/.gitattributes",
+    "docs/examples/model_repository",
+    "docs/examples/jetson",
+    "docker",
+    "qa/common/cuda_op_kernel.cu.cc.patch",
+    "qa/ensemble_models/mix_platform_float32_float32_float32/output0_labels.txt",
+    "qa/ensemble_models/mix_type_int32_float32_float32/output0_labels.txt",
+    "qa/ensemble_models/mix_ensemble_int32_float32_float32/output0_labels.txt",
+    "qa/ensemble_models/wrong_label_int32_float32_float32/output0_labels.txt",
+    "qa/ensemble_models/label_override_int32_float32_float32/output0_labels.txt",
+    "qa/L0_model_config/noautofill_platform",
+    "qa/L0_model_config/autofill_noplatform",
+    "qa/L0_model_config/autofill_noplatform_success",
+    "qa/L0_model_config/special_cases",
+    "qa/L0_model_config/cli_messages/cli_override/expected",
+    "qa/L0_model_config/cli_messages/cli_deprecation/expected",
+    "qa/L0_model_namespacing/test_duplication",
+    "qa/L0_model_namespacing/test_dynamic_resolution",
+    "qa/L0_model_namespacing/test_ensemble_duplication",
+    "qa/L0_model_namespacing/test_no_duplication",
+    "qa/L0_perf_nomodel/baseline",
+    "qa/L0_perf_nomodel/legacy_baseline",
+    "qa/L0_warmup/raw_mug_data",
+    "qa/L0_java_resnet/expected_output_data",
+    "qa/L0_trt_dla_jetson/trt_dla_model_store",
+    "qa/openvino_models/dynamic_batch",
+    "qa/openvino_models/fixed_batch",
+    "CITATION.cff",
+    "TRITON_VERSION",
+)
 
-COPYRIGHT_YEAR_RE = 'Copyright( \\(c\\))? 20[1-9][0-9](-(20)?[1-9][0-9])?(,((20[2-9][0-9](-(20)?[2-9][0-9])?)|([2-9][0-9](-[2-9][0-9])?)))*,? NVIDIA CORPORATION( & AFFILIATES)?. All rights reserved.'
+COPYRIGHT_YEAR_RE = "Copyright( \\(c\\))? 20[1-9][0-9](-(20)?[1-9][0-9])?(,((20[2-9][0-9](-(20)?[2-9][0-9])?)|([2-9][0-9](-[2-9][0-9])?)))*,? NVIDIA CORPORATION( & AFFILIATES)?. All rights reserved."
 
-COPYRIGHT = '''
+COPYRIGHT = """
 
 Redistribution and use in source and binary forms, with or without
 modification, are permitted provided that the following conditions
@@ -83,10 +114,11 @@
 OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-'''
+"""
 
-repo_abs_path = pathlib.Path(__file__).parent.joinpath(
-    REPO_PATH_FROM_THIS_FILE).resolve()
+repo_abs_path = (
+    pathlib.Path(__file__).parent.joinpath(REPO_PATH_FROM_THIS_FILE).resolve()
+)
 
 copyright_year_re = re.compile(COPYRIGHT_YEAR_RE)
 
@@ -96,32 +128,37 @@ def visit(path):
         print("visiting " + path)
 
     for skip in SKIP_EXTS:
-        if path.endswith('.' + skip):
+        if path.endswith("." + skip):
             if FLAGS.verbose:
                 print("skipping due to extension: " + path)
             return True
 
     for skip in SKIP_PATHS:
         if str(pathlib.Path(path).resolve()).startswith(
-                str(repo_abs_path.joinpath(skip).resolve())):
+            str(repo_abs_path.joinpath(skip).resolve())
+        ):
             if FLAGS.verbose:
                 print("skipping due to path prefix: " + path)
             return True
 
-    with open(path, 'r') as f:
+    with open(path, "r") as f:
         first_line = True
         line = None
         try:
             for fline in f:
                 line = fline
 
-                # Skip any '#!', '..', '<!--', or '{{/*' lines at the
+                # Skip any '#!', '..', '<!--', '\*' or '{{/*' lines at the
                 # start of the file
                 if first_line:
                     first_line = False
-                    if (fline.startswith("#!") or fline.startswith("..") or
-                            fline.startswith("<!--") or
-                            fline.startswith("{{/*")):
+                    if (
+                        fline.startswith("#!")
+                        or fline.startswith("..")
+                        or fline.startswith("<!--")
+                        or fline.startswith("/*")
+                        or fline.startswith("{{/*")
+                    ):
                         continue
                 # Skip empty lines...
                 if len(fline.strip()) != 0:
@@ -146,25 +183,32 @@ def visit(path):
         # or a year range. It is optionally allowed to have '# ' or
         # '// ' prefix.
         prefix = ""
-        if line.startswith('# '):
-            prefix = '# '
-        elif line.startswith('// '):
-            prefix = '// '
+        if line.startswith("# "):
+            prefix = "# "
+        elif line.startswith("// "):
+            prefix = "// "
         elif not line.startswith(COPYRIGHT_YEAR_RE[0]):
             print(
                 "incorrect prefix for copyright line, allowed prefixes '# ' or '// ', for "
-                + path + ": " + line)
+                + path
+                + ": "
+                + line
+            )
             return False
 
         # Check if the copyright year line matches the regex
         # and see if the year(s) are reasonable
         years = []
 
-        copyright_row = line[len(prefix):]
+        copyright_row = line[len(prefix) :]
         if copyright_year_re.match(copyright_row):
-            for year in copyright_row.split("(c) " if "(c) " in
-                                            copyright_row else "Copyright "
-                                           )[1].split(" NVIDIA ")[0].split(","):
+            for year in (
+                copyright_row.split(
+                    "(c) " if "(c) " in copyright_row else "Copyright "
+                )[1]
+                .split(" NVIDIA ")[0]
+                .split(",")
+            ):
                 if len(year) == 4:  # 2021
                     years.append(int(year))
                 elif len(year) == 2:  # 21
@@ -183,17 +227,21 @@ def visit(path):
             return False
 
         if years[0] > FLAGS.year:
-            print("copyright start year greater than current year for " + path +
-                  ": " + line)
+            print(
+                "copyright start year greater than current year for "
+                + path
+                + ": "
+                + line
+            )
             return False
         if years[-1] > FLAGS.year:
-            print("copyright end year greater than current year for " + path +
-                  ": " + line)
+            print(
+                "copyright end year greater than current year for " + path + ": " + line
+            )
             return False
         for i in range(1, len(years)):
             if years[i - 1] >= years[i]:
-                print("copyright years are not increasing for " + path + ": " +
-                      line)
+                print("copyright years are not increasing for " + path + ": " + line)
                 return False
 
         # Subsequent lines must match the copyright body.
@@ -213,7 +261,7 @@ def visit(path):
             if len(copyright_body[copyright_idx]) == 0:
                 expected = prefix.strip()
             else:
-                expected = (prefix + copyright_body[copyright_idx])
+                expected = prefix + copyright_body[copyright_idx]
             if line != expected:
                 print("incorrect copyright body for " + path)
                 print("  expected: '" + expected + "'")
@@ -222,8 +270,11 @@ def visit(path):
             copyright_idx += 1
 
         if copyright_idx != len(copyright_body):
-            print("missing " + str(len(copyright_body) - copyright_idx) +
-                  " lines of the copyright body")
+            print(
+                "missing "
+                + str(len(copyright_body) - copyright_idx)
+                + " lines of the copyright body"
+            )
             return False
 
     if FLAGS.verbose:
@@ -231,24 +282,20 @@ def visit(path):
     return True
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('-v',
-                        '--verbose',
-                        action="store_true",
-                        required=False,
-                        default=False,
-                        help='Enable verbose output')
-    parser.add_argument('-y',
-                        '--year',
-                        type=int,
-                        required=True,
-                        help='Copyright year')
-    parser.add_argument('paths',
-                        type=str,
-                        nargs='*',
-                        default=None,
-                        help='Directories or files to check')
+    parser.add_argument(
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable verbose output",
+    )
+    parser.add_argument("-y", "--year", type=int, required=True, help="Copyright year")
+    parser.add_argument(
+        "paths", type=str, nargs="*", default=None, help="Directories or files to check"
+    )
     FLAGS = parser.parse_args()
 
     if FLAGS.paths is None or len(FLAGS.paths) == 0:
diff --git a/qa/common/check_massif_log.py b/qa/common/check_massif_log.py
index 700c58ecbd..3d08922f88 100755
--- a/qa/common/check_massif_log.py
+++ b/qa/common/check_massif_log.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2020-2021, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,24 +26,22 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import sys
+import math
 import re
+import sys
 from collections import defaultdict
-import math
 
 
 def parse_massif_out(filename):
     """
     Extract the allocation data from the massif output file, and compile
-    it into a dictionary.    
-    
+    it into a dictionary.
+
     """
     # Read the file
-    with open(filename, 'r') as f:
+    with open(filename, "r") as f:
         contents = f.read()
-        snapshots = re.findall('snapshot=(.*?)heap_tree',
-                               contents,
-                               flags=re.DOTALL)
+        snapshots = re.findall("snapshot=(.*?)heap_tree", contents, flags=re.DOTALL)
 
     # Create snapshot dictionary
     summary = defaultdict(list)
@@ -52,7 +52,7 @@ def parse_massif_out(filename):
 
         # Put columns and values into dictionary
         for col in columns:
-            k, v = col.split('=')
+            k, v = col.split("=")
             summary[k].append(int(v))
 
     # Return dict
@@ -61,42 +61,43 @@ def parse_massif_out(filename):
 
 def is_unbounded_growth(summary, max_allowed_alloc, start_from_middle):
     """
-    Check whether the heap allocations is increasing     
-    
+    Check whether the heap allocations is increasing
+
     """
-    totals = summary['mem_heap_B']
+    totals = summary["mem_heap_B"]
 
     if len(totals) < 5:
         print("Error: Not enough snapshots")
         return False
 
     # Measure difference between mean and maximum memory usage
-    processed_snapshot = totals[len(totals) //
-                                2:] if start_from_middle else totals
+    processed_snapshot = totals[len(totals) // 2 :] if start_from_middle else totals
     processed_snapshot.sort(reverse=True)
     # Remove 5% of the max value which will be treated as outlier
     num_max_min_dropout = math.ceil(0.05 * len(processed_snapshot))
     start = num_max_min_dropout
     end = len(processed_snapshot) - num_max_min_dropout
     mem_heap_avg = sum(processed_snapshot[start:end]) / len(
-        processed_snapshot[start:end])
+        processed_snapshot[start:end]
+    )
     mem_heap_max = max(processed_snapshot[start:end])
 
     # Compute change in allocation rate
     memory_allocation_delta_mb = (mem_heap_max - mem_heap_avg) / 1e6
 
-    print("Change in memory allocation: %f MB, MAX ALLOWED: %f MB" %
-          (memory_allocation_delta_mb, max_allowed_alloc))
+    print(
+        "Change in memory allocation: %f MB, MAX ALLOWED: %f MB"
+        % (memory_allocation_delta_mb, max_allowed_alloc)
+    )
 
-    return (memory_allocation_delta_mb > max_allowed_alloc)
+    return memory_allocation_delta_mb > max_allowed_alloc
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     # FIXME turn to proper argument handling
     summary = parse_massif_out(sys.argv[1])
     max_allowed_alloc = float(sys.argv[2])
-    start_from_middle = ((len(sys.argv) == 4) and
-                         (sys.argv[3] == "--start-from-middle"))
+    start_from_middle = (len(sys.argv) == 4) and (sys.argv[3] == "--start-from-middle")
     if is_unbounded_growth(summary, max_allowed_alloc, start_from_middle):
         sys.exit(1)
     else:
diff --git a/qa/common/check_valgrind_log.py b/qa/common/check_valgrind_log.py
index ed1436e8c1..201d0e922c 100755
--- a/qa/common/check_valgrind_log.py
+++ b/qa/common/check_valgrind_log.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,8 +26,8 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import sys
 import argparse
+import sys
 
 # Check the valgrind logs for memory leaks, ignoring known memory leaks
 #   * cnmem https://github.com/NVIDIA/cnmem/issues/12
@@ -37,8 +39,12 @@
 #     -> dlerror
 
 LEAK_WHITE_LIST = [
-    'cnmem', 'tensorflow::NewSession', 'dl-init', 'dl-open', 'dlerror',
-    'libtorch'
+    "cnmem",
+    "tensorflow::NewSession",
+    "dl-init",
+    "dl-open",
+    "dlerror",
+    "libtorch",
 ]
 
 
@@ -52,31 +58,29 @@ def check_valgrind_log(log_file):
     ----------
     log_file: str
         The path to the log file
-    
+
     Returns
     -------
     list of str
         a list of the leak records as strings
     """
 
-    with open(args.input_log_file, 'r') as f:
+    with open(args.input_log_file, "r") as f:
         logs = f.read()
 
     # Find the pid and start and end of definite leak reports
-    pid_token_end = logs.find('==', logs.find('==') + 1) + 2
+    pid_token_end = logs.find("==", logs.find("==") + 1) + 2
     pid_token = logs[:pid_token_end]
-    leaks_start = logs.find('are definitely lost')
-    first_leak_line = logs.rfind('\n', 0, leaks_start)
+    leaks_start = logs.find("are definitely lost")
+    first_leak_line = logs.rfind("\n", 0, leaks_start)
     if leaks_start == -1 or first_leak_line == -1:
         # No leaks in log
         return []
     end_of_leaks = logs.find(f"{pid_token} LEAK SUMMARY:")
     if end_of_leaks == -1:
-        print(
-            f"\n***\n*** Test Failed for {log_file}: Malformed Valgrind log.\n***"
-        )
+        print(f"\n***\n*** Test Failed for {log_file}: Malformed Valgrind log.\n***")
         sys.exit(1)
-    leak_records_section = logs[first_leak_line + 1:end_of_leaks]
+    leak_records_section = logs[first_leak_line + 1 : end_of_leaks]
 
     # Each leak record is separated by a line containing '==<pid>== \n'
     record_separator = f"{pid_token} \n"
@@ -94,21 +98,21 @@ def check_valgrind_log(log_file):
     return filtered_leak_records
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
     parser.add_argument(
-        '-f',
-        '--input-log-file',
+        "-f",
+        "--input-log-file",
         type=str,
         required=True,
-        help="The name of the file containing the valgrind logs.")
+        help="The name of the file containing the valgrind logs.",
+    )
     args = parser.parse_args()
 
     leak_records = check_valgrind_log(log_file=args.input_log_file)
     if leak_records:
         for leak in leak_records:
             print(leak)
-        print(
-            f"\n***\n*** Test Failed: {len(leak_records)} leaks detected.\n***")
+        print(f"\n***\n*** Test Failed: {len(leak_records)} leaks detected.\n***")
         sys.exit(1)
     sys.exit(0)
diff --git a/qa/common/cuda_op_kernel.cu.cc.patch b/qa/common/cuda_op_kernel.cu.cc.patch
index 8a32873f4f..24d915aa20 100644
--- a/qa/common/cuda_op_kernel.cu.cc.patch
+++ b/qa/common/cuda_op_kernel.cu.cc.patch
@@ -4,7 +4,7 @@ index a9d66f9..a92e218 100644
 +++ b/tensorflow/examples/adding_an_op/cuda_op_kernel.cu.cc
 @@ -14,10 +14,12 @@ limitations under the License.
  ==============================================================================*/
- 
+
  #if GOOGLE_CUDA
 -#define EIGEN_USE_GPU
 -#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
@@ -16,12 +16,12 @@ index a9d66f9..a92e218 100644
 +//#include "tensorflow/core/util/gpu_launch_config.h"
 +#include <algorithm>
 +#include <stdint.h>
- 
+
  __global__ void AddOneKernel(const int* in, const int N, int* out) {
    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N;
 @@ -27,8 +29,9 @@ __global__ void AddOneKernel(const int* in, const int N, int* out) {
  }
- 
+
  void AddOneKernelLauncher(const int* in, const int N, int* out) {
 -  TF_CHECK_OK(::tensorflow::GpuLaunchKernel(AddOneKernel, 32, 256, 0, nullptr,
 -                                            in, N, out));
@@ -29,5 +29,5 @@ index a9d66f9..a92e218 100644
 +  int grid_size = (N + block_size - 1) / block_size;
 +  AddOneKernel<<<grid_size, block_size>>>(in, N, out);
  }
- 
+
  #endif
diff --git a/qa/common/gen_common.py b/qa/common/gen_common.py
new file mode 100644
index 0000000000..d574627dfd
--- /dev/null
+++ b/qa/common/gen_common.py
@@ -0,0 +1,160 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+from typing import List
+
+# Common utilities for model generation scripts
+import numpy as np
+
+np_dtype_string = np.dtype(object)
+
+
+def np_to_onnx_dtype(np_dtype):
+    import onnx
+
+    if np_dtype == bool:
+        return onnx.TensorProto.BOOL
+    elif np_dtype == np.int8:
+        return onnx.TensorProto.INT8
+    elif np_dtype == np.int16:
+        return onnx.TensorProto.INT16
+    elif np_dtype == np.int32:
+        return onnx.TensorProto.INT32
+    elif np_dtype == np.int64:
+        return onnx.TensorProto.INT64
+    elif np_dtype == np.uint8:
+        return onnx.TensorProto.UINT8
+    elif np_dtype == np.uint16:
+        return onnx.TensorProto.UINT16
+    elif np_dtype == np.float16:
+        return onnx.TensorProto.FLOAT16
+    elif np_dtype == np.float32:
+        return onnx.TensorProto.FLOAT
+    elif np_dtype == np.float64:
+        return onnx.TensorProto.DOUBLE
+    elif np_dtype == np_dtype_string:
+        return onnx.TensorProto.STRING
+    return None
+
+
+def np_to_model_dtype(np_dtype):
+    if np_dtype == bool:
+        return "TYPE_BOOL"
+    elif np_dtype == np.int8:
+        return "TYPE_INT8"
+    elif np_dtype == np.int16:
+        return "TYPE_INT16"
+    elif np_dtype == np.int32:
+        return "TYPE_INT32"
+    elif np_dtype == np.int64:
+        return "TYPE_INT64"
+    elif np_dtype == np.uint8:
+        return "TYPE_UINT8"
+    elif np_dtype == np.uint16:
+        return "TYPE_UINT16"
+    elif np_dtype == np.float16:
+        return "TYPE_FP16"
+    elif np_dtype == np.float32:
+        return "TYPE_FP32"
+    elif np_dtype == np.float64:
+        return "TYPE_FP64"
+    elif np_dtype == np_dtype_string:
+        return "TYPE_STRING"
+    return None
+
+
+def np_to_trt_dtype(np_dtype):
+    import tensorrt as trt
+
+    if np_dtype == bool:
+        return trt.bool
+    elif np_dtype == np.int8:
+        return trt.int8
+    elif np_dtype == np.int32:
+        return trt.int32
+    elif np_dtype == np.uint8:
+        return trt.uint8
+    elif np_dtype == np.float16:
+        return trt.float16
+    elif np_dtype == np.float32:
+        return trt.float32
+    return None
+
+
+def np_to_tf_dtype(np_dtype):
+    import tensorflow as tf
+
+    if np_dtype == bool:
+        return tf.bool
+    elif np_dtype == np.int8:
+        return tf.int8
+    elif np_dtype == np.int16:
+        return tf.int16
+    elif np_dtype == np.int32:
+        return tf.int32
+    elif np_dtype == np.int64:
+        return tf.int64
+    elif np_dtype == np.uint8:
+        return tf.uint8
+    elif np_dtype == np.uint16:
+        return tf.uint16
+    elif np_dtype == np.float16:
+        return tf.float16
+    elif np_dtype == np.float32:
+        return tf.float32
+    elif np_dtype == np.float64:
+        return tf.float64
+    elif np_dtype == np_dtype_string:
+        return tf.string
+    return None
+
+
+def np_to_torch_dtype(np_dtype):
+    import torch
+
+    if np_dtype == bool:
+        return torch.bool
+    elif np_dtype == np.int8:
+        return torch.int8
+    elif np_dtype == np.int16:
+        return torch.int16
+    elif np_dtype == np.int32:
+        return torch.int
+    elif np_dtype == np.int64:
+        return torch.long
+    elif np_dtype == np.uint8:
+        return torch.uint8
+    elif np_dtype == np.uint16:
+        return None  # Not supported in Torch
+    elif np_dtype == np.float16:
+        return None
+    elif np_dtype == np.float32:
+        return torch.float
+    elif np_dtype == np.float64:
+        return torch.double
+    elif np_dtype == np_dtype_string:
+        return List[str]
+    return None
diff --git a/qa/common/gen_ensemble_model_utils.py b/qa/common/gen_ensemble_model_utils.py
old mode 100644
new mode 100755
index e7e792d0c4..dd4f6e326a
--- a/qa/common/gen_ensemble_model_utils.py
+++ b/qa/common/gen_ensemble_model_utils.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2019-2022, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,50 +27,28 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import os
-import test_util as tu
+
 import numpy as np
+import test_util as tu
+from gen_common import np_to_model_dtype
 
 BASIC_ENSEMBLE_TYPES = ["simple", "sequence", "fan"]
 
 np_dtype_string = np.dtype(object)
 
 
-def np_to_model_dtype(np_dtype):
-    if np_dtype == bool:
-        return "TYPE_BOOL"
-    elif np_dtype == np.int8:
-        return "TYPE_INT8"
-    elif np_dtype == np.int16:
-        return "TYPE_INT16"
-    elif np_dtype == np.int32:
-        return "TYPE_INT32"
-    elif np_dtype == np.int64:
-        return "TYPE_INT64"
-    elif np_dtype == np.uint8:
-        return "TYPE_UINT8"
-    elif np_dtype == np.uint16:
-        return "TYPE_UINT16"
-    elif np_dtype == np.float16:
-        return "TYPE_FP16"
-    elif np_dtype == np.float32:
-        return "TYPE_FP32"
-    elif np_dtype == np.float64:
-        return "TYPE_FP64"
-    elif np_dtype == np_dtype_string:
-        return "TYPE_STRING"
-    return None
-
-
 def fixed_to_variable_size(shape):
     return [-1] * len(shape)
 
 
 def platform_types_and_validation():
-    res = [("graphdef", tu.validate_for_tf_model),
-           ("savedmodel", tu.validate_for_tf_model),
-           ("plan", tu.validate_for_trt_model),
-           ("onnx", tu.validate_for_onnx_model),
-           ("libtorch", tu.validate_for_libtorch_model)]
+    res = [
+        ("graphdef", tu.validate_for_tf_model),
+        ("savedmodel", tu.validate_for_tf_model),
+        ("plan", tu.validate_for_trt_model),
+        ("onnx", tu.validate_for_onnx_model),
+        ("libtorch", tu.validate_for_libtorch_model),
+    ]
     return res
 
 
@@ -86,21 +66,41 @@ def __init__(self, ensemble_type):
         else:
             self._get_schedule = AddSubEnsembleSchedule._get_simple_ensemble_schedule
 
-    def get_schedule(self, base_model_name, input_shape, output0_shape,
-                     output1_shape, input_model_dtype, output0_model_dtype,
-                     output1_model_dtype):
-        return self._get_schedule(base_model_name, input_shape, output0_shape,
-                                  output1_shape, input_model_dtype,
-                                  output0_model_dtype, output1_model_dtype)
+    def get_schedule(
+        self,
+        base_model_name,
+        input_shape,
+        output0_shape,
+        output1_shape,
+        input_model_dtype,
+        output0_model_dtype,
+        output1_model_dtype,
+    ):
+        return self._get_schedule(
+            base_model_name,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_model_dtype,
+            output0_model_dtype,
+            output1_model_dtype,
+        )
 
     @classmethod
-    def _get_simple_ensemble_schedule(cls, base_model_name, input_shape,
-                                      output0_shape, output1_shape, input_dtype,
-                                      output0_dtype, output1_dtype):
+    def _get_simple_ensemble_schedule(
+        cls,
+        base_model_name,
+        input_shape,
+        output0_shape,
+        output1_shape,
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+    ):
         # libtorch model uses other naming convention for outputs
         output_index_delimiter = "__" if "libtorch" in base_model_name else ""
         # ensemble input -> addsub -> ensemble output
-        schedule = '''
+        schedule = """
 ensemble_scheduling {{
   step [
     {{
@@ -125,19 +125,27 @@ def _get_simple_ensemble_schedule(cls, base_model_name, input_shape,
     }}
   ]
 }}
-'''.format(base_model_name, delimiter=output_index_delimiter)
+""".format(
+            base_model_name, delimiter=output_index_delimiter
+        )
         return schedule
 
     @classmethod
-    def _get_sequence_ensemble_schedule(cls, base_model_name, input_shape,
-                                        output0_shape, output1_shape,
-                                        input_dtype, output0_dtype,
-                                        output1_dtype):
+    def _get_sequence_ensemble_schedule(
+        cls,
+        base_model_name,
+        input_shape,
+        output0_shape,
+        output1_shape,
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+    ):
         # libtorch model uses other naming convention for outputs
         output_index_delimiter = "__" if "libtorch" in base_model_name else ""
         # ensemble input -> nop -> addsub -> ensemble output
         nop_input_shape = fixed_to_variable_size(input_shape)
-        schedule = '''
+        schedule = """
 ensemble_scheduling {{
   step [
     {{
@@ -182,16 +190,25 @@ def _get_sequence_ensemble_schedule(cls, base_model_name, input_shape,
     }}
   ]
 }}
-'''.format(input_dtype,
-           tu.shape_to_dims_str(nop_input_shape),
-           base_model_name,
-           delimiter=output_index_delimiter)
+""".format(
+            input_dtype,
+            tu.shape_to_dims_str(nop_input_shape),
+            base_model_name,
+            delimiter=output_index_delimiter,
+        )
         return schedule
 
     @classmethod
-    def _get_fan_ensemble_schedule(cls, base_model_name, input_shape,
-                                   output0_shape, output1_shape, input_dtype,
-                                   output0_dtype, output1_dtype):
+    def _get_fan_ensemble_schedule(
+        cls,
+        base_model_name,
+        input_shape,
+        output0_shape,
+        output1_shape,
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+    ):
         # libtorch model uses other naming convention for outputs
         output_index_delimiter = "__" if "libtorch" in base_model_name else ""
 
@@ -200,7 +217,7 @@ def _get_fan_ensemble_schedule(cls, base_model_name, input_shape,
         nop_input_shape = fixed_to_variable_size(input_shape)
         nop_output0_shape = fixed_to_variable_size(output0_shape)
         nop_output1_shape = fixed_to_variable_size(output1_shape)
-        schedule = '''
+        schedule = """
 ensemble_scheduling {{
   step [
     {{
@@ -277,14 +294,16 @@ def _get_fan_ensemble_schedule(cls, base_model_name, input_shape,
     }}
   ]
 }}
-'''.format(input_dtype,
-           tu.shape_to_dims_str(nop_input_shape),
-           base_model_name,
-           output0_dtype,
-           tu.shape_to_dims_str(nop_output0_shape),
-           output1_dtype,
-           tu.shape_to_dims_str(nop_output1_shape),
-           delimiter=output_index_delimiter)
+""".format(
+            input_dtype,
+            tu.shape_to_dims_str(nop_input_shape),
+            base_model_name,
+            output0_dtype,
+            tu.shape_to_dims_str(nop_output0_shape),
+            output1_dtype,
+            tu.shape_to_dims_str(nop_output1_shape),
+            delimiter=output_index_delimiter,
+        )
         return schedule
 
 
@@ -299,25 +318,45 @@ def __init__(self, ensemble_type, ensemble_test_type="zero"):
         if ensemble_type == "fan":
             self._get_schedule = IdentityEnsembleSchedule._get_fan_ensemble_schedule
         elif ensemble_type == "sequence":
-            self._get_schedule = IdentityEnsembleSchedule._get_sequence_ensemble_schedule
+            self._get_schedule = (
+                IdentityEnsembleSchedule._get_sequence_ensemble_schedule
+            )
         else:
             self._get_schedule = IdentityEnsembleSchedule._get_simple_ensemble_schedule
 
-    def get_schedule(self, dtype, input_shapes, input_model_shapes,
-                     output_shapes, output_model_shapes):
-        return self._get_schedule(dtype, input_shapes, input_model_shapes,
-                                  output_shapes, output_model_shapes,
-                                  self._test_type)
+    def get_schedule(
+        self,
+        dtype,
+        input_shapes,
+        input_model_shapes,
+        output_shapes,
+        output_model_shapes,
+    ):
+        return self._get_schedule(
+            dtype,
+            input_shapes,
+            input_model_shapes,
+            output_shapes,
+            output_model_shapes,
+            self._test_type,
+        )
 
     @classmethod
-    def _get_simple_ensemble_schedule(cls, dtype, input_shapes,
-                                      input_model_shapes, output_shapes,
-                                      output_model_shapes, test_type):
+    def _get_simple_ensemble_schedule(
+        cls,
+        dtype,
+        input_shapes,
+        input_model_shapes,
+        output_shapes,
+        output_model_shapes,
+        test_type,
+    ):
         # ensemble reshaped input -> nop with reshaped tensor shape -> ensemble
         # reshaped output (actual ensemble input/output is not visible in schedule)
         steps = []
         for idx in range(len(input_shapes)):
-            steps.append('''
+            steps.append(
+                """
     {{
       model_name: "nop_{}_{}"
       model_version: -1
@@ -334,30 +373,45 @@ def _get_simple_ensemble_schedule(cls, dtype, input_shapes,
         value: "OUTPUT{}"
       }}
     }}
-'''.format(np_to_model_dtype(dtype),
-            tu.shape_to_dims_str(input_model_shapes[idx]), idx, idx, idx))
-
-        schedule = '''
+""".format(
+                    np_to_model_dtype(dtype),
+                    tu.shape_to_dims_str(input_model_shapes[idx]),
+                    idx,
+                    idx,
+                    idx,
+                )
+            )
+
+        schedule = """
 ensemble_scheduling {{
   step [
 {}
   ]
 }}
-'''.format(",".join(steps))
+""".format(
+            ",".join(steps)
+        )
 
         return schedule
 
     @classmethod
-    def _get_sequence_ensemble_schedule(cls, dtype, input_shapes,
-                                        input_model_shapes, output_shapes,
-                                        output_model_shapes, test_type):
+    def _get_sequence_ensemble_schedule(
+        cls,
+        dtype,
+        input_shapes,
+        input_model_shapes,
+        output_shapes,
+        output_model_shapes,
+        test_type,
+    ):
         in_str = "tunnel_in_" if test_type == "reshape" else ""
         out_str = "tunnel_out_" if test_type == "reshape" else ""
         # ensemble reshaped input -> nop with another input only reshape ->
         # nop with output only reshape -> ensemble reshaped output
         steps = []
         for idx in range(len(input_shapes)):
-            steps.append('''
+            steps.append(
+                """
     {{
       model_name: "nop_{in_str}{type}_{shape}"
       model_version: -1
@@ -390,26 +444,37 @@ def _get_sequence_ensemble_schedule(cls, dtype, input_shapes,
         value: "OUTPUT{idx}"
       }}
     }}
-'''.format(type=np_to_model_dtype(dtype),
-            in_str=in_str,
-            out_str=out_str,
-            idx=idx,
-            shape=tu.shape_to_dims_str(input_model_shapes[idx])))
-
-        schedule = '''
+""".format(
+                    type=np_to_model_dtype(dtype),
+                    in_str=in_str,
+                    out_str=out_str,
+                    idx=idx,
+                    shape=tu.shape_to_dims_str(input_model_shapes[idx]),
+                )
+            )
+
+        schedule = """
 ensemble_scheduling {{
   step [
 {}
   ]
 }}
-'''.format(",".join(steps))
+""".format(
+            ",".join(steps)
+        )
 
         return schedule
 
     @classmethod
-    def _get_fan_ensemble_schedule(cls, dtype, input_shapes, input_model_shapes,
-                                   output_shapes, output_model_shapes,
-                                   test_type):
+    def _get_fan_ensemble_schedule(
+        cls,
+        dtype,
+        input_shapes,
+        input_model_shapes,
+        output_shapes,
+        output_model_shapes,
+        test_type,
+    ):
         # Note that the simple and sequence test already test "fan" in some
         # degree, because there is no direct match from nop input/output
         # like what is in addsub-like ensemble.
@@ -426,7 +491,8 @@ def _get_fan_ensemble_schedule(cls, dtype, input_shapes, input_model_shapes,
             intermediate_shapes = [[-1]] * len(input_model_shapes)
         steps = []
         for idx in range(len(input_shapes)):
-            steps.append('''
+            steps.append(
+                """
     {{
       model_name: "nop_{in_str}{type}_{shape}"
       model_version: -1
@@ -475,20 +541,25 @@ def _get_fan_ensemble_schedule(cls, dtype, input_shapes, input_model_shapes,
         value: "OUTPUT{idx}"
       }}
     }}
-'''.format(type=np_to_model_dtype(dtype),
-            in_str=in_str,
-            out_str=out_str,
-            intermediate_shape=tu.shape_to_dims_str(intermediate_shapes[idx]),
-            idx=idx,
-            shape=tu.shape_to_dims_str(input_model_shapes[idx])))
-
-        schedule = '''
+""".format(
+                    type=np_to_model_dtype(dtype),
+                    in_str=in_str,
+                    out_str=out_str,
+                    intermediate_shape=tu.shape_to_dims_str(intermediate_shapes[idx]),
+                    idx=idx,
+                    shape=tu.shape_to_dims_str(input_model_shapes[idx]),
+                )
+            )
+
+        schedule = """
 ensemble_scheduling {{
   step [
 {}
   ]
 }}
-'''.format(",".join(steps))
+""".format(
+            ",".join(steps)
+        )
 
         return schedule
 
@@ -503,7 +574,9 @@ def __init__(self, ensemble_type):
         if ensemble_type == "fan":
             self._get_schedule = SequenceEnsembleSchedule._get_fan_ensemble_schedule
         elif ensemble_type == "sequence":
-            self._get_schedule = SequenceEnsembleSchedule._get_sequence_ensemble_schedule
+            self._get_schedule = (
+                SequenceEnsembleSchedule._get_sequence_ensemble_schedule
+            )
         else:
             self._get_schedule = SequenceEnsembleSchedule._get_simple_ensemble_schedule
 
@@ -515,7 +588,7 @@ def _get_simple_ensemble_schedule(cls, base_model_name, shape, model_dtype):
         # libtorch model uses other naming convention
         index_suffix = "__0" if "libtorch" in base_model_name else ""
         # ensemble input -> sequence -> ensemble output
-        schedule = '''
+        schedule = """
 ensemble_scheduling {{
   step [
     {{
@@ -532,22 +605,24 @@ def _get_simple_ensemble_schedule(cls, base_model_name, shape, model_dtype):
     }}
   ]
 }}
-'''.format(base_model_name, index=index_suffix)
+""".format(
+            base_model_name, index=index_suffix
+        )
         return schedule
 
     @classmethod
-    def _get_sequence_ensemble_schedule(cls, base_model_name, shape,
-                                        model_dtype):
+    def _get_sequence_ensemble_schedule(cls, base_model_name, shape, model_dtype):
         # nop cannot handle STRING data type, fall back to simple
         if model_dtype == "TYPE_STRING":
             return SequenceEnsembleSchedule._get_simple_ensemble_schedule(
-                base_model_name, shape, model_dtype)
+                base_model_name, shape, model_dtype
+            )
 
         # libtorch model uses other naming convention
         index_suffix = "__0" if "libtorch" in base_model_name else ""
         # ensemble input -> nop -> sequence -> ensemble output
         nop_input_shape = fixed_to_variable_size(shape)
-        schedule = '''
+        schedule = """
 ensemble_scheduling {{
   step [
     {{
@@ -580,10 +655,12 @@ def _get_sequence_ensemble_schedule(cls, base_model_name, shape,
     }}
   ]
 }}
-'''.format(model_dtype,
-           tu.shape_to_dims_str(nop_input_shape),
-           base_model_name,
-           index=index_suffix)
+""".format(
+            model_dtype,
+            tu.shape_to_dims_str(nop_input_shape),
+            base_model_name,
+            index=index_suffix,
+        )
         return schedule
 
     @classmethod
@@ -591,14 +668,15 @@ def _get_fan_ensemble_schedule(cls, base_model_name, shape, model_dtype):
         # nop cannot handle STRING data type, fall back to simple
         if model_dtype == "TYPE_STRING":
             return SequenceEnsembleSchedule._get_simple_ensemble_schedule(
-                base_model_name, shape, model_dtype)
+                base_model_name, shape, model_dtype
+            )
 
         # libtorch model uses other naming convention
         index_suffix = "__0" if "libtorch" in base_model_name else ""
         # Not a "fan" due to configuration of base sequence model
         # ensemble input -> nop -> sequence -> nop -> ensemble output
         nop_shape = fixed_to_variable_size(shape)
-        schedule = '''
+        schedule = """
 ensemble_scheduling {{
   step [
     {{
@@ -647,37 +725,41 @@ def _get_fan_ensemble_schedule(cls, base_model_name, shape, model_dtype):
     }}
   ]
 }}
-'''.format(model_dtype,
-           tu.shape_to_dims_str(nop_shape),
-           base_model_name,
-           model_dtype,
-           tu.shape_to_dims_str(nop_shape),
-           index=index_suffix)
+""".format(
+            model_dtype,
+            tu.shape_to_dims_str(nop_shape),
+            base_model_name,
+            model_dtype,
+            tu.shape_to_dims_str(nop_shape),
+            index=index_suffix,
+        )
         return schedule
 
 
-def create_ensemble_modelfile(base_model,
-                              models_dir,
-                              max_batch,
-                              model_version,
-                              input_shape,
-                              output0_shape,
-                              output1_shape,
-                              input_dtype,
-                              output0_dtype,
-                              output1_dtype,
-                              swap=False):
-
+def create_ensemble_modelfile(
+    base_model,
+    models_dir,
+    max_batch,
+    model_version,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    swap=False,
+):
     # No actual model file in ensemble model
 
     # Use a different model name for the non-batching variant
     for ensemble_type in BASIC_ENSEMBLE_TYPES:
         ensemble_model_name = "{}_{}{}".format(
-            ensemble_type, base_model, "_nobatch" if max_batch == 0 else "")
-        model_name = tu.get_model_name(ensemble_model_name, input_dtype,
-                                       output0_dtype, output1_dtype)
-        model_version_dir = models_dir + "/" + model_name + "/" + str(
-            model_version)
+            ensemble_type, base_model, "_nobatch" if max_batch == 0 else ""
+        )
+        model_name = tu.get_model_name(
+            ensemble_model_name, input_dtype, output0_dtype, output1_dtype
+        )
+        model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
         try:
             os.makedirs(model_version_dir)
@@ -685,12 +767,20 @@ def create_ensemble_modelfile(base_model,
             pass  # ignore existing dir
 
 
-def create_ensemble_modelconfig(base_model, models_dir, max_batch,
-                                model_version, input_shape, output0_shape,
-                                output1_shape, input_dtype, output0_dtype,
-                                output1_dtype, output0_label_cnt,
-                                version_policy):
-
+def create_ensemble_modelconfig(
+    base_model,
+    models_dir,
+    max_batch,
+    model_version,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    output0_label_cnt,
+    version_policy,
+):
     # No validation as long as the base model supports the type and shape
 
     input_model_dtype = np_to_model_dtype(input_dtype)
@@ -705,29 +795,43 @@ def create_ensemble_modelconfig(base_model, models_dir, max_batch,
 
         # Use a different model name for the non-batching variant
         ensemble_model_name = "{}_{}{}".format(
-            ensemble_type, base_model, "_nobatch" if max_batch == 0 else "")
-        model_name = tu.get_model_name(ensemble_model_name, input_dtype,
-                                       output0_dtype, output1_dtype)
+            ensemble_type, base_model, "_nobatch" if max_batch == 0 else ""
+        )
+        model_name = tu.get_model_name(
+            ensemble_model_name, input_dtype, output0_dtype, output1_dtype
+        )
         base_model_name = tu.get_model_name(
             "{}{}".format(base_model, "_nobatch" if max_batch == 0 else ""),
-            input_dtype, output0_dtype, output1_dtype)
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+        )
 
         ensemble_schedule = AddSubEnsembleSchedule(ensemble_type).get_schedule(
-            base_model_name, input_shape, output0_shape, output1_shape,
-            input_model_dtype, output0_model_dtype, output1_model_dtype)
+            base_model_name,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_model_dtype,
+            output0_model_dtype,
+            output1_model_dtype,
+        )
 
         config_dir = models_dir + "/" + model_name
-        config = create_general_modelconfig(model_name,
-                                            "ensemble",
-                                            max_batch,
-                                            repeat(input_dtype, 2),
-                                            repeat(input_shape, 2),
-                                            repeat(None, 2),
-                                            [output0_dtype, output1_dtype],
-                                            [output0_shape, output1_shape],
-                                            repeat(None, 2), [labels, None],
-                                            version_policy=version_policy,
-                                            force_tensor_number_suffix=True)
+        config = create_general_modelconfig(
+            model_name,
+            "ensemble",
+            max_batch,
+            repeat(input_dtype, 2),
+            repeat(input_shape, 2),
+            repeat(None, 2),
+            [output0_dtype, output1_dtype],
+            [output0_shape, output1_shape],
+            repeat(None, 2),
+            [labels, None],
+            version_policy=version_policy,
+            force_tensor_number_suffix=True,
+        )
         config += ensemble_schedule
 
         try:
@@ -744,19 +848,24 @@ def create_ensemble_modelconfig(base_model, models_dir, max_batch,
                     lfile.write("label" + str(l) + "\n")
 
 
-def create_identity_ensemble_modelfile(ensemble_test_type, models_dir,
-                                       model_version, max_batch, dtype,
-                                       input_shapes, output_shapes):
+def create_identity_ensemble_modelfile(
+    ensemble_test_type,
+    models_dir,
+    model_version,
+    max_batch,
+    dtype,
+    input_shapes,
+    output_shapes,
+):
     io_cnt = len(input_shapes)
 
     # Use a different model name for the non-batching variant
     for ensemble_type in BASIC_ENSEMBLE_TYPES:
         ensemble_prefix = "{}_{}".format(ensemble_type, ensemble_test_type)
         model_name = tu.get_zero_model_name(
-            ensemble_prefix + ("_nobatch" if max_batch == 0 else ""), io_cnt,
-            dtype)
-        model_version_dir = models_dir + "/" + model_name + "/" + str(
-            model_version)
+            ensemble_prefix + ("_nobatch" if max_batch == 0 else ""), io_cnt, dtype
+        )
+        model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
         try:
             os.makedirs(model_version_dir)
@@ -764,37 +873,46 @@ def create_identity_ensemble_modelfile(ensemble_test_type, models_dir,
             pass  # ignore existing dir
 
 
-def create_identity_ensemble_modelconfig(ensemble_test_type, models_dir,
-                                         model_version, max_batch, dtype,
-                                         input_shapes, input_model_shapes,
-                                         output_shapes, output_model_shapes):
+def create_identity_ensemble_modelconfig(
+    ensemble_test_type,
+    models_dir,
+    model_version,
+    max_batch,
+    dtype,
+    input_shapes,
+    input_model_shapes,
+    output_shapes,
+    output_model_shapes,
+):
     io_cnt = len(input_shapes)
 
     for ensemble_type in BASIC_ENSEMBLE_TYPES:
         # Use a different model name for the non-batching variant
         ensemble_prefix = "{}_{}".format(ensemble_type, ensemble_test_type)
         model_name = tu.get_zero_model_name(
-            ensemble_prefix + ("_nobatch" if max_batch == 0 else ""), io_cnt,
-            dtype)
+            ensemble_prefix + ("_nobatch" if max_batch == 0 else ""), io_cnt, dtype
+        )
 
         ensemble_schedule = IdentityEnsembleSchedule(
-            ensemble_type,
-            ensemble_test_type).get_schedule(dtype, input_shapes,
-                                             input_model_shapes, output_shapes,
-                                             output_model_shapes)
+            ensemble_type, ensemble_test_type
+        ).get_schedule(
+            dtype, input_shapes, input_model_shapes, output_shapes, output_model_shapes
+        )
 
         config_dir = models_dir + "/" + model_name
-        config = create_general_modelconfig(model_name,
-                                            "ensemble",
-                                            max_batch,
-                                            repeat(dtype, io_cnt),
-                                            input_shapes,
-                                            input_model_shapes,
-                                            repeat(dtype, io_cnt),
-                                            output_shapes,
-                                            output_model_shapes,
-                                            repeat(None, io_cnt),
-                                            force_tensor_number_suffix=True)
+        config = create_general_modelconfig(
+            model_name,
+            "ensemble",
+            max_batch,
+            repeat(dtype, io_cnt),
+            input_shapes,
+            input_model_shapes,
+            repeat(dtype, io_cnt),
+            output_shapes,
+            output_model_shapes,
+            repeat(None, io_cnt),
+            force_tensor_number_suffix=True,
+        )
         config += ensemble_schedule
 
         try:
@@ -806,18 +924,18 @@ def create_identity_ensemble_modelconfig(ensemble_test_type, models_dir,
             cfile.write(config)
 
 
-def create_sequence_ensemble_modelfile(base_model, models_dir, max_batch,
-                                       model_version, shape, dtype):
-
+def create_sequence_ensemble_modelfile(
+    base_model, models_dir, max_batch, model_version, shape, dtype
+):
     # No actual model file in ensemble model
 
     # Use a different model name for the non-batching variant
     for ensemble_type in BASIC_ENSEMBLE_TYPES:
         ensemble_model_name = "{}_{}{}".format(
-            ensemble_type, base_model, "_nobatch" if max_batch == 0 else "")
+            ensemble_type, base_model, "_nobatch" if max_batch == 0 else ""
+        )
         model_name = tu.get_sequence_model_name(ensemble_model_name, dtype)
-        model_version_dir = models_dir + "/" + model_name + "/" + str(
-            model_version)
+        model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
         try:
             os.makedirs(model_version_dir)
@@ -825,9 +943,9 @@ def create_sequence_ensemble_modelfile(base_model, models_dir, max_batch,
             pass  # ignore existing dir
 
 
-def create_sequence_ensemble_modelconfig(base_model, models_dir, max_batch,
-                                         model_version, shape, dtype):
-
+def create_sequence_ensemble_modelconfig(
+    base_model, models_dir, max_batch, model_version, shape, dtype
+):
     # No validation as long as the base model supports the type and shape
 
     model_dtype = np_to_model_dtype(dtype)
@@ -835,19 +953,30 @@ def create_sequence_ensemble_modelconfig(base_model, models_dir, max_batch,
     for ensemble_type in BASIC_ENSEMBLE_TYPES:
         # Use a different model name for the non-batching variant
         ensemble_model_name = "{}_{}{}".format(
-            ensemble_type, base_model, "_nobatch" if max_batch == 0 else "")
+            ensemble_type, base_model, "_nobatch" if max_batch == 0 else ""
+        )
         model_name = tu.get_sequence_model_name(ensemble_model_name, dtype)
         base_model_name = tu.get_sequence_model_name(
-            "{}{}".format(base_model, "_nobatch" if max_batch == 0 else ""),
-            dtype)
+            "{}{}".format(base_model, "_nobatch" if max_batch == 0 else ""), dtype
+        )
 
-        ensemble_schedule = SequenceEnsembleSchedule(
-            ensemble_type).get_schedule(base_model_name, shape, model_dtype)
+        ensemble_schedule = SequenceEnsembleSchedule(ensemble_type).get_schedule(
+            base_model_name, shape, model_dtype
+        )
 
         config_dir = models_dir + "/" + model_name
-        config = create_general_modelconfig(model_name, "ensemble", max_batch,
-                                            [dtype], [shape], [None], [dtype],
-                                            [shape], [None], [None])
+        config = create_general_modelconfig(
+            model_name,
+            "ensemble",
+            max_batch,
+            [dtype],
+            [shape],
+            [None],
+            [dtype],
+            [shape],
+            [None],
+            [None],
+        )
         config += ensemble_schedule
 
         try:
@@ -859,12 +988,12 @@ def create_sequence_ensemble_modelconfig(base_model, models_dir, max_batch,
             cfile.write(config)
 
 
-def create_nop_modelconfig(models_dir,
-                           tensor_shape,
-                           tensor_dtype,
-                           tensor_model_shape=None):
-    model_name = "nop_{}_{}".format(dtype_str(tensor_dtype),
-                                    tu.shape_to_dims_str(tensor_shape))
+def create_nop_modelconfig(
+    models_dir, tensor_shape, tensor_dtype, tensor_model_shape=None
+):
+    model_name = "nop_{}_{}".format(
+        dtype_str(tensor_dtype), tu.shape_to_dims_str(tensor_shape)
+    )
     # Make [] to [1].
     # Note that this doesn't affect the naming ("nop_{}_" instead of "nop_{}_1")
     if len(tensor_shape) == 0:
@@ -883,7 +1012,8 @@ def create_nop_modelconfig(models_dir,
         repeat(tensor_model_shape, 2),
         repeat(None, 2),
         backend="identity",
-        instance_group_str="instance_group [ { kind: KIND_CPU } ]")
+        instance_group_str="instance_group [ { kind: KIND_CPU } ]",
+    )
 
     try:
         os.makedirs(config_dir)
@@ -897,9 +1027,11 @@ def create_nop_modelconfig(models_dir,
 def create_nop_tunnel_modelconfig(models_dir, tensor_shape, tensor_dtype):
     # Must be fixed size
     in_model_name = "nop_tunnel_in_{}_{}".format(
-        dtype_str(tensor_dtype), tu.shape_to_dims_str(tensor_shape))
+        dtype_str(tensor_dtype), tu.shape_to_dims_str(tensor_shape)
+    )
     out_model_name = "nop_tunnel_out_{}_{}".format(
-        dtype_str(tensor_dtype), tu.shape_to_dims_str(tensor_shape))
+        dtype_str(tensor_dtype), tu.shape_to_dims_str(tensor_shape)
+    )
     # Make [] to [1].
     # Note that this doesn't affect the naming ("nop_{}_" instead of "nop_{}_1")
     if len(tensor_shape) == 0:
@@ -907,8 +1039,7 @@ def create_nop_tunnel_modelconfig(models_dir, tensor_shape, tensor_dtype):
     internal_shape = 1
     for dim in tensor_shape:
         if dim < 0:
-            raise Exception(
-                "Must specify fixed size input / output for nop tunnel")
+            raise Exception("Must specify fixed size input / output for nop tunnel")
         internal_shape *= dim
 
     # Tunnel in nop (reshape to one dimension)
@@ -925,7 +1056,8 @@ def create_nop_tunnel_modelconfig(models_dir, tensor_shape, tensor_dtype):
         repeat(None, 2),
         repeat(None, 2),
         backend="identity",
-        instance_group_str="instance_group [ { kind: KIND_CPU } ]")
+        instance_group_str="instance_group [ { kind: KIND_CPU } ]",
+    )
 
     try:
         os.makedirs(config_dir)
@@ -949,7 +1081,8 @@ def create_nop_tunnel_modelconfig(models_dir, tensor_shape, tensor_dtype):
         repeat(None, 2),
         repeat(None, 2),
         backend="identity",
-        instance_group_str="instance_group [ { kind: KIND_CPU } ]")
+        instance_group_str="instance_group [ { kind: KIND_CPU } ]",
+    )
 
     try:
         os.makedirs(config_dir)
@@ -960,21 +1093,23 @@ def create_nop_tunnel_modelconfig(models_dir, tensor_shape, tensor_dtype):
         cfile.write(config)
 
 
-def create_general_modelconfig(model_name,
-                               platform,
-                               max_batch,
-                               input_dtypes,
-                               input_shapes,
-                               input_model_shapes,
-                               output_dtypes,
-                               output_shapes,
-                               output_model_shapes,
-                               label_filenames,
-                               backend=None,
-                               version_policy=None,
-                               default_model_filename=None,
-                               instance_group_str="",
-                               force_tensor_number_suffix=False):
+def create_general_modelconfig(
+    model_name,
+    platform,
+    max_batch,
+    input_dtypes,
+    input_shapes,
+    input_model_shapes,
+    output_dtypes,
+    output_shapes,
+    output_model_shapes,
+    label_filenames,
+    backend=None,
+    version_policy=None,
+    default_model_filename=None,
+    instance_group_str="",
+    force_tensor_number_suffix=False,
+):
     assert len(input_dtypes) == len(input_shapes)
     assert len(input_model_shapes) == len(input_shapes)
     assert len(output_dtypes) == len(output_shapes)
@@ -985,10 +1120,9 @@ def create_general_modelconfig(model_name,
     version_policy_str = "{ latest { num_versions: 1 }}"
     if version_policy is not None:
         type, val = version_policy
-        if type == 'latest':
-            version_policy_str = "{{ latest {{ num_versions: {} }}}}".format(
-                val)
-        elif type == 'specific':
+        if type == "latest":
+            version_policy_str = "{{ latest {{ num_versions: {} }}}}".format(val)
+        elif type == "specific":
             version_policy_str = "{{ specific {{ versions: {} }}}}".format(val)
         else:
             version_policy_str = "{ all { }}"
@@ -996,7 +1130,8 @@ def create_general_modelconfig(model_name,
     default_model_filename_str = ""
     if default_model_filename is not None:
         default_model_filename_str = 'default_model_filename: "{}"'.format(
-            default_model_filename)
+            default_model_filename
+        )
 
     # If backend is specified use backend instead of platform
     if backend is not None:
@@ -1006,21 +1141,28 @@ def create_general_modelconfig(model_name,
         key = "platform"
         val = platform
 
-    config = '''
+    config = """
 name: "{}"
 {}: "{}"
 max_batch_size: {}
 version_policy: {}
 {}
 {}
-'''.format(model_name, key, val, max_batch, version_policy_str,
-           default_model_filename_str, instance_group_str)
+""".format(
+        model_name,
+        key,
+        val,
+        max_batch,
+        version_policy_str,
+        default_model_filename_str,
+        instance_group_str,
+    )
 
     for idx in range(len(input_dtypes)):
         idx_str = ""
         if len(input_dtypes) != 1 or force_tensor_number_suffix:
             idx_str = str(idx)
-        config += '''
+        config += """
 input [
   {{
     name: "INPUT{}"
@@ -1028,15 +1170,18 @@ def create_general_modelconfig(model_name,
     dims: [ {} ]
     {}
   }}
-]'''.format(idx_str, dtype_str(input_dtypes[idx]),
+]""".format(
+            idx_str,
+            dtype_str(input_dtypes[idx]),
             tu.shape_to_dims_str(input_shapes[idx]),
-            reshape_str(input_shapes[idx], input_model_shapes[idx]))
+            reshape_str(input_shapes[idx], input_model_shapes[idx]),
+        )
 
     for idx in range(len(output_dtypes)):
         idx_str = ""
         if len(input_dtypes) != 1 or force_tensor_number_suffix:
             idx_str = str(idx)
-        config += '''
+        config += """
 output [
   {{
     name: "OUTPUT{}"
@@ -1045,10 +1190,13 @@ def create_general_modelconfig(model_name,
     {}
     {}
   }}
-]'''.format(idx_str, dtype_str(output_dtypes[idx]),
+]""".format(
+            idx_str,
+            dtype_str(output_dtypes[idx]),
             tu.shape_to_dims_str(output_shapes[idx]),
             reshape_str(output_shapes[idx], output_model_shapes[idx]),
-            label_str(label_filenames[idx]))
+            label_str(label_filenames[idx]),
+        )
     return config
 
 
@@ -1063,8 +1211,7 @@ def dtype_str(dtype):
 def reshape_str(shape, model_shape):
     if model_shape is None or shape == model_shape:
         return ""
-    return "reshape: {{ shape: [ {} ] }}".format(
-        tu.shape_to_dims_str(model_shape))
+    return "reshape: {{ shape: [ {} ] }}".format(tu.shape_to_dims_str(model_shape))
 
 
 def label_str(label):
diff --git a/qa/common/gen_jetson_trt_models b/qa/common/gen_jetson_trt_models
new file mode 100755
index 0000000000..6411a2ecb3
--- /dev/null
+++ b/qa/common/gen_jetson_trt_models
@@ -0,0 +1,188 @@
+#!/bin/bash
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+############################################################################
+## This script generates the model repository needed for TensorRT testing
+## on the Jetson device. Generating these models requires having TensorRT
+## container.
+############################################################################
+#!/bin/bash -xe
+# Make all generated files accessible outside of container
+umask 0000
+# Set the version of the models
+TRITON_VERSION=${TRITON_VERSION:=23.11}
+# Set the CUDA device to use
+CUDA_DEVICE=${RUNNER_ID:=0}
+# Set TensorRT image
+TENSORRT_IMAGE=${TENSORRT_IMAGE:=nvcr.io/nvidia/tensorrt:$TRITON_VERSION-py3-igpu}
+
+# Set the path to the host working directory
+HOST_BUILD_DIR=${HOST_BUILD_DIR:=/tmp/models_build}
+# Set the path to the host model output directory
+HOST_MODEL_DIR=${HOST_MODEL_DIR:="${HOST_BUILD_DIR}/${TRITON_VERSION}"}
+# Set the source directory to store executable source file to generate models
+HOST_SOURCE_DIR=$HOST_BUILD_DIR/gen_srcdir
+
+# Set CI specific parameters
+DOCKER_GPU_ARGS="${DOCKER_GPU_ARGS:="--gpus device=$CUDA_DEVICE"}"
+[[ $RUNNER_GPUS =~ ^[0-9] ]] && DOCKER_GPU_ARGS=$(eval $NV_DOCKER_ARGS)
+
+
+# Set model output directories
+
+HOST_DESTDIR=$HOST_MODEL_DIR/qa_model_repository
+HOST_DATADEPENDENTDIR=$HOST_MODEL_DIR/qa_trt_data_dependent_model_repository
+HOST_DYNASEQDESTDIR=$HOST_MODEL_DIR/qa_dyna_sequence_model_repository
+HOST_DYNASEQIMPLICITDESTDIR=$HOST_MODEL_DIR/qa_dyna_sequence_implicit_model_repository
+HOST_FORMATDESTDIR=$HOST_MODEL_DIR/qa_trt_format_model_repository
+HOST_IDENTITYBIGDESTDIR=$HOST_MODEL_DIR/qa_identity_big_model_repository
+HOST_IDENTITYDESTDIR=$HOST_MODEL_DIR/qa_identity_model_repository
+HOST_IMPLICITSEQDESTDIR=$HOST_MODEL_DIR/qa_sequence_implicit_model_repository
+HOST_RAGGEDDESTDIR=$HOST_MODEL_DIR/qa_ragged_model_repository
+HOST_RESHAPEDESTDIR=$HOST_MODEL_DIR/qa_reshape_model_repository
+HOST_SEQDESTDIR=$HOST_MODEL_DIR/qa_sequence_model_repository
+HOST_SHAPEDESTDIR=$HOST_MODEL_DIR/qa_shapetensor_model_repository
+HOST_VARDESTDIR=$HOST_MODEL_DIR/qa_variable_model_repository
+HOST_VARIMPLICITSEQDESTDIR=$HOST_MODEL_DIR/qa_variable_sequence_implicit_model_repository
+HOST_VARSEQDESTDIR=$HOST_MODEL_DIR/qa_variable_sequence_model_repository
+
+# Clean up host working directory
+rm -fr $HOST_BUILD_DIR
+
+# Create the model output directories
+mkdir -p $HOST_SOURCE_DIR
+mkdir -p $HOST_DESTDIR
+mkdir -p $HOST_DATADEPENDENTDIR
+mkdir -p $HOST_DYNASEQDESTDIR
+mkdir -p $HOST_DYNASEQIMPLICITDESTDIR
+mkdir -p $HOST_FORMATDESTDIR
+mkdir -p $HOST_IDENTITYBIGDESTDIR
+mkdir -p $HOST_IDENTITYDESTDIR
+mkdir -p $HOST_IMPLICITSEQDESTDIR
+mkdir -p $HOST_RAGGEDDESTDIR
+mkdir -p $HOST_RESHAPEDESTDIR
+mkdir -p $HOST_SEQDESTDIR
+mkdir -p $HOST_SHAPEDESTDIR
+mkdir -p $HOST_VARDESTDIR
+mkdir -p $HOST_VARIMPLICITSEQDESTDIR
+mkdir -p $HOST_VARSEQDESTDIR
+
+# Copy the executable source file to the host generate models source directory
+cp ./gen_ensemble_model_utils.py $HOST_SOURCE_DIR/.
+cp ./gen_common.py $HOST_SOURCE_DIR/.
+cp ./gen_qa_dyna_sequence_implicit_models.py $HOST_SOURCE_DIR/.
+cp ./gen_qa_dyna_sequence_models.py $HOST_SOURCE_DIR/.
+cp ./gen_qa_identity_models.py $HOST_SOURCE_DIR/.
+cp ./gen_qa_implicit_models.py $HOST_SOURCE_DIR/.
+cp ./gen_qa_models.py $HOST_SOURCE_DIR/.
+cp ./gen_qa_noshape_models.py $HOST_SOURCE_DIR/.
+cp ./gen_qa_ragged_models.py $HOST_SOURCE_DIR/.
+cp ./gen_qa_reshape_models.py $HOST_SOURCE_DIR/.
+cp ./gen_qa_sequence_models.py $HOST_SOURCE_DIR/.
+cp ./gen_qa_trt_data_dependent_shape.py $HOST_SOURCE_DIR/.
+cp ./gen_qa_trt_format_models.py $HOST_SOURCE_DIR/.
+cp ./gen_qa_trt_plugin_models.py $HOST_SOURCE_DIR/.
+cp ./test_util.py $HOST_SOURCE_DIR/.
+
+# Set TensorRT model generation script name
+TRT_MODEL_SCRIPT=trt_gen.cmds
+
+# Set container working directory
+CONTAINER_SOURCE_DIR=/workspace/src
+CONTAINER_MODEL_DIR=/tmp/models
+CONTAINER_DESTDIR=$CONTAINER_MODEL_DIR/qa_model_repository
+CONTAINER_DATADEPENDENTDIR=$CONTAINER_MODEL_DIR/qa_trt_data_dependent_model_repository
+CONTAINER_DYNASEQDESTDIR=$CONTAINER_MODEL_DIR/qa_dyna_sequence_model_repository
+CONTAINER_DYNASEQIMPLICITDESTDIR=$CONTAINER_MODEL_DIR/qa_dyna_sequence_implicit_model_repository
+CONTAINER_FORMATDESTDIR=$CONTAINER_MODEL_DIR/qa_trt_format_model_repository
+CONTAINER_IDENTITYBIGDESTDIR=$CONTAINER_MODEL_DIR/qa_identity_big_model_repository
+CONTAINER_IDENTITYDESTDIR=$CONTAINER_MODEL_DIR/qa_identity_model_repository
+CONTAINER_IMPLICITSEQDESTDIR=$CONTAINER_MODEL_DIR/qa_sequence_implicit_model_repository
+CONTAINER_RAGGEDDESTDIR=$CONTAINER_MODEL_DIR/qa_ragged_model_repository
+CONTAINER_RESHAPEDESTDIR=$CONTAINER_MODEL_DIR/qa_reshape_model_repository
+CONTAINER_SEQDESTDIR=$CONTAINER_MODEL_DIR/qa_sequence_model_repository
+CONTAINER_SHAPEDESTDIR=$CONTAINER_MODEL_DIR/qa_shapetensor_model_repository
+CONTAINER_VARDESTDIR=$CONTAINER_MODEL_DIR/qa_variable_model_repository
+CONTAINER_VARIMPLICITSEQDESTDIR=$CONTAINER_MODEL_DIR/qa_variable_sequence_implicit_model_repository
+CONTAINER_VARSEQDESTDIR=$CONTAINER_MODEL_DIR/qa_variable_sequence_model_repository
+
+# Set script to generate TensorRT models
+cat >$HOST_SOURCE_DIR/$TRT_MODEL_SCRIPT <<EOF
+#!/bin/bash -xe
+# Make all generated files accessible outside of container
+umask 0000
+nvidia-smi -L || true
+export TRT_SUPPRESS_DEPRECATION_WARNINGS=1
+ldconfig || true
+
+cd $CONTAINER_SOURCE_DIR
+# Models using shape tensor i/o
+python3 $CONTAINER_SOURCE_DIR/gen_qa_identity_models.py --tensorrt-shape-io --models_dir=$CONTAINER_SHAPEDESTDIR
+python3 $CONTAINER_SOURCE_DIR/gen_qa_sequence_models.py --tensorrt-shape-io --models_dir=$CONTAINER_SHAPEDESTDIR
+python3 $CONTAINER_SOURCE_DIR/gen_qa_dyna_sequence_models.py --tensorrt-shape-io --models_dir=$CONTAINER_SHAPEDESTDIR
+python3 $CONTAINER_SOURCE_DIR/gen_qa_models.py --tensorrt --models_dir=$CONTAINER_DESTDIR
+python3 $CONTAINER_SOURCE_DIR/gen_qa_models.py --tensorrt --variable --models_dir=$CONTAINER_VARDESTDIR
+python3 $CONTAINER_SOURCE_DIR/gen_qa_identity_models.py --tensorrt --models_dir=$CONTAINER_IDENTITYDESTDIR
+python3 $CONTAINER_SOURCE_DIR/gen_qa_identity_models.py --tensorrt-big --models_dir=$CONTAINER_IDENTITYBIGDESTDIR
+python3 $CONTAINER_SOURCE_DIR/gen_qa_reshape_models.py --tensorrt --variable --models_dir=$CONTAINER_RESHAPEDESTDIR
+python3 $CONTAINER_SOURCE_DIR/gen_qa_sequence_models.py --tensorrt --models_dir=$CONTAINER_SEQDESTDIR
+python3 $CONTAINER_SOURCE_DIR/gen_qa_implicit_models.py --tensorrt --models_dir=$CONTAINER_IMPLICITSEQDESTDIR
+python3 $CONTAINER_SOURCE_DIR/gen_qa_implicit_models.py --tensorrt --variable --models_dir=$CONTAINER_VARIMPLICITSEQDESTDIR
+python3 $CONTAINER_SOURCE_DIR/gen_qa_dyna_sequence_models.py --tensorrt --models_dir=$CONTAINER_DYNASEQDESTDIR
+python3 $CONTAINER_SOURCE_DIR/gen_qa_sequence_models.py --tensorrt --variable --models_dir=$CONTAINER_VARSEQDESTDIR
+python3 $CONTAINER_SOURCE_DIR/gen_qa_dyna_sequence_implicit_models.py --tensorrt --models_dir=$CONTAINER_DYNASEQIMPLICITDESTDIR
+python3 $CONTAINER_SOURCE_DIR/gen_qa_ragged_models.py --tensorrt --models_dir=$CONTAINER_RAGGEDDESTDIR
+python3 $CONTAINER_SOURCE_DIR/gen_qa_trt_format_models.py --models_dir=$CONTAINER_FORMATDESTDIR
+python3 $CONTAINER_SOURCE_DIR/gen_qa_trt_data_dependent_shape.py --models_dir=$CONTAINER_DATADEPENDENTDIR
+
+chmod -R 777 $CONTAINER_MODEL_DIR
+EOF
+# Make the TensorRT model generation script executable
+chmod a+x $HOST_SOURCE_DIR/$TRT_MODEL_SCRIPT
+# Pull the TensorRT image
+docker pull $TENSORRT_IMAGE
+# Run the TensorRT model generation script inside the TensorRT container
+docker run \
+  $DOCKER_GPU_ARGS \
+  --rm --entrypoint $CONTAINER_SOURCE_DIR/$TRT_MODEL_SCRIPT \
+  --mount type=bind,source=$HOST_SOURCE_DIR,target=$CONTAINER_SOURCE_DIR \
+  --mount type=bind,source=$HOST_DESTDIR,target=$CONTAINER_DESTDIR \
+  --mount type=bind,source=$HOST_DATADEPENDENTDIR,target=$CONTAINER_DATADEPENDENTDIR \
+  --mount type=bind,source=$HOST_DYNASEQDESTDIR,target=$CONTAINER_DYNASEQDESTDIR \
+  --mount type=bind,source=$HOST_DYNASEQIMPLICITDESTDIR,target=$CONTAINER_DYNASEQIMPLICITDESTDIR \
+  --mount type=bind,source=$HOST_FORMATDESTDIR,target=$CONTAINER_FORMATDESTDIR \
+  --mount type=bind,source=$HOST_IDENTITYBIGDESTDIR,target=$CONTAINER_IDENTITYBIGDESTDIR \
+  --mount type=bind,source=$HOST_IDENTITYDESTDIR,target=$CONTAINER_IDENTITYDESTDIR \
+  --mount type=bind,source=$HOST_IMPLICITSEQDESTDIR,target=$CONTAINER_IMPLICITSEQDESTDIR \
+  --mount type=bind,source=$HOST_RAGGEDDESTDIR,target=$CONTAINER_RAGGEDDESTDIR \
+  --mount type=bind,source=$HOST_RESHAPEDESTDIR,target=$CONTAINER_RESHAPEDESTDIR \
+  --mount type=bind,source=$HOST_SEQDESTDIR,target=$CONTAINER_SEQDESTDIR \
+  --mount type=bind,source=$HOST_SHAPEDESTDIR,target=$CONTAINER_SHAPEDESTDIR \
+  --mount type=bind,source=$HOST_VARDESTDIR,target=$CONTAINER_VARDESTDIR \
+  --mount type=bind,source=$HOST_VARIMPLICITSEQDESTDIR,target=$CONTAINER_VARIMPLICITSEQDESTDIR \
+  --mount type=bind,source=$HOST_VARSEQDESTDIR,target=$CONTAINER_VARSEQDESTDIR \
+  $TENSORRT_IMAGE
diff --git a/qa/common/gen_qa_custom_ops b/qa/common/gen_qa_custom_ops
index d93e314eab..dd9a0c23e4 100755
--- a/qa/common/gen_qa_custom_ops
+++ b/qa/common/gen_qa_custom_ops
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -37,15 +37,19 @@
 ##
 ############################################################################
 
-TRITON_VERSION=22.05
-TENSORFLOW_IMAGE=${TENSORFLOW_IMAGE:=nvcr.io/nvidia/tensorflow:$TRITON_VERSION-tf1-py3}
-PYTORCH_IMAGE=${PYTORCH_IMAGE:=nvcr.io/nvidia/pytorch:$TRITON_VERSION-py3}
+TRITON_VERSION=${TRITON_VERSION:=23.11}
+NVIDIA_UPSTREAM_VERSION=${NVIDIA_UPSTREAM_VERSION:=$TRITON_VERSION}
+TENSORFLOW_IMAGE=${TENSORFLOW_IMAGE:=nvcr.io/nvidia/tensorflow:$NVIDIA_UPSTREAM_VERSION-tf2-py3}
+PYTORCH_IMAGE=${PYTORCH_IMAGE:=nvcr.io/nvidia/pytorch:$NVIDIA_UPSTREAM_VERSION-py3}
 
-CUDA_DEVICE=0
+CUDA_DEVICE=${NV_GPU:=0}
+
+[[ $RUNNER_GPUS =~ ^[0-9] ]] && DOCKER_GPU_ARGS=$(eval $NV_DOCKER_ARGS) || DOCKER_GPU_ARGS="--gpus device=$CUDA_DEVICE"
 
 ###
-HOST_SRCDIR=/tmp/gen_srcdir
-HOST_DESTDIR=/tmp/$TRITON_VERSION/qa_custom_ops
+HOST_BUILD_DIR=${HOST_BUILD_DIR:=/tmp}
+HOST_SRCDIR=$HOST_BUILD_DIR/gen_srcdir
+HOST_DESTDIR=$HOST_BUILD_DIR/$TRITON_VERSION/qa_custom_ops
 
 rm -fr $HOST_DESTDIR
 mkdir -p $HOST_DESTDIR/tf_custom_ops $HOST_DESTDIR/libtorch_custom_ops
@@ -63,29 +67,38 @@ SRCDIR=/tmp/src
 DESTDIR=/tmp/custom_ops
 
 # Tensorflow
+# Set compilation option by "Select a particular C++ dialect."
+[[ "${NVIDIA_UPSTREAM_VERSION}" < "22.10"  ]] && STD_FLAG="c++14" || STD_FLAG="c++17"
+
 cat >$HOST_SRCDIR/$TFSCRIPT <<EOF
 #!/bin/bash -x
+nvidia-smi -L || true
+nvidia-smi || true
 set -e
 
+# Segmentation fault with protobuf 4.24.0 (https://github.com/tensorflow/tensorflow/issues/61551)
+# Upgrade protobuf version to fix the issue.
+pip3 install "protobuf>4.24.0"
+
 TF_CFLAGS=\$(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_compile_flags()))')
 TF_LFLAGS=\$(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_link_flags()))')
 
 # No CUDA
 cd /tmp
 cp /opt/tensorflow/tensorflow-source/tensorflow/examples/adding_an_op/zero_out_op_kernel_1.cc .
-g++ -std=c++11 -O2 -shared -fPIC zero_out_op_kernel_1.cc -o $DESTDIR/libzeroout.so \${TF_CFLAGS[@]} \${TF_LFLAGS[@]}
+g++ -std=${STD_FLAG} -O2 -shared -fPIC zero_out_op_kernel_1.cc -o $DESTDIR/libzeroout.so \${TF_CFLAGS[@]} \${TF_LFLAGS[@]}
 
 # CUDA. Need to patch so that we can build it outside of bazel/TF
 cp /opt/tensorflow/tensorflow-source/tensorflow/examples/adding_an_op/cuda_op_kernel.cc .
 cp /opt/tensorflow/tensorflow-source/tensorflow/examples/adding_an_op/cuda_op_kernel.cu.cc .
 patch -i $SRCDIR/cuda_op_kernel.cu.cc.patch cuda_op_kernel.cu.cc
-nvcc -std=c++11 -O2 -c -arch=all -o cuda_op_kernel.cu.o cuda_op_kernel.cu.cc \${TF_CFLAGS[@]} -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC
-g++ -std=c++11 -shared -o $DESTDIR/libcudaop.so cuda_op_kernel.cc cuda_op_kernel.cu.o \${TF_CFLAGS[@]} -fPIC -L/usr/local/cuda/lib64 -lcudart \${TF_LFLAGS[@]}
+nvcc --expt-relaxed-constexpr -std=${STD_FLAG} -O2 -c -arch=all -o cuda_op_kernel.cu.o cuda_op_kernel.cu.cc \${TF_CFLAGS[@]} -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC
+g++ -std=${STD_FLAG} -shared -o $DESTDIR/libcudaop.so cuda_op_kernel.cc cuda_op_kernel.cu.o \${TF_CFLAGS[@]} -fPIC -L/usr/local/cuda/lib64 -lcudart \${TF_LFLAGS[@]}
 
 cp $SRCDIR/busy_op_kernel.cc .
 cp $SRCDIR/busy_op_kernel.cu.cc .
-nvcc -std=c++11 -O2 -c -arch=all -o busy_op_kernel.cu.o busy_op_kernel.cu.cc \${TF_CFLAGS[@]} -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC
-g++ -std=c++11 -shared -o $DESTDIR/libbusyop.so busy_op_kernel.cc busy_op_kernel.cu.o \${TF_CFLAGS[@]} -fPIC -L/usr/local/cuda/lib64 -lcudart \${TF_LFLAGS[@]}
+nvcc --expt-relaxed-constexpr -std=${STD_FLAG} -O2 -c -arch=all -o busy_op_kernel.cu.o busy_op_kernel.cu.cc \${TF_CFLAGS[@]} -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC
+g++ -std=${STD_FLAG} -shared -o $DESTDIR/libbusyop.so busy_op_kernel.cc busy_op_kernel.cu.o \${TF_CFLAGS[@]} -fPIC -L/usr/local/cuda/lib64 -lcudart \${TF_LFLAGS[@]}
 
 python3 $SRCDIR/gen_qa_custom_ops_models.py --graphdef --savedmodel \
     --models_dir=$DESTDIR --zero_out_lib_path=$DESTDIR/libzeroout.so \
@@ -101,7 +114,7 @@ if [ $? -ne 0 ]; then
 fi
 
 docker pull $TENSORFLOW_IMAGE
-docker run --gpus device=$CUDA_DEVICE --rm --entrypoint $SRCDIR/$TFSCRIPT \
+docker run $DOCKER_GPU_ARGS --rm --entrypoint $SRCDIR/$TFSCRIPT \
        --mount type=bind,source=$HOST_SRCDIR,target=$SRCDIR \
        --mount type=bind,source=$HOST_DESTDIR/tf_custom_ops,target=$DESTDIR \
        $TENSORFLOW_IMAGE
@@ -113,9 +126,11 @@ fi
 # PyTorch
 cat >$HOST_SRCDIR/$PYTSCRIPT <<EOF
 #!/bin/bash -x
+nvidia-smi -L || true
+nvidia-smi || true
 set -e
 python3 $SRCDIR/gen_qa_custom_ops_models.py --libtorch --models_dir=$DESTDIR
-cp /root/.cache/torch_extensions/py38_cu117/custom_modulo/custom_modulo.so $DESTDIR/libtorch_modulo/.
+cp /root/.cache/torch_extensions/py310_cu123/custom_modulo/custom_modulo.so $DESTDIR/libtorch_modulo/.
 chmod -R 777 $DESTDIR
 EOF
 
@@ -126,7 +141,7 @@ if [ $? -ne 0 ]; then
 fi
 
 docker pull $PYTORCH_IMAGE
-docker run --gpus device=$CUDA_DEVICE --rm --entrypoint $SRCDIR/$PYTSCRIPT \
+docker run $DOCKER_GPU_ARGS --rm --entrypoint $SRCDIR/$PYTSCRIPT \
        --mount type=bind,source=$HOST_SRCDIR,target=$SRCDIR \
        --mount type=bind,source=$HOST_DESTDIR/libtorch_custom_ops,target=$DESTDIR \
        $PYTORCH_IMAGE
diff --git a/qa/common/gen_qa_custom_ops_models.py b/qa/common/gen_qa_custom_ops_models.py
old mode 100644
new mode 100755
index 1f79ae802e..31219f82aa
--- a/qa/common/gen_qa_custom_ops_models.py
+++ b/qa/common/gen_qa_custom_ops_models.py
@@ -1,4 +1,6 @@
-# Copyright 2019-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -36,10 +38,14 @@ def create_zeroout_modelfile(create_savedmodel, models_dir, model_version):
     zero_out = _zero_out_module.zero_out
 
     # Create the model that uses custom operator.
-    tf.reset_default_graph()
-    zin = tf.placeholder(tf.int32, [
-        None,
-    ], "to_zero")
+    tf.compat.v1.reset_default_graph()
+    zin = tf.compat.v1.placeholder(
+        tf.int32,
+        [
+            None,
+        ],
+        "to_zero",
+    )
     zout = zero_out(zin, name="zeroed")
 
     model_name = "savedmodel_zeroout" if create_savedmodel else "graphdef_zeroout"
@@ -51,33 +57,39 @@ def create_zeroout_modelfile(create_savedmodel, models_dir, model_version):
         pass  # ignore existing dir
 
     if create_savedmodel:
-        with tf.Session() as sess:
+        with tf.compat.v1.Session() as sess:
             input_name = "to_zero"
             output_name = "zeroed"
-            input_tensor = tf.get_default_graph().get_tensor_by_name(
-                input_name + ":0")
-            output_tensor = tf.get_default_graph().get_tensor_by_name(
-                output_name + ":0")
+            input_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+                input_name + ":0"
+            )
+            output_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+                output_name + ":0"
+            )
             input_dict = dict()
             output_dict = dict()
             input_dict[input_name] = input_tensor
             output_dict[output_name] = output_tensor
-            tf.saved_model.simple_save(sess,
-                                       model_version_dir + "/model.savedmodel",
-                                       inputs=input_dict,
-                                       outputs=output_dict)
+            tf.compat.v1.saved_model.simple_save(
+                sess,
+                model_version_dir + "/model.savedmodel",
+                inputs=input_dict,
+                outputs=output_dict,
+            )
     else:
-        with tf.Session() as sess:
-            graph_io.write_graph(sess.graph.as_graph_def(),
-                                 model_version_dir,
-                                 "model.graphdef",
-                                 as_text=False)
+        with tf.compat.v1.Session() as sess:
+            graph_io.write_graph(
+                sess.graph.as_graph_def(),
+                model_version_dir,
+                "model.graphdef",
+                as_text=False,
+            )
 
 
 def create_zeroout_modelconfig(create_savedmodel, models_dir, model_version):
     model_name = "savedmodel_zeroout" if create_savedmodel else "graphdef_zeroout"
     config_dir = models_dir + "/" + model_name
-    config = '''
+    config = """
 name: "{}"
 platform: "{}"
 max_batch_size: 0
@@ -95,9 +107,10 @@ def create_zeroout_modelconfig(create_savedmodel, models_dir, model_version):
     dims: [ -1 ]
   }}
 ]
-'''.format(
+""".format(
         model_name,
-        "tensorflow_savedmodel" if create_savedmodel else "tensorflow_graphdef")
+        "tensorflow_savedmodel" if create_savedmodel else "tensorflow_graphdef",
+    )
 
     try:
         os.makedirs(config_dir)
@@ -114,10 +127,14 @@ def create_cudaop_modelfile(create_savedmodel, models_dir, model_version):
     add_one = _cuda_op_module.add_one
 
     # Create the model that uses custom operator.
-    tf.reset_default_graph()
-    zin = tf.placeholder(tf.int32, [
-        None,
-    ], "in")
+    tf.compat.v1.reset_default_graph()
+    zin = tf.compat.v1.placeholder(
+        tf.int32,
+        [
+            None,
+        ],
+        "in",
+    )
     zout = add_one(zin, name="out")
 
     model_name = "savedmodel_cudaop" if create_savedmodel else "graphdef_cudaop"
@@ -129,33 +146,39 @@ def create_cudaop_modelfile(create_savedmodel, models_dir, model_version):
         pass  # ignore existing dir
 
     if create_savedmodel:
-        with tf.Session() as sess:
+        with tf.compat.v1.Session() as sess:
             input_name = "in"
             output_name = "out"
-            input_tensor = tf.get_default_graph().get_tensor_by_name(
-                input_name + ":0")
-            output_tensor = tf.get_default_graph().get_tensor_by_name(
-                output_name + ":0")
+            input_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+                input_name + ":0"
+            )
+            output_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+                output_name + ":0"
+            )
             input_dict = dict()
             output_dict = dict()
             input_dict[input_name] = input_tensor
             output_dict[output_name] = output_tensor
-            tf.saved_model.simple_save(sess,
-                                       model_version_dir + "/model.savedmodel",
-                                       inputs=input_dict,
-                                       outputs=output_dict)
+            tf.compat.v1.saved_model.simple_save(
+                sess,
+                model_version_dir + "/model.savedmodel",
+                inputs=input_dict,
+                outputs=output_dict,
+            )
     else:
-        with tf.Session() as sess:
-            graph_io.write_graph(sess.graph.as_graph_def(),
-                                 model_version_dir,
-                                 "model.graphdef",
-                                 as_text=False)
+        with tf.compat.v1.Session() as sess:
+            graph_io.write_graph(
+                sess.graph.as_graph_def(),
+                model_version_dir,
+                "model.graphdef",
+                as_text=False,
+            )
 
 
 def create_cudaop_modelconfig(create_savedmodel, models_dir, model_version):
     model_name = "savedmodel_cudaop" if create_savedmodel else "graphdef_cudaop"
     config_dir = models_dir + "/" + model_name
-    config = '''
+    config = """
 name: "{}"
 platform: "{}"
 max_batch_size: 0
@@ -173,9 +196,10 @@ def create_cudaop_modelconfig(create_savedmodel, models_dir, model_version):
     dims: [ -1 ]
   }}
 ]
-'''.format(
+""".format(
         model_name,
-        "tensorflow_savedmodel" if create_savedmodel else "tensorflow_graphdef")
+        "tensorflow_savedmodel" if create_savedmodel else "tensorflow_graphdef",
+    )
 
     try:
         os.makedirs(config_dir)
@@ -192,10 +216,14 @@ def create_busyop_modelfile(create_savedmodel, models_dir, model_version):
     busy_loop = _busy_op_module.busy_loop
 
     # Create the model that uses custom operator.
-    tf.reset_default_graph()
-    zin = tf.placeholder(tf.int32, [
-        None,
-    ], "in")
+    tf.compat.v1.reset_default_graph()
+    zin = tf.compat.v1.placeholder(
+        tf.int32,
+        [
+            None,
+        ],
+        "in",
+    )
     zout = busy_loop(zin, name="out")
 
     model_name = "savedmodel_busyop" if create_savedmodel else "graphdef_busyop"
@@ -207,33 +235,39 @@ def create_busyop_modelfile(create_savedmodel, models_dir, model_version):
         pass  # ignore existing dir
 
     if create_savedmodel:
-        with tf.Session() as sess:
+        with tf.compat.v1.Session() as sess:
             input_name = "in"
             output_name = "out"
-            input_tensor = tf.get_default_graph().get_tensor_by_name(
-                input_name + ":0")
-            output_tensor = tf.get_default_graph().get_tensor_by_name(
-                output_name + ":0")
+            input_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+                input_name + ":0"
+            )
+            output_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+                output_name + ":0"
+            )
             input_dict = dict()
             output_dict = dict()
             input_dict[input_name] = input_tensor
             output_dict[output_name] = output_tensor
-            tf.saved_model.simple_save(sess,
-                                       model_version_dir + "/model.savedmodel",
-                                       inputs=input_dict,
-                                       outputs=output_dict)
+            tf.compat.v1.saved_model.simple_save(
+                sess,
+                model_version_dir + "/model.savedmodel",
+                inputs=input_dict,
+                outputs=output_dict,
+            )
     else:
-        with tf.Session() as sess:
-            graph_io.write_graph(sess.graph.as_graph_def(),
-                                 model_version_dir,
-                                 "model.graphdef",
-                                 as_text=False)
+        with tf.compat.v1.Session() as sess:
+            graph_io.write_graph(
+                sess.graph.as_graph_def(),
+                model_version_dir,
+                "model.graphdef",
+                as_text=False,
+            )
 
 
 def create_busyop_modelconfig(create_savedmodel, models_dir, model_version):
     model_name = "savedmodel_busyop" if create_savedmodel else "graphdef_busyop"
     config_dir = models_dir + "/" + model_name
-    config = '''
+    config = """
 name: "{}"
 platform: "{}"
 max_batch_size: 0
@@ -251,9 +285,10 @@ def create_busyop_modelconfig(create_savedmodel, models_dir, model_version):
     dims: [ -1 ]
   }}
 ]
-'''.format(
+""".format(
         model_name,
-        "tensorflow_savedmodel" if create_savedmodel else "tensorflow_graphdef")
+        "tensorflow_savedmodel" if create_savedmodel else "tensorflow_graphdef",
+    )
 
     try:
         os.makedirs(config_dir)
@@ -285,7 +320,6 @@ def create_moduloop_modelfile(models_dir, model_version):
     )
 
     class ModuloCustomNet(nn.Module):
-
         def __init__(self):
             super(ModuloCustomNet, self).__init__()
 
@@ -295,8 +329,7 @@ def forward(self, input0, input1):
     moduloCustomModel = ModuloCustomNet()
     example_input0 = torch.arange(1, 11, dtype=torch.float32)
     example_input1 = torch.tensor([2] * 10, dtype=torch.float32)
-    traced = torch.jit.trace(moduloCustomModel,
-                             (example_input0, example_input1))
+    traced = torch.jit.trace(moduloCustomModel, (example_input0, example_input1))
 
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
@@ -311,7 +344,7 @@ def forward(self, input0, input1):
 def create_moduloop_modelconfig(models_dir, model_version):
     model_name = "libtorch_modulo"
     config_dir = models_dir + "/" + model_name
-    config = '''
+    config = """
 name: "{}"
 platform: "pytorch_libtorch"
 max_batch_size: 0
@@ -334,7 +367,9 @@ def create_moduloop_modelconfig(models_dir, model_version):
     dims: [ 10 ]
   }}
 ]
-'''.format(model_name)
+""".format(
+        model_name
+    )
 
     try:
         os.makedirs(config_dir)
@@ -350,20 +385,17 @@ def create_visionop_modelfile(models_dir, model_version):
     model_name = "libtorch_visionop"
 
     class CustomVisionNet(nn.Module):
-
         def __init__(self):
             super(CustomVisionNet, self).__init__()
 
         def forward(self, input, boxes):
-            return torchvision.ops.roi_align(input, boxes, [5,5], 1.0, -1,
-                                                   False)
+            return torchvision.ops.roi_align(input, boxes, [5, 5], 1.0, -1, False)
 
     visionCustomModel = CustomVisionNet()
     visionCustomModel.eval()
     scripted = torch.jit.script(visionCustomModel)
 
-    model_version_dir = models_dir + "/" + \
-        model_name + "/" + str(model_version)
+    model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     try:
         os.makedirs(model_version_dir)
@@ -376,7 +408,7 @@ def forward(self, input, boxes):
 def create_visionop_modelconfig(models_dir, model_version):
     model_name = "libtorch_visionop"
     config_dir = models_dir + "/" + model_name
-    config = '''
+    config = """
 name: "{}"
 platform: "pytorch_libtorch"
 max_batch_size: 0
@@ -399,7 +431,9 @@ def create_visionop_modelconfig(models_dir, model_version):
     dims: [1, 3, 5, 5]
   }}
 ]
-'''.format(model_name)
+""".format(
+        model_name
+    )
 
     try:
         os.makedirs(config_dir)
@@ -462,52 +496,69 @@ def create_vision_op_models(models_dir):
         create_visionop_modelfile(models_dir, model_version)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('--models_dir',
-                        type=str,
-                        required=True,
-                        help='Top-level model directory')
-    parser.add_argument('--zero_out_lib_path',
-                        type=str,
-                        required=False,
-                        default="./libzeroout.so",
-                        help='Fullpath to libzeroout.so')
-    parser.add_argument('--cuda_op_lib_path',
-                        type=str,
-                        required=False,
-                        default="./libcudaop.so",
-                        help='Fullpath to libcudaop.so')
-    parser.add_argument('--busy_op_lib_path',
-                        type=str,
-                        required=False,
-                        default="./libbusyop.so",
-                        help='Fullpath to libbusyop.so')
-    parser.add_argument('--graphdef',
-                        required=False,
-                        action='store_true',
-                        help='Generate GraphDef models')
-    parser.add_argument('--savedmodel',
-                        required=False,
-                        action='store_true',
-                        help='Generate SavedModel models')
-    parser.add_argument('--libtorch',
-                        required=False,
-                        action='store_true',
-                        help='Generate Pytorch LibTorch models')
+    parser.add_argument(
+        "--models_dir", type=str, required=True, help="Top-level model directory"
+    )
+    parser.add_argument(
+        "--zero_out_lib_path",
+        type=str,
+        required=False,
+        default="./libzeroout.so",
+        help="Fullpath to libzeroout.so",
+    )
+    parser.add_argument(
+        "--cuda_op_lib_path",
+        type=str,
+        required=False,
+        default="./libcudaop.so",
+        help="Fullpath to libcudaop.so",
+    )
+    parser.add_argument(
+        "--busy_op_lib_path",
+        type=str,
+        required=False,
+        default="./libbusyop.so",
+        help="Fullpath to libbusyop.so",
+    )
+    parser.add_argument(
+        "--graphdef",
+        required=False,
+        action="store_true",
+        help="Generate GraphDef models",
+    )
+    parser.add_argument(
+        "--savedmodel",
+        required=False,
+        action="store_true",
+        help="Generate SavedModel models",
+    )
+    parser.add_argument(
+        "--libtorch",
+        required=False,
+        action="store_true",
+        help="Generate Pytorch LibTorch models",
+    )
     FLAGS, unparsed = parser.parse_known_args()
 
     if FLAGS.graphdef or FLAGS.savedmodel:
+        # Use Tensorflow 2 as default. Need to disable the v2 behavior for
+        # model generation scripts.
         import tensorflow as tf
+
+        tf.compat.v1.disable_eager_execution()
         from tensorflow.python.framework import graph_io
+
         create_zero_out_models(FLAGS.models_dir)
         create_cuda_op_models(FLAGS.models_dir)
         create_busy_op_models(FLAGS.models_dir)
 
     if FLAGS.libtorch:
         import torch
-        from torch import nn
-        import torchvision
         import torch.utils.cpp_extension
+        import torchvision
+        from torch import nn
+
         create_modulo_op_models(FLAGS.models_dir)
         create_vision_op_models(FLAGS.models_dir)
diff --git a/qa/common/gen_qa_dyna_sequence_implicit_models.py b/qa/common/gen_qa_dyna_sequence_implicit_models.py
old mode 100644
new mode 100755
index e7cc1131b8..ffa3f48ede
--- a/qa/common/gen_qa_dyna_sequence_implicit_models.py
+++ b/qa/common/gen_qa_dyna_sequence_implicit_models.py
@@ -1,4 +1,6 @@
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,84 +28,21 @@
 
 import argparse
 import os
+
 import numpy as np
+from gen_common import np_to_model_dtype, np_to_onnx_dtype, np_to_trt_dtype
 
 FLAGS = None
 np_dtype_string = np.dtype(object)
 
 
-def np_to_onnx_dtype(np_dtype):
-    if np_dtype == bool:
-        return onnx.TensorProto.BOOL
-    elif np_dtype == np.int8:
-        return onnx.TensorProto.INT8
-    elif np_dtype == np.int16:
-        return onnx.TensorProto.INT16
-    elif np_dtype == np.int32:
-        return onnx.TensorProto.INT32
-    elif np_dtype == np.int64:
-        return onnx.TensorProto.INT64
-    elif np_dtype == np.uint8:
-        return onnx.TensorProto.UINT8
-    elif np_dtype == np.uint16:
-        return onnx.TensorProto.UINT16
-    elif np_dtype == np.float16:
-        return onnx.TensorProto.FLOAT16
-    elif np_dtype == np.float32:
-        return onnx.TensorProto.FLOAT
-    elif np_dtype == np.float64:
-        return onnx.TensorProto.DOUBLE
-    elif np_dtype == np_dtype_string:
-        return onnx.TensorProto.STRING
-
-
-def np_to_model_dtype(np_dtype):
-    if np_dtype == bool:
-        return "TYPE_BOOL"
-    elif np_dtype == np.int8:
-        return "TYPE_INT8"
-    elif np_dtype == np.int16:
-        return "TYPE_INT16"
-    elif np_dtype == np.int32:
-        return "TYPE_INT32"
-    elif np_dtype == np.int64:
-        return "TYPE_INT64"
-    elif np_dtype == np.uint8:
-        return "TYPE_UINT8"
-    elif np_dtype == np.uint16:
-        return "TYPE_UINT16"
-    elif np_dtype == np.float16:
-        return "TYPE_FP16"
-    elif np_dtype == np.float32:
-        return "TYPE_FP32"
-    elif np_dtype == np.float64:
-        return "TYPE_FP64"
-    elif np_dtype == np_dtype_string:
-        return "TYPE_STRING"
-    return None
-
-
-def np_to_trt_dtype(np_dtype):
-    if np_dtype == bool:
-        return trt.bool
-    elif np_dtype == np.int8:
-        return trt.int8
-    elif np_dtype == np.int32:
-        return trt.int32
-    elif np_dtype == np.float16:
-        return trt.float16
-    elif np_dtype == np.float32:
-        return trt.float32
-    return None
-
-
 def create_onnx_modelfile(models_dir, model_version, max_batch, dtype, shape):
-
     if not tu.validate_for_onnx_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
     model_name = tu.get_dyna_sequence_model_name(
-        "onnx_nobatch" if max_batch == 0 else "onnx", dtype)
+        "onnx_nobatch" if max_batch == 0 else "onnx", dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     # Create the model. For now don't implement a proper accumulator
@@ -127,132 +66,181 @@ def create_onnx_modelfile(models_dir, model_version, max_batch, dtype, shape):
     batch_dim = [] if max_batch == 0 else [None]
 
     onnx_input = onnx.helper.make_tensor_value_info(
-        "INPUT", onnx_dtype, batch_dim + onnx_input_shape)
+        "INPUT", onnx_dtype, batch_dim + onnx_input_shape
+    )
     onnx_input_state = onnx.helper.make_tensor_value_info(
-        "INPUT_STATE", onnx_dtype, batch_dim + onnx_input_shape)
-    onnx_start = onnx.helper.make_tensor_value_info("START", onnx_control_dtype,
-                                                    batch_dim + [1])
-    onnx_ready = onnx.helper.make_tensor_value_info("READY", onnx_control_dtype,
-                                                    batch_dim + [1])
-    onnx_corrid = onnx.helper.make_tensor_value_info("CORRID",
-                                                     onnx.TensorProto.UINT64,
-                                                     batch_dim + [1])
-    onnx_end = onnx.helper.make_tensor_value_info("END", onnx_control_dtype,
-                                                  batch_dim + [1])
+        "INPUT_STATE", onnx_dtype, batch_dim + onnx_input_shape
+    )
+    onnx_start = onnx.helper.make_tensor_value_info(
+        "START", onnx_control_dtype, batch_dim + [1]
+    )
+    onnx_ready = onnx.helper.make_tensor_value_info(
+        "READY", onnx_control_dtype, batch_dim + [1]
+    )
+    onnx_corrid = onnx.helper.make_tensor_value_info(
+        "CORRID", onnx.TensorProto.UINT64, batch_dim + [1]
+    )
+    onnx_end = onnx.helper.make_tensor_value_info(
+        "END", onnx_control_dtype, batch_dim + [1]
+    )
     onnx_output = onnx.helper.make_tensor_value_info(
-        "OUTPUT", onnx_dtype, batch_dim + onnx_output_shape)
+        "OUTPUT", onnx_dtype, batch_dim + onnx_output_shape
+    )
     onnx_output_state = onnx.helper.make_tensor_value_info(
-        "OUTPUT_STATE", onnx_dtype, batch_dim + onnx_output_shape)
+        "OUTPUT_STATE", onnx_dtype, batch_dim + onnx_output_shape
+    )
 
     internal_input = onnx.helper.make_node("Identity", ["INPUT"], ["_INPUT"])
-    internal_input_state = onnx.helper.make_node("Identity", ["INPUT_STATE"],
-                                                 ["_INPUT_STATE"])
-    # cast int8, int16 input to higer precision int as Onnx Add/Sub operator doesn't support those type
+    internal_input_state = onnx.helper.make_node(
+        "Identity", ["INPUT_STATE"], ["_INPUT_STATE"]
+    )
+    # cast int8, int16 input to higher precision int as Onnx Add/Sub operator doesn't support those type
     # Also casting String data type to int32
-    if ((onnx_dtype == onnx.TensorProto.INT8) or
-        (onnx_dtype == onnx.TensorProto.INT16) or
-        (onnx_dtype == onnx.TensorProto.STRING)):
-        internal_input = onnx.helper.make_node("Cast", ["INPUT"], ["_INPUT"],
-                                               to=onnx.TensorProto.INT32)
-        internal_input_state = onnx.helper.make_node("Cast", ["INPUT_STATE"],
-                                                     ["_INPUT_STATE"],
-                                                     to=onnx.TensorProto.INT32)
+    if (
+        (onnx_dtype == onnx.TensorProto.INT8)
+        or (onnx_dtype == onnx.TensorProto.INT16)
+        or (onnx_dtype == onnx.TensorProto.STRING)
+    ):
+        internal_input = onnx.helper.make_node(
+            "Cast", ["INPUT"], ["_INPUT"], to=onnx.TensorProto.INT32
+        )
+        internal_input_state = onnx.helper.make_node(
+            "Cast", ["INPUT_STATE"], ["_INPUT_STATE"], to=onnx.TensorProto.INT32
+        )
 
     # Convert boolean value to int32 value
     if onnx_control_dtype == onnx.TensorProto.BOOL:
-        internal_input1 = onnx.helper.make_node("Cast", ["START"], ["_START"],
-                                                to=onnx.TensorProto.INT32)
-        internal_input2 = onnx.helper.make_node("Cast", ["READY"], ["_READY"],
-                                                to=onnx.TensorProto.INT32)
-        not_start_cast = onnx.helper.make_node("Not", ["START"],
-                                               ["_NOT_START_CAST"])
-        not_start = onnx.helper.make_node("Cast", ["_NOT_START_CAST"],
-                                          ["_NOT_START"],
-                                          to=onnx.TensorProto.INT32)
-        not_ready_cast = onnx.helper.make_node("Not", ["START"],
-                                               ["_NOT_READY_CAST"])
-        not_ready = onnx.helper.make_node("Cast", ["_NOT_READY_CAST"],
-                                          ["_NOT_READY"],
-                                          to=onnx.TensorProto.INT32)
-
-        input_state_cond = onnx.helper.make_node("And",
-                                                 ["READY", "_NOT_START_CAST"],
-                                                 ["input_state_cond"])
-        input_state_cond_cast = onnx.helper.make_node("Cast",
-                                                      ["input_state_cond"],
-                                                      ["input_state_cond_cast"],
-                                                      to=onnx.TensorProto.INT32)
+        internal_input1 = onnx.helper.make_node(
+            "Cast", ["START"], ["_START"], to=onnx.TensorProto.INT32
+        )
+        internal_input2 = onnx.helper.make_node(
+            "Cast", ["READY"], ["_READY"], to=onnx.TensorProto.INT32
+        )
+        not_start_cast = onnx.helper.make_node("Not", ["START"], ["_NOT_START_CAST"])
+        not_start = onnx.helper.make_node(
+            "Cast", ["_NOT_START_CAST"], ["_NOT_START"], to=onnx.TensorProto.INT32
+        )
+        not_ready_cast = onnx.helper.make_node("Not", ["START"], ["_NOT_READY_CAST"])
+        not_ready = onnx.helper.make_node(
+            "Cast", ["_NOT_READY_CAST"], ["_NOT_READY"], to=onnx.TensorProto.INT32
+        )
+
+        input_state_cond = onnx.helper.make_node(
+            "And", ["READY", "_NOT_START_CAST"], ["input_state_cond"]
+        )
+        input_state_cond_cast = onnx.helper.make_node(
+            "Cast",
+            ["input_state_cond"],
+            ["input_state_cond_cast"],
+            to=onnx.TensorProto.INT32,
+        )
         mul_state = onnx.helper.make_node(
-            "Mul", ["_INPUT_STATE", "input_state_cond_cast"], ["mul_state"])
+            "Mul", ["_INPUT_STATE", "input_state_cond_cast"], ["mul_state"]
+        )
         add = onnx.helper.make_node("Add", ["_INPUT", "mul_state"], ["CAST"])
 
     else:
-        start_cast = onnx.helper.make_node("Cast", ["START"], ["_START_CAST"],
-                                           to=onnx.TensorProto.BOOL)
-        not_start_cast = onnx.helper.make_node("Not", ["_START_CAST"],
-                                               ["_NOT_START_CAST"])
-        not_start = onnx.helper.make_node("Cast", ["_NOT_START_CAST"],
-                                          ["_NOT_START"],
-                                          to=onnx.TensorProto.INT32)
-
-        ready_cast = onnx.helper.make_node("Cast", ["READY"], ["_READY_CAST"],
-                                           to=onnx.TensorProto.BOOL)
-        not_ready_cast = onnx.helper.make_node("Not", ["_READY_CAST"],
-                                               ["_NOT_READY_CAST"])
-        not_ready = onnx.helper.make_node("Cast", ["_NOT_READY_CAST"],
-                                          ["_NOT_READY"],
-                                          to=onnx.TensorProto.INT32)
+        start_cast = onnx.helper.make_node(
+            "Cast", ["START"], ["_START_CAST"], to=onnx.TensorProto.BOOL
+        )
+        not_start_cast = onnx.helper.make_node(
+            "Not", ["_START_CAST"], ["_NOT_START_CAST"]
+        )
+        not_start = onnx.helper.make_node(
+            "Cast", ["_NOT_START_CAST"], ["_NOT_START"], to=onnx.TensorProto.INT32
+        )
+
+        ready_cast = onnx.helper.make_node(
+            "Cast", ["READY"], ["_READY_CAST"], to=onnx.TensorProto.BOOL
+        )
+        not_ready_cast = onnx.helper.make_node(
+            "Not", ["_READY_CAST"], ["_NOT_READY_CAST"]
+        )
+        not_ready = onnx.helper.make_node(
+            "Cast", ["_NOT_READY_CAST"], ["_NOT_READY"], to=onnx.TensorProto.INT32
+        )
 
         # Take advantage of knowledge that the READY false value is 0 and true is 1
         input_state_cond = onnx.helper.make_node(
-            "And", ["_NOT_START_CAST", "_READY_CAST"], ["input_state_cond"])
-        input_state_cond_cast = onnx.helper.make_node("Cast",
-                                                      ["input_state_cond"],
-                                                      ["input_state_cond_cast"],
-                                                      to=onnx.TensorProto.INT32)
+            "And", ["_NOT_START_CAST", "_READY_CAST"], ["input_state_cond"]
+        )
+        input_state_cond_cast = onnx.helper.make_node(
+            "Cast",
+            ["input_state_cond"],
+            ["input_state_cond_cast"],
+            to=onnx.TensorProto.INT32,
+        )
         mul_state = onnx.helper.make_node(
-            "Mul", ["_INPUT_STATE", "input_state_cond_cast"], ["mul_state"])
+            "Mul", ["_INPUT_STATE", "input_state_cond_cast"], ["mul_state"]
+        )
         add = onnx.helper.make_node("Add", ["_INPUT", "mul_state"], ["CAST"])
 
     cast = onnx.helper.make_node("Cast", ["CAST"], ["OUTPUT"], to=onnx_dtype)
-    cast_output_state = onnx.helper.make_node("Cast", ["CAST"],
-                                              ["OUTPUT_STATE"],
-                                              to=onnx_dtype)
+    cast_output_state = onnx.helper.make_node(
+        "Cast", ["CAST"], ["OUTPUT_STATE"], to=onnx_dtype
+    )
 
     # Avoid cast from float16 to float16
     # (bug in Onnx Runtime, cast from float16 to float16 will become cast from float16 to float32)
     if onnx_dtype == onnx.TensorProto.FLOAT16:
         cast = onnx.helper.make_node("Identity", ["CAST"], ["OUTPUT"])
-        cast_output_state = onnx.helper.make_node("Identity", ["CAST"],
-                                                  ["OUTPUT_STATE"])
+        cast_output_state = onnx.helper.make_node(
+            "Identity", ["CAST"], ["OUTPUT_STATE"]
+        )
 
     if onnx_control_dtype == onnx.TensorProto.BOOL:
         onnx_nodes = [
-            internal_input, internal_input_state, internal_input1,
-            internal_input2, not_start_cast, not_start, not_ready_cast,
-            not_ready, input_state_cond, input_state_cond_cast, mul_state, add,
-            cast, cast_output_state
+            internal_input,
+            internal_input_state,
+            internal_input1,
+            internal_input2,
+            not_start_cast,
+            not_start,
+            not_ready_cast,
+            not_ready,
+            input_state_cond,
+            input_state_cond_cast,
+            mul_state,
+            add,
+            cast,
+            cast_output_state,
         ]
     else:
         onnx_nodes = [
-            internal_input, internal_input_state, start_cast, not_start_cast,
-            not_start, ready_cast, not_ready_cast, not_ready, input_state_cond,
-            input_state_cond_cast, mul_state, add, cast, cast_output_state
+            internal_input,
+            internal_input_state,
+            start_cast,
+            not_start_cast,
+            not_start,
+            ready_cast,
+            not_ready_cast,
+            not_ready,
+            input_state_cond,
+            input_state_cond_cast,
+            mul_state,
+            add,
+            cast,
+            cast_output_state,
         ]
 
     onnx_inputs = [
-        onnx_end, onnx_corrid, onnx_input_state, onnx_input, onnx_start,
-        onnx_ready
+        onnx_end,
+        onnx_corrid,
+        onnx_input_state,
+        onnx_input,
+        onnx_start,
+        onnx_ready,
     ]
     onnx_outputs = [onnx_output, onnx_output_state]
-    graph_proto = onnx.helper.make_graph(onnx_nodes, model_name, onnx_inputs,
-                                         onnx_outputs)
+    graph_proto = onnx.helper.make_graph(
+        onnx_nodes, model_name, onnx_inputs, onnx_outputs
+    )
 
     if FLAGS.onnx_opset > 0:
         model_opset = onnx.helper.make_operatorsetid("", FLAGS.onnx_opset)
-        model_def = onnx.helper.make_model(graph_proto,
-                                           producer_name="triton",
-                                           opset_imports=[model_opset])
+        model_def = onnx.helper.make_model(
+            graph_proto, producer_name="triton", opset_imports=[model_opset]
+        )
     else:
         model_def = onnx.helper.make_model(graph_proto, producer_name="triton")
 
@@ -265,14 +253,14 @@ def create_onnx_modelfile(models_dir, model_version, max_batch, dtype, shape):
 
 
 def create_onnx_modelconfig(models_dir, model_version, max_batch, dtype, shape):
-
     if not tu.validate_for_onnx_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
     model_name = tu.get_dyna_sequence_model_name(
-        "onnx_nobatch" if max_batch == 0 else "onnx", dtype)
+        "onnx_nobatch" if max_batch == 0 else "onnx", dtype
+    )
     config_dir = models_dir + "/" + model_name
-    config = '''
+    config = """
 name: "{}"
 platform: "onnxruntime_onnx"
 max_batch_size: {}
@@ -323,7 +311,7 @@ def create_onnx_modelconfig(models_dir, model_version, max_batch, dtype, shape):
       output_name: "OUTPUT_STATE"
       data_type: {dtype}
       dims: {dims}
-    }} 
+    }}
   ]
 }}
 input [
@@ -345,14 +333,16 @@ def create_onnx_modelconfig(models_dir, model_version, max_batch, dtype, shape):
     kind: KIND_CPU
   }}
 ]
-'''.format(
+""".format(
         model_name,
         max_batch,
         "oldest { max_candidate_sequences: 6\npreferred_batch_size: [ 4 ]\nmax_queue_delay_microseconds: 0\n}"
-        if max_batch > 0 else "",
+        if max_batch > 0
+        else "",
         dtype=np_to_model_dtype(dtype),
         dims=tu.shape_to_dims_str(shape),
-        type="fp32" if dtype == np.float32 else "int32")
+        type="fp32" if dtype == np.float32 else "int32",
+    )
 
     try:
         os.makedirs(config_dir)
@@ -363,8 +353,7 @@ def create_onnx_modelconfig(models_dir, model_version, max_batch, dtype, shape):
         cfile.write(config)
 
 
-def create_plan_fixed_modelfile(models_dir, model_version, max_batch, dtype,
-                                shape):
+def create_plan_fixed_modelfile(models_dir, model_version, max_batch, dtype, shape):
     trt_dtype = np_to_trt_dtype(dtype)
     TRT_LOGGER = trt.Logger(trt.Logger.INFO)
     builder = trt.Builder(TRT_LOGGER)
@@ -378,25 +367,31 @@ def create_plan_fixed_modelfile(models_dir, model_version, max_batch, dtype,
 
     constant_1_data = trt.Weights(np.ones([1 for i in shape], dtype=dtype))
     constant_1 = network.add_constant([1 for i in shape], constant_1_data)
-    not_start = network.add_elementwise(constant_1.get_output(0), start0,
-                                        trt.ElementWiseOperation.SUB)
+    not_start = network.add_elementwise(
+        constant_1.get_output(0), start0, trt.ElementWiseOperation.SUB
+    )
     not_start.set_output_type(0, trt_dtype)
 
     input_state_cond_temp = network.add_elementwise(
-        ready0, not_start.get_output(0), trt.ElementWiseOperation.SUM)
-    constant_2 = network.add_elementwise(constant_1.get_output(0),
-                                         constant_1.get_output(0),
-                                         trt.ElementWiseOperation.SUM)
+        ready0, not_start.get_output(0), trt.ElementWiseOperation.SUM
+    )
+    constant_2 = network.add_elementwise(
+        constant_1.get_output(0), constant_1.get_output(0), trt.ElementWiseOperation.SUM
+    )
     input_state_cond = network.add_elementwise(
-        input_state_cond_temp.get_output(0), constant_2.get_output(0),
-        trt.ElementWiseOperation.FLOOR_DIV)
-    internal_state = network.add_elementwise(in_state0,
-                                             input_state_cond.get_output(0),
-                                             trt.ElementWiseOperation.PROD)
-    out0 = network.add_elementwise(internal_state.get_output(0), in0,
-                                   trt.ElementWiseOperation.SUM)
-    out0_state = network.add_elementwise(internal_state.get_output(0), in0,
-                                         trt.ElementWiseOperation.SUM)
+        input_state_cond_temp.get_output(0),
+        constant_2.get_output(0),
+        trt.ElementWiseOperation.FLOOR_DIV,
+    )
+    internal_state = network.add_elementwise(
+        in_state0, input_state_cond.get_output(0), trt.ElementWiseOperation.PROD
+    )
+    out0 = network.add_elementwise(
+        internal_state.get_output(0), in0, trt.ElementWiseOperation.SUM
+    )
+    out0_state = network.add_elementwise(
+        internal_state.get_output(0), in0, trt.ElementWiseOperation.SUM
+    )
 
     out0.get_output(0).name = "OUTPUT"
     network.mark_output(out0.get_output(0))
@@ -405,7 +400,7 @@ def create_plan_fixed_modelfile(models_dir, model_version, max_batch, dtype,
     network.mark_output(out0_state.get_output(0))
 
     config = builder.create_builder_config()
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     builder.max_batch_size = max(1, max_batch)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
@@ -416,7 +411,8 @@ def create_plan_fixed_modelfile(models_dir, model_version, max_batch, dtype,
     del network
 
     model_name = tu.get_dyna_sequence_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", dtype)
+        "plan_nobatch" if max_batch == 0 else "plan", dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     try:
@@ -428,8 +424,7 @@ def create_plan_fixed_modelfile(models_dir, model_version, max_batch, dtype,
         f.write(engine_bytes)
 
 
-def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype,
-                                   shape):
+def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype, shape):
     trt_dtype = np_to_trt_dtype(dtype)
     trt_memory_format = trt.TensorFormat.LINEAR
 
@@ -445,25 +440,31 @@ def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype,
 
     constant_1_data = trt.Weights(np.ones([1 for i in shape], dtype=dtype))
     constant_1 = network.add_constant([1 for i in shape], constant_1_data)
-    not_start = network.add_elementwise(constant_1.get_output(0), start0,
-                                        trt.ElementWiseOperation.SUB)
+    not_start = network.add_elementwise(
+        constant_1.get_output(0), start0, trt.ElementWiseOperation.SUB
+    )
     not_start.set_output_type(0, trt_dtype)
 
     input_state_cond_temp = network.add_elementwise(
-        ready0, not_start.get_output(0), trt.ElementWiseOperation.SUM)
-    constant_2 = network.add_elementwise(constant_1.get_output(0),
-                                         constant_1.get_output(0),
-                                         trt.ElementWiseOperation.SUM)
+        ready0, not_start.get_output(0), trt.ElementWiseOperation.SUM
+    )
+    constant_2 = network.add_elementwise(
+        constant_1.get_output(0), constant_1.get_output(0), trt.ElementWiseOperation.SUM
+    )
     input_state_cond = network.add_elementwise(
-        input_state_cond_temp.get_output(0), constant_2.get_output(0),
-        trt.ElementWiseOperation.FLOOR_DIV)
-    internal_state = network.add_elementwise(in_state0,
-                                             input_state_cond.get_output(0),
-                                             trt.ElementWiseOperation.PROD)
-    out0 = network.add_elementwise(internal_state.get_output(0), in0,
-                                   trt.ElementWiseOperation.SUM)
-    out0_state = network.add_elementwise(internal_state.get_output(0), in0,
-                                         trt.ElementWiseOperation.SUM)
+        input_state_cond_temp.get_output(0),
+        constant_2.get_output(0),
+        trt.ElementWiseOperation.FLOOR_DIV,
+    )
+    internal_state = network.add_elementwise(
+        in_state0, input_state_cond.get_output(0), trt.ElementWiseOperation.PROD
+    )
+    out0 = network.add_elementwise(
+        internal_state.get_output(0), in0, trt.ElementWiseOperation.SUM
+    )
+    out0_state = network.add_elementwise(
+        internal_state.get_output(0), in0, trt.ElementWiseOperation.SUM
+    )
 
     out0.get_output(0).name = "OUTPUT"
     network.mark_output(out0.get_output(0))
@@ -480,7 +481,7 @@ def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype,
     out0.get_output(0).allowed_formats = 1 << int(trt_memory_format)
     out0_state.get_output(0).allowed_formats = 1 << int(trt_memory_format)
 
-    if (trt_dtype == trt.int8):
+    if trt_dtype == trt.int8:
         in0.dynamic_range = (-128.0, 127.0)
         in_state0.dynamic_range = (-128.0, 127.0)
         out0.dynamic_range = (-128.0, 127.0)
@@ -490,14 +491,14 @@ def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype,
 
     flags = 1 << int(trt.BuilderFlag.STRICT_TYPES)
 
-    if (trt_dtype == trt.int8):
+    if trt_dtype == trt.int8:
         flags |= 1 << int(trt.BuilderFlag.INT8)
-    elif (trt_dtype == trt.float16):
+    elif trt_dtype == trt.float16:
         flags |= 1 << int(trt.BuilderFlag.FP16)
 
     config = builder.create_builder_config()
     config.flags = flags
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     builder.max_batch_size = max(1, max_batch)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
@@ -507,7 +508,8 @@ def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype,
         del engine
 
     model_name = tu.get_dyna_sequence_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", dtype)
+        "plan_nobatch" if max_batch == 0 else "plan", dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     try:
@@ -520,27 +522,26 @@ def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype,
 
 
 def create_plan_modelfile(models_dir, model_version, max_batch, dtype, shape):
-
     if not tu.validate_for_trt_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
     if dtype != np.float32:
-        create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch,
-                                       dtype, shape)
+        create_plan_fixed_rf_modelfile(
+            models_dir, model_version, max_batch, dtype, shape
+        )
     else:
-        create_plan_fixed_modelfile(models_dir, model_version, max_batch, dtype,
-                                    shape)
+        create_plan_fixed_modelfile(models_dir, model_version, max_batch, dtype, shape)
 
 
 def create_plan_modelconfig(models_dir, model_version, max_batch, dtype, shape):
-
     if not tu.validate_for_trt_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
     model_name = tu.get_dyna_sequence_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", dtype)
+        "plan_nobatch" if max_batch == 0 else "plan", dtype
+    )
     config_dir = models_dir + "/" + model_name
-    config = '''
+    config = """
 name: "{}"
 platform: "tensorrt_plan"
 max_batch_size: {}
@@ -591,7 +592,7 @@ def create_plan_modelconfig(models_dir, model_version, max_batch, dtype, shape):
       output_name: "OUTPUT_STATE"
       data_type: {dtype}
       dims: {dims}
-    }} 
+    }}
   ]
 }}
 input [
@@ -613,14 +614,16 @@ def create_plan_modelconfig(models_dir, model_version, max_batch, dtype, shape):
     kind: KIND_GPU
   }}
 ]
-'''.format(
+""".format(
         model_name,
         max_batch,
         "oldest { max_candidate_sequences: 6\npreferred_batch_size: [ 4 ]\nmax_queue_delay_microseconds: 0\n}"
-        if max_batch > 0 else "",
+        if max_batch > 0
+        else "",
         dtype=np_to_model_dtype(dtype),
         dims=tu.shape_to_dims_str(shape),
-        type="fp32" if dtype == np.float32 else "int32")
+        type="fp32" if dtype == np.float32 else "int32",
+    )
 
     try:
         os.makedirs(config_dir)
@@ -652,51 +655,63 @@ def create_models(models_dir, dtype, shape, no_batch=True):
             create_plan_modelfile(models_dir, model_version, 0, dtype, shape)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('--models_dir',
-                        type=str,
-                        required=True,
-                        help='Top-level model directory')
-    parser.add_argument('--graphdef',
-                        required=False,
-                        action='store_true',
-                        help='Generate GraphDef models')
-    parser.add_argument('--savedmodel',
-                        required=False,
-                        action='store_true',
-                        help='Generate SavedModel models')
-    parser.add_argument('--tensorrt',
-                        required=False,
-                        action='store_true',
-                        help='Generate TensorRT PLAN models')
     parser.add_argument(
-        '--tensorrt-shape-io',
+        "--models_dir", type=str, required=True, help="Top-level model directory"
+    )
+    parser.add_argument(
+        "--graphdef",
         required=False,
-        action='store_true',
-        help='Generate TensorRT PLAN models w/ shape tensor i/o')
-    parser.add_argument('--onnx',
-                        required=False,
-                        action='store_true',
-                        help='Generate Onnx models')
+        action="store_true",
+        help="Generate GraphDef models",
+    )
     parser.add_argument(
-        '--onnx_opset',
+        "--savedmodel",
+        required=False,
+        action="store_true",
+        help="Generate SavedModel models",
+    )
+    parser.add_argument(
+        "--tensorrt",
+        required=False,
+        action="store_true",
+        help="Generate TensorRT PLAN models",
+    )
+    parser.add_argument(
+        "--tensorrt-shape-io",
+        required=False,
+        action="store_true",
+        help="Generate TensorRT PLAN models w/ shape tensor i/o",
+    )
+    parser.add_argument(
+        "--onnx", required=False, action="store_true", help="Generate Onnx models"
+    )
+    parser.add_argument(
+        "--onnx_opset",
         type=int,
         required=False,
         default=0,
-        help='Opset used for Onnx models. Default is to use ONNXRT default')
-    parser.add_argument('--libtorch',
-                        required=False,
-                        action='store_true',
-                        help='Generate Pytorch LibTorch models')
-    parser.add_argument('--openvino',
-                        required=False,
-                        action='store_true',
-                        help='Generate OpenVino models')
-    parser.add_argument('--variable',
-                        required=False,
-                        action='store_true',
-                        help='Used variable-shape tensors for input/output')
+        help="Opset used for Onnx models. Default is to use ONNXRT default",
+    )
+    parser.add_argument(
+        "--libtorch",
+        required=False,
+        action="store_true",
+        help="Generate Pytorch LibTorch models",
+    )
+    parser.add_argument(
+        "--openvino",
+        required=False,
+        action="store_true",
+        help="Generate OpenVino models",
+    )
+    parser.add_argument(
+        "--variable",
+        required=False,
+        action="store_true",
+        help="Used variable-shape tensors for input/output",
+    )
     FLAGS, unparsed = parser.parse_known_args()
 
     if FLAGS.onnx:
@@ -709,12 +724,21 @@ def create_models(models_dir, dtype, shape, no_batch=True):
 
     # Tests with models that accept fixed-shape input/output tensors
     if not FLAGS.variable:
-        create_models(FLAGS.models_dir, np.int32, [
-            1,
-        ])
+        create_models(
+            FLAGS.models_dir,
+            np.int32,
+            [
+                1,
+            ],
+        )
 
     # Tests with models that accept variable-shape input/output tensors
     if FLAGS.variable:
-        create_models(FLAGS.models_dir, np.int32, [
-            -1,
-        ], False)
+        create_models(
+            FLAGS.models_dir,
+            np.int32,
+            [
+                -1,
+            ],
+            False,
+        )
diff --git a/qa/common/gen_qa_dyna_sequence_models.py b/qa/common/gen_qa_dyna_sequence_models.py
old mode 100644
new mode 100755
index 34559c5c36..469d524ffb
--- a/qa/common/gen_qa_dyna_sequence_models.py
+++ b/qa/common/gen_qa_dyna_sequence_models.py
@@ -1,4 +1,6 @@
-# Copyright 2019-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,132 +28,23 @@
 
 import argparse
 import os
+
 import numpy as np
+from gen_common import (
+    np_to_model_dtype,
+    np_to_onnx_dtype,
+    np_to_tf_dtype,
+    np_to_torch_dtype,
+    np_to_trt_dtype,
+)
 
 FLAGS = None
 np_dtype_string = np.dtype(object)
 
 
-def np_to_model_dtype(np_dtype):
-    if np_dtype == bool:
-        return "TYPE_BOOL"
-    elif np_dtype == np.int8:
-        return "TYPE_INT8"
-    elif np_dtype == np.int16:
-        return "TYPE_INT16"
-    elif np_dtype == np.int32:
-        return "TYPE_INT32"
-    elif np_dtype == np.int64:
-        return "TYPE_INT64"
-    elif np_dtype == np.uint8:
-        return "TYPE_UINT8"
-    elif np_dtype == np.uint16:
-        return "TYPE_UINT16"
-    elif np_dtype == np.float16:
-        return "TYPE_FP16"
-    elif np_dtype == np.float32:
-        return "TYPE_FP32"
-    elif np_dtype == np.float64:
-        return "TYPE_FP64"
-    elif np_dtype == np_dtype_string:
-        return "TYPE_STRING"
-    return None
-
-
-def np_to_tf_dtype(np_dtype):
-    if np_dtype == bool:
-        return tf.bool
-    elif np_dtype == np.int8:
-        return tf.int8
-    elif np_dtype == np.int16:
-        return tf.int16
-    elif np_dtype == np.int32:
-        return tf.int32
-    elif np_dtype == np.int64:
-        return tf.int64
-    elif np_dtype == np.uint8:
-        return tf.uint8
-    elif np_dtype == np.uint16:
-        return tf.uint16
-    elif np_dtype == np.float16:
-        return tf.float16
-    elif np_dtype == np.float32:
-        return tf.float32
-    elif np_dtype == np.float64:
-        return tf.float64
-    elif np_dtype == np_dtype_string:
-        return tf.string
-    return None
-
-
-def np_to_trt_dtype(np_dtype):
-    if np_dtype == bool:
-        return trt.bool
-    elif np_dtype == np.int8:
-        return trt.int8
-    elif np_dtype == np.int32:
-        return trt.int32
-    elif np_dtype == np.float16:
-        return trt.float16
-    elif np_dtype == np.float32:
-        return trt.float32
-    return None
-
-
-def np_to_onnx_dtype(np_dtype):
-    if np_dtype == bool:
-        return onnx.TensorProto.BOOL
-    elif np_dtype == np.int8:
-        return onnx.TensorProto.INT8
-    elif np_dtype == np.int16:
-        return onnx.TensorProto.INT16
-    elif np_dtype == np.int32:
-        return onnx.TensorProto.INT32
-    elif np_dtype == np.int64:
-        return onnx.TensorProto.INT64
-    elif np_dtype == np.uint8:
-        return onnx.TensorProto.UINT8
-    elif np_dtype == np.uint16:
-        return onnx.TensorProto.UINT16
-    elif np_dtype == np.float16:
-        return onnx.TensorProto.FLOAT16
-    elif np_dtype == np.float32:
-        return onnx.TensorProto.FLOAT
-    elif np_dtype == np.float64:
-        return onnx.TensorProto.DOUBLE
-    elif np_dtype == np_dtype_string:
-        return onnx.TensorProto.STRING
-
-
-def np_to_torch_dtype(np_dtype):
-    if np_dtype == bool:
-        return torch.bool
-    elif np_dtype == np.int8:
-        return torch.int8
-    elif np_dtype == np.int16:
-        return torch.int16
-    elif np_dtype == np.int32:
-        return torch.int
-    elif np_dtype == np.int64:
-        return torch.long
-    elif np_dtype == np.uint8:
-        return torch.uint8
-    elif np_dtype == np.uint16:
-        return None  # Not supported in Torch
-    elif np_dtype == np.float16:
-        return None
-    elif np_dtype == np.float32:
-        return torch.float
-    elif np_dtype == np.float64:
-        return torch.double
-    elif np_dtype == np_dtype_string:
-        return None  # Not supported in Torch
-    return None
-
-
-def create_tf_modelfile(create_savedmodel, models_dir, model_version, max_batch,
-                        dtype, shape):
-
+def create_tf_modelfile(
+    create_savedmodel, models_dir, model_version, max_batch, dtype, shape
+):
     if not tu.validate_for_tf_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
@@ -165,38 +58,61 @@ def create_tf_modelfile(create_savedmodel, models_dir, model_version, max_batch,
 
     # Create the model. If non-batching then don't include the batch
     # dimension.
-    tf.reset_default_graph()
+    tf.compat.v1.reset_default_graph()
     if create_savedmodel and (max_batch == 0):
-        input0 = tf.placeholder(tf_input_dtype, [
-            1,
-        ], "INPUT")
+        input0 = tf.compat.v1.placeholder(
+            tf_input_dtype,
+            [
+                1,
+            ],
+            "INPUT",
+        )
         if tf_input_dtype == tf.string:
-            input0 = tf.strings.to_number(tf.strings.join(["0", input0]),
-                                          tf_dtype)
-        start0 = tf.placeholder(tf_dtype, [
-            1,
-        ], "START")
-        end0 = tf.placeholder(tf_dtype, [
-            1,
-        ], "END")
-        ready0 = tf.placeholder(tf_dtype, [
-            1,
-        ], "READY")
-        corrid0 = tf.placeholder(tf.uint64, [
-            1,
-        ], "CORRID")
+            input0 = tf.strings.to_number(tf.strings.join(["0", input0]), tf_dtype)
+        start0 = tf.compat.v1.placeholder(
+            tf_dtype,
+            [
+                1,
+            ],
+            "START",
+        )
+        end0 = tf.compat.v1.placeholder(
+            tf_dtype,
+            [
+                1,
+            ],
+            "END",
+        )
+        ready0 = tf.compat.v1.placeholder(
+            tf_dtype,
+            [
+                1,
+            ],
+            "READY",
+        )
+        corrid0 = tf.compat.v1.placeholder(
+            tf.uint64,
+            [
+                1,
+            ],
+            "CORRID",
+        )
         corrid_cast0 = tf.cast(corrid0, tf_dtype)
-        acc = tf.get_variable("ACC", [
-            1,
-        ], dtype=tf_dtype)
-        tmp0 = tf.where(tf.equal(start0, 1), input0, tf.add(acc, input0))
-        tmp1 = tf.where(tf.equal(end0, 1), tf.add(tmp0, corrid_cast0), tmp0)
-        newacc = tf.where(tf.equal(ready0, 1), tmp1, acc)
-        assign = tf.assign(acc, newacc)
+        acc = tf.compat.v1.get_variable(
+            "ACC",
+            [
+                1,
+            ],
+            dtype=tf_dtype,
+        )
+        tmp0 = tf.compat.v1.where(tf.equal(start0, 1), input0, tf.add(acc, input0))
+        tmp1 = tf.compat.v1.where(tf.equal(end0, 1), tf.add(tmp0, corrid_cast0), tmp0)
+        newacc = tf.compat.v1.where(tf.equal(ready0, 1), tmp1, acc)
+        assign = tf.compat.v1.assign(acc, newacc)
         if tf_input_dtype == tf.string:
-            output0 = tf.dtypes.as_string(assign, name="OUTPUT")
+            tf.strings.as_string(assign, name="OUTPUT")
         else:
-            output0 = tf.identity(assign, name="OUTPUT")
+            tf.identity(assign, name="OUTPUT")
     else:
         # For batching we can't use a tf.variable to hold the
         # accumulated values since that forces the size of the output
@@ -205,33 +121,40 @@ def create_tf_modelfile(create_savedmodel, models_dir, model_version, max_batch,
         # output shape being [None, 1]. So instead we just return 0 if
         # not-ready and 'INPUT'+'START'+('END'*'CORRID')
         # otherwise... the tests know to expect this.
-        input0 = tf.placeholder(tf_input_dtype, [
-            None,
-        ] + tu.shape_to_tf_shape(shape), "INPUT")
+        input0 = tf.compat.v1.placeholder(
+            tf_input_dtype,
+            [
+                None,
+            ]
+            + tu.shape_to_tf_shape(shape),
+            "INPUT",
+        )
         if tf_input_dtype == tf.string:
-            input0 = tf.strings.to_number(tf.strings.join(["0", input0]),
-                                          tf_dtype)
-        start0 = tf.placeholder(tf_dtype, [None, 1], "START")
-        end0 = tf.placeholder(tf_dtype, [None, 1], "END")
-        ready0 = tf.placeholder(tf_dtype, [None, 1], "READY")
-        corrid0 = tf.placeholder(tf.uint64, [None, 1], "CORRID")
+            input0 = tf.strings.to_number(tf.strings.join(["0", input0]), tf_dtype)
+        start0 = tf.compat.v1.placeholder(tf_dtype, [None, 1], "START")
+        end0 = tf.compat.v1.placeholder(tf_dtype, [None, 1], "END")
+        ready0 = tf.compat.v1.placeholder(tf_dtype, [None, 1], "READY")
+        corrid0 = tf.compat.v1.placeholder(tf.uint64, [None, 1], "CORRID")
         corrid_cast0 = tf.cast(corrid0, tf_dtype)
-        tmp = tf.where(
+        tmp = tf.compat.v1.where(
             tf.equal(ready0, 1),
             tf.add(tf.add(start0, input0), tf.multiply(end0, corrid_cast0)),
-            tf.zeros(tf.shape(input0), dtype=tf_dtype))
+            tf.zeros(tf.shape(input=input0), dtype=tf_dtype),
+        )
         if tf_input_dtype == tf.string:
-            output0 = tf.dtypes.as_string(tmp, name="OUTPUT")
+            tf.strings.as_string(tmp, name="OUTPUT")
         else:
-            output0 = tf.identity(tmp, name="OUTPUT")
+            tf.identity(tmp, name="OUTPUT")
 
     # Use a different model name for the non-batching variant
     if create_savedmodel:
         model_name = tu.get_dyna_sequence_model_name(
-            "savedmodel_nobatch" if max_batch == 0 else "savedmodel", dtype)
+            "savedmodel_nobatch" if max_batch == 0 else "savedmodel", dtype
+        )
     else:
         model_name = tu.get_dyna_sequence_model_name(
-            "graphdef_nobatch" if max_batch == 0 else "graphdef", dtype)
+            "graphdef_nobatch" if max_batch == 0 else "graphdef", dtype
+        )
 
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
@@ -241,51 +164,65 @@ def create_tf_modelfile(create_savedmodel, models_dir, model_version, max_batch,
         pass  # ignore existing dir
 
     if create_savedmodel:
-        with tf.Session() as sess:
-            sess.run(tf.initializers.global_variables())
-            input0_tensor = tf.get_default_graph().get_tensor_by_name("INPUT:0")
-            start0_tensor = tf.get_default_graph().get_tensor_by_name("START:0")
-            end0_tensor = tf.get_default_graph().get_tensor_by_name("END:0")
-            ready0_tensor = tf.get_default_graph().get_tensor_by_name("READY:0")
-            corrid0_tensor = tf.get_default_graph().get_tensor_by_name(
-                "CORRID:0")
-            output0_tensor = tf.get_default_graph().get_tensor_by_name(
-                "OUTPUT:0")
-            tf.saved_model.simple_save(sess,
-                                       model_version_dir + "/model.savedmodel",
-                                       inputs={
-                                           "INPUT": input0_tensor,
-                                           "START": start0_tensor,
-                                           "END": end0_tensor,
-                                           "READY": ready0_tensor,
-                                           "CORRID": corrid0_tensor
-                                       },
-                                       outputs={"OUTPUT": output0_tensor})
+        with tf.compat.v1.Session() as sess:
+            sess.run(tf.compat.v1.initializers.global_variables())
+            input0_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+                "INPUT:0"
+            )
+            start0_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+                "START:0"
+            )
+            end0_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name("END:0")
+            ready0_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+                "READY:0"
+            )
+            corrid0_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+                "CORRID:0"
+            )
+            output0_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+                "OUTPUT:0"
+            )
+            tf.compat.v1.saved_model.simple_save(
+                sess,
+                model_version_dir + "/model.savedmodel",
+                inputs={
+                    "INPUT": input0_tensor,
+                    "START": start0_tensor,
+                    "END": end0_tensor,
+                    "READY": ready0_tensor,
+                    "CORRID": corrid0_tensor,
+                },
+                outputs={"OUTPUT": output0_tensor},
+            )
     else:
-        with tf.Session() as sess:
-            sess.run(tf.initializers.global_variables())
-            graph_io.write_graph(sess.graph.as_graph_def(),
-                                 model_version_dir,
-                                 "model.graphdef",
-                                 as_text=False)
-
-
-def create_tf_modelconfig(create_savedmodel, models_dir, model_version,
-                          max_batch, dtype, shape):
-
+        with tf.compat.v1.Session() as sess:
+            sess.run(tf.compat.v1.initializers.global_variables())
+            graph_io.write_graph(
+                sess.graph.as_graph_def(),
+                model_version_dir,
+                "model.graphdef",
+                as_text=False,
+            )
+
+
+def create_tf_modelconfig(
+    create_savedmodel, models_dir, model_version, max_batch, dtype, shape
+):
     if not tu.validate_for_tf_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
     # Use a different model name for the non-batching variant
     if create_savedmodel:
         model_name = tu.get_dyna_sequence_model_name(
-            "savedmodel_nobatch" if max_batch == 0 else "savedmodel", dtype)
+            "savedmodel_nobatch" if max_batch == 0 else "savedmodel", dtype
+        )
     else:
         model_name = tu.get_dyna_sequence_model_name(
-            "graphdef_nobatch" if max_batch == 0 else "graphdef", dtype)
+            "graphdef_nobatch" if max_batch == 0 else "graphdef", dtype
+        )
 
     config_dir = models_dir + "/" + model_name
-    config = '''
+    config = """
 name: "{}"
 platform: "{}"
 max_batch_size: {}
@@ -350,15 +287,20 @@ def create_tf_modelconfig(create_savedmodel, models_dir, model_version,
     kind: KIND_CPU
   }}
 ]
-'''.format(
+""".format(
         model_name,
         "tensorflow_savedmodel" if create_savedmodel else "tensorflow_graphdef",
         max_batch,
         "oldest { max_candidate_sequences: 6\npreferred_batch_size: [ 4 ]\nmax_queue_delay_microseconds: 0\n}"
-        if max_batch > 0 else "", "fp32" if dtype == np.float32 else "int32",
+        if max_batch > 0
+        else "",
+        "fp32" if dtype == np.float32 else "int32",
+        "fp32" if dtype == np.float32 else "int32",
         "fp32" if dtype == np.float32 else "int32",
-        "fp32" if dtype == np.float32 else "int32", np_to_model_dtype(dtype),
-        tu.shape_to_dims_str(shape), np_to_model_dtype(dtype))
+        np_to_model_dtype(dtype),
+        tu.shape_to_dims_str(shape),
+        np_to_model_dtype(dtype),
+    )
 
     try:
         os.makedirs(config_dir)
@@ -369,8 +311,9 @@ def create_tf_modelconfig(create_savedmodel, models_dir, model_version,
         cfile.write(config)
 
 
-def create_plan_shape_tensor_modelfile(models_dir, model_version, max_batch,
-                                       dtype, shape):
+def create_plan_shape_tensor_modelfile(
+    models_dir, model_version, max_batch, dtype, shape
+):
     # Note that resize layer does not support int tensors.
     # The model takes three inputs (INPUT, DUMMY_INPUT and SHAPE_INPUT)
     # and four control inputs(START, END, READY, CORR_ID).
@@ -387,16 +330,15 @@ def create_plan_shape_tensor_modelfile(models_dir, model_version, max_batch,
     TRT_LOGGER = trt.Logger(trt.Logger.INFO)
     builder = trt.Builder(TRT_LOGGER)
     network = builder.create_network(
-        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
+        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+    )
 
-    unit_shape = ([1] * len(shape))
-    dummy_shape = ([-1] * len(shape))
+    unit_shape = [1] * len(shape)
+    dummy_shape = [-1] * len(shape)
     if max_batch != 0:
         in0 = network.add_input("INPUT", trt.int32, [-1] + dummy_shape)
-        dummy_in0 = network.add_input("DUMMY_INPUT", trt_dtype,
-                                      [-1] + dummy_shape)
-        shape_in0 = network.add_input("SHAPE_INPUT", trt.int32,
-                                      [1 + len(shape)])
+        dummy_in0 = network.add_input("DUMMY_INPUT", trt_dtype, [-1] + dummy_shape)
+        shape_in0 = network.add_input("SHAPE_INPUT", trt.int32, [1 + len(shape)])
         start0 = network.add_input("START", trt.int32, [-1] + unit_shape)
         end0 = network.add_input("END", trt.int32, [-1] + unit_shape)
         ready0 = network.add_input("READY", trt.int32, [-1] + unit_shape)
@@ -412,10 +354,12 @@ def create_plan_shape_tensor_modelfile(models_dir, model_version, max_batch,
 
     add0 = network.add_elementwise(in0, start0, trt.ElementWiseOperation.SUM)
     mul0 = network.add_elementwise(end0, corrid0, trt.ElementWiseOperation.PROD)
-    sum0 = network.add_elementwise(add0.get_output(0), mul0.get_output(0),
-                                   trt.ElementWiseOperation.SUM)
-    out0 = network.add_elementwise(sum0.get_output(0), ready0,
-                                   trt.ElementWiseOperation.PROD).get_output(0)
+    sum0 = network.add_elementwise(
+        add0.get_output(0), mul0.get_output(0), trt.ElementWiseOperation.SUM
+    )
+    out0 = network.add_elementwise(
+        sum0.get_output(0), ready0, trt.ElementWiseOperation.PROD
+    ).get_output(0)
 
     resize_layer = network.add_resize(dummy_in0)
     resize_layer.set_input(1, shape_in0)
@@ -442,7 +386,7 @@ def create_plan_shape_tensor_modelfile(models_dir, model_version, max_batch,
     shape_out0.get_output(0).allowed_formats = 1 << int(trt_memory_format)
     resized_out0.allowed_formats = 1 << int(trt_memory_format)
 
-    if (trt_dtype == trt.int8):
+    if trt_dtype == trt.int8:
         dummy_in0.dynamic_range = (-128.0, 127.0)
         resized_out0.dynamic_range = (-128.0, 127.0)
         start0.dynamic_range = (-128.0, 127.0)
@@ -450,9 +394,9 @@ def create_plan_shape_tensor_modelfile(models_dir, model_version, max_batch,
         ready0.dynamic_range = (-128.0, 127.0)
 
     flags = 1 << int(trt.BuilderFlag.STRICT_TYPES)
-    if (trt_dtype == trt.int8):
+    if trt_dtype == trt.int8:
         flags |= 1 << int(trt.BuilderFlag.INT8)
-    elif (trt_dtype == trt.float16):
+    elif trt_dtype == trt.float16:
         flags |= 1 << int(trt.BuilderFlag.FP16)
 
     min_prefix = []
@@ -472,19 +416,32 @@ def create_plan_shape_tensor_modelfile(models_dir, model_version, max_batch,
     profile.set_shape("INPUT", min_shape, opt_shape, max_shape)
     profile.set_shape_input("SHAPE_INPUT", min_shape, opt_shape, max_shape)
     profile.set_shape("DUMMY_INPUT", min_shape, opt_shape, max_shape)
-    profile.set_shape("START", min_prefix + unit_shape, opt_prefix + unit_shape,
-                      max_prefix + unit_shape)
-    profile.set_shape("END", min_prefix + unit_shape, opt_prefix + unit_shape,
-                      max_prefix + unit_shape)
-    profile.set_shape("READY", min_prefix + unit_shape, opt_prefix + unit_shape,
-                      max_prefix + unit_shape)
-    profile.set_shape("CORRID", min_prefix + unit_shape,
-                      opt_prefix + unit_shape, max_prefix + unit_shape)
+    profile.set_shape(
+        "START",
+        min_prefix + unit_shape,
+        opt_prefix + unit_shape,
+        max_prefix + unit_shape,
+    )
+    profile.set_shape(
+        "END", min_prefix + unit_shape, opt_prefix + unit_shape, max_prefix + unit_shape
+    )
+    profile.set_shape(
+        "READY",
+        min_prefix + unit_shape,
+        opt_prefix + unit_shape,
+        max_prefix + unit_shape,
+    )
+    profile.set_shape(
+        "CORRID",
+        min_prefix + unit_shape,
+        opt_prefix + unit_shape,
+        max_prefix + unit_shape,
+    )
 
     config = builder.create_builder_config()
     config.flags = flags
     config.add_optimization_profile(profile)
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
     except AttributeError:
@@ -493,7 +450,8 @@ def create_plan_shape_tensor_modelfile(models_dir, model_version, max_batch,
         del engine
 
     model_name = tu.get_dyna_sequence_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", dtype)
+        "plan_nobatch" if max_batch == 0 else "plan", dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     try:
@@ -505,8 +463,7 @@ def create_plan_shape_tensor_modelfile(models_dir, model_version, max_batch,
         f.write(engine_bytes)
 
 
-def create_plan_fixed_modelfile(models_dir, model_version, max_batch, dtype,
-                                shape):
+def create_plan_fixed_modelfile(models_dir, model_version, max_batch, dtype, shape):
     trt_dtype = np_to_trt_dtype(dtype)
     # Create the model. For now don't implement a proper accumulator
     # just return 0 if not-ready and 'INPUT'+'START'+('END'*'CORRID')
@@ -521,16 +478,18 @@ def create_plan_fixed_modelfile(models_dir, model_version, max_batch, dtype,
     corrid0 = network.add_input("CORRID", trt.int32, [1 for i in shape])
     add0 = network.add_elementwise(in0, start0, trt.ElementWiseOperation.SUM)
     mul0 = network.add_elementwise(end0, corrid0, trt.ElementWiseOperation.PROD)
-    sum0 = network.add_elementwise(add0.get_output(0), mul0.get_output(0),
-                                   trt.ElementWiseOperation.SUM)
-    out0 = network.add_elementwise(sum0.get_output(0), ready0,
-                                   trt.ElementWiseOperation.PROD)
+    sum0 = network.add_elementwise(
+        add0.get_output(0), mul0.get_output(0), trt.ElementWiseOperation.SUM
+    )
+    out0 = network.add_elementwise(
+        sum0.get_output(0), ready0, trt.ElementWiseOperation.PROD
+    )
 
     out0.get_output(0).name = "OUTPUT"
     network.mark_output(out0.get_output(0))
 
     config = builder.create_builder_config()
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     builder.max_batch_size = max(1, max_batch)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
@@ -541,7 +500,8 @@ def create_plan_fixed_modelfile(models_dir, model_version, max_batch, dtype,
     del network
 
     model_name = tu.get_dyna_sequence_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", dtype)
+        "plan_nobatch" if max_batch == 0 else "plan", dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     try:
@@ -553,8 +513,7 @@ def create_plan_fixed_modelfile(models_dir, model_version, max_batch, dtype,
         f.write(engine_bytes)
 
 
-def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype,
-                                   shape):
+def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype, shape):
     trt_dtype = np_to_trt_dtype(dtype)
     trt_memory_format = trt.TensorFormat.LINEAR
     # Create the model. For now don't implement a proper accumulator
@@ -570,10 +529,12 @@ def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype,
     corrid0 = network.add_input("CORRID", trt.int32, [1 for i in shape])
     add0 = network.add_elementwise(in0, start0, trt.ElementWiseOperation.SUM)
     mul0 = network.add_elementwise(end0, corrid0, trt.ElementWiseOperation.PROD)
-    sum0 = network.add_elementwise(add0.get_output(0), mul0.get_output(0),
-                                   trt.ElementWiseOperation.SUM)
-    out0 = network.add_elementwise(sum0.get_output(0), ready0,
-                                   trt.ElementWiseOperation.PROD)
+    sum0 = network.add_elementwise(
+        add0.get_output(0), mul0.get_output(0), trt.ElementWiseOperation.SUM
+    )
+    out0 = network.add_elementwise(
+        sum0.get_output(0), ready0, trt.ElementWiseOperation.PROD
+    )
 
     out0.get_output(0).name = "OUTPUT"
     network.mark_output(out0.get_output(0))
@@ -587,7 +548,7 @@ def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype,
     corrid0.allowed_formats = 1 << int(trt_memory_format)
     out0.get_output(0).allowed_formats = 1 << int(trt_memory_format)
 
-    if (trt_dtype == trt.int8):
+    if trt_dtype == trt.int8:
         in0.dynamic_range = (-128.0, 127.0)
         out0.dynamic_range = (-128.0, 127.0)
         start0.dynamic_range = (-128.0, 127.0)
@@ -596,14 +557,14 @@ def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype,
         corrid0.dynamic_range = (-128.0, 127.0)
 
     flags = 1 << int(trt.BuilderFlag.STRICT_TYPES)
-    if (trt_dtype == trt.int8):
+    if trt_dtype == trt.int8:
         flags |= 1 << int(trt.BuilderFlag.INT8)
-    elif (trt_dtype == trt.float16):
+    elif trt_dtype == trt.float16:
         flags |= 1 << int(trt.BuilderFlag.FP16)
 
     config = builder.create_builder_config()
     config.flags = flags
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     builder.max_batch_size = max(1, max_batch)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
@@ -613,7 +574,8 @@ def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype,
         del engine
 
     model_name = tu.get_dyna_sequence_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", dtype)
+        "plan_nobatch" if max_batch == 0 else "plan", dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     try:
@@ -625,8 +587,7 @@ def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype,
         f.write(engine_bytes)
 
 
-def create_plan_dynamic_modelfile(models_dir, model_version, max_batch, dtype,
-                                  shape):
+def create_plan_dynamic_modelfile(models_dir, model_version, max_batch, dtype, shape):
     trt_dtype = np_to_trt_dtype(dtype)
     # Create the model. For now don't implement a proper accumulator
     # just return 0 if not-ready and 'INPUT'+'START'*('END'*'CORRID')
@@ -634,9 +595,10 @@ def create_plan_dynamic_modelfile(models_dir, model_version, max_batch, dtype,
     TRT_LOGGER = trt.Logger(trt.Logger.INFO)
     builder = trt.Builder(TRT_LOGGER)
     network = builder.create_network(
-        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
+        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+    )
 
-    unit_shape = ([1] * len(shape))
+    unit_shape = [1] * len(shape)
     if max_batch != 0:
         in0 = network.add_input("INPUT", trt_dtype, [-1] + shape)
         start0 = network.add_input("START", trt_dtype, [-1] + unit_shape)
@@ -652,10 +614,12 @@ def create_plan_dynamic_modelfile(models_dir, model_version, max_batch, dtype,
 
     add0 = network.add_elementwise(in0, start0, trt.ElementWiseOperation.SUM)
     mul0 = network.add_elementwise(end0, corrid0, trt.ElementWiseOperation.PROD)
-    sum0 = network.add_elementwise(add0.get_output(0), mul0.get_output(0),
-                                   trt.ElementWiseOperation.SUM)
-    out0 = network.add_elementwise(sum0.get_output(0), ready0,
-                                   trt.ElementWiseOperation.PROD)
+    sum0 = network.add_elementwise(
+        add0.get_output(0), mul0.get_output(0), trt.ElementWiseOperation.SUM
+    )
+    out0 = network.add_elementwise(
+        sum0.get_output(0), ready0, trt.ElementWiseOperation.PROD
+    )
 
     out0.get_output(0).name = "OUTPUT"
     network.mark_output(out0.get_output(0))
@@ -680,14 +644,27 @@ def create_plan_dynamic_modelfile(models_dir, model_version, max_batch, dtype,
     profile = builder.create_optimization_profile()
     profile.set_shape("INPUT", min_shape, opt_shape, max_shape)
     if max_batch != 0:
-        profile.set_shape("START", [1] + unit_shape, [max_batch] + unit_shape,
-                          [max_batch] + unit_shape)
-        profile.set_shape("END", [1] + unit_shape, [max_batch] + unit_shape,
-                          [max_batch] + unit_shape)
-        profile.set_shape("READY", [1] + unit_shape, [max_batch] + unit_shape,
-                          [max_batch] + unit_shape)
-        profile.set_shape("CORRID", [1] + unit_shape, [max_batch] + unit_shape,
-                          [max_batch] + unit_shape)
+        profile.set_shape(
+            "START",
+            [1] + unit_shape,
+            [max_batch] + unit_shape,
+            [max_batch] + unit_shape,
+        )
+        profile.set_shape(
+            "END", [1] + unit_shape, [max_batch] + unit_shape, [max_batch] + unit_shape
+        )
+        profile.set_shape(
+            "READY",
+            [1] + unit_shape,
+            [max_batch] + unit_shape,
+            [max_batch] + unit_shape,
+        )
+        profile.set_shape(
+            "CORRID",
+            [1] + unit_shape,
+            [max_batch] + unit_shape,
+            [max_batch] + unit_shape,
+        )
     else:
         profile.set_shape("START", unit_shape, unit_shape, unit_shape)
         profile.set_shape("END", unit_shape, unit_shape, unit_shape)
@@ -696,7 +673,7 @@ def create_plan_dynamic_modelfile(models_dir, model_version, max_batch, dtype,
     config = builder.create_builder_config()
     config.add_optimization_profile(profile)
 
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
     except AttributeError:
@@ -705,7 +682,8 @@ def create_plan_dynamic_modelfile(models_dir, model_version, max_batch, dtype,
         del engine
 
     model_name = tu.get_dyna_sequence_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", dtype)
+        "plan_nobatch" if max_batch == 0 else "plan", dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     try:
@@ -717,8 +695,9 @@ def create_plan_dynamic_modelfile(models_dir, model_version, max_batch, dtype,
         f.write(engine_bytes)
 
 
-def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
-                                     dtype, shape):
+def create_plan_dynamic_rf_modelfile(
+    models_dir, model_version, max_batch, dtype, shape
+):
     trt_dtype = np_to_trt_dtype(dtype)
     trt_memory_format = trt.TensorFormat.LINEAR
 
@@ -728,9 +707,10 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
     TRT_LOGGER = trt.Logger(trt.Logger.INFO)
     builder = trt.Builder(TRT_LOGGER)
     network = builder.create_network(
-        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
+        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+    )
 
-    unit_shape = ([1] * len(shape))
+    unit_shape = [1] * len(shape)
     if max_batch != 0:
         in0 = network.add_input("INPUT", trt_dtype, [-1] + shape)
         start0 = network.add_input("START", trt_dtype, [-1] + unit_shape)
@@ -746,10 +726,12 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
 
     add0 = network.add_elementwise(in0, start0, trt.ElementWiseOperation.SUM)
     mul0 = network.add_elementwise(end0, corrid0, trt.ElementWiseOperation.PROD)
-    sum0 = network.add_elementwise(add0.get_output(0), mul0.get_output(0),
-                                   trt.ElementWiseOperation.SUM)
-    out0 = network.add_elementwise(sum0.get_output(0), ready0,
-                                   trt.ElementWiseOperation.PROD)
+    sum0 = network.add_elementwise(
+        add0.get_output(0), mul0.get_output(0), trt.ElementWiseOperation.SUM
+    )
+    out0 = network.add_elementwise(
+        sum0.get_output(0), ready0, trt.ElementWiseOperation.PROD
+    )
 
     out0.get_output(0).name = "OUTPUT"
     network.mark_output(out0.get_output(0))
@@ -761,7 +743,7 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
     ready0.allowed_formats = 1 << int(trt_memory_format)
     out0.get_output(0).allowed_formats = 1 << int(trt_memory_format)
 
-    if (trt_dtype == trt.int8):
+    if trt_dtype == trt.int8:
         in0.dynamic_range = (-128.0, 127.0)
         out0.dynamic_range = (-128.0, 127.0)
         start0.dynamic_range = (-128.0, 127.0)
@@ -770,9 +752,9 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
         corrid0.dynamic_range = (-128.0, 127.0)
 
     flags = 1 << int(trt.BuilderFlag.STRICT_TYPES)
-    if (trt_dtype == trt.int8):
+    if trt_dtype == trt.int8:
         flags |= 1 << int(trt.BuilderFlag.INT8)
-    elif (trt_dtype == trt.float16):
+    elif trt_dtype == trt.float16:
         flags |= 1 << int(trt.BuilderFlag.FP16)
 
     min_shape = []
@@ -795,14 +777,27 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
     profile = builder.create_optimization_profile()
     profile.set_shape("INPUT", min_shape, opt_shape, max_shape)
     if max_batch != 0:
-        profile.set_shape("START", [1] + unit_shape, [max_batch] + unit_shape,
-                          [max_batch] + unit_shape)
-        profile.set_shape("END", [1] + unit_shape, [max_batch] + unit_shape,
-                          [max_batch] + unit_shape)
-        profile.set_shape("READY", [1] + unit_shape, [max_batch] + unit_shape,
-                          [max_batch] + unit_shape)
-        profile.set_shape("CORRID", [1] + unit_shape, [max_batch] + unit_shape,
-                          [max_batch] + unit_shape)
+        profile.set_shape(
+            "START",
+            [1] + unit_shape,
+            [max_batch] + unit_shape,
+            [max_batch] + unit_shape,
+        )
+        profile.set_shape(
+            "END", [1] + unit_shape, [max_batch] + unit_shape, [max_batch] + unit_shape
+        )
+        profile.set_shape(
+            "READY",
+            [1] + unit_shape,
+            [max_batch] + unit_shape,
+            [max_batch] + unit_shape,
+        )
+        profile.set_shape(
+            "CORRID",
+            [1] + unit_shape,
+            [max_batch] + unit_shape,
+            [max_batch] + unit_shape,
+        )
     else:
         profile.set_shape("START", unit_shape, unit_shape, unit_shape)
         profile.set_shape("END", unit_shape, unit_shape, unit_shape)
@@ -812,7 +807,7 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
     config = builder.create_builder_config()
     config.flags = flags
     config.add_optimization_profile(profile)
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
     except AttributeError:
@@ -821,7 +816,8 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
         del engine
 
     model_name = tu.get_dyna_sequence_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", dtype)
+        "plan_nobatch" if max_batch == 0 else "plan", dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     try:
@@ -834,38 +830,41 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
 
 
 def create_plan_modelfile(models_dir, model_version, max_batch, dtype, shape):
-
     if not tu.validate_for_trt_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
     if dtype != np.float32:
-        if (not tu.shape_is_fixed(shape)):
-            create_plan_dynamic_rf_modelfile(models_dir, model_version,
-                                             max_batch, dtype, shape)
+        if not tu.shape_is_fixed(shape):
+            create_plan_dynamic_rf_modelfile(
+                models_dir, model_version, max_batch, dtype, shape
+            )
         else:
-            create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch,
-                                           dtype, shape)
+            create_plan_fixed_rf_modelfile(
+                models_dir, model_version, max_batch, dtype, shape
+            )
     else:
-        if (not tu.shape_is_fixed(shape)):
-            create_plan_dynamic_modelfile(models_dir, model_version, max_batch,
-                                          dtype, shape)
+        if not tu.shape_is_fixed(shape):
+            create_plan_dynamic_modelfile(
+                models_dir, model_version, max_batch, dtype, shape
+            )
         else:
-            create_plan_fixed_modelfile(models_dir, model_version, max_batch,
-                                        dtype, shape)
+            create_plan_fixed_modelfile(
+                models_dir, model_version, max_batch, dtype, shape
+            )
 
 
 def create_plan_modelconfig(models_dir, model_version, max_batch, dtype, shape):
-
     if not tu.validate_for_trt_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
     model_name = tu.get_dyna_sequence_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", dtype)
+        "plan_nobatch" if max_batch == 0 else "plan", dtype
+    )
     config_dir = models_dir + "/" + model_name
 
     if FLAGS.tensorrt_shape_io:
         shape_tensor_dim = len(shape)
-        config = '''
+        config = """
 name: "{}"
 platform: "tensorrt_plan"
 max_batch_size: {}
@@ -960,17 +959,27 @@ def create_plan_modelconfig(models_dir, model_version, max_batch, dtype, shape):
     kind: KIND_GPU
   }}
 ]
-'''.format(
-            model_name, max_batch,
+""".format(
+            model_name,
+            max_batch,
             "oldest { max_candidate_sequences: 6\npreferred_batch_size: [ 4 ]\nmax_queue_delay_microseconds: 0\n}"
-            if max_batch > 0 else "", "int32", "int32", "int32",
-            tu.shape_to_dims_str(shape), np_to_model_dtype(dtype),
-            tu.shape_to_dims_str(shape), shape_tensor_dim,
-            tu.shape_to_dims_str(shape), np_to_model_dtype(dtype),
-            tu.shape_to_dims_str(shape), shape_tensor_dim)
+            if max_batch > 0
+            else "",
+            "int32",
+            "int32",
+            "int32",
+            tu.shape_to_dims_str(shape),
+            np_to_model_dtype(dtype),
+            tu.shape_to_dims_str(shape),
+            shape_tensor_dim,
+            tu.shape_to_dims_str(shape),
+            np_to_model_dtype(dtype),
+            tu.shape_to_dims_str(shape),
+            shape_tensor_dim,
+        )
 
     else:
-        config = '''
+        config = """
 name: "{}"
 platform: "tensorrt_plan"
 max_batch_size: {}
@@ -1035,14 +1044,20 @@ def create_plan_modelconfig(models_dir, model_version, max_batch, dtype, shape):
     kind: KIND_GPU
   }}
 ]
-'''.format(
-            model_name, max_batch,
+""".format(
+            model_name,
+            max_batch,
             "oldest { max_candidate_sequences: 6\npreferred_batch_size: [ 4 ]\nmax_queue_delay_microseconds: 0\n}"
-            if max_batch > 0 else "", "int32" if dtype == np.int32 else "fp32",
+            if max_batch > 0
+            else "",
+            "int32" if dtype == np.int32 else "fp32",
+            "int32" if dtype == np.int32 else "fp32",
             "int32" if dtype == np.int32 else "fp32",
-            "int32" if dtype == np.int32 else "fp32", np_to_model_dtype(dtype),
-            tu.shape_to_dims_str(shape), np_to_model_dtype(dtype),
-            tu.shape_to_dims_str(shape))
+            np_to_model_dtype(dtype),
+            tu.shape_to_dims_str(shape),
+            np_to_model_dtype(dtype),
+            tu.shape_to_dims_str(shape),
+        )
 
     try:
         os.makedirs(config_dir)
@@ -1054,12 +1069,12 @@ def create_plan_modelconfig(models_dir, model_version, max_batch, dtype, shape):
 
 
 def create_onnx_modelfile(models_dir, model_version, max_batch, dtype, shape):
-
     if not tu.validate_for_onnx_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
     model_name = tu.get_dyna_sequence_model_name(
-        "onnx_nobatch" if max_batch == 0 else "onnx", dtype)
+        "onnx_nobatch" if max_batch == 0 else "onnx", dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     # Create the model. For now don't implement a proper accumulator
@@ -1078,32 +1093,40 @@ def create_onnx_modelfile(models_dir, model_version, max_batch, dtype, shape):
     batch_dim = [] if max_batch == 0 else [None]
 
     onnx_input = onnx.helper.make_tensor_value_info(
-        "INPUT", onnx_dtype, batch_dim + onnx_input_shape)
-    onnx_start = onnx.helper.make_tensor_value_info("START", onnx_control_dtype,
-                                                    batch_dim + [1])
-    onnx_end = onnx.helper.make_tensor_value_info("END", onnx_control_dtype,
-                                                  batch_dim + [1])
-    onnx_ready = onnx.helper.make_tensor_value_info("READY", onnx_control_dtype,
-                                                    batch_dim + [1])
-    onnx_corrid = onnx.helper.make_tensor_value_info("CORRID",
-                                                     onnx.TensorProto.UINT64,
-                                                     batch_dim + [1])
+        "INPUT", onnx_dtype, batch_dim + onnx_input_shape
+    )
+    onnx_start = onnx.helper.make_tensor_value_info(
+        "START", onnx_control_dtype, batch_dim + [1]
+    )
+    onnx_end = onnx.helper.make_tensor_value_info(
+        "END", onnx_control_dtype, batch_dim + [1]
+    )
+    onnx_ready = onnx.helper.make_tensor_value_info(
+        "READY", onnx_control_dtype, batch_dim + [1]
+    )
+    onnx_corrid = onnx.helper.make_tensor_value_info(
+        "CORRID", onnx.TensorProto.UINT64, batch_dim + [1]
+    )
     onnx_output = onnx.helper.make_tensor_value_info(
-        "OUTPUT", onnx_dtype, batch_dim + onnx_output_shape)
+        "OUTPUT", onnx_dtype, batch_dim + onnx_output_shape
+    )
 
     internal_input = onnx.helper.make_node("Identity", ["INPUT"], ["_INPUT"])
 
-    # cast int8, int16 input to higer precision int as Onnx Add/Sub operator doesn't support those type
+    # cast int8, int16 input to higher precision int as Onnx Add/Sub operator doesn't support those type
     # Also casting String data type to int32
-    if ((onnx_dtype == onnx.TensorProto.INT8) or
-        (onnx_dtype == onnx.TensorProto.INT16) or
-        (onnx_dtype == onnx.TensorProto.STRING)):
-        internal_input = onnx.helper.make_node("Cast", ["INPUT"], ["_INPUT"],
-                                               to=onnx.TensorProto.INT32)
-
-    onnx_corrid_cast0 = onnx.helper.make_node("Cast", ["CORRID"],
-                                              ["onnx_corrid_cast0"],
-                                              to=onnx_control_dtype)
+    if (
+        (onnx_dtype == onnx.TensorProto.INT8)
+        or (onnx_dtype == onnx.TensorProto.INT16)
+        or (onnx_dtype == onnx.TensorProto.STRING)
+    ):
+        internal_input = onnx.helper.make_node(
+            "Cast", ["INPUT"], ["_INPUT"], to=onnx.TensorProto.INT32
+        )
+
+    onnx_corrid_cast0 = onnx.helper.make_node(
+        "Cast", ["CORRID"], ["onnx_corrid_cast0"], to=onnx_control_dtype
+    )
     add0 = onnx.helper.make_node("Add", ["_INPUT", "START"], ["add0"])
     mul0 = onnx.helper.make_node("Mul", ["END", "onnx_corrid_cast0"], ["mul0"])
     sum0 = onnx.helper.make_node("Add", ["add0", "mul0"], ["sum0"])
@@ -1115,19 +1138,18 @@ def create_onnx_modelfile(models_dir, model_version, max_batch, dtype, shape):
     if onnx_dtype == onnx.TensorProto.FLOAT16:
         cast = onnx.helper.make_node("Identity", ["CAST"], ["OUTPUT"])
 
-    onnx_nodes = [
-        internal_input, onnx_corrid_cast0, add0, mul0, sum0, res0, cast
-    ]
+    onnx_nodes = [internal_input, onnx_corrid_cast0, add0, mul0, sum0, res0, cast]
     onnx_inputs = [onnx_input, onnx_start, onnx_end, onnx_ready, onnx_corrid]
     onnx_outputs = [onnx_output]
 
-    graph_proto = onnx.helper.make_graph(onnx_nodes, model_name, onnx_inputs,
-                                         onnx_outputs)
+    graph_proto = onnx.helper.make_graph(
+        onnx_nodes, model_name, onnx_inputs, onnx_outputs
+    )
     if FLAGS.onnx_opset > 0:
         model_opset = onnx.helper.make_operatorsetid("", FLAGS.onnx_opset)
-        model_def = onnx.helper.make_model(graph_proto,
-                                           producer_name="triton",
-                                           opset_imports=[model_opset])
+        model_def = onnx.helper.make_model(
+            graph_proto, producer_name="triton", opset_imports=[model_opset]
+        )
     else:
         model_def = onnx.helper.make_model(graph_proto, producer_name="triton")
 
@@ -1140,14 +1162,14 @@ def create_onnx_modelfile(models_dir, model_version, max_batch, dtype, shape):
 
 
 def create_onnx_modelconfig(models_dir, model_version, max_batch, dtype, shape):
-
     if not tu.validate_for_onnx_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
     model_name = tu.get_dyna_sequence_model_name(
-        "onnx_nobatch" if max_batch == 0 else "onnx", dtype)
+        "onnx_nobatch" if max_batch == 0 else "onnx", dtype
+    )
     config_dir = models_dir + "/" + model_name
-    config = '''
+    config = """
 name: "{}"
 platform: "onnxruntime_onnx"
 max_batch_size: {}
@@ -1212,16 +1234,18 @@ def create_onnx_modelconfig(models_dir, model_version, max_batch, dtype, shape):
     kind: KIND_CPU
   }}
 ]
-'''.format(
+""".format(
         model_name,
         max_batch,
         "oldest { max_candidate_sequences: 6\npreferred_batch_size: [ 4 ]\nmax_queue_delay_microseconds: 0\n}"
-        if max_batch > 0 else "",
+        if max_batch > 0
+        else "",
         np_to_model_dtype(dtype),
         tu.shape_to_dims_str(shape),
         np_to_model_dtype(dtype),
         tu.shape_to_dims_str(shape),
-        type="fp32" if dtype == np.float32 else "int32")
+        type="fp32" if dtype == np.float32 else "int32",
+    )
 
     try:
         os.makedirs(config_dir)
@@ -1232,22 +1256,19 @@ def create_onnx_modelconfig(models_dir, model_version, max_batch, dtype, shape):
         cfile.write(config)
 
 
-def create_libtorch_modelfile(models_dir, model_version, max_batch, dtype,
-                              shape):
-
-    if not tu.validate_for_libtorch_model(dtype, dtype, dtype, shape, shape,
-                                          shape):
+def create_libtorch_modelfile(models_dir, model_version, max_batch, dtype, shape):
+    if not tu.validate_for_libtorch_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
     torch_dtype = np_to_torch_dtype(dtype)
 
     model_name = tu.get_dyna_sequence_model_name(
-        "libtorch_nobatch" if max_batch == 0 else "libtorch", dtype)
+        "libtorch_nobatch" if max_batch == 0 else "libtorch", dtype
+    )
     # handle for -1 (when variable) since can't create tensor with shape of [-1]
     shape = [abs(ips) for ips in shape]
 
     class SequenceNet(nn.Module):
-
         def __init__(self):
             super(SequenceNet, self).__init__()
 
@@ -1258,9 +1279,16 @@ def forward(self, input0, start0, end0, ready0, corrid0):
     sequenceModel = SequenceNet()
     example_input = torch.zeros(shape, dtype=torch_dtype)
     example_corrid_input = torch.zeros(shape, dtype=torch.long)
-    traced = torch.jit.trace(sequenceModel,
-                             (example_input, example_input, example_input,
-                              example_input, example_corrid_input))
+    traced = torch.jit.trace(
+        sequenceModel,
+        (
+            example_input,
+            example_input,
+            example_input,
+            example_input,
+            example_corrid_input,
+        ),
+    )
 
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
@@ -1272,18 +1300,16 @@ def forward(self, input0, start0, end0, ready0, corrid0):
     traced.save(model_version_dir + "/model.pt")
 
 
-def create_libtorch_modelconfig(models_dir, model_version, max_batch, dtype,
-                                shape):
-
-    if not tu.validate_for_libtorch_model(dtype, dtype, dtype, shape, shape,
-                                          shape):
+def create_libtorch_modelconfig(models_dir, model_version, max_batch, dtype, shape):
+    if not tu.validate_for_libtorch_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
     model_name = tu.get_dyna_sequence_model_name(
-        "libtorch_nobatch" if max_batch == 0 else "libtorch", dtype)
+        "libtorch_nobatch" if max_batch == 0 else "libtorch", dtype
+    )
     config_dir = models_dir + "/" + model_name
     #  FIX FOR LibTorch
-    config = '''
+    config = """
 name: "{}"
 platform: "pytorch_libtorch"
 max_batch_size: {}
@@ -1348,13 +1374,19 @@ def create_libtorch_modelconfig(models_dir, model_version, max_batch, dtype,
     kind: KIND_CPU
   }}
 ]
-'''.format(
-        model_name, max_batch,
+""".format(
+        model_name,
+        max_batch,
         "oldest { max_candidate_sequences: 6\npreferred_batch_size: [ 4 ]\nmax_queue_delay_microseconds: 0\n}"
-        if max_batch > 0 else "", "int32" if dtype == np.int32 else "fp32",
+        if max_batch > 0
+        else "",
+        "int32" if dtype == np.int32 else "fp32",
         "int32" if dtype == np.int32 else "fp32",
-        "int32" if dtype == np.int32 else "fp32", np_to_model_dtype(dtype),
-        tu.shape_to_dims_str(shape), np_to_model_dtype(dtype))
+        "int32" if dtype == np.int32 else "fp32",
+        np_to_model_dtype(dtype),
+        tu.shape_to_dims_str(shape),
+        np_to_model_dtype(dtype),
+    )
 
     try:
         os.makedirs(config_dir)
@@ -1365,19 +1397,22 @@ def create_libtorch_modelconfig(models_dir, model_version, max_batch, dtype,
         cfile.write(config)
 
 
-def create_openvino_modelfile(models_dir, model_version, max_batch, dtype,
-                              shape):
-
-    batch_dim = [] if max_batch == 0 else [
-        max_batch,
-    ]
-    if not tu.validate_for_openvino_model(dtype, dtype, dtype,
-                                          batch_dim + shape, batch_dim + shape,
-                                          batch_dim + shape):
+def create_openvino_modelfile(models_dir, model_version, max_batch, dtype, shape):
+    batch_dim = (
+        []
+        if max_batch == 0
+        else [
+            max_batch,
+        ]
+    )
+    if not tu.validate_for_openvino_model(
+        dtype, dtype, dtype, batch_dim + shape, batch_dim + shape, batch_dim + shape
+    ):
         return
 
     model_name = tu.get_dyna_sequence_model_name(
-        "openvino_nobatch" if max_batch == 0 else "openvino", dtype)
+        "openvino_nobatch" if max_batch == 0 else "openvino", dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     in0 = ng.parameter(shape=batch_dim + shape, dtype=dtype, name="INPUT")
@@ -1391,8 +1426,7 @@ def create_openvino_modelfile(models_dir, model_version, max_batch, dtype,
     tmp = ng.add(tmp1, tmp2)
     op0 = ng.multiply(tmp, ready, name="OUTPUT")
 
-    function = ng.impl.Function([op0], [in0, start, end, ready, corrid],
-                                model_name)
+    function = ng.impl.Function([op0], [in0, start, end, ready, corrid], model_name)
     ie_network = IENetwork(ng.impl.Function.to_capsule(function))
 
     try:
@@ -1400,25 +1434,29 @@ def create_openvino_modelfile(models_dir, model_version, max_batch, dtype,
     except OSError as ex:
         pass  # ignore existing dir
 
-    ie_network.serialize(model_version_dir + "/model.xml",
-                         model_version_dir + "/model.bin")
-
-
-def create_openvino_modelconfig(models_dir, model_version, max_batch, dtype,
-                                shape):
-
-    batch_dim = [] if max_batch == 0 else [
-        max_batch,
-    ]
-    if not tu.validate_for_openvino_model(dtype, dtype, dtype,
-                                          batch_dim + shape, batch_dim + shape,
-                                          batch_dim + shape):
+    ie_network.serialize(
+        model_version_dir + "/model.xml", model_version_dir + "/model.bin"
+    )
+
+
+def create_openvino_modelconfig(models_dir, model_version, max_batch, dtype, shape):
+    batch_dim = (
+        []
+        if max_batch == 0
+        else [
+            max_batch,
+        ]
+    )
+    if not tu.validate_for_openvino_model(
+        dtype, dtype, dtype, batch_dim + shape, batch_dim + shape, batch_dim + shape
+    ):
         return
 
     model_name = tu.get_dyna_sequence_model_name(
-        "openvino_nobatch" if max_batch == 0 else "openvino", dtype)
+        "openvino_nobatch" if max_batch == 0 else "openvino", dtype
+    )
     config_dir = models_dir + "/" + model_name
-    config = '''
+    config = """
 name: "{}"
 backend: "openvino"
 max_batch_size: {}
@@ -1478,18 +1516,19 @@ def create_openvino_modelconfig(models_dir, model_version, max_batch, dtype,
     dims: [ 1 ]
   }}
 ]
-instance_group [
-  {{
-    kind: KIND_CPU
-  }}
-]
-'''.format(
-        model_name, max_batch,
+""".format(
+        model_name,
+        max_batch,
         "oldest { max_candidate_sequences: 6\npreferred_batch_size: [ 4 ]\nmax_queue_delay_microseconds: 0\n}"
-        if max_batch > 0 else "", "int32" if dtype == np.int32 else "fp32",
+        if max_batch > 0
+        else "",
+        "int32" if dtype == np.int32 else "fp32",
+        "int32" if dtype == np.int32 else "fp32",
         "int32" if dtype == np.int32 else "fp32",
-        "int32" if dtype == np.int32 else "fp32", np_to_model_dtype(dtype),
-        tu.shape_to_dims_str(shape), np_to_model_dtype(dtype))
+        np_to_model_dtype(dtype),
+        tu.shape_to_dims_str(shape),
+        np_to_model_dtype(dtype),
+    )
 
     try:
         os.makedirs(config_dir)
@@ -1504,12 +1543,10 @@ def create_shape_tensor_models(models_dir, dtype, shape, no_batch=True):
     model_version = 1
 
     create_plan_modelconfig(models_dir, model_version, 8, dtype, shape)
-    create_plan_shape_tensor_modelfile(models_dir, model_version, 8, dtype,
-                                       shape)
+    create_plan_shape_tensor_modelfile(models_dir, model_version, 8, dtype, shape)
     if no_batch:
         create_plan_modelconfig(models_dir, model_version, 0, dtype, shape)
-        create_plan_shape_tensor_modelfile(models_dir, model_version, 0, dtype,
-                                           shape)
+        create_plan_shape_tensor_modelfile(models_dir, model_version, 0, dtype, shape)
 
 
 def create_models(models_dir, dtype, shape, no_batch=True):
@@ -1519,34 +1556,26 @@ def create_models(models_dir, dtype, shape, no_batch=True):
         create_tf_modelconfig(False, models_dir, model_version, 8, dtype, shape)
         create_tf_modelfile(False, models_dir, model_version, 8, dtype, shape)
         if no_batch:
-            create_tf_modelconfig(False, models_dir, model_version, 0, dtype,
-                                  shape)
-            create_tf_modelfile(False, models_dir, model_version, 0, dtype,
-                                shape)
+            create_tf_modelconfig(False, models_dir, model_version, 0, dtype, shape)
+            create_tf_modelfile(False, models_dir, model_version, 0, dtype, shape)
 
     if FLAGS.savedmodel:
         create_tf_modelconfig(True, models_dir, model_version, 8, dtype, shape)
         create_tf_modelfile(True, models_dir, model_version, 8, dtype, shape)
         if no_batch:
-            create_tf_modelconfig(True, models_dir, model_version, 0, dtype,
-                                  shape)
-            create_tf_modelfile(True, models_dir, model_version, 0, dtype,
-                                shape)
+            create_tf_modelconfig(True, models_dir, model_version, 0, dtype, shape)
+            create_tf_modelfile(True, models_dir, model_version, 0, dtype, shape)
 
     if FLAGS.tensorrt:
         suffix = []
         if dtype == np.int8:
             suffix = [1, 1]
 
-        create_plan_modelconfig(models_dir, model_version, 8, dtype,
-                                shape + suffix)
-        create_plan_modelfile(models_dir, model_version, 8, dtype,
-                              shape + suffix)
+        create_plan_modelconfig(models_dir, model_version, 8, dtype, shape + suffix)
+        create_plan_modelfile(models_dir, model_version, 8, dtype, shape + suffix)
         if no_batch:
-            create_plan_modelconfig(models_dir, model_version, 0, dtype,
-                                    shape + suffix)
-            create_plan_modelfile(models_dir, model_version, 0, dtype,
-                                  shape + suffix)
+            create_plan_modelconfig(models_dir, model_version, 0, dtype, shape + suffix)
+            create_plan_modelfile(models_dir, model_version, 0, dtype, shape + suffix)
 
     if FLAGS.onnx:
         create_onnx_modelconfig(models_dir, model_version, 8, dtype, shape)
@@ -1559,71 +1588,81 @@ def create_models(models_dir, dtype, shape, no_batch=True):
         create_libtorch_modelconfig(models_dir, model_version, 8, dtype, shape)
         create_libtorch_modelfile(models_dir, model_version, 8, dtype, shape)
         if no_batch:
-            create_libtorch_modelconfig(models_dir, model_version, 0, dtype,
-                                        shape)
-            create_libtorch_modelfile(models_dir, model_version, 0, dtype,
-                                      shape)
+            create_libtorch_modelconfig(models_dir, model_version, 0, dtype, shape)
+            create_libtorch_modelfile(models_dir, model_version, 0, dtype, shape)
 
     if FLAGS.openvino:
         create_openvino_modelconfig(models_dir, model_version, 8, dtype, shape)
         create_openvino_modelfile(models_dir, model_version, 8, dtype, shape)
         if no_batch:
-            create_openvino_modelconfig(models_dir, model_version, 0, dtype,
-                                        shape)
-            create_openvino_modelfile(models_dir, model_version, 0, dtype,
-                                      shape)
+            create_openvino_modelconfig(models_dir, model_version, 0, dtype, shape)
+            create_openvino_modelfile(models_dir, model_version, 0, dtype, shape)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('--models_dir',
-                        type=str,
-                        required=True,
-                        help='Top-level model directory')
-    parser.add_argument('--graphdef',
-                        required=False,
-                        action='store_true',
-                        help='Generate GraphDef models')
-    parser.add_argument('--savedmodel',
-                        required=False,
-                        action='store_true',
-                        help='Generate SavedModel models')
-    parser.add_argument('--tensorrt',
-                        required=False,
-                        action='store_true',
-                        help='Generate TensorRT PLAN models')
     parser.add_argument(
-        '--tensorrt-shape-io',
+        "--models_dir", type=str, required=True, help="Top-level model directory"
+    )
+    parser.add_argument(
+        "--graphdef",
         required=False,
-        action='store_true',
-        help='Generate TensorRT PLAN models w/ shape tensor i/o')
-    parser.add_argument('--onnx',
-                        required=False,
-                        action='store_true',
-                        help='Generate Onnx models')
+        action="store_true",
+        help="Generate GraphDef models",
+    )
     parser.add_argument(
-        '--onnx_opset',
+        "--savedmodel",
+        required=False,
+        action="store_true",
+        help="Generate SavedModel models",
+    )
+    parser.add_argument(
+        "--tensorrt",
+        required=False,
+        action="store_true",
+        help="Generate TensorRT PLAN models",
+    )
+    parser.add_argument(
+        "--tensorrt-shape-io",
+        required=False,
+        action="store_true",
+        help="Generate TensorRT PLAN models w/ shape tensor i/o",
+    )
+    parser.add_argument(
+        "--onnx", required=False, action="store_true", help="Generate Onnx models"
+    )
+    parser.add_argument(
+        "--onnx_opset",
         type=int,
         required=False,
         default=0,
-        help='Opset used for Onnx models. Default is to use ONNXRT default')
-    parser.add_argument('--libtorch',
-                        required=False,
-                        action='store_true',
-                        help='Generate Pytorch LibTorch models')
-    parser.add_argument('--openvino',
-                        required=False,
-                        action='store_true',
-                        help='Generate OpenVino models')
-    parser.add_argument('--variable',
-                        required=False,
-                        action='store_true',
-                        help='Used variable-shape tensors for input/output')
+        help="Opset used for Onnx models. Default is to use ONNXRT default",
+    )
+    parser.add_argument(
+        "--libtorch",
+        required=False,
+        action="store_true",
+        help="Generate Pytorch LibTorch models",
+    )
+    parser.add_argument(
+        "--openvino",
+        required=False,
+        action="store_true",
+        help="Generate OpenVino models",
+    )
+    parser.add_argument(
+        "--variable",
+        required=False,
+        action="store_true",
+        help="Used variable-shape tensors for input/output",
+    )
     FLAGS, unparsed = parser.parse_known_args()
 
     if FLAGS.graphdef or FLAGS.savedmodel:
         import tensorflow as tf
         from tensorflow.python.framework import graph_io
+
+        tf.compat.v1.disable_eager_execution()
     if FLAGS.tensorrt or FLAGS.tensorrt_shape_io:
         import tensorrt as trt
     if FLAGS.onnx:
@@ -1638,18 +1677,31 @@ def create_models(models_dir, dtype, shape, no_batch=True):
     import test_util as tu
 
     if FLAGS.tensorrt_shape_io:
-        create_shape_tensor_models(FLAGS.models_dir, np.float32, [
-            -1,
-        ])
+        create_shape_tensor_models(
+            FLAGS.models_dir,
+            np.float32,
+            [
+                -1,
+            ],
+        )
     else:
         # Tests with models that accept fixed-shape input/output tensors
         if not FLAGS.variable:
-            create_models(FLAGS.models_dir, np.int32, [
-                1,
-            ])
+            create_models(
+                FLAGS.models_dir,
+                np.int32,
+                [
+                    1,
+                ],
+            )
 
         # Tests with models that accept variable-shape input/output tensors
         if FLAGS.variable:
-            create_models(FLAGS.models_dir, np.int32, [
-                -1,
-            ], False)
+            create_models(
+                FLAGS.models_dir,
+                np.int32,
+                [
+                    -1,
+                ],
+                False,
+            )
diff --git a/qa/common/gen_qa_identity_models.py b/qa/common/gen_qa_identity_models.py
old mode 100644
new mode 100755
index 407957deae..60b045a09c
--- a/qa/common/gen_qa_identity_models.py
+++ b/qa/common/gen_qa_identity_models.py
@@ -1,4 +1,6 @@
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,199 +27,98 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import argparse
-from builtins import range
 import os
-import numpy as np
+from builtins import range
+
 import gen_ensemble_model_utils as emu
+import numpy as np
+from gen_common import (
+    np_to_model_dtype,
+    np_to_onnx_dtype,
+    np_to_tf_dtype,
+    np_to_trt_dtype,
+)
 
 FLAGS = None
 np_dtype_string = np.dtype(object)
 from typing import List, Tuple
 
 
-def np_to_model_dtype(np_dtype):
-    if np_dtype == bool:
-        return "TYPE_BOOL"
-    elif np_dtype == np.int8:
-        return "TYPE_INT8"
-    elif np_dtype == np.int16:
-        return "TYPE_INT16"
-    elif np_dtype == np.int32:
-        return "TYPE_INT32"
-    elif np_dtype == np.int64:
-        return "TYPE_INT64"
-    elif np_dtype == np.uint8:
-        return "TYPE_UINT8"
-    elif np_dtype == np.uint16:
-        return "TYPE_UINT16"
-    elif np_dtype == np.float16:
-        return "TYPE_FP16"
-    elif np_dtype == np.float32:
-        return "TYPE_FP32"
-    elif np_dtype == np.float64:
-        return "TYPE_FP64"
-    elif np_dtype == np_dtype_string:
-        return "TYPE_STRING"
-    return None
-
-
-def np_to_tf_dtype(np_dtype):
-    if np_dtype == bool:
-        return tf.bool
-    elif np_dtype == np.int8:
-        return tf.int8
-    elif np_dtype == np.int16:
-        return tf.int16
-    elif np_dtype == np.int32:
-        return tf.int32
-    elif np_dtype == np.int64:
-        return tf.int64
-    elif np_dtype == np.uint8:
-        return tf.uint8
-    elif np_dtype == np.uint16:
-        return tf.uint16
-    elif np_dtype == np.float16:
-        return tf.float16
-    elif np_dtype == np.float32:
-        return tf.float32
-    elif np_dtype == np.float64:
-        return tf.float64
-    elif np_dtype == np_dtype_string:
-        return tf.string
-    return None
-
-
-def np_to_trt_dtype(np_dtype):
-    if np_dtype == bool:
-        return trt.bool
-    elif np_dtype == np.int8:
-        return trt.int8
-    elif np_dtype == np.int32:
-        return trt.int32
-    elif np_dtype == np.float16:
-        return trt.float16
-    elif np_dtype == np.float32:
-        return trt.float32
-    return None
-
-
-def np_to_onnx_dtype(np_dtype):
-    if np_dtype == bool:
-        return onnx.TensorProto.BOOL
-    elif np_dtype == np.int8:
-        return onnx.TensorProto.INT8
-    elif np_dtype == np.int16:
-        return onnx.TensorProto.INT16
-    elif np_dtype == np.int32:
-        return onnx.TensorProto.INT32
-    elif np_dtype == np.int64:
-        return onnx.TensorProto.INT64
-    elif np_dtype == np.uint8:
-        return onnx.TensorProto.UINT8
-    elif np_dtype == np.uint16:
-        return onnx.TensorProto.UINT16
-    elif np_dtype == np.float16:
-        return onnx.TensorProto.FLOAT16
-    elif np_dtype == np.float32:
-        return onnx.TensorProto.FLOAT
-    elif np_dtype == np.float64:
-        return onnx.TensorProto.DOUBLE
-    elif np_dtype == np_dtype_string:
-        return onnx.TensorProto.STRING
-    return None
-
-
-def np_to_torch_dtype(np_dtype):
-    if np_dtype == bool:
-        return torch.bool
-    elif np_dtype == np.int8:
-        return torch.int8
-    elif np_dtype == np.int16:
-        return torch.int16
-    elif np_dtype == np.int32:
-        return torch.int
-    elif np_dtype == np.int64:
-        return torch.long
-    elif np_dtype == np.uint8:
-        return torch.uint8
-    elif np_dtype == np.uint16:
-        return None  # Not supported in Torch
-    elif np_dtype == np.float16:
-        return None
-    elif np_dtype == np.float32:
-        return torch.float
-    elif np_dtype == np.float64:
-        return torch.double
-    elif np_dtype == np_dtype_string:
-        return List[str]
-
-
-def create_tf_modelfile(create_savedmodel, models_dir, model_version, io_cnt,
-                        max_batch, dtype, shape):
-
+def create_tf_modelfile(
+    create_savedmodel, models_dir, model_version, io_cnt, max_batch, dtype, shape
+):
     if not tu.validate_for_tf_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
     tf_dtype = np_to_tf_dtype(dtype)
 
     # Create the model that copies inputs to corresponding outputs.
-    tf.reset_default_graph()
+    tf.compat.v1.reset_default_graph()
     for io_num in range(io_cnt):
         input_name = "INPUT{}".format(io_num)
         output_name = "OUTPUT{}".format(io_num)
         if max_batch == 0:
-            tin = tf.placeholder(tf_dtype, tu.shape_to_tf_shape(shape),
-                                 input_name)
+            tin = tf.compat.v1.placeholder(
+                tf_dtype, tu.shape_to_tf_shape(shape), input_name
+            )
         else:
-            tin = tf.placeholder(tf_dtype, [
-                None,
-            ] + tu.shape_to_tf_shape(shape), input_name)
+            tin = tf.compat.v1.placeholder(
+                tf_dtype,
+                [
+                    None,
+                ]
+                + tu.shape_to_tf_shape(shape),
+                input_name,
+            )
         toutput = tf.identity(tin, name=output_name)
 
     # Use model name based on io_cnt and non-batching variant
     if create_savedmodel:
         model_name = tu.get_zero_model_name(
-            "savedmodel_nobatch" if max_batch == 0 else "savedmodel", io_cnt,
-            dtype)
+            "savedmodel_nobatch" if max_batch == 0 else "savedmodel", io_cnt, dtype
+        )
     else:
         model_name = tu.get_zero_model_name(
-            "graphdef_nobatch" if max_batch == 0 else "graphdef", io_cnt, dtype)
-
-    model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
+            "graphdef_nobatch" if max_batch == 0 else "graphdef", io_cnt, dtype
+        )
 
-    try:
-        os.makedirs(model_version_dir)
-    except OSError as ex:
-        pass  # ignore existing dir
+    model_version_dir = os.path.join(models_dir, model_name, str(model_version))
+    os.makedirs(model_version_dir, exist_ok=True)
 
     if create_savedmodel:
-        with tf.Session() as sess:
+        with tf.compat.v1.Session() as sess:
             input_dict = {}
             output_dict = {}
             for io_num in range(io_cnt):
                 input_name = "INPUT{}".format(io_num)
                 output_name = "OUTPUT{}".format(io_num)
-                input_tensor = tf.get_default_graph().get_tensor_by_name(
-                    input_name + ":0")
-                output_tensor = tf.get_default_graph().get_tensor_by_name(
-                    output_name + ":0")
+                input_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+                    input_name + ":0"
+                )
+                output_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+                    output_name + ":0"
+                )
                 input_dict[input_name] = input_tensor
                 output_dict[output_name] = output_tensor
-            tf.saved_model.simple_save(sess,
-                                       model_version_dir + "/model.savedmodel",
-                                       inputs=input_dict,
-                                       outputs=output_dict)
+            tf.compat.v1.saved_model.simple_save(
+                sess,
+                model_version_dir + "/model.savedmodel",
+                inputs=input_dict,
+                outputs=output_dict,
+            )
     else:
-        with tf.Session() as sess:
-            graph_io.write_graph(sess.graph.as_graph_def(),
-                                 model_version_dir,
-                                 "model.graphdef",
-                                 as_text=False)
-
-
-def create_tf_modelconfig(create_savedmodel, models_dir, model_version, io_cnt,
-                          max_batch, dtype, shape):
-
+        with tf.compat.v1.Session() as sess:
+            graph_io.write_graph(
+                sess.graph.as_graph_def(),
+                model_version_dir,
+                "model.graphdef",
+                as_text=False,
+            )
+
+
+def create_tf_modelconfig(
+    create_savedmodel, models_dir, model_version, io_cnt, max_batch, dtype, shape
+):
     if not tu.validate_for_tf_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
@@ -226,24 +127,26 @@ def create_tf_modelconfig(create_savedmodel, models_dir, model_version, io_cnt,
     # Use a different model name for the non-batching variant
     if create_savedmodel:
         model_name = tu.get_zero_model_name(
-            "savedmodel_nobatch" if max_batch == 0 else "savedmodel", io_cnt,
-            dtype)
+            "savedmodel_nobatch" if max_batch == 0 else "savedmodel", io_cnt, dtype
+        )
     else:
         model_name = tu.get_zero_model_name(
-            "graphdef_nobatch" if max_batch == 0 else "graphdef", io_cnt, dtype)
+            "graphdef_nobatch" if max_batch == 0 else "graphdef", io_cnt, dtype
+        )
 
-    config_dir = models_dir + "/" + model_name
-    config = '''
+    config_dir = os.path.join(models_dir, model_name)
+    config = """
 name: "{}"
 platform: "{}"
 max_batch_size: {}
-'''.format(
+""".format(
         model_name,
         "tensorflow_savedmodel" if create_savedmodel else "tensorflow_graphdef",
-        max_batch)
+        max_batch,
+    )
 
     for io_num in range(io_cnt):
-        config += '''
+        config += """
 input [
   {{
     name: "INPUT{}"
@@ -258,44 +161,64 @@ def create_tf_modelconfig(create_savedmodel, models_dir, model_version, io_cnt,
     dims: [ {} ]
   }}
 ]
-'''.format(io_num, np_to_model_dtype(dtype), shape_str, io_num,
-           np_to_model_dtype(dtype), shape_str)
+""".format(
+            io_num,
+            np_to_model_dtype(dtype),
+            shape_str,
+            io_num,
+            np_to_model_dtype(dtype),
+            shape_str,
+        )
 
-    try:
-        os.makedirs(config_dir)
-    except OSError as ex:
-        pass  # ignore existing dir
+    os.makedirs(config_dir, exist_ok=True)
 
     with open(config_dir + "/config.pbtxt", "w") as cfile:
         cfile.write(config)
 
 
-def create_ensemble_modelfile(create_savedmodel, models_dir, model_version,
-                              io_cnt, max_batch, dtype, shape):
-    if not tu.validate_for_ensemble_model("zero", dtype, dtype, dtype, shape,
-                                          shape, shape):
+def create_ensemble_modelfile(
+    create_savedmodel, models_dir, model_version, io_cnt, max_batch, dtype, shape
+):
+    if not tu.validate_for_ensemble_model(
+        "zero", dtype, dtype, dtype, shape, shape, shape
+    ):
         return
 
-    emu.create_identity_ensemble_modelfile("zero", models_dir, model_version,
-                                           max_batch, dtype, [shape] * io_cnt,
-                                           [shape] * io_cnt)
-
-
-def create_ensemble_modelconfig(create_savedmodel, models_dir, model_version,
-                                io_cnt, max_batch, dtype, shape):
-    if not tu.validate_for_ensemble_model("zero", dtype, dtype, dtype, shape,
-                                          shape, shape):
+    emu.create_identity_ensemble_modelfile(
+        "zero",
+        models_dir,
+        model_version,
+        max_batch,
+        dtype,
+        [shape] * io_cnt,
+        [shape] * io_cnt,
+    )
+
+
+def create_ensemble_modelconfig(
+    create_savedmodel, models_dir, model_version, io_cnt, max_batch, dtype, shape
+):
+    if not tu.validate_for_ensemble_model(
+        "zero", dtype, dtype, dtype, shape, shape, shape
+    ):
         return
 
-    emu.create_identity_ensemble_modelconfig("zero", models_dir, model_version,
-                                             max_batch, dtype, [shape] * io_cnt,
-                                             [shape] * io_cnt, [shape] * io_cnt,
-                                             [shape] * io_cnt)
-
+    emu.create_identity_ensemble_modelconfig(
+        "zero",
+        models_dir,
+        model_version,
+        max_batch,
+        dtype,
+        [shape] * io_cnt,
+        [shape] * io_cnt,
+        [shape] * io_cnt,
+        [shape] * io_cnt,
+    )
 
-def create_onnx_modelfile(create_savedmodel, models_dir, model_version, io_cnt,
-                          max_batch, dtype, shape):
 
+def create_onnx_modelfile(
+    create_savedmodel, models_dir, model_version, io_cnt, max_batch, dtype, shape
+):
     if not tu.validate_for_onnx_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
@@ -303,8 +226,9 @@ def create_onnx_modelfile(create_savedmodel, models_dir, model_version, io_cnt,
 
     # Create the model
     model_name = tu.get_zero_model_name(
-        "onnx_nobatch" if max_batch == 0 else "onnx", io_cnt, dtype)
-    model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
+        "onnx_nobatch" if max_batch == 0 else "onnx", io_cnt, dtype
+    )
+    model_version_dir = os.path.join(models_dir, model_name, str(model_version))
 
     batch_dim = [] if max_batch == 0 else [None]
 
@@ -320,153 +244,158 @@ def create_onnx_modelfile(create_savedmodel, models_dir, model_version, io_cnt,
         out_name = "OUTPUT{}".format(io_num)
 
         onnx_inputs.append(
-            onnx.helper.make_tensor_value_info(in_name, onnx_dtype,
-                                               batch_dim + in_shape))
+            onnx.helper.make_tensor_value_info(
+                in_name, onnx_dtype, batch_dim + in_shape
+            )
+        )
         onnx_outputs.append(
-            onnx.helper.make_tensor_value_info(out_name, onnx_dtype,
-                                               batch_dim + out_shape))
-        onnx_nodes.append(
-            onnx.helper.make_node("Identity", [in_name], [out_name]))
-
-    graph_proto = onnx.helper.make_graph(onnx_nodes, model_name, onnx_inputs,
-                                         onnx_outputs)
+            onnx.helper.make_tensor_value_info(
+                out_name, onnx_dtype, batch_dim + out_shape
+            )
+        )
+        onnx_nodes.append(onnx.helper.make_node("Identity", [in_name], [out_name]))
+
+    graph_proto = onnx.helper.make_graph(
+        onnx_nodes, model_name, onnx_inputs, onnx_outputs
+    )
     if FLAGS.onnx_opset > 0:
         model_opset = onnx.helper.make_operatorsetid("", FLAGS.onnx_opset)
-        model_def = onnx.helper.make_model(graph_proto,
-                                           producer_name="triton",
-                                           opset_imports=[model_opset])
+        model_def = onnx.helper.make_model(
+            graph_proto, producer_name="triton", opset_imports=[model_opset]
+        )
     else:
         model_def = onnx.helper.make_model(graph_proto, producer_name="triton")
 
-    try:
-        os.makedirs(model_version_dir)
-    except OSError as ex:
-        pass  # ignore existing dir
+    os.makedirs(model_version_dir, exist_ok=True)
 
     onnx.save(model_def, model_version_dir + "/model.onnx")
 
 
-def create_onnx_modelconfig(create_savedmodel, models_dir, model_version,
-                            io_cnt, max_batch, dtype, shape):
-
+def create_onnx_modelconfig(
+    create_savedmodel, models_dir, model_version, io_cnt, max_batch, dtype, shape
+):
     if not tu.validate_for_onnx_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
     # Use a different model name for the non-batching variant
     model_name = tu.get_zero_model_name(
-        "onnx_nobatch" if max_batch == 0 else "onnx", io_cnt, dtype)
-    config_dir = models_dir + "/" + model_name
-
-    config = emu.create_general_modelconfig(model_name,
-                                            "onnxruntime_onnx",
-                                            max_batch,
-                                            emu.repeat(dtype, io_cnt),
-                                            emu.repeat(shape, io_cnt),
-                                            emu.repeat(shape, io_cnt),
-                                            emu.repeat(dtype, io_cnt),
-                                            emu.repeat(shape, io_cnt),
-                                            emu.repeat(shape, io_cnt),
-                                            emu.repeat(None, io_cnt),
-                                            force_tensor_number_suffix=True)
+        "onnx_nobatch" if max_batch == 0 else "onnx", io_cnt, dtype
+    )
+    config_dir = os.path.join(models_dir, model_name)
 
-    try:
-        os.makedirs(config_dir)
-    except OSError as ex:
-        pass  # ignore existing dir
+    config = emu.create_general_modelconfig(
+        model_name,
+        "onnxruntime_onnx",
+        max_batch,
+        emu.repeat(dtype, io_cnt),
+        emu.repeat(shape, io_cnt),
+        emu.repeat(shape, io_cnt),
+        emu.repeat(dtype, io_cnt),
+        emu.repeat(shape, io_cnt),
+        emu.repeat(shape, io_cnt),
+        emu.repeat(None, io_cnt),
+        force_tensor_number_suffix=True,
+    )
+
+    os.makedirs(config_dir, exist_ok=True)
 
     with open(config_dir + "/config.pbtxt", "w") as cfile:
         cfile.write(config)
 
 
-def create_libtorch_modelfile(create_savedmodel, models_dir, model_version,
-                              io_cnt, max_batch, dtype, shape):
-
-    if not tu.validate_for_libtorch_model(dtype, dtype, dtype, shape, shape,
-                                          shape, max_batch):
+def create_libtorch_modelfile(
+    create_savedmodel, models_dir, model_version, io_cnt, max_batch, dtype, shape
+):
+    if not tu.validate_for_libtorch_model(
+        dtype, dtype, dtype, shape, shape, shape, max_batch
+    ):
         return
 
     model_name = tu.get_zero_model_name(
-        "libtorch_nobatch" if max_batch == 0 else "libtorch", io_cnt, dtype)
+        "libtorch_nobatch" if max_batch == 0 else "libtorch", io_cnt, dtype
+    )
 
     # Create the model
     if io_cnt == 1:
-        if (dtype == np_dtype_string):
+        if dtype == np_dtype_string:
 
             class IdentityNet(nn.Module):
-
                 def __init__(self):
                     super(IdentityNet, self).__init__()
 
                 def forward(self, input0: List[str]) -> List[str]:
                     return input0
+
         else:
 
             class IdentityNet(nn.Module):
-
                 def __init__(self):
                     super(IdentityNet, self).__init__()
 
                 def forward(self, input0):
                     return input0
+
     elif io_cnt == 2:
-        if (dtype == np_dtype_string):
+        if dtype == np_dtype_string:
 
             class IdentityNet(nn.Module):
-
                 def __init__(self):
                     super(IdentityNet, self).__init__()
 
-                def forward(self, input0: List[str],
-                            input1: List[str]) -> Tuple[List[str], List[str]]:
+                def forward(
+                    self, input0: List[str], input1: List[str]
+                ) -> Tuple[List[str], List[str]]:
                     return input0, input1
+
         else:
 
             class IdentityNet(nn.Module):
-
                 def __init__(self):
                     super(IdentityNet, self).__init__()
 
                 def forward(self, input0, input1):
                     return input0, input1
+
     elif io_cnt == 3:
-        if (dtype == np_dtype_string):
+        if dtype == np_dtype_string:
 
             class IdentityNet(nn.Module):
-
                 def __init__(self):
                     super(IdentityNet, self).__init__()
 
                 def forward(
-                    self, input0: List[str], input1: List[str],
-                    input2: List[str]
+                    self, input0: List[str], input1: List[str], input2: List[str]
                 ) -> Tuple[List[str], List[str], List[str]]:
                     return input0, input1, input2
+
         else:
 
             class IdentityNet(nn.Module):
-
                 def __init__(self):
                     super(IdentityNet, self).__init__()
 
                 def forward(self, input0, input1, input2):
                     return input0, input1, input2
+
     elif io_cnt == 4:
-        if (dtype == np_dtype_string):
+        if dtype == np_dtype_string:
 
             class IdentityNet(nn.Module):
-
                 def __init__(self):
                     super(IdentityNet, self).__init__()
 
                 def forward(
-                    self, input0: List[str], input1: List[str],
-                    input2: List[str], input3: List[str]
+                    self,
+                    input0: List[str],
+                    input1: List[str],
+                    input2: List[str],
+                    input3: List[str],
                 ) -> Tuple[List[str], List[str], List[str], List[str]]:
                     return input0, input1, input2, input3
+
         else:
 
             class IdentityNet(nn.Module):
-
                 def __init__(self):
                     super(IdentityNet, self).__init__()
 
@@ -476,21 +405,18 @@ def forward(self, input0, input1, input2, input3):
     identityModel = IdentityNet()
     traced = torch.jit.script(identityModel)
 
-    model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
-
-    try:
-        os.makedirs(model_version_dir)
-    except OSError as ex:
-        pass  # ignore existing dir
+    model_version_dir = os.path.join(models_dir, model_name, str(model_version))
+    os.makedirs(model_version_dir, exist_ok=True)
 
     traced.save(model_version_dir + "/model.pt")
 
 
-def create_libtorch_modelconfig(create_savedmodel, models_dir, model_version,
-                                io_cnt, max_batch, dtype, shape):
-
-    if not tu.validate_for_libtorch_model(dtype, dtype, dtype, shape, shape,
-                                          shape, max_batch):
+def create_libtorch_modelconfig(
+    create_savedmodel, models_dir, model_version, io_cnt, max_batch, dtype, shape
+):
+    if not tu.validate_for_libtorch_model(
+        dtype, dtype, dtype, shape, shape, shape, max_batch
+    ):
         return
 
     # Unpack version policy
@@ -498,19 +424,22 @@ def create_libtorch_modelconfig(create_savedmodel, models_dir, model_version,
 
     # Use a different model name for the non-batching variant
     model_name = tu.get_zero_model_name(
-        "libtorch_nobatch" if max_batch == 0 else "libtorch", io_cnt, dtype)
+        "libtorch_nobatch" if max_batch == 0 else "libtorch", io_cnt, dtype
+    )
     shape_str = tu.shape_to_dims_str(shape)
 
-    config_dir = models_dir + "/" + model_name
-    config = '''
+    config_dir = os.path.join(models_dir, model_name)
+    config = """
 name: "{}"
 platform: "pytorch_libtorch"
 max_batch_size: {}
 version_policy: {}
-'''.format(model_name, max_batch, version_policy_str)
+""".format(
+        model_name, max_batch, version_policy_str
+    )
 
     for io_num in range(io_cnt):
-        config += '''
+        config += """
 input [
   {{
     name: "INPUT__{}"
@@ -525,33 +454,119 @@ def create_libtorch_modelconfig(create_savedmodel, models_dir, model_version,
     dims: [ {} ]
   }}
 ]
-'''.format(io_num, np_to_model_dtype(dtype), shape_str, io_num,
-           np_to_model_dtype(dtype), shape_str)
+""".format(
+            io_num,
+            np_to_model_dtype(dtype),
+            shape_str,
+            io_num,
+            np_to_model_dtype(dtype),
+            shape_str,
+        )
 
-    try:
-        os.makedirs(config_dir)
-    except OSError as ex:
-        pass  # ignore existing dir
+    os.makedirs(config_dir, exist_ok=True)
 
     with open(config_dir + "/config.pbtxt", "w") as cfile:
         cfile.write(config)
 
 
-def create_openvino_modelfile(models_dir, model_version, io_cnt, max_batch,
-                              dtype, shape):
+def create_libtorch_linalg_modelfile(create_savedmodel, models_dir, model_version):
+    model_name = "libtorch_float32_linalg"
 
-    batch_dim = [] if max_batch == 0 else [
-        max_batch,
-    ]
-    if not tu.validate_for_openvino_model(dtype, dtype, dtype,
-                                          batch_dim + shape, batch_dim + shape,
-                                          batch_dim + shape):
+    # To test the linalg library, this script uses two inverse matrix operations
+    # to return the original input.
+    class IdentityNet(nn.Module):
+        def __init__(self, ref_pts):
+            super(IdentityNet, self).__init__()
+            ref_pts = torch.as_tensor(ref_pts)
+            self.register_buffer("ref_pts", ref_pts)
+
+        def forward(self, src: torch.Tensor):
+            X = torch.linalg.tensorsolve(self.ref_pts, src)
+            Y = torch.tensordot(self.ref_pts, X, dims=X.ndim)
+            return Y
+
+    ref_pts = torch.eye(2 * 3 * 4).reshape(2 * 3, 4, 2, 3, 4)
+    identityModel = IdentityNet(ref_pts)
+    traced = torch.jit.script(identityModel)
+
+    model_version_dir = os.path.join(models_dir, model_name, str(model_version))
+    os.makedirs(model_version_dir, exist_ok=True)
+
+    traced.save(model_version_dir + "/model.pt")
+
+
+def create_libtorch_linalg_modelconfig(create_savedmodel, models_dir, model_version):
+    # Unpack version policy
+    version_policy_str = "{ latest { num_versions: 1 }}"
+
+    model_name = "libtorch_float32_linalg"
+    dtype = np.float32
+    io_cnt = 1
+    max_batch = 0
+    shape = [6, 4]
+    shape_str = tu.shape_to_dims_str(shape)
+
+    config_dir = os.path.join(models_dir, model_name)
+    config = """
+name: "{}"
+platform: "pytorch_libtorch"
+max_batch_size: {}
+version_policy: {}
+""".format(
+        model_name, max_batch, version_policy_str
+    )
+
+    for io_num in range(io_cnt):
+        config += """
+input [
+  {{
+    name: "INPUT__{}"
+    data_type: {}
+    dims: [ {} ]
+  }}
+]
+output [
+  {{
+    name: "OUTPUT__{}"
+    data_type: {}
+    dims: [ {} ]
+  }}
+]
+""".format(
+            io_num,
+            np_to_model_dtype(dtype),
+            shape_str,
+            io_num,
+            np_to_model_dtype(dtype),
+            shape_str,
+        )
+
+    os.makedirs(config_dir, exist_ok=True)
+
+    with open(config_dir + "/config.pbtxt", "w") as cfile:
+        cfile.write(config)
+
+
+def create_openvino_modelfile(
+    models_dir, model_version, io_cnt, max_batch, dtype, shape
+):
+    batch_dim = (
+        []
+        if max_batch == 0
+        else [
+            max_batch,
+        ]
+    )
+    if not tu.validate_for_openvino_model(
+        dtype, dtype, dtype, batch_dim + shape, batch_dim + shape, batch_dim + shape
+    ):
         return
 
     # Create the model
     model_name = tu.get_zero_model_name(
-        "openvino_nobatch" if max_batch == 0 else "openvino", io_cnt, dtype)
-    model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
+        "openvino_nobatch" if max_batch == 0 else "openvino", io_cnt, dtype
+    )
+    model_version_dir = os.path.join(models_dir, model_name, str(model_version))
 
     openvino_inputs = []
     openvino_outputs = []
@@ -559,31 +574,33 @@ def create_openvino_modelfile(models_dir, model_version, io_cnt, max_batch,
         in_name = "INPUT{}".format(io_num)
         out_name = "OUTPUT{}".format(io_num)
         openvino_inputs.append(
-            ng.parameter(shape=batch_dim + shape, dtype=dtype, name=in_name))
-        openvino_outputs.append(
-            ng.result(openvino_inputs[io_num], name=out_name))
+            ng.parameter(shape=batch_dim + shape, dtype=dtype, name=in_name)
+        )
+        openvino_outputs.append(ng.result(openvino_inputs[io_num], name=out_name))
 
     function = ng.impl.Function(openvino_outputs, openvino_inputs, model_name)
     ie_network = IENetwork(ng.impl.Function.to_capsule(function))
 
-    try:
-        os.makedirs(model_version_dir)
-    except OSError as ex:
-        pass  # ignore existing dir
-
-    ie_network.serialize(model_version_dir + "/model.xml",
-                         model_version_dir + "/model.bin")
-
-
-def create_openvino_modelconfig(models_dir, model_version, io_cnt, max_batch,
-                                dtype, shape):
-
-    batch_dim = [] if max_batch == 0 else [
-        max_batch,
-    ]
-    if not tu.validate_for_openvino_model(dtype, dtype, dtype,
-                                          batch_dim + shape, batch_dim + shape,
-                                          batch_dim + shape):
+    os.makedirs(model_version_dir, exist_ok=True)
+
+    ie_network.serialize(
+        model_version_dir + "/model.xml", model_version_dir + "/model.bin"
+    )
+
+
+def create_openvino_modelconfig(
+    models_dir, model_version, io_cnt, max_batch, dtype, shape
+):
+    batch_dim = (
+        []
+        if max_batch == 0
+        else [
+            max_batch,
+        ]
+    )
+    if not tu.validate_for_openvino_model(
+        dtype, dtype, dtype, batch_dim + shape, batch_dim + shape, batch_dim + shape
+    ):
         return
 
     # Unpack version policy
@@ -591,20 +608,22 @@ def create_openvino_modelconfig(models_dir, model_version, io_cnt, max_batch,
 
     # Use a different model name for the non-batching variant
     model_name = tu.get_zero_model_name(
-        "openvino_nobatch" if max_batch == 0 else "openvino", io_cnt, dtype)
+        "openvino_nobatch" if max_batch == 0 else "openvino", io_cnt, dtype
+    )
     shape_str = tu.shape_to_dims_str(shape)
 
-    config_dir = models_dir + "/" + model_name
-    config = '''
+    config_dir = os.path.join(models_dir, model_name)
+    config = """
 name: "{}"
 backend: "openvino"
 max_batch_size: {}
 version_policy: {}
-instance_group [ {{ kind: KIND_CPU }}]
-'''.format(model_name, max_batch, version_policy_str)
+""".format(
+        model_name, max_batch, version_policy_str
+    )
 
     for io_num in range(io_cnt):
-        config += '''
+        config += """
 input [
   {{
     name: "INPUT__{}"
@@ -619,41 +638,54 @@ def create_openvino_modelconfig(models_dir, model_version, io_cnt, max_batch,
     dims: [ {} ]
   }}
 ]
-'''.format(io_num, np_to_model_dtype(dtype), shape_str, io_num,
-           np_to_model_dtype(dtype), shape_str)
+""".format(
+            io_num,
+            np_to_model_dtype(dtype),
+            shape_str,
+            io_num,
+            np_to_model_dtype(dtype),
+            shape_str,
+        )
 
-    try:
-        os.makedirs(config_dir)
-    except OSError as ex:
-        pass  # ignore existing dir
+    os.makedirs(config_dir, exist_ok=True)
 
     with open(config_dir + "/config.pbtxt", "w") as cfile:
         cfile.write(config)
 
 
-def create_plan_modelfile(create_savedmodel, models_dir, model_version, io_cnt,
-                          max_batch, dtype, shape, profile_max_size):
-
+def create_plan_modelfile(
+    create_savedmodel,
+    models_dir,
+    model_version,
+    io_cnt,
+    max_batch,
+    dtype,
+    shape,
+    profile_max_size,
+):
     if not tu.validate_for_trt_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
     # generate models with different configuration to ensure test coverage
     if dtype != np.float32:
-        create_plan_dynamic_rf_modelfile(models_dir, model_version, io_cnt,
-                                         max_batch, dtype, shape,
-                                         profile_max_size)
+        create_plan_dynamic_rf_modelfile(
+            models_dir, model_version, io_cnt, max_batch, dtype, shape, profile_max_size
+        )
     else:
-        create_plan_dynamic_modelfile(models_dir, model_version, io_cnt,
-                                      max_batch, dtype, shape, profile_max_size)
+        create_plan_dynamic_modelfile(
+            models_dir, model_version, io_cnt, max_batch, dtype, shape, profile_max_size
+        )
 
 
-def create_plan_dynamic_rf_modelfile(models_dir, model_version, io_cnt,
-                                     max_batch, dtype, shape, profile_max_size):
+def create_plan_dynamic_rf_modelfile(
+    models_dir, model_version, io_cnt, max_batch, dtype, shape, profile_max_size
+):
     # Create the model
     TRT_LOGGER = trt.Logger(trt.Logger.INFO)
     builder = trt.Builder(TRT_LOGGER)
     network = builder.create_network(
-        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
+        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+    )
     if max_batch == 0:
         shape_with_batchsize = [i for i in shape]
     else:
@@ -662,8 +694,9 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, io_cnt,
     trt_dtype = np_to_trt_dtype(dtype)
     trt_memory_format = trt.TensorFormat.LINEAR
     for io_num in range(io_cnt):
-        in_node = network.add_input("INPUT{}".format(io_num), trt_dtype,
-                                    shape_with_batchsize)
+        in_node = network.add_input(
+            "INPUT{}".format(io_num), trt_dtype, shape_with_batchsize
+        )
         in_node.allowed_formats = 1 << int(trt_memory_format)
 
         out_node = network.add_identity(in_node)
@@ -673,7 +706,7 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, io_cnt,
         network.mark_output(out_node.get_output(0))
         out_node.get_output(0).allowed_formats = 1 << int(trt_memory_format)
 
-        if (trt_dtype == trt.int8):
+        if trt_dtype == trt.int8:
             in_node.dynamic_range = (-128.0, 127.0)
             out_node.get_output(0).dynamic_range = (-128.0, 127.0)
 
@@ -697,19 +730,18 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, io_cnt,
 
     profile = builder.create_optimization_profile()
     for io_num in range(io_cnt):
-        profile.set_shape("INPUT{}".format(io_num), min_shape, opt_shape,
-                          max_shape)
+        profile.set_shape("INPUT{}".format(io_num), min_shape, opt_shape, max_shape)
 
     flags = 1 << int(trt.BuilderFlag.STRICT_TYPES)
     datatype_set = set([trt_dtype])
     for dt in datatype_set:
-        if (dt == trt.int8):
+        if dt == trt.int8:
             flags |= 1 << int(trt.BuilderFlag.INT8)
-        elif (dt == trt.float16):
+        elif dt == trt.float16:
             flags |= 1 << int(trt.BuilderFlag.FP16)
     config = builder.create_builder_config()
     config.flags = flags
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     config.add_optimization_profile(profile)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
@@ -719,23 +751,18 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, io_cnt,
         del engine
 
     model_name = tu.get_zero_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", io_cnt, dtype)
-    model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
-
-    try:
-        os.makedirs(model_version_dir)
-    except OSError as ex:
-        pass  # ignore existing dir
+        "plan_nobatch" if max_batch == 0 else "plan", io_cnt, dtype
+    )
+    model_version_dir = os.path.join(models_dir, model_name, str(model_version))
+    os.makedirs(model_version_dir, exist_ok=True)
 
     with open(model_version_dir + "/model.plan", "wb") as f:
         f.write(engine_bytes)
 
-    del builder
 
-
-def create_plan_shape_tensor_modelfile(models_dir, model_version, io_cnt,
-                                       max_batch, dtype, shape,
-                                       profile_max_size):
+def create_plan_shape_tensor_modelfile(
+    models_dir, model_version, io_cnt, max_batch, dtype, shape, profile_max_size
+):
     # Note that resize layer does not support int tensors.
     # The model takes two inputs (INPUT and DUMMY_INPUT)
     # and produce two outputs.
@@ -748,7 +775,8 @@ def create_plan_shape_tensor_modelfile(models_dir, model_version, io_cnt,
     TRT_LOGGER = trt.Logger(trt.Logger.INFO)
     builder = trt.Builder(TRT_LOGGER)
     network = builder.create_network(
-        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
+        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+    )
     if max_batch == 0:
         shape_with_batchsize = len(shape)
         dummy_shape = [-1] * shape_with_batchsize
@@ -759,11 +787,13 @@ def create_plan_shape_tensor_modelfile(models_dir, model_version, io_cnt,
     trt_dtype = np_to_trt_dtype(dtype)
     trt_memory_format = trt.TensorFormat.LINEAR
     for io_num in range(io_cnt):
-        in_node = network.add_input("INPUT{}".format(io_num), trt.int32,
-                                    [shape_with_batchsize])
+        in_node = network.add_input(
+            "INPUT{}".format(io_num), trt.int32, [shape_with_batchsize]
+        )
         in_node.allowed_formats = 1 << int(trt_memory_format)
-        dummy_in_node = network.add_input("DUMMY_INPUT{}".format(io_num),
-                                          trt_dtype, dummy_shape)
+        dummy_in_node = network.add_input(
+            "DUMMY_INPUT{}".format(io_num), trt_dtype, dummy_shape
+        )
         dummy_in_node.allowed_formats = 1 << int(trt_memory_format)
         resize_layer = network.add_resize(dummy_in_node)
         resize_layer.set_input(1, in_node)
@@ -782,7 +812,7 @@ def create_plan_shape_tensor_modelfile(models_dir, model_version, io_cnt,
         network.mark_output_for_shapes(out_node.get_output(0))
         out_node.get_output(0).allowed_formats = 1 << int(trt_memory_format)
 
-        if (trt_dtype == trt.int8):
+        if trt_dtype == trt.int8:
             in_node.dynamic_range = (-128.0, 127.0)
             out_node.get_output(0).dynamic_range = (-128.0, 127.0)
 
@@ -803,23 +833,25 @@ def create_plan_shape_tensor_modelfile(models_dir, model_version, io_cnt,
 
     profile = builder.create_optimization_profile()
     for io_num in range(io_cnt):
-        profile.set_shape_input("INPUT{}".format(io_num), min_shape, opt_shape,
-                                max_shape)
-        profile.set_shape("DUMMY_INPUT{}".format(io_num), min_shape, opt_shape,
-                          max_shape)
+        profile.set_shape_input(
+            "INPUT{}".format(io_num), min_shape, opt_shape, max_shape
+        )
+        profile.set_shape(
+            "DUMMY_INPUT{}".format(io_num), min_shape, opt_shape, max_shape
+        )
 
     config.add_optimization_profile(profile)
 
     flags = 1 << int(trt.BuilderFlag.STRICT_TYPES)
     datatype_set = set([trt_dtype])
     for dt in datatype_set:
-        if (dt == trt.int8):
+        if dt == trt.int8:
             flags |= 1 << int(trt.BuilderFlag.INT8)
-        elif (dt == trt.float16):
+        elif dt == trt.float16:
             flags |= 1 << int(trt.BuilderFlag.FP16)
     config.flags = flags
 
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
     except AttributeError:
@@ -828,27 +860,24 @@ def create_plan_shape_tensor_modelfile(models_dir, model_version, io_cnt,
         del engine
 
     model_name = tu.get_zero_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", io_cnt, dtype)
-    model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
-
-    try:
-        os.makedirs(model_version_dir)
-    except OSError as ex:
-        pass  # ignore existing dir
+        "plan_nobatch" if max_batch == 0 else "plan", io_cnt, dtype
+    )
+    model_version_dir = os.path.join(models_dir, model_name, str(model_version))
+    os.makedirs(model_version_dir, exist_ok=True)
 
     with open(model_version_dir + "/model.plan", "wb") as f:
         f.write(engine_bytes)
 
-    del builder
-
 
-def create_plan_dynamic_modelfile(models_dir, model_version, io_cnt, max_batch,
-                                  dtype, shape, profile_max_size):
+def create_plan_dynamic_modelfile(
+    models_dir, model_version, io_cnt, max_batch, dtype, shape, profile_max_size
+):
     # Create the model
     TRT_LOGGER = trt.Logger(trt.Logger.INFO)
     builder = trt.Builder(TRT_LOGGER)
     network = builder.create_network(
-        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
+        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+    )
     if max_batch == 0:
         shape_with_batchsize = [i for i in shape]
     else:
@@ -856,8 +885,9 @@ def create_plan_dynamic_modelfile(models_dir, model_version, io_cnt, max_batch,
 
     trt_dtype = np_to_trt_dtype(dtype)
     for io_num in range(io_cnt):
-        in_node = network.add_input("INPUT{}".format(io_num), trt_dtype,
-                                    shape_with_batchsize)
+        in_node = network.add_input(
+            "INPUT{}".format(io_num), trt_dtype, shape_with_batchsize
+        )
         out_node = network.add_identity(in_node)
         out_node.get_output(0).name = "OUTPUT{}".format(io_num)
         network.mark_output(out_node.get_output(0))
@@ -882,11 +912,12 @@ def create_plan_dynamic_modelfile(models_dir, model_version, io_cnt, max_batch,
 
     profile = builder.create_optimization_profile()
     for io_num in range(io_cnt):
-        profile.set_shape("INPUT{}".format(io_num), min_shape, opt_shape,
-                          max_shape)
+        profile.set_shape("INPUT{}".format(io_num), min_shape, opt_shape, max_shape)
     config = builder.create_builder_config()
     config.add_optimization_profile(profile)
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
+    if FLAGS.tensorrt_compat:
+        config.set_flag(trt.BuilderFlag.VERSION_COMPATIBLE)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
     except AttributeError:
@@ -894,42 +925,48 @@ def create_plan_dynamic_modelfile(models_dir, model_version, io_cnt, max_batch,
         engine_bytes = engine.serialize()
         del engine
 
-    model_name = tu.get_zero_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", io_cnt, dtype)
-    model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
+    model_name_base = "plan"
+    if max_batch == 0:
+        model_name_base += "_nobatch"
+    if FLAGS.tensorrt_compat:
+        model_name_base += "_compatible"
 
-    try:
-        os.makedirs(model_version_dir)
-    except OSError as ex:
-        pass  # ignore existing dir
+    model_name = tu.get_zero_model_name(model_name_base, io_cnt, dtype)
+    model_version_dir = os.path.join(models_dir, model_name, str(model_version))
+    os.makedirs(model_version_dir, exist_ok=True)
 
     with open(model_version_dir + "/model.plan", "wb") as f:
         f.write(engine_bytes)
 
-    del builder
-
 
-def create_plan_modelconfig(create_savedmodel, models_dir, model_version,
-                            io_cnt, max_batch, dtype, shape):
+def create_plan_modelconfig(
+    create_savedmodel, models_dir, model_version, io_cnt, max_batch, dtype, shape
+):
     if not tu.validate_for_trt_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
     shape_str = tu.shape_to_dims_str(shape)
 
-    model_name = tu.get_zero_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", io_cnt, dtype)
-    config_dir = models_dir + "/" + model_name
+    model_name_base = "plan"
+    if max_batch == 0:
+        model_name_base += "_nobatch"
+    if FLAGS.tensorrt_compat:
+        model_name_base += "_compatible"
+    model_name = tu.get_zero_model_name(model_name_base, io_cnt, dtype)
+    config_dir = os.path.join(models_dir, model_name)
 
     if FLAGS.tensorrt_shape_io:
         shape_tensor_dim = len(shape)
-        config = '''
+        config = """
 name: "{}"
 platform: "tensorrt_plan"
 max_batch_size: {}
-'''.format(model_name, max_batch)
+""".format(
+            model_name, max_batch
+        )
 
         for io_num in range(io_cnt):
-            config += '''
+            config += """
 input [
   {{
     name: "DUMMY_INPUT{}"
@@ -956,19 +993,30 @@ def create_plan_modelconfig(create_savedmodel, models_dir, model_version,
     is_shape_tensor: true
   }}
 ]
-'''.format(io_num, np_to_model_dtype(dtype), shape_str, io_num,
-            shape_tensor_dim, io_num, np_to_model_dtype(dtype), shape_str,
-            io_num, shape_tensor_dim)
+""".format(
+                io_num,
+                np_to_model_dtype(dtype),
+                shape_str,
+                io_num,
+                shape_tensor_dim,
+                io_num,
+                np_to_model_dtype(dtype),
+                shape_str,
+                io_num,
+                shape_tensor_dim,
+            )
 
     else:
-        config = '''
+        config = """
 name: "{}"
 platform: "tensorrt_plan"
 max_batch_size: {}
-'''.format(model_name, max_batch)
+""".format(
+            model_name, max_batch
+        )
 
         for io_num in range(io_cnt):
-            config += '''
+            config += """
 input [
   {{
     name: "INPUT{}"
@@ -983,190 +1031,245 @@ def create_plan_modelconfig(create_savedmodel, models_dir, model_version,
     dims: [ {} ]
   }}
 ]
-'''.format(io_num, np_to_model_dtype(dtype), shape_str, io_num,
-            np_to_model_dtype(dtype), shape_str)
+""".format(
+                io_num,
+                np_to_model_dtype(dtype),
+                shape_str,
+                io_num,
+                np_to_model_dtype(dtype),
+                shape_str,
+            )
 
-    try:
-        os.makedirs(config_dir)
-    except OSError as ex:
-        pass  # ignore existing dir
+    os.makedirs(config_dir, exist_ok=True)
 
     with open(config_dir + "/config.pbtxt", "w") as cfile:
         cfile.write(config)
 
 
-def create_shape_tensor_models(models_dir,
-                               dtype,
-                               shape,
-                               io_cnt=1,
-                               no_batch=True):
+def create_shape_tensor_models(models_dir, dtype, shape, io_cnt=1, no_batch=True):
     model_version = 1
 
-    create_plan_modelconfig(True, models_dir, model_version, io_cnt, 8, dtype,
-                            shape)
-    create_plan_shape_tensor_modelfile(models_dir, model_version, io_cnt, 8,
-                                       dtype, shape, 32)
+    create_plan_modelconfig(True, models_dir, model_version, io_cnt, 8, dtype, shape)
+    create_plan_shape_tensor_modelfile(
+        models_dir, model_version, io_cnt, 8, dtype, shape, 32
+    )
     if no_batch:
-        create_plan_modelconfig(True, models_dir, model_version, io_cnt, 0,
-                                dtype, shape)
-        create_plan_shape_tensor_modelfile(models_dir, model_version, io_cnt, 0,
-                                           dtype, shape, 32)
+        create_plan_modelconfig(
+            True, models_dir, model_version, io_cnt, 0, dtype, shape
+        )
+        create_plan_shape_tensor_modelfile(
+            models_dir, model_version, io_cnt, 0, dtype, shape, 32
+        )
 
 
 def create_models(models_dir, dtype, shape, io_cnt=1, no_batch=True):
     model_version = 1
 
     if FLAGS.graphdef:
-        create_tf_modelconfig(False, models_dir, model_version, io_cnt, 8,
-                              dtype, shape)
-        create_tf_modelfile(False, models_dir, model_version, io_cnt, 8, dtype,
-                            shape)
+        create_tf_modelconfig(False, models_dir, model_version, io_cnt, 8, dtype, shape)
+        create_tf_modelfile(False, models_dir, model_version, io_cnt, 8, dtype, shape)
         if no_batch:
-            create_tf_modelconfig(False, models_dir, model_version, io_cnt, 0,
-                                  dtype, shape)
-            create_tf_modelfile(False, models_dir, model_version, io_cnt, 0,
-                                dtype, shape)
+            create_tf_modelconfig(
+                False, models_dir, model_version, io_cnt, 0, dtype, shape
+            )
+            create_tf_modelfile(
+                False, models_dir, model_version, io_cnt, 0, dtype, shape
+            )
 
     if FLAGS.savedmodel:
-        create_tf_modelconfig(True, models_dir, model_version, io_cnt, 8, dtype,
-                              shape)
-        create_tf_modelfile(True, models_dir, model_version, io_cnt, 8, dtype,
-                            shape)
+        create_tf_modelconfig(True, models_dir, model_version, io_cnt, 8, dtype, shape)
+        create_tf_modelfile(True, models_dir, model_version, io_cnt, 8, dtype, shape)
         if no_batch:
-            create_tf_modelconfig(True, models_dir, model_version, io_cnt, 0,
-                                  dtype, shape)
-            create_tf_modelfile(True, models_dir, model_version, io_cnt, 0,
-                                dtype, shape)
+            create_tf_modelconfig(
+                True, models_dir, model_version, io_cnt, 0, dtype, shape
+            )
+            create_tf_modelfile(
+                True, models_dir, model_version, io_cnt, 0, dtype, shape
+            )
 
     if FLAGS.onnx:
-        create_onnx_modelconfig(True, models_dir, model_version, io_cnt, 8,
-                                dtype, shape)
-        create_onnx_modelfile(True, models_dir, model_version, io_cnt, 8, dtype,
-                              shape)
+        create_onnx_modelconfig(
+            True, models_dir, model_version, io_cnt, 8, dtype, shape
+        )
+        create_onnx_modelfile(True, models_dir, model_version, io_cnt, 8, dtype, shape)
         if no_batch:
-            create_onnx_modelconfig(True, models_dir, model_version, io_cnt, 0,
-                                    dtype, shape)
-            create_onnx_modelfile(True, models_dir, model_version, io_cnt, 0,
-                                  dtype, shape)
+            create_onnx_modelconfig(
+                True, models_dir, model_version, io_cnt, 0, dtype, shape
+            )
+            create_onnx_modelfile(
+                True, models_dir, model_version, io_cnt, 0, dtype, shape
+            )
 
     if FLAGS.openvino:
-        create_openvino_modelconfig(models_dir, model_version, io_cnt, 8, dtype,
-                                    shape)
-        create_openvino_modelfile(models_dir, model_version, io_cnt, 8, dtype,
-                                  shape)
+        create_openvino_modelconfig(models_dir, model_version, io_cnt, 8, dtype, shape)
+        create_openvino_modelfile(models_dir, model_version, io_cnt, 8, dtype, shape)
         if no_batch:
-            create_openvino_modelconfig(models_dir, model_version, io_cnt, 0,
-                                        dtype, shape)
-            create_openvino_modelfile(models_dir, model_version, io_cnt, 0,
-                                      dtype, shape)
+            create_openvino_modelconfig(
+                models_dir, model_version, io_cnt, 0, dtype, shape
+            )
+            create_openvino_modelfile(
+                models_dir, model_version, io_cnt, 0, dtype, shape
+            )
 
     if FLAGS.libtorch:
-        create_libtorch_modelconfig(True, models_dir, model_version, io_cnt, 8,
-                                    dtype, shape)
-        create_libtorch_modelfile(True, models_dir, model_version, io_cnt, 8,
-                                  dtype, shape)
+        create_libtorch_modelconfig(
+            True, models_dir, model_version, io_cnt, 8, dtype, shape
+        )
+        create_libtorch_modelfile(
+            True, models_dir, model_version, io_cnt, 8, dtype, shape
+        )
         if no_batch:
-            create_libtorch_modelconfig(True, models_dir, model_version, io_cnt,
-                                        0, dtype, shape)
-            create_libtorch_modelfile(True, models_dir, model_version, io_cnt,
-                                      0, dtype, shape)
-
-    if FLAGS.tensorrt:
-        create_plan_modelconfig(True, models_dir, model_version, io_cnt, 8,
-                                dtype, shape)
-        create_plan_modelfile(True, models_dir, model_version, io_cnt, 8, dtype,
-                              shape, 32)
+            create_libtorch_modelconfig(
+                True, models_dir, model_version, io_cnt, 0, dtype, shape
+            )
+            create_libtorch_modelfile(
+                True, models_dir, model_version, io_cnt, 0, dtype, shape
+            )
+
+    if FLAGS.tensorrt or FLAGS.tensorrt_compat:
+        create_plan_modelconfig(
+            True, models_dir, model_version, io_cnt, 8, dtype, shape
+        )
+        create_plan_modelfile(
+            True, models_dir, model_version, io_cnt, 8, dtype, shape, 32
+        )
         if no_batch:
-            create_plan_modelconfig(True, models_dir, model_version, io_cnt, 0,
-                                    dtype, shape)
-            create_plan_modelfile(True, models_dir, model_version, io_cnt, 0,
-                                  dtype, shape, 32)
+            create_plan_modelconfig(
+                True, models_dir, model_version, io_cnt, 0, dtype, shape
+            )
+            create_plan_modelfile(
+                True, models_dir, model_version, io_cnt, 0, dtype, shape, 32
+            )
 
     if FLAGS.tensorrt_big:
-        create_plan_modelconfig(True, models_dir, model_version, io_cnt, 8,
-                                dtype, shape)
-        create_plan_modelfile(True, models_dir, model_version, io_cnt, 8, dtype,
-                              shape, 16 * 1024 * 1024)
+        create_plan_modelconfig(
+            True, models_dir, model_version, io_cnt, 8, dtype, shape
+        )
+        create_plan_modelfile(
+            True, models_dir, model_version, io_cnt, 8, dtype, shape, 16 * 1024 * 1024
+        )
         if no_batch:
-            create_plan_modelconfig(True, models_dir, model_version, io_cnt, 0,
-                                    dtype, shape)
-            create_plan_modelfile(True, models_dir, model_version, io_cnt, 0,
-                                  dtype, shape, 16 * 1024 * 1024)
+            create_plan_modelconfig(
+                True, models_dir, model_version, io_cnt, 0, dtype, shape
+            )
+            create_plan_modelfile(
+                True,
+                models_dir,
+                model_version,
+                io_cnt,
+                0,
+                dtype,
+                shape,
+                16 * 1024 * 1024,
+            )
 
     if FLAGS.ensemble:
         emu.create_nop_modelconfig(models_dir, shape, dtype)
-        create_ensemble_modelconfig(True, models_dir, model_version, io_cnt, 8,
-                                    dtype, shape)
-        create_ensemble_modelfile(True, models_dir, model_version, io_cnt, 8,
-                                  dtype, shape)
+        create_ensemble_modelconfig(
+            True, models_dir, model_version, io_cnt, 8, dtype, shape
+        )
+        create_ensemble_modelfile(
+            True, models_dir, model_version, io_cnt, 8, dtype, shape
+        )
         if no_batch:
-            create_ensemble_modelconfig(True, models_dir, model_version, io_cnt,
-                                        0, dtype, shape)
-            create_ensemble_modelfile(True, models_dir, model_version, io_cnt,
-                                      0, dtype, shape)
+            create_ensemble_modelconfig(
+                True, models_dir, model_version, io_cnt, 0, dtype, shape
+            )
+            create_ensemble_modelfile(
+                True, models_dir, model_version, io_cnt, 0, dtype, shape
+            )
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('--models_dir',
-                        type=str,
-                        required=True,
-                        help='Top-level model directory')
-    parser.add_argument('--graphdef',
-                        required=False,
-                        action='store_true',
-                        help='Generate GraphDef models')
-    parser.add_argument('--savedmodel',
-                        required=False,
-                        action='store_true',
-                        help='Generate SavedModel models')
-    parser.add_argument('--onnx',
-                        required=False,
-                        action='store_true',
-                        help='Generate Onnx Runtime Onnx models')
     parser.add_argument(
-        '--onnx_opset',
+        "--models_dir", type=str, required=True, help="Top-level model directory"
+    )
+    parser.add_argument(
+        "--graphdef",
+        required=False,
+        action="store_true",
+        help="Generate GraphDef models",
+    )
+    parser.add_argument(
+        "--savedmodel",
+        required=False,
+        action="store_true",
+        help="Generate SavedModel models",
+    )
+    parser.add_argument(
+        "--onnx",
+        required=False,
+        action="store_true",
+        help="Generate Onnx Runtime Onnx models",
+    )
+    parser.add_argument(
+        "--onnx_opset",
         type=int,
         required=False,
         default=0,
-        help='Opset used for Onnx models. Default is to use ONNXRT default')
-    parser.add_argument('--libtorch',
-                        required=False,
-                        action='store_true',
-                        help='Generate Pytorch LibTorch models')
-    parser.add_argument('--openvino',
-                        required=False,
-                        action='store_true',
-                        help='Generate OpenVino models')
-    parser.add_argument('--tensorrt',
-                        required=False,
-                        action='store_true',
-                        help='Generate TensorRT PLAN models')
+        help="Opset used for Onnx models. Default is to use ONNXRT default",
+    )
+    parser.add_argument(
+        "--libtorch",
+        required=False,
+        action="store_true",
+        help="Generate Pytorch LibTorch models",
+    )
+    parser.add_argument(
+        "--openvino",
+        required=False,
+        action="store_true",
+        help="Generate OpenVino models",
+    )
+    parser.add_argument(
+        "--tensorrt",
+        required=False,
+        action="store_true",
+        help="Generate TensorRT PLAN models",
+    )
     parser.add_argument(
-        '--tensorrt-big',
+        "--tensorrt-big",
         required=False,
-        action='store_true',
-        help='Generate TensorRT PLAN models w/ opt profile with large max')
+        action="store_true",
+        help="Generate TensorRT PLAN models w/ opt profile with large max",
+    )
     parser.add_argument(
-        '--tensorrt-shape-io',
+        "--tensorrt-compat",
         required=False,
-        action='store_true',
-        help='Generate TensorRT PLAN models w/ shape tensor i/o')
-    parser.add_argument('--ensemble',
-                        required=False,
-                        action='store_true',
-                        help='Generate ensemble models')
+        action="store_true",
+        help="Generate TensorRT version-compatible models",
+    )
+    parser.add_argument(
+        "--tensorrt-shape-io",
+        required=False,
+        action="store_true",
+        help="Generate TensorRT PLAN models w/ shape tensor i/o",
+    )
+    parser.add_argument(
+        "--ensemble",
+        required=False,
+        action="store_true",
+        help="Generate ensemble models",
+    )
     FLAGS, unparsed = parser.parse_known_args()
 
     if FLAGS.graphdef or FLAGS.savedmodel:
         import tensorflow as tf
         from tensorflow.python.framework import graph_io
+
+        tf.compat.v1.disable_eager_execution()
     if FLAGS.onnx:
         import onnx
     if FLAGS.libtorch:
         import torch
         from torch import nn
-    if FLAGS.tensorrt or FLAGS.tensorrt_big or FLAGS.tensorrt_shape_io:
+    if (
+        FLAGS.tensorrt
+        or FLAGS.tensorrt_big
+        or FLAGS.tensorrt_compat
+        or FLAGS.tensorrt_shape_io
+    ):
         import tensorrt as trt
     if FLAGS.openvino:
         from openvino.inference_engine import IENetwork
@@ -1175,14 +1278,14 @@ def create_models(models_dir, dtype, shape, io_cnt=1, no_batch=True):
     import test_util as tu
 
     # Create models with variable-sized input and output. For big
-    # TensorRT models only create the one needed for performance
-    # testing
+    # and version-compatible TensorRT models, only create the one
+    # needed for testing.
     if FLAGS.tensorrt_big:
         create_models(FLAGS.models_dir, np.float32, [-1], io_cnt=1)
+    elif FLAGS.tensorrt_compat:
+        create_models(FLAGS.models_dir, np.float32, [-1], io_cnt=1, no_batch=False)
     elif FLAGS.tensorrt_shape_io:
-        create_shape_tensor_models(FLAGS.models_dir,
-                                   np.float32, [-1, -1],
-                                   io_cnt=1)
+        create_shape_tensor_models(FLAGS.models_dir, np.float32, [-1, -1], io_cnt=1)
     else:
         create_models(FLAGS.models_dir, bool, [-1], io_cnt=1)
         create_models(FLAGS.models_dir, np.float32, [-1], io_cnt=1)
@@ -1191,3 +1294,9 @@ def create_models(models_dir, dtype, shape, io_cnt=1, no_batch=True):
         create_models(FLAGS.models_dir, np.float16, [-1, -1], io_cnt=3)
         create_models(FLAGS.models_dir, np_dtype_string, [-1], io_cnt=1)
         create_models(FLAGS.models_dir, np_dtype_string, [-1, -1], io_cnt=3)
+
+    # Create libtorch linalg model
+    if FLAGS.libtorch:
+        model_version = 1
+        create_libtorch_linalg_modelconfig(True, FLAGS.models_dir, model_version)
+        create_libtorch_linalg_modelfile(True, FLAGS.models_dir, model_version)
diff --git a/qa/common/gen_qa_implicit_models.py b/qa/common/gen_qa_implicit_models.py
old mode 100644
new mode 100755
index 4c88737550..89872c3b92
--- a/qa/common/gen_qa_implicit_models.py
+++ b/qa/common/gen_qa_implicit_models.py
@@ -1,4 +1,6 @@
-# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,86 +28,30 @@
 
 import argparse
 import os
-import numpy as np
+from typing import List, Tuple
+
 import gen_ensemble_model_utils as emu
+import numpy as np
+from gen_common import (
+    np_to_model_dtype,
+    np_to_onnx_dtype,
+    np_to_torch_dtype,
+    np_to_trt_dtype,
+)
 
 FLAGS = None
 np_dtype_string = np.dtype(object)
 
 
-def np_to_model_dtype(np_dtype):
-    if np_dtype == bool:
-        return "TYPE_BOOL"
-    elif np_dtype == np.int8:
-        return "TYPE_INT8"
-    elif np_dtype == np.int16:
-        return "TYPE_INT16"
-    elif np_dtype == np.int32:
-        return "TYPE_INT32"
-    elif np_dtype == np.int64:
-        return "TYPE_INT64"
-    elif np_dtype == np.uint8:
-        return "TYPE_UINT8"
-    elif np_dtype == np.uint16:
-        return "TYPE_UINT16"
-    elif np_dtype == np.float16:
-        return "TYPE_FP16"
-    elif np_dtype == np.float32:
-        return "TYPE_FP32"
-    elif np_dtype == np.float64:
-        return "TYPE_FP64"
-    elif np_dtype == np_dtype_string:
-        return "TYPE_STRING"
-    return None
-
-
-def np_to_onnx_dtype(np_dtype):
-    if np_dtype == bool:
-        return onnx.TensorProto.BOOL
-    elif np_dtype == np.int8:
-        return onnx.TensorProto.INT8
-    elif np_dtype == np.int16:
-        return onnx.TensorProto.INT16
-    elif np_dtype == np.int32:
-        return onnx.TensorProto.INT32
-    elif np_dtype == np.int64:
-        return onnx.TensorProto.INT64
-    elif np_dtype == np.uint8:
-        return onnx.TensorProto.UINT8
-    elif np_dtype == np.uint16:
-        return onnx.TensorProto.UINT16
-    elif np_dtype == np.float16:
-        return onnx.TensorProto.FLOAT16
-    elif np_dtype == np.float32:
-        return onnx.TensorProto.FLOAT
-    elif np_dtype == np.float64:
-        return onnx.TensorProto.DOUBLE
-    elif np_dtype == np_dtype_string:
-        return onnx.TensorProto.STRING
-
-
-def np_to_trt_dtype(np_dtype):
-    if np_dtype == bool:
-        return trt.bool
-    elif np_dtype == np.int8:
-        return trt.int8
-    elif np_dtype == np.int32:
-        return trt.int32
-    elif np_dtype == np.float16:
-        return trt.float16
-    elif np_dtype == np.float32:
-        return trt.float32
-    return None
-
-
-def create_onnx_modelfile_wo_initial_state(models_dir, model_version, max_batch,
-                                           dtype, shape):
-
+def create_onnx_modelfile_wo_initial_state(
+    models_dir, model_version, max_batch, dtype, shape
+):
     if not tu.validate_for_onnx_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
     model_name = tu.get_sequence_model_name(
-        "onnx_nobatch" if max_batch == 0 else "onnx", dtype)
+        "onnx_nobatch" if max_batch == 0 else "onnx", dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     onnx_dtype = np_to_onnx_dtype(dtype)
@@ -126,142 +72,178 @@ def create_onnx_modelfile_wo_initial_state(models_dir, model_version, max_batch,
     batch_dim = [] if max_batch == 0 else [None]
 
     onnx_input = onnx.helper.make_tensor_value_info(
-        "INPUT", onnx_dtype, batch_dim + onnx_input_shape)
+        "INPUT", onnx_dtype, batch_dim + onnx_input_shape
+    )
     onnx_input_state = onnx.helper.make_tensor_value_info(
-        "INPUT_STATE", onnx_dtype, batch_dim + onnx_input_shape)
-    onnx_start = onnx.helper.make_tensor_value_info("START", onnx_control_dtype,
-                                                    batch_dim + [1])
-    onnx_ready = onnx.helper.make_tensor_value_info("READY", onnx_control_dtype,
-                                                    batch_dim + [1])
+        "INPUT_STATE", onnx_dtype, batch_dim + onnx_input_shape
+    )
+    onnx_start = onnx.helper.make_tensor_value_info(
+        "START", onnx_control_dtype, batch_dim + [1]
+    )
+    onnx_ready = onnx.helper.make_tensor_value_info(
+        "READY", onnx_control_dtype, batch_dim + [1]
+    )
     onnx_output = onnx.helper.make_tensor_value_info(
-        "OUTPUT", onnx_dtype, batch_dim + onnx_output_shape)
+        "OUTPUT", onnx_dtype, batch_dim + onnx_output_shape
+    )
     onnx_output_state = onnx.helper.make_tensor_value_info(
-        "OUTPUT_STATE", onnx_dtype, batch_dim + onnx_output_shape)
+        "OUTPUT_STATE", onnx_dtype, batch_dim + onnx_output_shape
+    )
 
     internal_input = onnx.helper.make_node("Identity", ["INPUT"], ["_INPUT"])
-    internal_input_state = onnx.helper.make_node("Identity", ["INPUT_STATE"],
-                                                 ["_INPUT_STATE"])
-    # cast int8, int16 input to higer precision int as Onnx Add/Sub operator doesn't support those type
-    if ((onnx_dtype == onnx.TensorProto.INT8) or
-        (onnx_dtype == onnx.TensorProto.INT16)):
-        internal_input = onnx.helper.make_node("Cast", ["INPUT"], ["_INPUT"],
-                                               to=onnx.TensorProto.INT32)
-        internal_input_state = onnx.helper.make_node("Cast", ["INPUT_STATE"],
-                                                     ["_INPUT_STATE"],
-                                                     to=onnx.TensorProto.INT32)
+    internal_input_state = onnx.helper.make_node(
+        "Identity", ["INPUT_STATE"], ["_INPUT_STATE"]
+    )
+    # cast int8, int16 input to higher precision int as Onnx Add/Sub operator doesn't support those type
+    if (onnx_dtype == onnx.TensorProto.INT8) or (onnx_dtype == onnx.TensorProto.INT16):
+        internal_input = onnx.helper.make_node(
+            "Cast", ["INPUT"], ["_INPUT"], to=onnx.TensorProto.INT32
+        )
+        internal_input_state = onnx.helper.make_node(
+            "Cast", ["INPUT_STATE"], ["_INPUT_STATE"], to=onnx.TensorProto.INT32
+        )
 
     # Convert boolean value to int32 value
     if onnx_control_dtype == onnx.TensorProto.BOOL:
         if onnx_dtype != onnx.TensorProto.STRING:
-            internal_input1 = onnx.helper.make_node("Cast", ["START"],
-                                                    ["_START"],
-                                                    to=onnx.TensorProto.INT32)
-            internal_input2 = onnx.helper.make_node("Cast", ["READY"],
-                                                    ["_READY"],
-                                                    to=onnx.TensorProto.INT32)
-            not_start_cast = onnx.helper.make_node("Not", ["START"],
-                                                   ["_NOT_START_CAST"])
-            not_start = onnx.helper.make_node("Cast", ["_NOT_START_CAST"],
-                                              ["_NOT_START"],
-                                              to=onnx.TensorProto.INT32)
-            not_ready_cast = onnx.helper.make_node("Not", ["START"],
-                                                   ["_NOT_READY_CAST"])
-            not_ready = onnx.helper.make_node("Cast", ["_NOT_READY_CAST"],
-                                              ["_NOT_READY"],
-                                              to=onnx.TensorProto.INT32)
+            internal_input1 = onnx.helper.make_node(
+                "Cast", ["START"], ["_START"], to=onnx.TensorProto.INT32
+            )
+            internal_input2 = onnx.helper.make_node(
+                "Cast", ["READY"], ["_READY"], to=onnx.TensorProto.INT32
+            )
+            not_start_cast = onnx.helper.make_node(
+                "Not", ["START"], ["_NOT_START_CAST"]
+            )
+            not_start = onnx.helper.make_node(
+                "Cast", ["_NOT_START_CAST"], ["_NOT_START"], to=onnx.TensorProto.INT32
+            )
+            not_ready_cast = onnx.helper.make_node(
+                "Not", ["START"], ["_NOT_READY_CAST"]
+            )
+            not_ready = onnx.helper.make_node(
+                "Cast", ["_NOT_READY_CAST"], ["_NOT_READY"], to=onnx.TensorProto.INT32
+            )
             input_state_cond = onnx.helper.make_node(
-                "And", ["READY", "_NOT_START_CAST"], ["input_state_cond"])
+                "And", ["READY", "_NOT_START_CAST"], ["input_state_cond"]
+            )
             input_state_cond_cast = onnx.helper.make_node(
-                "Cast", ["input_state_cond"], ["input_state_cond_cast"],
-                to=onnx.TensorProto.INT32)
+                "Cast",
+                ["input_state_cond"],
+                ["input_state_cond_cast"],
+                to=onnx.TensorProto.INT32,
+            )
             mul_state = onnx.helper.make_node(
-                "Mul", ["_INPUT_STATE", "input_state_cond_cast"], ["mul_state"])
-            add = onnx.helper.make_node("Add", ["_INPUT", "mul_state"],
-                                        ["CAST"])
+                "Mul", ["_INPUT_STATE", "input_state_cond_cast"], ["mul_state"]
+            )
+            add = onnx.helper.make_node("Add", ["_INPUT", "mul_state"], ["CAST"])
 
     else:
-
         if onnx_dtype != onnx.TensorProto.STRING:
-            start_cast = onnx.helper.make_node("Cast", ["START"],
-                                               ["_START_CAST"],
-                                               to=onnx.TensorProto.BOOL)
-            not_start_cast = onnx.helper.make_node("Not", ["_START_CAST"],
-                                                   ["_NOT_START_CAST"])
-            not_start = onnx.helper.make_node("Cast", ["_NOT_START_CAST"],
-                                              ["_NOT_START"],
-                                              to=onnx.TensorProto.INT32)
-
-            ready_cast = onnx.helper.make_node("Cast", ["READY"],
-                                               ["_READY_CAST"],
-                                               to=onnx.TensorProto.BOOL)
-            not_ready_cast = onnx.helper.make_node("Not", ["_READY_CAST"],
-                                                   ["_NOT_READY_CAST"])
-            not_ready = onnx.helper.make_node("Cast", ["_NOT_READY_CAST"],
-                                              ["_NOT_READY"],
-                                              to=onnx.TensorProto.INT32)
+            start_cast = onnx.helper.make_node(
+                "Cast", ["START"], ["_START_CAST"], to=onnx.TensorProto.BOOL
+            )
+            not_start_cast = onnx.helper.make_node(
+                "Not", ["_START_CAST"], ["_NOT_START_CAST"]
+            )
+            not_start = onnx.helper.make_node(
+                "Cast", ["_NOT_START_CAST"], ["_NOT_START"], to=onnx.TensorProto.INT32
+            )
+
+            ready_cast = onnx.helper.make_node(
+                "Cast", ["READY"], ["_READY_CAST"], to=onnx.TensorProto.BOOL
+            )
+            not_ready_cast = onnx.helper.make_node(
+                "Not", ["_READY_CAST"], ["_NOT_READY_CAST"]
+            )
+            not_ready = onnx.helper.make_node(
+                "Cast", ["_NOT_READY_CAST"], ["_NOT_READY"], to=onnx.TensorProto.INT32
+            )
             # Take advantage of knowledge that the READY false value is 0 and true is 1
             input_state_cond = onnx.helper.make_node(
-                "And", ["_NOT_START_CAST", "_READY_CAST"], ["input_state_cond"])
+                "And", ["_NOT_START_CAST", "_READY_CAST"], ["input_state_cond"]
+            )
             input_state_cond_cast = onnx.helper.make_node(
-                "Cast", ["input_state_cond"], ["input_state_cond_cast"],
-                to=onnx.TensorProto.INT32)
+                "Cast",
+                ["input_state_cond"],
+                ["input_state_cond_cast"],
+                to=onnx.TensorProto.INT32,
+            )
             mul_state = onnx.helper.make_node(
-                "Mul", ["_INPUT_STATE", "input_state_cond_cast"], ["mul_state"])
-            add = onnx.helper.make_node("Add", ["_INPUT", "mul_state"],
-                                        ["CAST"])
+                "Mul", ["_INPUT_STATE", "input_state_cond_cast"], ["mul_state"]
+            )
+            add = onnx.helper.make_node("Add", ["_INPUT", "mul_state"], ["CAST"])
 
     if onnx_dtype == onnx.TensorProto.STRING:
         cast = onnx.helper.make_node("Identity", ["_INPUT"], ["OUTPUT"])
-        cast_output_state = onnx.helper.make_node("Identity", ["_INPUT"],
-                                                  ["OUTPUT_STATE"])
+        cast_output_state = onnx.helper.make_node(
+            "Identity", ["_INPUT"], ["OUTPUT_STATE"]
+        )
     elif onnx_dtype == onnx.TensorProto.FLOAT16:
         # Avoid cast from float16 to float16
         # (bug in Onnx Runtime, cast from float16 to float16 will become cast from float16 to float32)
         cast = onnx.helper.make_node("Identity", ["CAST"], ["OUTPUT"])
-        cast_output_state = onnx.helper.make_node("Identity", ["CAST"],
-                                                  ["OUTPUT_STATE"])
+        cast_output_state = onnx.helper.make_node(
+            "Identity", ["CAST"], ["OUTPUT_STATE"]
+        )
     else:
-        cast = onnx.helper.make_node("Cast", ["CAST"], ["OUTPUT"],
-                                     to=onnx_dtype)
-        cast_output_state = onnx.helper.make_node("Cast", ["CAST"],
-                                                  ["OUTPUT_STATE"],
-                                                  to=onnx_dtype)
+        cast = onnx.helper.make_node("Cast", ["CAST"], ["OUTPUT"], to=onnx_dtype)
+        cast_output_state = onnx.helper.make_node(
+            "Cast", ["CAST"], ["OUTPUT_STATE"], to=onnx_dtype
+        )
 
     if onnx_control_dtype == onnx.TensorProto.BOOL:
         if onnx_dtype != onnx.TensorProto.STRING:
             onnx_nodes = [
-                internal_input, internal_input_state, internal_input1,
-                internal_input2, not_start_cast, not_start, not_ready_cast,
-                not_ready, input_state_cond, input_state_cond_cast, mul_state,
-                add, cast, cast_output_state
+                internal_input,
+                internal_input_state,
+                internal_input1,
+                internal_input2,
+                not_start_cast,
+                not_start,
+                not_ready_cast,
+                not_ready,
+                input_state_cond,
+                input_state_cond_cast,
+                mul_state,
+                add,
+                cast,
+                cast_output_state,
             ]
         else:
-            onnx_nodes = [
-                internal_input, internal_input_state, cast, cast_output_state
-            ]
+            onnx_nodes = [internal_input, internal_input_state, cast, cast_output_state]
     else:
         if onnx_dtype != onnx.TensorProto.STRING:
             onnx_nodes = [
-                internal_input, internal_input_state, start_cast,
-                not_start_cast, not_start, ready_cast, not_ready_cast,
-                not_ready, input_state_cond, input_state_cond_cast, mul_state,
-                add, cast, cast_output_state
+                internal_input,
+                internal_input_state,
+                start_cast,
+                not_start_cast,
+                not_start,
+                ready_cast,
+                not_ready_cast,
+                not_ready,
+                input_state_cond,
+                input_state_cond_cast,
+                mul_state,
+                add,
+                cast,
+                cast_output_state,
             ]
         else:
-            onnx_nodes = [
-                internal_input, internal_input_state, cast, cast_output_state
-            ]
+            onnx_nodes = [internal_input, internal_input_state, cast, cast_output_state]
 
     onnx_inputs = [onnx_input_state, onnx_input, onnx_start, onnx_ready]
     onnx_outputs = [onnx_output, onnx_output_state]
-    graph_proto = onnx.helper.make_graph(onnx_nodes, model_name, onnx_inputs,
-                                         onnx_outputs)
+    graph_proto = onnx.helper.make_graph(
+        onnx_nodes, model_name, onnx_inputs, onnx_outputs
+    )
 
     if FLAGS.onnx_opset > 0:
         model_opset = onnx.helper.make_operatorsetid("", FLAGS.onnx_opset)
-        model_def = onnx.helper.make_model(graph_proto,
-                                           producer_name="triton",
-                                           opset_imports=[model_opset])
+        model_def = onnx.helper.make_model(
+            graph_proto, producer_name="triton", opset_imports=[model_opset]
+        )
     else:
         model_def = onnx.helper.make_model(graph_proto, producer_name="triton")
 
@@ -273,13 +255,15 @@ def create_onnx_modelfile_wo_initial_state(models_dir, model_version, max_batch,
     onnx.save(model_def, model_version_dir + "/model.onnx")
 
 
-def create_onnx_modelfile_with_initial_state(models_dir, model_version,
-                                             max_batch, dtype, shape):
+def create_onnx_modelfile_with_initial_state(
+    models_dir, model_version, max_batch, dtype, shape
+):
     if not tu.validate_for_onnx_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
     model_name = tu.get_sequence_model_name(
-        "onnx_nobatch" if max_batch == 0 else "onnx", dtype)
+        "onnx_nobatch" if max_batch == 0 else "onnx", dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     onnx_dtype = np_to_onnx_dtype(dtype)
@@ -300,69 +284,80 @@ def create_onnx_modelfile_with_initial_state(models_dir, model_version,
     batch_dim = [] if max_batch == 0 else [None]
 
     onnx_input = onnx.helper.make_tensor_value_info(
-        "INPUT", onnx_dtype, batch_dim + onnx_input_shape)
+        "INPUT", onnx_dtype, batch_dim + onnx_input_shape
+    )
     onnx_input_state = onnx.helper.make_tensor_value_info(
-        "INPUT_STATE", onnx_dtype, batch_dim + onnx_input_shape)
-    onnx_start = onnx.helper.make_tensor_value_info("START", onnx_control_dtype,
-                                                    batch_dim + [1])
-    onnx_ready = onnx.helper.make_tensor_value_info("READY", onnx_control_dtype,
-                                                    batch_dim + [1])
+        "INPUT_STATE", onnx_dtype, batch_dim + onnx_input_shape
+    )
+    onnx_start = onnx.helper.make_tensor_value_info(
+        "START", onnx_control_dtype, batch_dim + [1]
+    )
+    onnx_ready = onnx.helper.make_tensor_value_info(
+        "READY", onnx_control_dtype, batch_dim + [1]
+    )
     onnx_output = onnx.helper.make_tensor_value_info(
-        "OUTPUT", onnx_dtype, batch_dim + onnx_output_shape)
+        "OUTPUT", onnx_dtype, batch_dim + onnx_output_shape
+    )
     onnx_output_state = onnx.helper.make_tensor_value_info(
-        "OUTPUT_STATE", onnx_dtype, batch_dim + onnx_output_shape)
+        "OUTPUT_STATE", onnx_dtype, batch_dim + onnx_output_shape
+    )
 
     internal_input = onnx.helper.make_node("Identity", ["INPUT"], ["_INPUT"])
-    internal_input_state = onnx.helper.make_node("Identity", ["INPUT_STATE"],
-                                                 ["_INPUT_STATE"])
-    # cast int8, int16 input to higer precision int as Onnx Add/Sub operator doesn't support those type
-    if ((onnx_dtype == onnx.TensorProto.INT8) or
-        (onnx_dtype == onnx.TensorProto.INT16)):
-        internal_input = onnx.helper.make_node("Cast", ["INPUT"], ["_INPUT"],
-                                               to=onnx.TensorProto.INT32)
-        internal_input_state = onnx.helper.make_node("Cast", ["INPUT_STATE"],
-                                                     ["_INPUT_STATE"],
-                                                     to=onnx.TensorProto.INT32)
+    internal_input_state = onnx.helper.make_node(
+        "Identity", ["INPUT_STATE"], ["_INPUT_STATE"]
+    )
+    # cast int8, int16 input to higher precision int as Onnx Add/Sub operator doesn't support those type
+    if (onnx_dtype == onnx.TensorProto.INT8) or (onnx_dtype == onnx.TensorProto.INT16):
+        internal_input = onnx.helper.make_node(
+            "Cast", ["INPUT"], ["_INPUT"], to=onnx.TensorProto.INT32
+        )
+        internal_input_state = onnx.helper.make_node(
+            "Cast", ["INPUT_STATE"], ["_INPUT_STATE"], to=onnx.TensorProto.INT32
+        )
 
     if onnx_dtype == onnx.TensorProto.STRING:
         identity = onnx.helper.make_node("Identity", ["_INPUT"], ["OUTPUT"])
-        identity_output_state = onnx.helper.make_node("Identity", ["_INPUT"],
-                                                      ["OUTPUT_STATE"])
-    else:
-        add = onnx.helper.make_node("Add", ["_INPUT", "_INPUT_STATE"], ["CAST"])
-        cast = onnx.helper.make_node("Cast", ["CAST"], ["OUTPUT"],
-                                     to=onnx_dtype)
-        cast_output_state = onnx.helper.make_node("Cast", ["CAST"],
-                                                  ["OUTPUT_STATE"],
-                                                  to=onnx_dtype)
-
-    # Avoid cast from float16 to float16
-    # (bug in Onnx Runtime, cast from float16 to float16 will become cast from float16 to float32)
-    if onnx_dtype == onnx.TensorProto.FLOAT16:
-        cast = onnx.helper.make_node("Identity", ["CAST"], ["OUTPUT"])
-        cast_output_state = onnx.helper.make_node("Identity", ["CAST"],
-                                                  ["OUTPUT_STATE"])
-
-    if onnx_dtype != onnx.TensorProto.STRING:
+        identity_output_state = onnx.helper.make_node(
+            "Identity", ["_INPUT"], ["OUTPUT_STATE"]
+        )
         onnx_nodes = [
-            internal_input, internal_input_state, add, cast, cast_output_state
+            internal_input,
+            internal_input_state,
+            identity,
+            identity_output_state,
         ]
     else:
+        add = onnx.helper.make_node("Add", ["_INPUT", "_INPUT_STATE"], ["CAST"])
+        cast = onnx.helper.make_node("Cast", ["CAST"], ["OUTPUT"], to=onnx_dtype)
+        cast_output_state = onnx.helper.make_node(
+            "Cast", ["CAST"], ["OUTPUT_STATE"], to=onnx_dtype
+        )
+        # Avoid cast from float16 to float16
+        # (bug in Onnx Runtime, cast from float16 to float16 will become cast from float16 to float32)
+        if onnx_dtype == onnx.TensorProto.FLOAT16:
+            cast = onnx.helper.make_node("Identity", ["CAST"], ["OUTPUT"])
+            cast_output_state = onnx.helper.make_node(
+                "Identity", ["CAST"], ["OUTPUT_STATE"]
+            )
         onnx_nodes = [
-            internal_input, internal_input_state, identity,
-            identity_output_state
+            internal_input,
+            internal_input_state,
+            add,
+            cast,
+            cast_output_state,
         ]
 
     onnx_inputs = [onnx_input_state, onnx_input, onnx_start, onnx_ready]
     onnx_outputs = [onnx_output, onnx_output_state]
-    graph_proto = onnx.helper.make_graph(onnx_nodes, model_name, onnx_inputs,
-                                         onnx_outputs)
+    graph_proto = onnx.helper.make_graph(
+        onnx_nodes, model_name, onnx_inputs, onnx_outputs
+    )
 
     if FLAGS.onnx_opset > 0:
         model_opset = onnx.helper.make_operatorsetid("", FLAGS.onnx_opset)
-        model_def = onnx.helper.make_model(graph_proto,
-                                           producer_name="triton",
-                                           opset_imports=[model_opset])
+        model_def = onnx.helper.make_model(
+            graph_proto, producer_name="triton", opset_imports=[model_opset]
+        )
     else:
         model_def = onnx.helper.make_model(graph_proto, producer_name="triton")
 
@@ -374,26 +369,356 @@ def create_onnx_modelfile_with_initial_state(models_dir, model_version,
     onnx.save(model_def, model_version_dir + "/model.onnx")
 
 
-def create_onnx_modelfile(models_dir, model_version, max_batch, dtype, shape,
-                          initial_state):
+def create_onnx_modelfile(
+    models_dir, model_version, max_batch, dtype, shape, initial_state
+):
+    if initial_state is None:
+        create_onnx_modelfile_wo_initial_state(
+            models_dir, model_version, max_batch, dtype, shape
+        )
+    else:
+        # This model assumes that the initial state contains correct data
+        create_onnx_modelfile_with_initial_state(
+            models_dir, model_version, max_batch, dtype, shape
+        )
+
+
+def create_libtorch_modelfile_wo_initial_state(
+    models_dir, model_version, max_batch, dtype, shape
+):
+    if not tu.validate_for_libtorch_model(dtype, dtype, dtype, shape, shape, shape):
+        return
+
+    torch_dtype = np_to_torch_dtype(dtype)
+    # If input dtype is bool, then use bool type for control and
+    # int32 type for input/output
+    if torch_dtype == torch.bool:
+        torch_dtype = torch.int32
+
+    model_name = tu.get_sequence_model_name(
+        "libtorch_nobatch" if max_batch == 0 else "libtorch", dtype
+    )
+
+    if torch_dtype == List[str]:
+
+        class SequenceNet(nn.Module):
+            def __init__(self):
+                super(SequenceNet, self).__init__()
+
+            def forward(
+                self, input0: List[str], input0_state: List[str], start0, ready0
+            ) -> Tuple[List[str], List[str]]:
+                use_state = torch.logical_and(ready0, torch.logical_not(start0))
+
+                input0_state_int = torch.tensor(
+                    [int("0" + i) for i in input0_state], device=use_state.device
+                )
+                input0_int = torch.tensor(
+                    [int("0" + i) for i in input0], device=use_state.device
+                )
+                result_int = torch.mul(use_state, input0_state_int)
+                result_int += input0_int
+                result = [str(i.item()) for i in result_int.cpu()]
+                return result, result
+
+    else:
+
+        class SequenceNet(nn.Module):
+            def __init__(self):
+                super(SequenceNet, self).__init__()
+
+            def forward(self, input0, input0_state, start0, ready0):
+                use_state = torch.logical_and(ready0, torch.logical_not(start0))
+
+                result = torch.mul(use_state, input0_state)
+                result += input0
+                return result, result
+
+    traced = torch.jit.script(SequenceNet())
+    model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
+
+    try:
+        os.makedirs(model_version_dir)
+    except OSError as ex:
+        pass  # ignore existing dir
+
+    traced.save(model_version_dir + "/model.pt")
+
+
+def create_libtorch_modelfile_with_initial_state(
+    models_dir, model_version, max_batch, dtype, shape
+):
+    if not tu.validate_for_libtorch_model(dtype, dtype, dtype, shape, shape, shape):
+        return
+
+    torch_dtype = np_to_torch_dtype(dtype)
+
+    # If input dtype is bool, then use bool type for control and
+    # int32 type for input/output
+    if torch_dtype == torch.bool:
+        torch_dtype = torch.int32
+
+    model_name = tu.get_sequence_model_name(
+        "libtorch_nobatch" if max_batch == 0 else "libtorch", dtype
+    )
+    # handle for -1 (when variable) since can't create tensor with shape of [-1]
+    if torch_dtype == List[str]:
+
+        class SequenceNet(nn.Module):
+            def __init__(self):
+                super(SequenceNet, self).__init__()
+
+            def forward(
+                self, input0: List[str], input0_state: List[str], start0, ready0
+            ) -> Tuple[List[str], List[str]]:
+                input0_state_int = torch.tensor(
+                    [int("0" + i) for i in input0_state], device=start0.device
+                )
+                input0_int = torch.tensor(
+                    [int("0" + i) for i in input0], device=start0.device
+                )
+                result_int = (input0_state_int + input0_int).cpu()
+                result = [str(i.item()) for i in result_int]
+                return result, result
+
+    else:
+
+        class SequenceNet(nn.Module):
+            def __init__(self):
+                super(SequenceNet, self).__init__()
+
+            def forward(self, input0, input0_state, start0, ready0):
+                result = input0_state + input0
+                return result, result
+
+    sequenceModel = SequenceNet()
+
+    traced = torch.jit.script(sequenceModel)
+
+    model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
+
+    try:
+        os.makedirs(model_version_dir)
+    except OSError as ex:
+        pass  # ignore existing dir
+
+    traced.save(model_version_dir + "/model.pt")
+
 
+def create_libtorch_modelfile(
+    models_dir, model_version, max_batch, dtype, shape, initial_state
+):
     if initial_state is None:
-        create_onnx_modelfile_wo_initial_state(models_dir, model_version,
-                                               max_batch, dtype, shape)
+        create_libtorch_modelfile_wo_initial_state(
+            models_dir, model_version, max_batch, dtype, shape
+        )
     else:
         # This model assumes that the initial state contains correct data
-        create_onnx_modelfile_with_initial_state(models_dir, model_version,
-                                                 max_batch, dtype, shape)
+        create_libtorch_modelfile_with_initial_state(
+            models_dir, model_version, max_batch, dtype, shape
+        )
+
+
+def create_libtorch_modelconfig(
+    models_dir, model_version, max_batch, dtype, shape, initial_state
+):
+    if not tu.validate_for_libtorch_model(dtype, dtype, dtype, shape, shape, shape):
+        return
+
+    model_name = tu.get_sequence_model_name(
+        "libtorch_nobatch" if max_batch == 0 else "libtorch", dtype
+    )
+    config_dir = models_dir + "/" + model_name
+
+    if dtype == np.float32:
+        control_type = "fp32"
+    elif dtype == bool:
+        control_type = "bool"
+        dtype = np.int32
+    else:
+        control_type = "int32"
+
+    instance_group_string = """
+instance_group [
+  {
+    kind: KIND_GPU
+  }
+]
+"""
+
+    config = f"""
+name: "{model_name}"
+platform: "pytorch_libtorch"
+max_batch_size: {max_batch}
+
+input [
+  {{
+    name: "INPUT__0"
+    data_type: {emu.dtype_str(dtype)}
+    dims: [ {tu.shape_to_dims_str(shape)} ]
+  }}
+]
+output [
+  {{
+    name: "OUTPUT__0"
+    data_type: {emu.dtype_str(dtype)}
+    dims: [ {tu.shape_to_dims_str(shape)} ]
+  }}
+]
+"""
+    config += instance_group_string
+
+    # Prepare the shapes for initial state initialization
+    shape_without_variable_dims = []
+    for dim in shape:
+        if dim == -1:
+            shape_without_variable_dims.append(1)
+        else:
+            shape_without_variable_dims.append(dim)
+
+    if initial_state is None:
+        config += """
+    sequence_batching {{
+      max_sequence_idle_microseconds: 5000000
+      control_input [
+        {{
+          name: "START__2"
+          control [
+            {{
+              kind: CONTROL_SEQUENCE_START
+              {type}_false_true: [ 0, 1 ]
+            }}
+          ]
+        }},
+        {{
+          name: "READY__3"
+          control [
+            {{
+              kind: CONTROL_SEQUENCE_READY
+              {type}_false_true: [ 0, 1 ]
+            }}
+          ]
+        }}
+      ]
+      state [
+        {{
+          input_name: "INPUT_STATE__1"
+          output_name: "OUTPUT_STATE__1"
+          data_type: {dtype}
+          dims: {dims}
+        }}
+      ]
+    }}
+    """.format(
+            type=control_type,
+            dims=tu.shape_to_dims_str(shape),
+            dtype=emu.dtype_str(dtype),
+        )
+    elif initial_state == "zero":
+        config += f"""
+    sequence_batching {{
+      max_sequence_idle_microseconds: 5000000
+      control_input [
+        {{
+          name: "START__2"
+          control [
+            {{
+              kind: CONTROL_SEQUENCE_START
+              {control_type}_false_true: [ 0, 1 ]
+            }}
+          ]
+        }},
+        {{
+          name: "READY__3"
+          control [
+            {{
+              kind: CONTROL_SEQUENCE_READY
+              {control_type}_false_true: [ 0, 1 ]
+            }}
+          ]
+        }}
+      ]
+      state [
+        {{
+          input_name: "INPUT_STATE__1"
+          output_name: "OUTPUT_STATE__1"
+          data_type: {emu.dtype_str(dtype)}
+          dims: {tu.shape_to_dims_str(shape)}
+          initial_state: {{
+              name: "state init"
+              data_type: {emu.dtype_str(dtype)}
+              dims: {tu.shape_to_dims_str(shape_without_variable_dims)}
+              zero_data: true
+          }}
+        }}
+      ]
+    }}
+    """
+    elif initial_state == "file":
+        config += """
+    sequence_batching {{
+      max_sequence_idle_microseconds: 5000000
+      control_input [
+        {{
+          name: "START__2"
+          control [
+            {{
+              kind: CONTROL_SEQUENCE_START
+              {type}_false_true: [ 0, 1 ]
+            }}
+          ]
+        }},
+        {{
+          name: "READY__3"
+          control [
+            {{
+              kind: CONTROL_SEQUENCE_READY
+              {type}_false_true: [ 0, 1 ]
+            }}
+          ]
+        }}
+      ]
+      state [
+        {{
+          input_name: "INPUT_STATE_1"
+          output_name: "OUTPUT_STATE_1"
+          data_type: {dtype}
+          dims: {dims}
+          initial_state: {{
+              name: "state init"
+              data_type: {dtype}
+              dims: {shape_without_variable_dims}
+              data_file: input_state_data
+          }}
+        }}
+      ]
+    }}
+    """.format(
+            type=control_type,
+            dims=tu.shape_to_dims_str(shape),
+            dtype=emu.dtype_str(dtype),
+            shape_without_variable_dims=tu.shape_to_dims_str(
+                shape_without_variable_dims
+            ),
+        )
 
+    try:
+        os.makedirs(config_dir)
+    except OSError as ex:
+        pass  # ignore existing dir
+    with open(config_dir + "/config.pbtxt", "w") as cfile:
+        cfile.write(config)
 
-def create_onnx_modelconfig(models_dir, model_version, max_batch, dtype, shape,
-                            initial_state):
 
+def create_onnx_modelconfig(
+    models_dir, model_version, max_batch, dtype, shape, initial_state
+):
     if not tu.validate_for_onnx_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
     model_name = tu.get_sequence_model_name(
-        "onnx_nobatch" if max_batch == 0 else "onnx", dtype)
+        "onnx_nobatch" if max_batch == 0 else "onnx", dtype
+    )
     config_dir = models_dir + "/" + model_name
 
     if dtype == np.float32:
@@ -404,22 +729,30 @@ def create_onnx_modelconfig(models_dir, model_version, max_batch, dtype, shape,
     else:
         control_type = "int32"
 
-    instance_group_string = '''
+    instance_group_string = """
 instance_group [
   {
     kind: KIND_GPU
   }
 ]
-'''
+"""
 
     # [TODO] move create_general_modelconfig() out of emu as it is general
     # enough for all backends to use
     config = emu.create_general_modelconfig(
         model_name,
         "onnxruntime_onnx",
-        max_batch, [dtype], [shape], [None], [dtype], [shape], [None], [None],
+        max_batch,
+        [dtype],
+        [shape],
+        [None],
+        [dtype],
+        [shape],
+        [None],
+        [None],
         force_tensor_number_suffix=False,
-        instance_group_str=instance_group_string)
+        instance_group_str=instance_group_string,
+    )
 
     # Prepare the shapes for initial state initialization
     shape_without_variable_dims = []
@@ -430,7 +763,7 @@ def create_onnx_modelconfig(models_dir, model_version, max_batch, dtype, shape,
             shape_without_variable_dims.append(dim)
 
     if initial_state is None:
-        config += '''
+        config += """
     sequence_batching {{
       max_sequence_idle_microseconds: 5000000
       control_input [
@@ -459,14 +792,16 @@ def create_onnx_modelconfig(models_dir, model_version, max_batch, dtype, shape,
           output_name: "OUTPUT_STATE"
           data_type: {dtype}
           dims: {dims}
-        }} 
+        }}
       ]
     }}
-    '''.format(type=control_type,
-               dims=tu.shape_to_dims_str(shape),
-               dtype=emu.dtype_str(dtype))
-    elif initial_state == 'zero':
-        config += f'''
+    """.format(
+            type=control_type,
+            dims=tu.shape_to_dims_str(shape),
+            dtype=emu.dtype_str(dtype),
+        )
+    elif initial_state == "zero":
+        config += f"""
     sequence_batching {{
       max_sequence_idle_microseconds: 5000000
       control_input [
@@ -501,12 +836,12 @@ def create_onnx_modelconfig(models_dir, model_version, max_batch, dtype, shape,
               dims: {tu.shape_to_dims_str(shape_without_variable_dims)}
               zero_data: true
           }}
-        }} 
+        }}
       ]
     }}
-    '''
-    elif initial_state == 'file':
-        config += '''
+    """
+    elif initial_state == "file":
+        config += """
     sequence_batching {{
       max_sequence_idle_microseconds: 5000000
       control_input [
@@ -541,14 +876,17 @@ def create_onnx_modelconfig(models_dir, model_version, max_batch, dtype, shape,
               dims: {shape_without_variable_dims}
               data_file: input_state_data
           }}
-        }} 
+        }}
       ]
     }}
-    '''.format(type=control_type,
-               dims=tu.shape_to_dims_str(shape),
-               dtype=emu.dtype_str(dtype),
-               shape_without_variable_dims=tu.shape_to_dims_str(
-                   shape_without_variable_dims))
+    """.format(
+            type=control_type,
+            dims=tu.shape_to_dims_str(shape),
+            dtype=emu.dtype_str(dtype),
+            shape_without_variable_dims=tu.shape_to_dims_str(
+                shape_without_variable_dims
+            ),
+        )
 
     try:
         os.makedirs(config_dir)
@@ -559,8 +897,7 @@ def create_onnx_modelconfig(models_dir, model_version, max_batch, dtype, shape,
         cfile.write(config)
 
 
-def create_plan_fixed_modelfile(models_dir, model_version, max_batch, dtype,
-                                shape):
+def create_plan_fixed_modelfile(models_dir, model_version, max_batch, dtype, shape):
     trt_dtype = np_to_trt_dtype(dtype)
     TRT_LOGGER = trt.Logger(trt.Logger.INFO)
     builder = trt.Builder(TRT_LOGGER)
@@ -571,15 +908,19 @@ def create_plan_fixed_modelfile(models_dir, model_version, max_batch, dtype,
     network.add_input("READY", trt_dtype, [1 for i in shape])
     constant_1_data = trt.Weights(np.ones([1 for i in shape], dtype=dtype))
     constant_1 = network.add_constant([1 for i in shape], constant_1_data)
-    not_start = network.add_elementwise(constant_1.get_output(0), start0,
-                                        trt.ElementWiseOperation.SUB)
+    not_start = network.add_elementwise(
+        constant_1.get_output(0), start0, trt.ElementWiseOperation.SUB
+    )
     not_start.set_output_type(0, trt_dtype)
-    internal_state = network.add_elementwise(in_state0, not_start.get_output(0),
-                                             trt.ElementWiseOperation.PROD)
-    out0 = network.add_elementwise(internal_state.get_output(0), in0,
-                                   trt.ElementWiseOperation.SUM)
-    out0_state = network.add_elementwise(internal_state.get_output(0), in0,
-                                         trt.ElementWiseOperation.SUM)
+    internal_state = network.add_elementwise(
+        in_state0, not_start.get_output(0), trt.ElementWiseOperation.PROD
+    )
+    out0 = network.add_elementwise(
+        internal_state.get_output(0), in0, trt.ElementWiseOperation.SUM
+    )
+    out0_state = network.add_elementwise(
+        internal_state.get_output(0), in0, trt.ElementWiseOperation.SUM
+    )
 
     out0.get_output(0).name = "OUTPUT"
     network.mark_output(out0.get_output(0))
@@ -588,7 +929,7 @@ def create_plan_fixed_modelfile(models_dir, model_version, max_batch, dtype,
     network.mark_output(out0_state.get_output(0))
 
     config = builder.create_builder_config()
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     builder.max_batch_size = max(1, max_batch)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
@@ -599,7 +940,8 @@ def create_plan_fixed_modelfile(models_dir, model_version, max_batch, dtype,
     del network
 
     model_name = tu.get_sequence_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", dtype)
+        "plan_nobatch" if max_batch == 0 else "plan", dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     try:
@@ -611,8 +953,7 @@ def create_plan_fixed_modelfile(models_dir, model_version, max_batch, dtype,
         f.write(engine_bytes)
 
 
-def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype,
-                                   shape):
+def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype, shape):
     trt_dtype = np_to_trt_dtype(dtype)
     trt_memory_format = trt.TensorFormat.LINEAR
 
@@ -625,16 +966,20 @@ def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype,
     ready0 = network.add_input("READY", trt_dtype, [1 for i in shape])
     constant_1_data = trt.Weights(np.ones([1 for i in shape], dtype=dtype))
     constant_1 = network.add_constant([1 for i in shape], constant_1_data)
-    not_start = network.add_elementwise(constant_1.get_output(0), start0,
-                                        trt.ElementWiseOperation.SUB)
+    not_start = network.add_elementwise(
+        constant_1.get_output(0), start0, trt.ElementWiseOperation.SUB
+    )
     not_start.set_output_type(0, trt_dtype)
 
-    internal_state = network.add_elementwise(in_state0, not_start.get_output(0),
-                                             trt.ElementWiseOperation.PROD)
-    out0 = network.add_elementwise(internal_state.get_output(0), in0,
-                                   trt.ElementWiseOperation.SUM)
-    out0_state = network.add_elementwise(internal_state.get_output(0), in0,
-                                         trt.ElementWiseOperation.SUM)
+    internal_state = network.add_elementwise(
+        in_state0, not_start.get_output(0), trt.ElementWiseOperation.PROD
+    )
+    out0 = network.add_elementwise(
+        internal_state.get_output(0), in0, trt.ElementWiseOperation.SUM
+    )
+    out0_state = network.add_elementwise(
+        internal_state.get_output(0), in0, trt.ElementWiseOperation.SUM
+    )
 
     out0.get_output(0).name = "OUTPUT"
     network.mark_output(out0.get_output(0))
@@ -651,7 +996,7 @@ def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype,
     out0.get_output(0).allowed_formats = 1 << int(trt_memory_format)
     out0_state.get_output(0).allowed_formats = 1 << int(trt_memory_format)
 
-    if (trt_dtype == trt.int8):
+    if trt_dtype == trt.int8:
         in0.dynamic_range = (-128.0, 127.0)
         in_state0.dynamic_range = (-128.0, 127.0)
         out0.dynamic_range = (-128.0, 127.0)
@@ -661,14 +1006,14 @@ def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype,
 
     flags = 1 << int(trt.BuilderFlag.STRICT_TYPES)
 
-    if (trt_dtype == trt.int8):
+    if trt_dtype == trt.int8:
         flags |= 1 << int(trt.BuilderFlag.INT8)
-    elif (trt_dtype == trt.float16):
+    elif trt_dtype == trt.float16:
         flags |= 1 << int(trt.BuilderFlag.FP16)
 
     config = builder.create_builder_config()
     config.flags = flags
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     builder.max_batch_size = max(1, max_batch)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
@@ -678,7 +1023,8 @@ def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype,
         del engine
 
     model_name = tu.get_sequence_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", dtype)
+        "plan_nobatch" if max_batch == 0 else "plan", dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     try:
@@ -690,17 +1036,17 @@ def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype,
         f.write(engine_bytes)
 
 
-def create_plan_dynamic_modelfile(models_dir, model_version, max_batch, dtype,
-                                  shape):
+def create_plan_dynamic_modelfile(models_dir, model_version, max_batch, dtype, shape):
     trt_dtype = np_to_trt_dtype(dtype)
     TRT_LOGGER = trt.Logger(trt.Logger.INFO)
     builder = trt.Builder(TRT_LOGGER)
 
     # EXPLICIT_BATCH must be used when the dimension is variable
     network = builder.create_network(
-        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
+        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+    )
 
-    unit_shape = ([1] * len(shape))
+    unit_shape = [1] * len(shape)
     if max_batch != 0:
         in0 = network.add_input("INPUT", trt_dtype, [-1] + shape)
         start0 = network.add_input("START", trt_dtype, [-1] + unit_shape)
@@ -715,15 +1061,19 @@ def create_plan_dynamic_modelfile(models_dir, model_version, max_batch, dtype,
     # Append the dimension by 1 so that broadcasting works properly
     constant_1_data = trt.Weights(np.ones(unit_shape + [1], dtype=dtype))
     constant_1 = network.add_constant(unit_shape + [1], constant_1_data)
-    not_start = network.add_elementwise(constant_1.get_output(0), start0,
-                                        trt.ElementWiseOperation.SUB)
+    not_start = network.add_elementwise(
+        constant_1.get_output(0), start0, trt.ElementWiseOperation.SUB
+    )
     not_start.set_output_type(0, trt_dtype)
-    internal_state = network.add_elementwise(in_state0, not_start.get_output(0),
-                                             trt.ElementWiseOperation.PROD)
-    out0 = network.add_elementwise(internal_state.get_output(0), in0,
-                                   trt.ElementWiseOperation.SUM)
-    out0_state = network.add_elementwise(internal_state.get_output(0), in0,
-                                         trt.ElementWiseOperation.SUM)
+    internal_state = network.add_elementwise(
+        in_state0, not_start.get_output(0), trt.ElementWiseOperation.PROD
+    )
+    out0 = network.add_elementwise(
+        internal_state.get_output(0), in0, trt.ElementWiseOperation.SUM
+    )
+    out0_state = network.add_elementwise(
+        internal_state.get_output(0), in0, trt.ElementWiseOperation.SUM
+    )
 
     out0.get_output(0).name = "OUTPUT"
     network.mark_output(out0.get_output(0))
@@ -752,17 +1102,25 @@ def create_plan_dynamic_modelfile(models_dir, model_version, max_batch, dtype,
     profile.set_shape("INPUT", min_shape, opt_shape, max_shape)
     profile.set_shape("INPUT_STATE", min_shape, opt_shape, max_shape)
     if max_batch != 0:
-        profile.set_shape("START", [1] + unit_shape, [max_batch] + unit_shape,
-                          [max_batch] + unit_shape)
-        profile.set_shape("READY", [1] + unit_shape, [max_batch] + unit_shape,
-                          [max_batch] + unit_shape)
+        profile.set_shape(
+            "START",
+            [1] + unit_shape,
+            [max_batch] + unit_shape,
+            [max_batch] + unit_shape,
+        )
+        profile.set_shape(
+            "READY",
+            [1] + unit_shape,
+            [max_batch] + unit_shape,
+            [max_batch] + unit_shape,
+        )
     else:
         profile.set_shape("START", unit_shape, unit_shape, unit_shape)
         profile.set_shape("READY", unit_shape, unit_shape, unit_shape)
     config = builder.create_builder_config()
     config.add_optimization_profile(profile)
 
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
     except AttributeError:
@@ -771,7 +1129,8 @@ def create_plan_dynamic_modelfile(models_dir, model_version, max_batch, dtype,
         del engine
 
     model_name = tu.get_sequence_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", dtype)
+        "plan_nobatch" if max_batch == 0 else "plan", dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     try:
@@ -783,8 +1142,9 @@ def create_plan_dynamic_modelfile(models_dir, model_version, max_batch, dtype,
         f.write(engine_bytes)
 
 
-def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
-                                     dtype, shape):
+def create_plan_dynamic_rf_modelfile(
+    models_dir, model_version, max_batch, dtype, shape
+):
     trt_dtype = np_to_trt_dtype(dtype)
     trt_memory_format = trt.TensorFormat.LINEAR
 
@@ -793,9 +1153,10 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
 
     # EXPLICIT_BATCH must be used when the dimension is variable
     network = builder.create_network(
-        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
+        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+    )
 
-    unit_shape = ([1] * len(shape))
+    unit_shape = [1] * len(shape)
     if max_batch != 0:
         in0 = network.add_input("INPUT", trt_dtype, [-1] + shape)
         start0 = network.add_input("START", trt_dtype, [-1] + unit_shape)
@@ -810,15 +1171,19 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
     # Append the dimension by 1 so that broadcasting works properly
     constant_1_data = trt.Weights(np.ones(unit_shape + [1], dtype=dtype))
     constant_1 = network.add_constant(unit_shape + [1], constant_1_data)
-    not_start = network.add_elementwise(constant_1.get_output(0), start0,
-                                        trt.ElementWiseOperation.SUB)
+    not_start = network.add_elementwise(
+        constant_1.get_output(0), start0, trt.ElementWiseOperation.SUB
+    )
     not_start.set_output_type(0, trt_dtype)
-    internal_state = network.add_elementwise(in_state0, not_start.get_output(0),
-                                             trt.ElementWiseOperation.PROD)
-    out0 = network.add_elementwise(internal_state.get_output(0), in0,
-                                   trt.ElementWiseOperation.SUM)
-    out0_state = network.add_elementwise(internal_state.get_output(0), in0,
-                                         trt.ElementWiseOperation.SUM)
+    internal_state = network.add_elementwise(
+        in_state0, not_start.get_output(0), trt.ElementWiseOperation.PROD
+    )
+    out0 = network.add_elementwise(
+        internal_state.get_output(0), in0, trt.ElementWiseOperation.SUM
+    )
+    out0_state = network.add_elementwise(
+        internal_state.get_output(0), in0, trt.ElementWiseOperation.SUM
+    )
     out0.get_output(0).name = "OUTPUT"
     network.mark_output(out0.get_output(0))
 
@@ -833,7 +1198,7 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
     ready0.allowed_formats = 1 << int(trt_memory_format)
     out0.get_output(0).allowed_formats = 1 << int(trt_memory_format)
 
-    if (trt_dtype == trt.int8):
+    if trt_dtype == trt.int8:
         in0.dynamic_range = (-128.0, 127.0)
         in_state0.dynamic_range = (-128.0, 127.0)
         out0.dynamic_range = (-128.0, 127.0)
@@ -842,9 +1207,9 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
 
     flags = 1 << int(trt.BuilderFlag.STRICT_TYPES)
 
-    if (trt_dtype == trt.int8):
+    if trt_dtype == trt.int8:
         flags |= 1 << int(trt.BuilderFlag.INT8)
-    elif (trt_dtype == trt.float16):
+    elif trt_dtype == trt.float16:
         flags |= 1 << int(trt.BuilderFlag.FP16)
 
     min_shape = []
@@ -868,10 +1233,18 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
     profile.set_shape("INPUT", min_shape, opt_shape, max_shape)
     profile.set_shape("INPUT_STATE", min_shape, opt_shape, max_shape)
     if max_batch != 0:
-        profile.set_shape("START", [1] + unit_shape, [max_batch] + unit_shape,
-                          [max_batch] + unit_shape)
-        profile.set_shape("READY", [1] + unit_shape, [max_batch] + unit_shape,
-                          [max_batch] + unit_shape)
+        profile.set_shape(
+            "START",
+            [1] + unit_shape,
+            [max_batch] + unit_shape,
+            [max_batch] + unit_shape,
+        )
+        profile.set_shape(
+            "READY",
+            [1] + unit_shape,
+            [max_batch] + unit_shape,
+            [max_batch] + unit_shape,
+        )
     else:
         profile.set_shape("START", unit_shape, unit_shape, unit_shape)
         profile.set_shape("READY", unit_shape, unit_shape, unit_shape)
@@ -880,7 +1253,7 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
     config.flags = flags
     config.add_optimization_profile(profile)
 
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
     except AttributeError:
@@ -889,7 +1262,8 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
         del engine
 
     model_name = tu.get_sequence_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", dtype)
+        "plan_nobatch" if max_batch == 0 else "plan", dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     try:
@@ -902,35 +1276,38 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
 
 
 def create_plan_modelfile(models_dir, model_version, max_batch, dtype, shape):
-
     if not tu.validate_for_trt_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
     if dtype != np.float32:
-        if (not tu.shape_is_fixed(shape)):
-            create_plan_dynamic_rf_modelfile(models_dir, model_version,
-                                             max_batch, dtype, shape)
+        if not tu.shape_is_fixed(shape):
+            create_plan_dynamic_rf_modelfile(
+                models_dir, model_version, max_batch, dtype, shape
+            )
         else:
-            create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch,
-                                           dtype, shape)
+            create_plan_fixed_rf_modelfile(
+                models_dir, model_version, max_batch, dtype, shape
+            )
     else:
-        if (not tu.shape_is_fixed(shape)):
-            create_plan_dynamic_modelfile(models_dir, model_version, max_batch,
-                                          dtype, shape)
+        if not tu.shape_is_fixed(shape):
+            create_plan_dynamic_modelfile(
+                models_dir, model_version, max_batch, dtype, shape
+            )
         else:
-            create_plan_fixed_modelfile(models_dir, model_version, max_batch,
-                                        dtype, shape)
+            create_plan_fixed_modelfile(
+                models_dir, model_version, max_batch, dtype, shape
+            )
 
 
 def create_plan_modelconfig(models_dir, model_version, max_batch, dtype, shape):
-
     if not tu.validate_for_trt_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
     model_name = tu.get_sequence_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", dtype)
+        "plan_nobatch" if max_batch == 0 else "plan", dtype
+    )
     config_dir = models_dir + "/" + model_name
-    config = '''
+    config = """
 name: "{}"
 platform: "tensorrt_plan"
 max_batch_size: {}
@@ -962,7 +1339,7 @@ def create_plan_modelconfig(models_dir, model_version, max_batch, dtype, shape):
       output_name: "OUTPUT_STATE"
       data_type: {dtype}
       dims: {shape}
-    }} 
+    }}
   ]
 }}
 input [
@@ -984,12 +1361,14 @@ def create_plan_modelconfig(models_dir, model_version, max_batch, dtype, shape):
     kind: KIND_GPU
   }}
 ]
-'''.format(model_name,
-           max_batch,
-           "int32" if dtype == np.int32 else "fp32",
-           "int32" if dtype == np.int32 else "fp32",
-           dtype=np_to_model_dtype(dtype),
-           shape=tu.shape_to_dims_str(shape))
+""".format(
+        model_name,
+        max_batch,
+        "int32" if dtype == np.int32 else "fp32",
+        "int32" if dtype == np.int32 else "fp32",
+        dtype=np_to_model_dtype(dtype),
+        shape=tu.shape_to_dims_str(shape),
+    )
 
     try:
         os.makedirs(config_dir)
@@ -1004,15 +1383,17 @@ def create_models(models_dir, dtype, shape, initial_state, no_batch=True):
     model_version = 1
 
     if FLAGS.onnx:
-        create_onnx_modelconfig(models_dir, model_version, 8, dtype, shape,
-                                initial_state)
-        create_onnx_modelfile(models_dir, model_version, 8, dtype, shape,
-                              initial_state)
+        create_onnx_modelconfig(
+            models_dir, model_version, 8, dtype, shape, initial_state
+        )
+        create_onnx_modelfile(models_dir, model_version, 8, dtype, shape, initial_state)
         if no_batch:
-            create_onnx_modelconfig(models_dir, model_version, 0, dtype, shape,
-                                    initial_state)
-            create_onnx_modelfile(models_dir, model_version, 0, dtype, shape,
-                                  initial_state)
+            create_onnx_modelconfig(
+                models_dir, model_version, 0, dtype, shape, initial_state
+            )
+            create_onnx_modelfile(
+                models_dir, model_version, 0, dtype, shape, initial_state
+            )
 
     if FLAGS.tensorrt:
         if dtype == bool:
@@ -1021,72 +1402,103 @@ def create_models(models_dir, dtype, shape, initial_state, no_batch=True):
         if dtype == np.int8:
             suffix = [1, 1]
 
-        create_plan_modelconfig(models_dir, model_version, 8, dtype,
-                                shape + suffix)
-        create_plan_modelfile(models_dir, model_version, 8, dtype,
-                              shape + suffix)
+        create_plan_modelconfig(models_dir, model_version, 8, dtype, shape + suffix)
+        create_plan_modelfile(models_dir, model_version, 8, dtype, shape + suffix)
         if no_batch:
-            create_plan_modelconfig(models_dir, model_version, 0, dtype,
-                                    shape + suffix)
-            create_plan_modelfile(models_dir, model_version, 0, dtype,
-                                  shape + suffix)
+            create_plan_modelconfig(models_dir, model_version, 0, dtype, shape + suffix)
+            create_plan_modelfile(models_dir, model_version, 0, dtype, shape + suffix)
 
+    if FLAGS.libtorch:
+        suffix = []
+        if dtype == np.int8:
+            suffix = [1, 1]
 
-if __name__ == '__main__':
+        create_libtorch_modelconfig(
+            models_dir, model_version, 8, dtype, shape + suffix, initial_state
+        )
+        create_libtorch_modelfile(
+            models_dir, model_version, 8, dtype, shape + suffix, initial_state
+        )
+        if no_batch:
+            create_libtorch_modelconfig(
+                models_dir, model_version, 0, dtype, shape + suffix, initial_state
+            )
+            create_libtorch_modelfile(
+                models_dir, model_version, 0, dtype, shape + suffix, initial_state
+            )
+
+
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('--models_dir',
-                        type=str,
-                        required=True,
-                        help='Top-level model directory')
-    parser.add_argument('--graphdef',
-                        required=False,
-                        action='store_true',
-                        help='Generate GraphDef models')
-    parser.add_argument('--savedmodel',
-                        required=False,
-                        action='store_true',
-                        help='Generate SavedModel models')
-    parser.add_argument('--tensorrt',
-                        required=False,
-                        action='store_true',
-                        help='Generate TensorRT PLAN models')
-    parser.add_argument('--initial-state',
-                        required=False,
-                        choices=['zero', 'file'],
-                        help='Generate models that rely on initial state.')
     parser.add_argument(
-        '--tensorrt-shape-io',
+        "--models_dir", type=str, required=True, help="Top-level model directory"
+    )
+    parser.add_argument(
+        "--graphdef",
+        required=False,
+        action="store_true",
+        help="Generate GraphDef models",
+    )
+    parser.add_argument(
+        "--savedmodel",
+        required=False,
+        action="store_true",
+        help="Generate SavedModel models",
+    )
+    parser.add_argument(
+        "--tensorrt",
         required=False,
-        action='store_true',
-        help='Generate TensorRT PLAN models w/ shape tensor i/o')
-    parser.add_argument('--onnx',
-                        required=False,
-                        action='store_true',
-                        help='Generate Onnx models')
+        action="store_true",
+        help="Generate TensorRT PLAN models",
+    )
     parser.add_argument(
-        '--onnx_opset',
+        "--initial-state",
+        required=False,
+        choices=["zero", "file"],
+        help="Generate models that rely on initial state.",
+    )
+    parser.add_argument(
+        "--tensorrt-shape-io",
+        required=False,
+        action="store_true",
+        help="Generate TensorRT PLAN models w/ shape tensor i/o",
+    )
+    parser.add_argument(
+        "--onnx", required=False, action="store_true", help="Generate Onnx models"
+    )
+    parser.add_argument(
+        "--onnx_opset",
         type=int,
         required=False,
         default=0,
-        help='Opset used for Onnx models. Default is to use ONNXRT default')
-    parser.add_argument('--libtorch',
-                        required=False,
-                        action='store_true',
-                        help='Generate Pytorch LibTorch models')
-    parser.add_argument('--openvino',
-                        required=False,
-                        action='store_true',
-                        help='Generate OpenVino models')
-    parser.add_argument('--variable',
-                        required=False,
-                        action='store_true',
-                        help='Used variable-shape tensors for input/output')
-    parser.add_argument('--ensemble',
-                        required=False,
-                        action='store_true',
-                        help='Generate ensemble models against the models' +
-                        ' in all platforms. Note that the models generated' +
-                        ' are not completed.')
+        help="Opset used for Onnx models. Default is to use ONNXRT default",
+    )
+    parser.add_argument(
+        "--libtorch",
+        required=False,
+        action="store_true",
+        help="Generate Pytorch LibTorch models",
+    )
+    parser.add_argument(
+        "--openvino",
+        required=False,
+        action="store_true",
+        help="Generate OpenVino models",
+    )
+    parser.add_argument(
+        "--variable",
+        required=False,
+        action="store_true",
+        help="Used variable-shape tensors for input/output",
+    )
+    parser.add_argument(
+        "--ensemble",
+        required=False,
+        action="store_true",
+        help="Generate ensemble models against the models"
+        + " in all platforms. Note that the models generated"
+        + " are not completed.",
+    )
     FLAGS, unparsed = parser.parse_known_args()
 
     if FLAGS.onnx:
@@ -1095,34 +1507,82 @@ def create_models(models_dir, dtype, shape, initial_state, no_batch=True):
     if FLAGS.tensorrt:
         import tensorrt as trt
 
+    if FLAGS.libtorch:
+        import torch
+        from torch import nn
+
     import test_util as tu
 
     # Tests with models that accept fixed-shape input/output tensors
     if not FLAGS.variable:
-        create_models(FLAGS.models_dir, np.float32, [
-            1,
-        ], FLAGS.initial_state)
-        create_models(FLAGS.models_dir, np.int32, [
-            1,
-        ], FLAGS.initial_state)
-        create_models(FLAGS.models_dir, np_dtype_string, [
-            1,
-        ], FLAGS.initial_state)
-        create_models(FLAGS.models_dir, bool, [
-            1,
-        ], FLAGS.initial_state)
+        create_models(
+            FLAGS.models_dir,
+            np.float32,
+            [
+                1,
+            ],
+            FLAGS.initial_state,
+        )
+        create_models(
+            FLAGS.models_dir,
+            np.int32,
+            [
+                1,
+            ],
+            FLAGS.initial_state,
+        )
+        create_models(
+            FLAGS.models_dir,
+            np_dtype_string,
+            [
+                1,
+            ],
+            FLAGS.initial_state,
+        )
+        create_models(
+            FLAGS.models_dir,
+            bool,
+            [
+                1,
+            ],
+            FLAGS.initial_state,
+        )
 
     # Tests with models that accept variable-shape input/output tensors
     if FLAGS.variable:
-        create_models(FLAGS.models_dir, np.int32, [
-            -1,
-        ], FLAGS.initial_state, False)
-        create_models(FLAGS.models_dir, np.float32, [
-            -1,
-        ], FLAGS.initial_state, False)
-        create_models(FLAGS.models_dir, np_dtype_string, [
-            -1,
-        ], FLAGS.initial_state, False)
-        create_models(FLAGS.models_dir, bool, [
-            -1,
-        ], FLAGS.initial_state, False)
+        create_models(
+            FLAGS.models_dir,
+            np.int32,
+            [
+                -1,
+            ],
+            FLAGS.initial_state,
+            False,
+        )
+        create_models(
+            FLAGS.models_dir,
+            np.float32,
+            [
+                -1,
+            ],
+            FLAGS.initial_state,
+            False,
+        )
+        create_models(
+            FLAGS.models_dir,
+            np_dtype_string,
+            [
+                -1,
+            ],
+            FLAGS.initial_state,
+            False,
+        )
+        create_models(
+            FLAGS.models_dir,
+            bool,
+            [
+                -1,
+            ],
+            FLAGS.initial_state,
+            False,
+        )
diff --git a/qa/common/gen_qa_model_repository b/qa/common/gen_qa_model_repository
index c1d3008987..fa65f1afdc 100755
--- a/qa/common/gen_qa_model_repository
+++ b/qa/common/gen_qa_model_repository
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2018-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -48,57 +48,65 @@
 ##
 ############################################################################
 
-TRITON_VERSION=22.05
+TRITON_VERSION=${TRITON_VERSION:=23.11}
 
 # ONNX. Use ONNX_OPSET 0 to use the default for ONNX version
-ONNX_VERSION=1.10.1
+ONNX_VERSION=1.13.0
 ONNX_OPSET=0
 
 # OPENVINO version
-OPENVINO_VERSION=2021.4.582
+OPENVINO_VERSION=2023.0.0
 
-UBUNTU_IMAGE=${UBUNTU_IMAGE:=ubuntu:20.04}
+UBUNTU_IMAGE=${UBUNTU_IMAGE:=ubuntu:22.04}
 PYTORCH_IMAGE=${PYTORCH_IMAGE:=nvcr.io/nvidia/pytorch:$TRITON_VERSION-py3}
-TENSORFLOW_IMAGE=${TENSORFLOW_IMAGE:=nvcr.io/nvidia/tensorflow:$TRITON_VERSION-tf1-py3}
+TENSORFLOW_IMAGE=${TENSORFLOW_IMAGE:=nvcr.io/nvidia/tensorflow:$TRITON_VERSION-tf2-py3}
 TENSORRT_IMAGE=${TENSORRT_IMAGE:=nvcr.io/nvidia/tensorrt:$TRITON_VERSION-py3}
-CUDA_DEVICE=0
+CUDA_DEVICE=${NV_GPU:=0}
+
+[[ $RUNNER_GPUS =~ ^[0-9] ]] && DOCKER_GPU_ARGS=$(eval $NV_DOCKER_ARGS) || DOCKER_GPU_ARGS="--gpus device=$CUDA_DEVICE"
 
 ###
-HOST_SRCDIR=/tmp/gen_srcdir
-HOST_DESTDIR=/tmp/$TRITON_VERSION/qa_model_repository
-HOST_VARDESTDIR=/tmp/$TRITON_VERSION/qa_variable_model_repository
-HOST_IDENTITYDESTDIR=/tmp/$TRITON_VERSION/qa_identity_model_repository
-HOST_SIGDEFDESTDIR=/tmp/$TRITON_VERSION/qa_tf_tag_sigdef_repository
-HOST_IDENTITYBIGDESTDIR=/tmp/$TRITON_VERSION/qa_identity_big_model_repository
-HOST_SHAPEDESTDIR=/tmp/$TRITON_VERSION/qa_shapetensor_model_repository
-HOST_RESHAPEDESTDIR=/tmp/$TRITON_VERSION/qa_reshape_model_repository
-HOST_SEQDESTDIR=/tmp/$TRITON_VERSION/qa_sequence_model_repository
-HOST_DYNASEQDESTDIR=/tmp/$TRITON_VERSION/qa_dyna_sequence_model_repository
-HOST_DYNASEQIMPLICITDESTDIR=/tmp/$TRITON_VERSION/qa_dyna_sequence_implicit_model_repository
-HOST_VARSEQDESTDIR=/tmp/$TRITON_VERSION/qa_variable_sequence_model_repository
-HOST_ENSEMBLEDESTDIR=/tmp/$TRITON_VERSION/qa_ensemble_model_repository
-HOST_NOSHAPEDESTDIR=/tmp/$TRITON_VERSION/qa_noshape_model_repository
-HOST_PLGDESTDIR=/tmp/$TRITON_VERSION/qa_trt_plugin_model_repository
-HOST_RAGGEDDESTDIR=/tmp/$TRITON_VERSION/qa_ragged_model_repository
-HOST_FORMATDESTDIR=/tmp/$TRITON_VERSION/qa_trt_format_model_repository
-HOST_IMPLICITSEQDESTDIR=/tmp/$TRITON_VERSION/qa_sequence_implicit_model_repository
-HOST_VARIMPLICITSEQDESTDIR=/tmp/$TRITON_VERSION/qa_variable_sequence_implicit_model_repository
-HOST_INITIALSTATEIMPLICITSEQDESTDIR=/tmp/$TRITON_VERSION/qa_sequence_initial_state_implicit_model_repository
-HOST_VARINITIALSTATEIMPLICITSEQDESTDIR=/tmp/$TRITON_VERSION/qa_variable_sequence_initial_state_implicit_model_repository
-HOST_TORCHTRTDESTDIR=/tmp/$TRITON_VERSION/torchtrt_model_store
-
-rm -fr $HOST_SRCDIR $HOST_DESTDIR $HOST_VARDESTDIR
+HOST_BUILD_DIR=${HOST_BUILD_DIR:=/tmp}
+HOST_SRCDIR=$HOST_BUILD_DIR/gen_srcdir
+HOST_DESTDIR=$HOST_BUILD_DIR/$TRITON_VERSION/qa_model_repository
+HOST_VARDESTDIR=$HOST_BUILD_DIR/$TRITON_VERSION/qa_variable_model_repository
+HOST_IDENTITYDESTDIR=$HOST_BUILD_DIR/$TRITON_VERSION/qa_identity_model_repository
+HOST_SIGDEFDESTDIR=$HOST_BUILD_DIR/$TRITON_VERSION/qa_tf_tag_sigdef_repository
+HOST_IDENTITYBIGDESTDIR=$HOST_BUILD_DIR/$TRITON_VERSION/qa_identity_big_model_repository
+HOST_TFPARAMETERSDESTDIR=$HOST_BUILD_DIR/$TRITON_VERSION/qa_tf_parameters_repository
+HOST_SHAPEDESTDIR=$HOST_BUILD_DIR/$TRITON_VERSION/qa_shapetensor_model_repository
+HOST_RESHAPEDESTDIR=$HOST_BUILD_DIR/$TRITON_VERSION/qa_reshape_model_repository
+HOST_SEQDESTDIR=$HOST_BUILD_DIR/$TRITON_VERSION/qa_sequence_model_repository
+HOST_DYNASEQDESTDIR=$HOST_BUILD_DIR/$TRITON_VERSION/qa_dyna_sequence_model_repository
+HOST_DYNASEQIMPLICITDESTDIR=$HOST_BUILD_DIR/$TRITON_VERSION/qa_dyna_sequence_implicit_model_repository
+HOST_VARSEQDESTDIR=$HOST_BUILD_DIR/$TRITON_VERSION/qa_variable_sequence_model_repository
+HOST_ENSEMBLEDESTDIR=$HOST_BUILD_DIR/$TRITON_VERSION/qa_ensemble_model_repository
+HOST_NOSHAPEDESTDIR=$HOST_BUILD_DIR/$TRITON_VERSION/qa_noshape_model_repository
+HOST_PLGDESTDIR=$HOST_BUILD_DIR/$TRITON_VERSION/qa_trt_plugin_model_repository
+HOST_RAGGEDDESTDIR=$HOST_BUILD_DIR/$TRITON_VERSION/qa_ragged_model_repository
+HOST_FORMATDESTDIR=$HOST_BUILD_DIR/$TRITON_VERSION/qa_trt_format_model_repository
+HOST_DATADEPENDENTDIR=$HOST_BUILD_DIR/$TRITON_VERSION/qa_trt_data_dependent_model_repository
+HOST_IMPLICITSEQDESTDIR=$HOST_BUILD_DIR/$TRITON_VERSION/qa_sequence_implicit_model_repository
+HOST_VARIMPLICITSEQDESTDIR=$HOST_BUILD_DIR/$TRITON_VERSION/qa_variable_sequence_implicit_model_repository
+HOST_INITIALSTATEIMPLICITSEQDESTDIR=$HOST_BUILD_DIR/$TRITON_VERSION/qa_sequence_initial_state_implicit_model_repository
+HOST_VARINITIALSTATEIMPLICITSEQDESTDIR=$HOST_BUILD_DIR/$TRITON_VERSION/qa_variable_sequence_initial_state_implicit_model_repository
+HOST_TORCHTRTDESTDIR=$HOST_BUILD_DIR/$TRITON_VERSION/torchtrt_model_store
+HOST_SCALARMODELSDESTDIR=$HOST_BUILD_DIR/$TRITON_VERSION/qa_scalar_models
+
+rm -fr $HOST_SRCDIR $HOST_DESTDIR $HOST_VARDESTDIR $HOST_TFPARAMETERSDESTDIR
 rm -fr $HOST_IDENTITYDESTDIR $HOST_IDENTITYBIGDESTDIR $HOST_SHAPEDESTDIR $HOST_SIGDEFDESTDIR
 rm -fr $HOST_SEQDESTDIR $HOST_DYNASEQDESTDIR $HOST_VARSEQDESTDIR
 rm -fr $HOST_ENSEMBLEDESTDIR $HOST_NOSHAPEDESTDIR $HOST_RESHAPEDESTDIR
-rm -fr $HOST_PLGDESTDIR $HOST_RAGGEDDESTDIR $HOST_FORMATDESTDIR
-rm -rf $HOST_IMPLICITSEQDESTDIR $HOST_VARIMPLICITSEQDESTDIR $HOST_DYNASEQIMPLICITDESTDIR 
+rm -fr $HOST_PLGDESTDIR $HOST_RAGGEDDESTDIR $HOST_FORMATDESTDIR $HOST_DATADEPENDENTDIR
+rm -rf $HOST_IMPLICITSEQDESTDIR $HOST_VARIMPLICITSEQDESTDIR $HOST_DYNASEQIMPLICITDESTDIR
 rm -rf $HOST_VARINITIALSTATEIMPLICITSEQDESTDIR $HOST_INITIALSTATEIMPLICITSEQDESTDIR
+rm -rf $HOST_SCALARMODELSDESTDIR
 mkdir -p $HOST_SRCDIR
 mkdir -p $HOST_DESTDIR
 mkdir -p $HOST_VARDESTDIR
 mkdir -p $HOST_IDENTITYDESTDIR
 mkdir -p $HOST_SIGDEFDESTDIR
+mkdir -p $HOST_TFPARAMETERSDESTDIR
 mkdir -p $HOST_IDENTITYBIGDESTDIR
 mkdir -p $HOST_SHAPEDESTDIR
 mkdir -p $HOST_RESHAPEDESTDIR
@@ -111,11 +119,13 @@ mkdir -p $HOST_NOSHAPEDESTDIR
 mkdir -p $HOST_PLGDESTDIR
 mkdir -p $HOST_RAGGEDDESTDIR
 mkdir -p $HOST_FORMATDESTDIR
+mkdir -p $HOST_DATADEPENDENTDIR
 mkdir -p $HOST_IMPLICITSEQDESTDIR
 mkdir -p $HOST_VARIMPLICITSEQDESTDIR
 mkdir -p $HOST_INITIALSTATEIMPLICITSEQDESTDIR
 mkdir -p $HOST_VARINITIALSTATEIMPLICITSEQDESTDIR
 mkdir -p $HOST_TORCHTRTDESTDIR
+mkdir -p $HOST_SCALARMODELSDESTDIR
 
 # Since the models required by ensemble models may not be available
 # at this point, storing ensemble models separately so that other qa directories
@@ -141,10 +151,14 @@ cp ./gen_qa_dyna_sequence_implicit_models.py $HOST_SRCDIR/.
 cp ./gen_ensemble_model_utils.py $HOST_SRCDIR/.
 cp ./gen_qa_trt_plugin_models.py $HOST_SRCDIR/.
 cp ./gen_qa_trt_format_models.py $HOST_SRCDIR/.
+cp ./gen_qa_trt_data_dependent_shape.py $HOST_SRCDIR/.
 cp ./gen_qa_torchtrt_models.py $HOST_SRCDIR/.
 cp ./gen_qa_ragged_models.py $HOST_SRCDIR/.
 cp ./test_util.py $HOST_SRCDIR/.
 cp ./gen_tag_sigdef.py $HOST_SRCDIR/.
+cp ./gen_qa_tf_parameters.py $HOST_SRCDIR/.
+cp ./gen_common.py $HOST_SRCDIR/.
+cp ./gen_qa_ort_scalar_models.py $HOST_SRCDIR/.
 
 ONNXSCRIPT=onnx_gen.cmds
 OPENVINOSCRIPT=openvino_gen.cmds
@@ -157,6 +171,7 @@ DESTDIR=/tmp/models
 VARDESTDIR=/tmp/varmodels
 IDENTITYDESTDIR=/tmp/zeromodels
 SIGDEFDESTDIR=/tmp/sigdefmodels
+TFPARAMETERSDIR=/tmp/tfparameters
 IDENTITYBIGDESTDIR=/tmp/zerobigmodels
 SHAPEDESTDIR=/tmp/shapetensormodels
 RESHAPEDESTDIR=/tmp/reshapemodels
@@ -173,7 +188,9 @@ NOSHAPEDESTDIR=/tmp/noshapemodels
 PLGDESTDIR=/tmp/pluginmodels
 RAGGEDDESTDIR=/tmp/raggedmodels
 FORMATDESTDIR=/tmp/formatmodels
+DATADEPENDENTDIR=/tmp/datadependentmodels
 TORCHTRTDESTDIR=/tmp/torchtrtmodels
+SCALARMODELSDESTDIR=/tmp/scalarmodels
 
 # OPENVINO
 #
@@ -182,7 +199,10 @@ if [[ "aarch64" != $(uname -m) ]] ; then
 
 cat >$HOST_SRCDIR/$OPENVINOSCRIPT <<EOF
 #!/bin/bash
+nvidia-smi -L || true
+nvidia-smi || true
 set -e
+set -x
 export DEBIAN_FRONTEND=noninteractive
 apt-get update && \
     apt-get install -y --no-install-recommends build-essential cmake libprotobuf-dev \
@@ -190,19 +210,9 @@ apt-get update && \
             software-properties-common
 ln -s /usr/bin/python3 /usr/bin/python
 
-(wget https://apt.repos.intel.com/openvino/2021/GPG-PUB-KEY-INTEL-OPENVINO-2021 && \
-    apt-key add GPG-PUB-KEY-INTEL-OPENVINO-2021 && rm GPG-PUB-KEY-INTEL-OPENVINO-2021 && \
-    cd /etc/apt/sources.list.d && \
-    echo "deb https://apt.repos.intel.com/openvino/2021 all main">intel-openvino-2021.list && \
-    apt update && \
-    apt install -y intel-openvino-dev-ubuntu20-${OPENVINO_VERSION})
-
-pip3 install "protobuf<=3.20.1" # TODO: Remove current line DLIS-3838
-
-pip3 install numpy setuptools
+pip3 install  "numpy<=1.23.5" setuptools
 
-PYTHONPATH=/opt/intel/openvino_${OPENVINO_VERSION}/python/python3
-source /opt/intel/openvino_${OPENVINO_VERSION}/bin/setupvars.sh .
+pip3 install openvino==$OPENVINO_VERSION
 
 # Since variable shape tensors are not allowed, identity models may fail to generate.
 # TODO Add variable size tensor models after DLIS-2827 adds support for variable shape tensors.
@@ -226,7 +236,7 @@ if [ $? -ne 0 ]; then
 fi
 
 docker pull $UBUNTU_IMAGE
-docker run --gpus device=$CUDA_DEVICE --rm -it --entrypoint $SRCDIR/$OPENVINOSCRIPT \
+docker run $DOCKER_GPU_ARGS --rm --entrypoint $SRCDIR/$OPENVINOSCRIPT \
        --mount type=bind,source=$HOST_SRCDIR,target=$SRCDIR \
        --mount type=bind,source=$HOST_DESTDIR,target=$DESTDIR \
        --mount type=bind,source=$HOST_VARDESTDIR,target=$VARDESTDIR \
@@ -236,6 +246,7 @@ docker run --gpus device=$CUDA_DEVICE --rm -it --entrypoint $SRCDIR/$OPENVINOSCR
        --mount type=bind,source=$HOST_DYNASEQDESTDIR,target=$DYNASEQDESTDIR \
        --mount type=bind,source=$HOST_VARSEQDESTDIR,target=$VARSEQDESTDIR \
        --mount type=bind,source=$HOST_RAGGEDDESTDIR,target=$RAGGEDDESTDIR \
+       --mount type=bind,source=$HOST_SCALARMODELSDESTDIR,target=$SCALARMODELSDESTDIR \
        $UBUNTU_IMAGE
 if [ $? -ne 0 ]; then
     echo -e "Failed"
@@ -247,14 +258,17 @@ fi # [[ "aarch64" != $(uname -m) ]]
 # ONNX
 cat >$HOST_SRCDIR/$ONNXSCRIPT <<EOF
 #!/bin/bash
+nvidia-smi -L || true
+nvidia-smi || true
 set -e
+set -x
 export DEBIAN_FRONTEND=noninteractive
 apt-get update && \
         apt-get install -y --no-install-recommends build-essential cmake libprotobuf-dev \
                 protobuf-compiler python3 python3-dev python3-pip
 ln -s /usr/bin/python3 /usr/bin/python
 
-pip3 install "protobuf<=3.20.1" # TODO: Remove current line DLIS-3838
+pip3 install "protobuf<=3.20.1"  "numpy<=1.23.5" # TODO: Remove current line DLIS-3838
 pip3 install --upgrade onnx==${ONNX_VERSION}
 
 python3 $SRCDIR/gen_qa_models.py --onnx --onnx_opset=$ONNX_OPSET --models_dir=$DESTDIR
@@ -283,6 +297,8 @@ python3 $SRCDIR/gen_qa_dyna_sequence_implicit_models.py --onnx --onnx_opset=$ONN
 chmod -R 777 $DYNASEQIMPLICITDESTDIR
 python3 $SRCDIR/gen_qa_ragged_models.py --onnx --onnx_opset=$ONNX_OPSET --models_dir=$RAGGEDDESTDIR
 chmod -R 777 $RAGGEDDESTDIR
+python3 $SRCDIR/gen_qa_ort_scalar_models.py --onnx_opset=$ONNX_OPSET --models_dir=$SCALARMODELSDESTDIR
+chmod -R 777 $RAGGEDDESTDIR
 EOF
 
 chmod a+x $HOST_SRCDIR/$ONNXSCRIPT
@@ -292,7 +308,7 @@ if [ $? -ne 0 ]; then
 fi
 
 docker pull $UBUNTU_IMAGE
-docker run --gpus device=$CUDA_DEVICE --rm --entrypoint $SRCDIR/$ONNXSCRIPT \
+docker run $DOCKER_GPU_ARGS --rm --entrypoint $SRCDIR/$ONNXSCRIPT \
        --mount type=bind,source=$HOST_SRCDIR,target=$SRCDIR \
        --mount type=bind,source=$HOST_DESTDIR,target=$DESTDIR \
        --mount type=bind,source=$HOST_VARDESTDIR,target=$VARDESTDIR \
@@ -307,6 +323,7 @@ docker run --gpus device=$CUDA_DEVICE --rm --entrypoint $SRCDIR/$ONNXSCRIPT \
        --mount type=bind,source=$HOST_IMPLICITSEQDESTDIR,target=$IMPLICITSEQDESTDIR \
        --mount type=bind,source=$HOST_VARINITIALSTATEIMPLICITSEQDESTDIR,target=$VARINITIALSTATEIMPLICITSEQDESTDIR \
        --mount type=bind,source=$HOST_INITIALSTATEIMPLICITSEQDESTDIR,target=$INITIALSTATEIMPLICITSEQDESTDIR \
+       --mount type=bind,source=$HOST_SCALARMODELSDESTDIR,target=$SCALARMODELSDESTDIR \
        $UBUNTU_IMAGE
 if [ $? -ne 0 ]; then
     echo -e "Failed"
@@ -316,7 +333,10 @@ fi
 # PyTorch
 cat >$HOST_SRCDIR/$TORCHSCRIPT <<EOF
 #!/bin/bash
+nvidia-smi -L || true
+nvidia-smi || true
 set -e
+set -x
 python3 $SRCDIR/gen_qa_models.py --libtorch --models_dir=$DESTDIR
 chmod -R 777 $DESTDIR
 python3 $SRCDIR/gen_qa_models.py --libtorch --variable --models_dir=$VARDESTDIR
@@ -329,10 +349,16 @@ python3 $SRCDIR/gen_qa_sequence_models.py --libtorch --models_dir=$SEQDESTDIR
 chmod -R 777 $SEQDESTDIR
 python3 $SRCDIR/gen_qa_sequence_models.py --libtorch --variable --models_dir=$VARSEQDESTDIR
 chmod -R 777 $VARSEQDESTDIR
+python3 $SRCDIR/gen_qa_implicit_models.py --libtorch --models_dir=$IMPLICITSEQDESTDIR
+chmod -R 777 $IMPLICITSEQDESTDIR
+python3 $SRCDIR/gen_qa_implicit_models.py --libtorch --variable --models_dir=$VARIMPLICITSEQDESTDIR
+chmod -R 777 $VARIMPLICITSEQDESTDIR
 python3 $SRCDIR/gen_qa_dyna_sequence_models.py --libtorch --models_dir=$DYNASEQDESTDIR
 chmod -R 777 $DYNASEQDESTDIR
 python3 $SRCDIR/gen_qa_torchtrt_models.py --models_dir=$TORCHTRTDESTDIR
 chmod -R 777 $TORCHTRTDESTDIR
+python3 $SRCDIR/gen_qa_ragged_models.py --libtorch --models_dir=$RAGGEDDESTDIR
+chmod -R 777 $RAGGEDDESTDIR
 EOF
 
 chmod a+x $HOST_SRCDIR/$TORCHSCRIPT
@@ -342,7 +368,7 @@ if [ $? -ne 0 ]; then
 fi
 
 docker pull $PYTORCH_IMAGE
-docker run --gpus device=$CUDA_DEVICE --rm --entrypoint $SRCDIR/$TORCHSCRIPT \
+docker run $DOCKER_GPU_ARGS --rm --entrypoint $SRCDIR/$TORCHSCRIPT \
        --mount type=bind,source=$HOST_SRCDIR,target=$SRCDIR \
        --mount type=bind,source=$HOST_DESTDIR,target=$DESTDIR \
        --mount type=bind,source=$HOST_VARDESTDIR,target=$VARDESTDIR \
@@ -352,6 +378,9 @@ docker run --gpus device=$CUDA_DEVICE --rm --entrypoint $SRCDIR/$TORCHSCRIPT \
        --mount type=bind,source=$HOST_DYNASEQDESTDIR,target=$DYNASEQDESTDIR \
        --mount type=bind,source=$HOST_VARSEQDESTDIR,target=$VARSEQDESTDIR \
        --mount type=bind,source=$HOST_TORCHTRTDESTDIR,target=$TORCHTRTDESTDIR \
+       --mount type=bind,source=$HOST_RAGGEDDESTDIR,target=$RAGGEDDESTDIR \
+       --mount type=bind,source=$HOST_VARIMPLICITSEQDESTDIR,target=$VARIMPLICITSEQDESTDIR \
+       --mount type=bind,source=$HOST_IMPLICITSEQDESTDIR,target=$IMPLICITSEQDESTDIR \
        $PYTORCH_IMAGE
 if [ $? -ne 0 ]; then
     echo -e "Failed"
@@ -361,7 +390,14 @@ fi
 # Tensorflow
 cat >$HOST_SRCDIR/$TFSCRIPT <<EOF
 #!/bin/bash
+nvidia-smi -L || true
+nvidia-smi || true
 set -e
+set -x
+# Segmentation fault with protobuf 4.24.0 (https://github.com/tensorflow/tensorflow/issues/61551)
+# Upgrade protobuf version to fix the issue.
+pip3 install "protobuf>4.24.0"
+
 python3 $SRCDIR/gen_qa_models.py --graphdef --savedmodel --models_dir=$DESTDIR
 chmod -R 777 $DESTDIR
 python3 $SRCDIR/gen_qa_models.py --graphdef --savedmodel --variable --models_dir=$VARDESTDIR
@@ -389,6 +425,8 @@ python3 $SRCDIR/gen_qa_sequence_models.py --ensemble --variable --models_dir=$EN
 chmod -R 777 $ENSEMBLEDESTDIR
 python3 $SRCDIR/gen_tag_sigdef.py --dir $SIGDEFDESTDIR
 chmod -R 777 $SIGDEFDESTDIR
+python3 $SRCDIR/gen_qa_tf_parameters.py --models_dir $TFPARAMETERSDIR
+chmod -R 777 $TFPARAMETERSDIR
 EOF
 
 chmod a+x $HOST_SRCDIR/$TFSCRIPT
@@ -398,7 +436,7 @@ if [ $? -ne 0 ]; then
 fi
 
 docker pull $TENSORFLOW_IMAGE
-docker run --gpus device=$CUDA_DEVICE --rm --entrypoint $SRCDIR/$TFSCRIPT \
+docker run $DOCKER_GPU_ARGS --rm --entrypoint $SRCDIR/$TFSCRIPT \
        --mount type=bind,source=$HOST_SRCDIR,target=$SRCDIR \
        --mount type=bind,source=$HOST_DESTDIR,target=$DESTDIR \
        --mount type=bind,source=$HOST_VARDESTDIR,target=$VARDESTDIR \
@@ -411,6 +449,7 @@ docker run --gpus device=$CUDA_DEVICE --rm --entrypoint $SRCDIR/$TFSCRIPT \
        --mount type=bind,source=$HOST_NOSHAPEDESTDIR,target=$NOSHAPEDESTDIR \
        --mount type=bind,source=$HOST_ENSEMBLEDESTDIR,target=$ENSEMBLEDESTDIR \
        --mount type=bind,source=$HOST_RAGGEDDESTDIR,target=$RAGGEDDESTDIR \
+       --mount type=bind,source=$HOST_TFPARAMETERSDESTDIR,target=$TFPARAMETERSDIR \
        $TENSORFLOW_IMAGE
 if [ $? -ne 0 ]; then
     echo -e "Failed"
@@ -420,7 +459,10 @@ fi
 # TensorRT
 cat >$HOST_SRCDIR/$TRTSCRIPT <<EOF
 #!/bin/bash
+nvidia-smi -L || true
+nvidia-smi || true
 set -e
+set -x
 export TRT_SUPPRESS_DEPRECATION_WARNINGS=1
 # Models using shape tensor i/o
 python3 $SRCDIR/gen_qa_identity_models.py --tensorrt-shape-io --models_dir=$SHAPEDESTDIR
@@ -432,6 +474,7 @@ chmod -R 777 $DESTDIR
 python3 $SRCDIR/gen_qa_models.py --tensorrt --variable --models_dir=$VARDESTDIR
 chmod -R 777 $VARDESTDIR
 python3 $SRCDIR/gen_qa_identity_models.py --tensorrt --models_dir=$IDENTITYDESTDIR
+python3 $SRCDIR/gen_qa_identity_models.py --tensorrt-compat --models_dir=$IDENTITYDESTDIR
 chmod -R 777 $IDENTITYDESTDIR
 python3 $SRCDIR/gen_qa_identity_models.py --tensorrt-big --models_dir=$IDENTITYBIGDESTDIR
 chmod -R 777 $IDENTITYBIGDESTDIR
@@ -453,8 +496,14 @@ python3 $SRCDIR/gen_qa_ragged_models.py --tensorrt --models_dir=$RAGGEDDESTDIR
 chmod -R 777 $RAGGEDDESTDIR
 python3 $SRCDIR/gen_qa_trt_format_models.py --models_dir=$FORMATDESTDIR
 chmod -R 777 $FORMATDESTDIR
-# make shared library for custom clip plugin
-(cd /workspace/tensorrt/samples/python/uff_custom_plugin && cmake . && make && \
+python3 $SRCDIR/gen_qa_trt_data_dependent_shape.py --models_dir=$DATADEPENDENTDIR
+chmod -R 777 $DATADEPENDENTDIR
+# Make shared library for custom clip plugin.
+# FIXME: [DLIS-4138] Once TensorRT uploads a new custom sample plugin they maintain, we should switch to
+# that one. This uses the provided plugin code from Release 8.4. This could break when TensorRT makes
+# changes to their plugin code.
+(git clone -b release/8.4 https://github.com/NVIDIA/TensorRT.git && \
+cd /workspace/TensorRT/samples/python/uff_custom_plugin && cmake . && make && \
 cp libclipplugin.so $PLGDESTDIR/.)
 LD_PRELOAD=$PLGDESTDIR/libclipplugin.so python3 $SRCDIR/gen_qa_trt_plugin_models.py --models_dir=$PLGDESTDIR
 chmod -R 777 $PLGDESTDIR
@@ -467,7 +516,7 @@ if [ $? -ne 0 ]; then
 fi
 
 docker pull $TENSORRT_IMAGE
-docker run --gpus device=$CUDA_DEVICE --rm --entrypoint $SRCDIR/$TRTSCRIPT \
+docker run $DOCKER_GPU_ARGS --rm --entrypoint $SRCDIR/$TRTSCRIPT \
        --mount type=bind,source=$HOST_SRCDIR,target=$SRCDIR \
        --mount type=bind,source=$HOST_DESTDIR,target=$DESTDIR \
        --mount type=bind,source=$HOST_VARDESTDIR,target=$VARDESTDIR \
@@ -481,6 +530,7 @@ docker run --gpus device=$CUDA_DEVICE --rm --entrypoint $SRCDIR/$TRTSCRIPT \
        --mount type=bind,source=$HOST_PLGDESTDIR,target=$PLGDESTDIR \
        --mount type=bind,source=$HOST_RAGGEDDESTDIR,target=$RAGGEDDESTDIR \
        --mount type=bind,source=$HOST_FORMATDESTDIR,target=$FORMATDESTDIR \
+       --mount type=bind,source=$HOST_DATADEPENDENTDIR,target=$DATADEPENDENTDIR \
        --mount type=bind,source=$HOST_VARIMPLICITSEQDESTDIR,target=$VARIMPLICITSEQDESTDIR \
        --mount type=bind,source=$HOST_DYNASEQIMPLICITDESTDIR,target=$DYNASEQIMPLICITDESTDIR \
        --mount type=bind,source=$HOST_IMPLICITSEQDESTDIR,target=$IMPLICITSEQDESTDIR \
diff --git a/qa/common/gen_qa_models.py b/qa/common/gen_qa_models.py
old mode 100644
new mode 100755
index 83dc3d3920..82d241f470
--- a/qa/common/gen_qa_models.py
+++ b/qa/common/gen_qa_models.py
@@ -1,4 +1,6 @@
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,146 +27,44 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import argparse
-from builtins import range
 import os
-import numpy as np
+from builtins import range
+
 import gen_ensemble_model_utils as emu
+import numpy as np
+from gen_common import (
+    np_to_model_dtype,
+    np_to_onnx_dtype,
+    np_to_tf_dtype,
+    np_to_torch_dtype,
+    np_to_trt_dtype,
+)
 
 FLAGS = None
 np_dtype_string = np.dtype(object)
 from typing import List, Tuple
 
 
-def np_to_model_dtype(np_dtype):
-    if np_dtype == bool:
-        return "TYPE_BOOL"
-    elif np_dtype == np.int8:
-        return "TYPE_INT8"
-    elif np_dtype == np.int16:
-        return "TYPE_INT16"
-    elif np_dtype == np.int32:
-        return "TYPE_INT32"
-    elif np_dtype == np.int64:
-        return "TYPE_INT64"
-    elif np_dtype == np.uint8:
-        return "TYPE_UINT8"
-    elif np_dtype == np.uint16:
-        return "TYPE_UINT16"
-    elif np_dtype == np.float16:
-        return "TYPE_FP16"
-    elif np_dtype == np.float32:
-        return "TYPE_FP32"
-    elif np_dtype == np.float64:
-        return "TYPE_FP64"
-    elif np_dtype == np_dtype_string:
-        return "TYPE_STRING"
-    return None
-
-
-def np_to_tf_dtype(np_dtype):
-    if np_dtype == bool:
-        return tf.bool
-    elif np_dtype == np.int8:
-        return tf.int8
-    elif np_dtype == np.int16:
-        return tf.int16
-    elif np_dtype == np.int32:
-        return tf.int32
-    elif np_dtype == np.int64:
-        return tf.int64
-    elif np_dtype == np.uint8:
-        return tf.uint8
-    elif np_dtype == np.uint16:
-        return tf.uint16
-    elif np_dtype == np.float16:
-        return tf.float16
-    elif np_dtype == np.float32:
-        return tf.float32
-    elif np_dtype == np.float64:
-        return tf.float64
-    elif np_dtype == np_dtype_string:
-        return tf.string
-    return None
-
-
-def np_to_trt_dtype(np_dtype):
-    if np_dtype == bool:
-        return trt.bool
-    elif np_dtype == np.int8:
-        return trt.int8
-    elif np_dtype == np.int32:
-        return trt.int32
-    elif np_dtype == np.float16:
-        return trt.float16
-    elif np_dtype == np.float32:
-        return trt.float32
-    return None
-
-
-def np_to_onnx_dtype(np_dtype):
-    if np_dtype == bool:
-        return onnx.TensorProto.BOOL
-    elif np_dtype == np.int8:
-        return onnx.TensorProto.INT8
-    elif np_dtype == np.int16:
-        return onnx.TensorProto.INT16
-    elif np_dtype == np.int32:
-        return onnx.TensorProto.INT32
-    elif np_dtype == np.int64:
-        return onnx.TensorProto.INT64
-    elif np_dtype == np.uint8:
-        return onnx.TensorProto.UINT8
-    elif np_dtype == np.uint16:
-        return onnx.TensorProto.UINT16
-    elif np_dtype == np.float16:
-        return onnx.TensorProto.FLOAT16
-    elif np_dtype == np.float32:
-        return onnx.TensorProto.FLOAT
-    elif np_dtype == np.float64:
-        return onnx.TensorProto.DOUBLE
-    elif np_dtype == np_dtype_string:
-        return onnx.TensorProto.STRING
-    return None
-
-
-def np_to_torch_dtype(np_dtype):
-    if np_dtype == bool:
-        return torch.bool
-    elif np_dtype == np.int8:
-        return torch.int8
-    elif np_dtype == np.int16:
-        return torch.int16
-    elif np_dtype == np.int32:
-        return torch.int
-    elif np_dtype == np.int64:
-        return torch.long
-    elif np_dtype == np.uint8:
-        return torch.uint8
-    elif np_dtype == np.uint16:
-        return None  # Not supported in Torch
-    elif np_dtype == np.float16:
-        return None
-    elif np_dtype == np.float32:
-        return torch.float
-    elif np_dtype == np.float64:
-        return torch.double
-    elif np_dtype == np_dtype_string:
-        return List[str]
-
-
-def create_graphdef_modelfile(models_dir,
-                              max_batch,
-                              model_version,
-                              input_shape,
-                              output0_shape,
-                              output1_shape,
-                              input_dtype,
-                              output0_dtype,
-                              output1_dtype,
-                              swap=False):
-
-    if not tu.validate_for_tf_model(input_dtype, output0_dtype, output1_dtype,
-                                    input_shape, output0_shape, output1_shape):
+def create_graphdef_modelfile(
+    models_dir,
+    max_batch,
+    model_version,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    swap=False,
+):
+    if not tu.validate_for_tf_model(
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+        input_shape,
+        output0_shape,
+        output1_shape,
+    ):
         return
 
     tf_input_dtype = np_to_tf_dtype(input_dtype)
@@ -173,19 +73,31 @@ def create_graphdef_modelfile(models_dir,
 
     # Create the model. If non-batching then don't include the batch
     # dimension.
-    tf.reset_default_graph()
+    tf.compat.v1.reset_default_graph()
     if max_batch == 0:
-        in0 = tf.placeholder(tf_input_dtype, tu.shape_to_tf_shape(input_shape),
-                             "INPUT0")
-        in1 = tf.placeholder(tf_input_dtype, tu.shape_to_tf_shape(input_shape),
-                             "INPUT1")
+        in0 = tf.compat.v1.placeholder(
+            tf_input_dtype, tu.shape_to_tf_shape(input_shape), "INPUT0"
+        )
+        in1 = tf.compat.v1.placeholder(
+            tf_input_dtype, tu.shape_to_tf_shape(input_shape), "INPUT1"
+        )
     else:
-        in0 = tf.placeholder(tf_input_dtype, [
-            None,
-        ] + tu.shape_to_tf_shape(input_shape), "INPUT0")
-        in1 = tf.placeholder(tf_input_dtype, [
-            None,
-        ] + tu.shape_to_tf_shape(input_shape), "INPUT1")
+        in0 = tf.compat.v1.placeholder(
+            tf_input_dtype,
+            [
+                None,
+            ]
+            + tu.shape_to_tf_shape(input_shape),
+            "INPUT0",
+        )
+        in1 = tf.compat.v1.placeholder(
+            tf_input_dtype,
+            [
+                None,
+            ]
+            + tu.shape_to_tf_shape(input_shape),
+            "INPUT1",
+        )
 
     # If the input is a string, then convert each string to the
     # equivalent int32 value.
@@ -198,12 +110,12 @@ def create_graphdef_modelfile(models_dir,
 
     # Cast or convert result to the output dtype.
     if tf_output0_dtype == tf.string:
-        cast0 = tf.dtypes.as_string(add if not swap else sub, name="TOSTR0")
+        cast0 = tf.strings.as_string(add if not swap else sub, name="TOSTR0")
     else:
         cast0 = tf.cast(add if not swap else sub, tf_output0_dtype, "CAST0")
 
     if tf_output1_dtype == tf.string:
-        cast1 = tf.dtypes.as_string(sub if not swap else add, name="TOSTR1")
+        cast1 = tf.strings.as_string(sub if not swap else add, name="TOSTR1")
     else:
         cast1 = tf.cast(sub if not swap else add, tf_output1_dtype, "CAST1")
 
@@ -212,8 +124,11 @@ def create_graphdef_modelfile(models_dir,
 
     # Use a different model name for the non-batching variant
     model_name = tu.get_model_name(
-        "graphdef_nobatch" if max_batch == 0 else "graphdef", input_dtype,
-        output0_dtype, output1_dtype)
+        "graphdef_nobatch" if max_batch == 0 else "graphdef",
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     try:
@@ -221,40 +136,58 @@ def create_graphdef_modelfile(models_dir,
     except OSError as ex:
         pass  # ignore existing dir
 
-    with tf.Session() as sess:
-        graph_io.write_graph(sess.graph.as_graph_def(),
-                             model_version_dir,
-                             "model.graphdef",
-                             as_text=False)
-
-
-def create_graphdef_modelconfig(models_dir, max_batch, model_version,
-                                input_shape, output0_shape, output1_shape,
-                                input_dtype, output0_dtype, output1_dtype,
-                                output0_label_cnt, version_policy):
-
-    if not tu.validate_for_tf_model(input_dtype, output0_dtype, output1_dtype,
-                                    input_shape, output0_shape, output1_shape):
+    with tf.compat.v1.Session() as sess:
+        graph_io.write_graph(
+            sess.graph.as_graph_def(),
+            model_version_dir,
+            "model.graphdef",
+            as_text=False,
+        )
+
+
+def create_graphdef_modelconfig(
+    models_dir,
+    max_batch,
+    model_version,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    output0_label_cnt,
+    version_policy,
+):
+    if not tu.validate_for_tf_model(
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+        input_shape,
+        output0_shape,
+        output1_shape,
+    ):
         return
 
     # Unpack version policy
     version_policy_str = "{ latest { num_versions: 1 }}"
     if version_policy is not None:
         type, val = version_policy
-        if type == 'latest':
-            version_policy_str = "{{ latest {{ num_versions: {} }}}}".format(
-                val)
-        elif type == 'specific':
+        if type == "latest":
+            version_policy_str = "{{ latest {{ num_versions: {} }}}}".format(val)
+        elif type == "specific":
             version_policy_str = "{{ specific {{ versions: {} }}}}".format(val)
         else:
             version_policy_str = "{ all { }}"
 
     # Use a different model name for the non-batching variant
     model_name = tu.get_model_name(
-        "graphdef_nobatch" if max_batch == 0 else "graphdef", input_dtype,
-        output0_dtype, output1_dtype)
+        "graphdef_nobatch" if max_batch == 0 else "graphdef",
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+    )
     config_dir = models_dir + "/" + model_name
-    config = '''
+    config = """
 name: "{}"
 platform: "tensorflow_graphdef"
 max_batch_size: {}
@@ -284,13 +217,19 @@ def create_graphdef_modelconfig(models_dir, max_batch, model_version,
     dims: [ {} ]
   }}
 ]
-'''.format(model_name, max_batch, version_policy_str,
-           np_to_model_dtype(input_dtype), tu.shape_to_dims_str(input_shape),
-           np_to_model_dtype(input_dtype), tu.shape_to_dims_str(input_shape),
-           np_to_model_dtype(output0_dtype),
-           tu.shape_to_dims_str(output0_shape),
-           np_to_model_dtype(output1_dtype),
-           tu.shape_to_dims_str(output1_shape))
+""".format(
+        model_name,
+        max_batch,
+        version_policy_str,
+        np_to_model_dtype(input_dtype),
+        tu.shape_to_dims_str(input_shape),
+        np_to_model_dtype(input_dtype),
+        tu.shape_to_dims_str(input_shape),
+        np_to_model_dtype(output0_dtype),
+        tu.shape_to_dims_str(output0_shape),
+        np_to_model_dtype(output1_dtype),
+        tu.shape_to_dims_str(output1_shape),
+    )
 
     try:
         os.makedirs(config_dir)
@@ -305,19 +244,26 @@ def create_graphdef_modelconfig(models_dir, max_batch, model_version,
             lfile.write("label" + str(l) + "\n")
 
 
-def create_savedmodel_modelfile(models_dir,
-                                max_batch,
-                                model_version,
-                                input_shape,
-                                output0_shape,
-                                output1_shape,
-                                input_dtype,
-                                output0_dtype,
-                                output1_dtype,
-                                swap=False):
-
-    if not tu.validate_for_tf_model(input_dtype, output0_dtype, output1_dtype,
-                                    input_shape, output0_shape, output1_shape):
+def create_savedmodel_modelfile(
+    models_dir,
+    max_batch,
+    model_version,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    swap=False,
+):
+    if not tu.validate_for_tf_model(
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+        input_shape,
+        output0_shape,
+        output1_shape,
+    ):
         return
 
     tf_input_dtype = np_to_tf_dtype(input_dtype)
@@ -326,19 +272,31 @@ def create_savedmodel_modelfile(models_dir,
 
     # Create the model. If non-batching then don't include the batch
     # dimension.
-    tf.reset_default_graph()
+    tf.compat.v1.reset_default_graph()
     if max_batch == 0:
-        in0 = tf.placeholder(tf_input_dtype, tu.shape_to_tf_shape(input_shape),
-                             "TENSOR_INPUT0")
-        in1 = tf.placeholder(tf_input_dtype, tu.shape_to_tf_shape(input_shape),
-                             "TENSOR_INPUT1")
+        in0 = tf.compat.v1.placeholder(
+            tf_input_dtype, tu.shape_to_tf_shape(input_shape), "TENSOR_INPUT0"
+        )
+        in1 = tf.compat.v1.placeholder(
+            tf_input_dtype, tu.shape_to_tf_shape(input_shape), "TENSOR_INPUT1"
+        )
     else:
-        in0 = tf.placeholder(tf_input_dtype, [
-            None,
-        ] + tu.shape_to_tf_shape(input_shape), "TENSOR_INPUT0")
-        in1 = tf.placeholder(tf_input_dtype, [
-            None,
-        ] + tu.shape_to_tf_shape(input_shape), "TENSOR_INPUT1")
+        in0 = tf.compat.v1.placeholder(
+            tf_input_dtype,
+            [
+                None,
+            ]
+            + tu.shape_to_tf_shape(input_shape),
+            "TENSOR_INPUT0",
+        )
+        in1 = tf.compat.v1.placeholder(
+            tf_input_dtype,
+            [
+                None,
+            ]
+            + tu.shape_to_tf_shape(input_shape),
+            "TENSOR_INPUT1",
+        )
 
     # If the input is a string, then convert each string to the
     # equivalent float value.
@@ -351,22 +309,25 @@ def create_savedmodel_modelfile(models_dir,
 
     # Cast or convert result to the output dtype.
     if tf_output0_dtype == tf.string:
-        cast0 = tf.dtypes.as_string(add if not swap else sub, name="TOSTR0")
+        cast0 = tf.strings.as_string(add if not swap else sub, name="TOSTR0")
     else:
         cast0 = tf.cast(add if not swap else sub, tf_output0_dtype, "CAST0")
 
     if tf_output1_dtype == tf.string:
-        cast1 = tf.dtypes.as_string(sub if not swap else add, name="TOSTR1")
+        cast1 = tf.strings.as_string(sub if not swap else add, name="TOSTR1")
     else:
         cast1 = tf.cast(sub if not swap else add, tf_output1_dtype, "CAST1")
 
-    out0 = tf.identity(cast0, "TENSOR_OUTPUT0")
-    out1 = tf.identity(cast1, "TENSOR_OUTPUT1")
+    tf.identity(cast0, "TENSOR_OUTPUT0")
+    tf.identity(cast1, "TENSOR_OUTPUT1")
 
     # Use a different model name for the non-batching variant
     model_name = tu.get_model_name(
-        "savedmodel_nobatch" if max_batch == 0 else "savedmodel", input_dtype,
-        output0_dtype, output1_dtype)
+        "savedmodel_nobatch" if max_batch == 0 else "savedmodel",
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     try:
@@ -374,54 +335,70 @@ def create_savedmodel_modelfile(models_dir,
     except OSError as ex:
         pass  # ignore existing dir
 
-    with tf.Session() as sess:
-        input0_tensor = tf.get_default_graph().get_tensor_by_name(
-            "TENSOR_INPUT0:0")
-        input1_tensor = tf.get_default_graph().get_tensor_by_name(
-            "TENSOR_INPUT1:0")
-        output0_tensor = tf.get_default_graph().get_tensor_by_name(
-            "TENSOR_OUTPUT0:0")
-        output1_tensor = tf.get_default_graph().get_tensor_by_name(
-            "TENSOR_OUTPUT1:0")
-        tf.saved_model.simple_save(sess,
-                                   model_version_dir + "/model.savedmodel",
-                                   inputs={
-                                       "INPUT0": input0_tensor,
-                                       "INPUT1": input1_tensor
-                                   },
-                                   outputs={
-                                       "OUTPUT0": output0_tensor,
-                                       "OUTPUT1": output1_tensor
-                                   })
-
-
-def create_savedmodel_modelconfig(models_dir, max_batch, model_version,
-                                  input_shape, output0_shape, output1_shape,
-                                  input_dtype, output0_dtype, output1_dtype,
-                                  output0_label_cnt, version_policy):
-
-    if not tu.validate_for_tf_model(input_dtype, output0_dtype, output1_dtype,
-                                    input_shape, output0_shape, output1_shape):
+    with tf.compat.v1.Session() as sess:
+        input0_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+            "TENSOR_INPUT0:0"
+        )
+        input1_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+            "TENSOR_INPUT1:0"
+        )
+        output0_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+            "TENSOR_OUTPUT0:0"
+        )
+        output1_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+            "TENSOR_OUTPUT1:0"
+        )
+        tf.compat.v1.saved_model.simple_save(
+            sess,
+            model_version_dir + "/model.savedmodel",
+            inputs={"INPUT0": input0_tensor, "INPUT1": input1_tensor},
+            outputs={"OUTPUT0": output0_tensor, "OUTPUT1": output1_tensor},
+        )
+
+
+def create_savedmodel_modelconfig(
+    models_dir,
+    max_batch,
+    model_version,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    output0_label_cnt,
+    version_policy,
+):
+    if not tu.validate_for_tf_model(
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+        input_shape,
+        output0_shape,
+        output1_shape,
+    ):
         return
 
     # Unpack version policy
     version_policy_str = "{ latest { num_versions: 1 }}"
     if version_policy is not None:
         type, val = version_policy
-        if type == 'latest':
-            version_policy_str = "{{ latest {{ num_versions: {} }}}}".format(
-                val)
-        elif type == 'specific':
+        if type == "latest":
+            version_policy_str = "{{ latest {{ num_versions: {} }}}}".format(val)
+        elif type == "specific":
             version_policy_str = "{{ specific {{ versions: {} }}}}".format(val)
         else:
             version_policy_str = "{ all { }}"
 
     # Use a different model name for the non-batching variant
     model_name = tu.get_model_name(
-        "savedmodel_nobatch" if max_batch == 0 else "savedmodel", input_dtype,
-        output0_dtype, output1_dtype)
+        "savedmodel_nobatch" if max_batch == 0 else "savedmodel",
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+    )
     config_dir = models_dir + "/" + model_name
-    config = '''
+    config = """
 name: "{}"
 platform: "tensorflow_savedmodel"
 max_batch_size: {}
@@ -451,13 +428,19 @@ def create_savedmodel_modelconfig(models_dir, max_batch, model_version,
     dims: [ {} ]
   }}
 ]
-'''.format(model_name, max_batch, version_policy_str,
-           np_to_model_dtype(input_dtype), tu.shape_to_dims_str(input_shape),
-           np_to_model_dtype(input_dtype), tu.shape_to_dims_str(input_shape),
-           np_to_model_dtype(output0_dtype),
-           tu.shape_to_dims_str(output0_shape),
-           np_to_model_dtype(output1_dtype),
-           tu.shape_to_dims_str(output1_shape))
+""".format(
+        model_name,
+        max_batch,
+        version_policy_str,
+        np_to_model_dtype(input_dtype),
+        tu.shape_to_dims_str(input_shape),
+        np_to_model_dtype(input_dtype),
+        tu.shape_to_dims_str(input_shape),
+        np_to_model_dtype(output0_dtype),
+        tu.shape_to_dims_str(output0_shape),
+        np_to_model_dtype(output1_dtype),
+        tu.shape_to_dims_str(output1_shape),
+    )
 
     try:
         os.makedirs(config_dir)
@@ -472,10 +455,20 @@ def create_savedmodel_modelconfig(models_dir, max_batch, model_version,
             lfile.write("label" + str(l) + "\n")
 
 
-def create_plan_dynamic_rf_modelfile(models_dir, max_batch, model_version,
-                                     input_shape, output0_shape, output1_shape,
-                                     input_dtype, output0_dtype, output1_dtype,
-                                     swap, min_dim, max_dim):
+def create_plan_dynamic_rf_modelfile(
+    models_dir,
+    max_batch,
+    model_version,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    swap,
+    min_dim,
+    max_dim,
+):
     trt_input_dtype = np_to_trt_dtype(input_dtype)
     trt_output0_dtype = np_to_trt_dtype(output0_dtype)
     trt_output1_dtype = np_to_trt_dtype(output1_dtype)
@@ -485,7 +478,8 @@ def create_plan_dynamic_rf_modelfile(models_dir, max_batch, model_version,
     TRT_LOGGER = trt.Logger(trt.Logger.INFO)
     builder = trt.Builder(TRT_LOGGER)
     network = builder.create_network(
-        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
+        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+    )
     if max_batch == 0:
         input_with_batchsize = [i for i in input_shape]
     else:
@@ -493,12 +487,34 @@ def create_plan_dynamic_rf_modelfile(models_dir, max_batch, model_version,
 
     in0 = network.add_input("INPUT0", trt_input_dtype, input_with_batchsize)
     in1 = network.add_input("INPUT1", trt_input_dtype, input_with_batchsize)
+
+    # TRT uint8 cannot be used to represent quantized floating-point value yet
+    # uint8 must be converted to float16 or float32 before any operation
+    # FIXME: Remove support check when jetson supports TRT 8.5 (DLIS-4256)
+    if tu.support_trt_uint8():
+        if trt_input_dtype == trt.uint8:
+            in0_cast = network.add_identity(in0)
+            in0_cast.set_output_type(0, trt.float32)
+            in0 = in0_cast.get_output(0)
+            in1_cast = network.add_identity(in1)
+            in1_cast.set_output_type(0, trt.float32)
+            in1 = in1_cast.get_output(0)
+
     add = network.add_elementwise(in0, in1, trt.ElementWiseOperation.SUM)
     sub = network.add_elementwise(in0, in1, trt.ElementWiseOperation.SUB)
-
     out0 = add if not swap else sub
     out1 = sub if not swap else add
 
+    # uint8 conversion after operations
+    # FIXME: Remove support check when jetson supports TRT 8.5 (DLIS-4256)
+    if tu.support_trt_uint8():
+        if trt_output0_dtype == trt.uint8:
+            out0 = network.add_identity(out0.get_output(0))
+            out0.set_output_type(0, trt.uint8)
+        if trt_output1_dtype == trt.uint8:
+            out1 = network.add_identity(out1.get_output(0))
+            out1.set_output_type(0, trt.uint8)
+
     out0.get_output(0).name = "OUTPUT0"
     out1.get_output(0).name = "OUTPUT1"
     network.mark_output(out0.get_output(0))
@@ -512,12 +528,12 @@ def create_plan_dynamic_rf_modelfile(models_dir, max_batch, model_version,
     out0.get_output(0).allowed_formats = 1 << int(trt_memory_format)
     out1.get_output(0).allowed_formats = 1 << int(trt_memory_format)
 
-    if (trt_input_dtype == trt.int8):
+    if trt_input_dtype == trt.int8:
         in0.dynamic_range = (-128.0, 127.0)
         in1.dynamic_range = (-128.0, 127.0)
-    if (trt_output0_dtype == trt.int8):
+    if trt_output0_dtype == trt.int8:
         out0.get_output(0).dynamic_range = (-128.0, 127.0)
-    if (trt_output1_dtype == trt.int8):
+    if trt_output1_dtype == trt.int8:
         out1.get_output(0).dynamic_range = (-128.0, 127.0)
 
     min_shape = []
@@ -543,14 +559,14 @@ def create_plan_dynamic_rf_modelfile(models_dir, max_batch, model_version,
     flags = 1 << int(trt.BuilderFlag.STRICT_TYPES)
     datatype_set = set([trt_input_dtype, trt_output0_dtype, trt_output1_dtype])
     for dt in datatype_set:
-        if (dt == trt.int8):
+        if dt == trt.int8:
             flags |= 1 << int(trt.BuilderFlag.INT8)
-        elif (dt == trt.float16):
+        elif dt == trt.float16:
             flags |= 1 << int(trt.BuilderFlag.FP16)
     config = builder.create_builder_config()
     config.flags = flags
     config.add_optimization_profile(profile)
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
     except AttributeError:
@@ -559,8 +575,12 @@ def create_plan_dynamic_rf_modelfile(models_dir, max_batch, model_version,
         del engine
 
     # Use a different model name for different kinds of models
-    model_name = tu.get_model_name("plan_nobatch" if max_batch == 0 else "plan",
-                                   input_dtype, output0_dtype, output1_dtype)
+    model_name = tu.get_model_name(
+        "plan_nobatch" if max_batch == 0 else "plan",
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+    )
     if min_dim != 1 or max_dim != 32:
         model_name = "{}-{}-{}".format(model_name, min_dim, max_dim)
 
@@ -575,10 +595,20 @@ def create_plan_dynamic_rf_modelfile(models_dir, max_batch, model_version,
         f.write(engine_bytes)
 
 
-def create_plan_dynamic_modelfile(models_dir, max_batch, model_version,
-                                  input_shape, output0_shape, output1_shape,
-                                  input_dtype, output0_dtype, output1_dtype,
-                                  swap, min_dim, max_dim):
+def create_plan_dynamic_modelfile(
+    models_dir,
+    max_batch,
+    model_version,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    swap,
+    min_dim,
+    max_dim,
+):
     trt_input_dtype = np_to_trt_dtype(input_dtype)
     trt_output0_dtype = np_to_trt_dtype(output0_dtype)
     trt_output1_dtype = np_to_trt_dtype(output1_dtype)
@@ -587,7 +617,8 @@ def create_plan_dynamic_modelfile(models_dir, max_batch, model_version,
     TRT_LOGGER = trt.Logger(trt.Logger.INFO)
     builder = trt.Builder(TRT_LOGGER)
     network = builder.create_network(
-        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
+        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+    )
     if max_batch == 0:
         input_with_batchsize = [i for i in input_shape]
     else:
@@ -638,10 +669,12 @@ def create_plan_dynamic_modelfile(models_dir, max_batch, model_version,
             if i == 0 and (min_dim == 1 and max_dim == 32):
                 max_shape_override[0] += 1
 
-            profile[i].set_shape("INPUT0", [1] + min_shape, opt_bs + opt_shape,
-                                 bs + max_shape_override)
-            profile[i].set_shape("INPUT1", [1] + min_shape, opt_bs + opt_shape,
-                                 bs + max_shape_override)
+            profile[i].set_shape(
+                "INPUT0", [1] + min_shape, opt_bs + opt_shape, bs + max_shape_override
+            )
+            profile[i].set_shape(
+                "INPUT1", [1] + min_shape, opt_bs + opt_shape, bs + max_shape_override
+            )
         config.add_optimization_profile(profile[i])
     # some profiles with non-one min shape for first dim to test autofiller
     for i in range(2):
@@ -650,10 +683,12 @@ def create_plan_dynamic_modelfile(models_dir, max_batch, model_version,
             profile[i + 4].set_shape("INPUT0", min_shape, opt_shape, max_shape)
             profile[i + 4].set_shape("INPUT1", min_shape, opt_shape, max_shape)
         else:
-            profile[i + 4].set_shape("INPUT0", [5 + i] + min_shape,
-                                     [6] + opt_shape, [max_batch] + max_shape)
-            profile[i + 4].set_shape("INPUT1", [5 + i] + min_shape,
-                                     [6] + opt_shape, [max_batch] + max_shape)
+            profile[i + 4].set_shape(
+                "INPUT0", [5 + i] + min_shape, [6] + opt_shape, [max_batch] + max_shape
+            )
+            profile[i + 4].set_shape(
+                "INPUT1", [5 + i] + min_shape, [6] + opt_shape, [max_batch] + max_shape
+            )
         config.add_optimization_profile(profile[i + 4])
     # Will repeat another profile with same min and max shapes as the first profile to test non-zero profile
     # for infer_variable test.
@@ -662,10 +697,12 @@ def create_plan_dynamic_modelfile(models_dir, max_batch, model_version,
         profile[6].set_shape("INPUT0", min_shape, opt_shape, max_shape)
         profile[6].set_shape("INPUT1", min_shape, opt_shape, max_shape)
     else:
-        profile[6].set_shape("INPUT0", [1] + min_shape, [1] + opt_shape,
-                             [max_batch] + max_shape)
-        profile[6].set_shape("INPUT1", [1] + min_shape, [1] + opt_shape,
-                             [max_batch] + max_shape)
+        profile[6].set_shape(
+            "INPUT0", [1] + min_shape, [1] + opt_shape, [max_batch] + max_shape
+        )
+        profile[6].set_shape(
+            "INPUT1", [1] + min_shape, [1] + opt_shape, [max_batch] + max_shape
+        )
     config.add_optimization_profile(profile[6])
 
     # Will add some profiles with static shapes to test the cases where min_shape=opt_shape=max_shape
@@ -673,30 +710,32 @@ def create_plan_dynamic_modelfile(models_dir, max_batch, model_version,
         profile.append(builder.create_optimization_profile())
         if max_batch == 0:
             static_shape = max_shape
-            profile[7 + i].set_shape("INPUT0", static_shape, static_shape,
-                                     static_shape)
-            profile[7 + i].set_shape("INPUT1", static_shape, static_shape,
-                                     static_shape)
+            profile[7 + i].set_shape("INPUT0", static_shape, static_shape, static_shape)
+            profile[7 + i].set_shape("INPUT1", static_shape, static_shape, static_shape)
         else:
             # Skipping alternate batch sizes for testing unsupported batches in L0_trt_dynamic_shape.
             full_static_shape = [1 + (2 * i)] + max_shape
-            profile[7 + i].set_shape("INPUT0", full_static_shape,
-                                     full_static_shape, full_static_shape)
-            profile[7 + i].set_shape("INPUT1", full_static_shape,
-                                     full_static_shape, full_static_shape)
+            profile[7 + i].set_shape(
+                "INPUT0", full_static_shape, full_static_shape, full_static_shape
+            )
+            profile[7 + i].set_shape(
+                "INPUT1", full_static_shape, full_static_shape, full_static_shape
+            )
         config.add_optimization_profile(profile[7 + i])
 
     # Add profiles where each profile supports a specific batch size
     if max_batch != 0:
         for i in range(max_batch):
             profile.append(builder.create_optimization_profile())
-            profile[10 + i].set_shape("INPUT0", [1 + i] + min_shape,
-                                      [1 + i] + opt_shape, [1 + i] + max_shape)
-            profile[10 + i].set_shape("INPUT1", [1 + i] + min_shape,
-                                      [1 + i] + opt_shape, [1 + i] + max_shape)
+            profile[10 + i].set_shape(
+                "INPUT0", [1 + i] + min_shape, [1 + i] + opt_shape, [1 + i] + max_shape
+            )
+            profile[10 + i].set_shape(
+                "INPUT1", [1 + i] + min_shape, [1 + i] + opt_shape, [1 + i] + max_shape
+            )
             config.add_optimization_profile(profile[10 + i])
 
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
     except AttributeError:
@@ -705,8 +744,12 @@ def create_plan_dynamic_modelfile(models_dir, max_batch, model_version,
         del engine
 
     # Use a different model name for different kinds of models
-    model_name = tu.get_model_name("plan_nobatch" if max_batch == 0 else "plan",
-                                   input_dtype, output0_dtype, output1_dtype)
+    model_name = tu.get_model_name(
+        "plan_nobatch" if max_batch == 0 else "plan",
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+    )
     if min_dim != 1 or max_dim != 32:
         model_name = "{}-{}-{}".format(model_name, min_dim, max_dim)
 
@@ -721,10 +764,18 @@ def create_plan_dynamic_modelfile(models_dir, max_batch, model_version,
         f.write(engine_bytes)
 
 
-def create_plan_fixed_rf_modelfile(models_dir, max_batch, model_version,
-                                   input_shape, output0_shape, output1_shape,
-                                   input_dtype, output0_dtype, output1_dtype,
-                                   swap):
+def create_plan_fixed_rf_modelfile(
+    models_dir,
+    max_batch,
+    model_version,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    swap,
+):
     trt_input_dtype = np_to_trt_dtype(input_dtype)
     trt_output0_dtype = np_to_trt_dtype(output0_dtype)
     trt_output1_dtype = np_to_trt_dtype(output1_dtype)
@@ -755,24 +806,24 @@ def create_plan_fixed_rf_modelfile(models_dir, max_batch, model_version,
     out0.get_output(0).allowed_formats = 1 << int(trt_memory_format)
     out1.get_output(0).allowed_formats = 1 << int(trt_memory_format)
 
-    if (trt_input_dtype == trt.int8):
+    if trt_input_dtype == trt.int8:
         in0.dynamic_range = (-128.0, 127.0)
         in1.dynamic_range = (-128.0, 127.0)
-    if (trt_output0_dtype == trt.int8):
+    if trt_output0_dtype == trt.int8:
         out0.get_output(0).dynamic_range = (-128.0, 127.0)
-    if (trt_output1_dtype == trt.int8):
+    if trt_output1_dtype == trt.int8:
         out1.get_output(0).dynamic_range = (-128.0, 127.0)
 
     flags = 1 << int(trt.BuilderFlag.STRICT_TYPES)
     datatype_set = set([trt_input_dtype, trt_output0_dtype, trt_output1_dtype])
     for dt in datatype_set:
-        if (dt == trt.int8):
+        if dt == trt.int8:
             flags |= 1 << int(trt.BuilderFlag.INT8)
-        elif (dt == trt.float16):
+        elif dt == trt.float16:
             flags |= 1 << int(trt.BuilderFlag.FP16)
     config = builder.create_builder_config()
     config.flags = flags
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     builder.max_batch_size = max(1, max_batch)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
@@ -781,8 +832,12 @@ def create_plan_fixed_rf_modelfile(models_dir, max_batch, model_version,
         engine_bytes = engine.serialize()
         del engine
 
-    model_name = tu.get_model_name("plan_nobatch" if max_batch == 0 else "plan",
-                                   input_dtype, output0_dtype, output1_dtype)
+    model_name = tu.get_model_name(
+        "plan_nobatch" if max_batch == 0 else "plan",
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     try:
@@ -794,10 +849,18 @@ def create_plan_fixed_rf_modelfile(models_dir, max_batch, model_version,
         f.write(engine_bytes)
 
 
-def create_plan_fixed_modelfile(models_dir, max_batch, model_version,
-                                input_shape, output0_shape, output1_shape,
-                                input_dtype, output0_dtype, output1_dtype,
-                                swap):
+def create_plan_fixed_modelfile(
+    models_dir,
+    max_batch,
+    model_version,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    swap,
+):
     trt_input_dtype = np_to_trt_dtype(input_dtype)
     trt_output0_dtype = np_to_trt_dtype(output0_dtype)
     trt_output1_dtype = np_to_trt_dtype(output1_dtype)
@@ -820,7 +883,7 @@ def create_plan_fixed_modelfile(models_dir, max_batch, model_version,
     network.mark_output(out1.get_output(0))
 
     config = builder.create_builder_config()
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     builder.max_batch_size = max(1, max_batch)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
@@ -830,8 +893,12 @@ def create_plan_fixed_modelfile(models_dir, max_batch, model_version,
         del engine
     del network
 
-    model_name = tu.get_model_name("plan_nobatch" if max_batch == 0 else "plan",
-                                   input_dtype, output0_dtype, output1_dtype)
+    model_name = tu.get_model_name(
+        "plan_nobatch" if max_batch == 0 else "plan",
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     try:
@@ -843,88 +910,168 @@ def create_plan_fixed_modelfile(models_dir, max_batch, model_version,
         f.write(engine_bytes)
 
 
-def create_plan_modelfile(models_dir,
-                          max_batch,
-                          model_version,
-                          input_shape,
-                          output0_shape,
-                          output1_shape,
-                          input_dtype,
-                          output0_dtype,
-                          output1_dtype,
-                          swap=False,
-                          min_dim=1,
-                          max_dim=32):
-
-    if not tu.validate_for_trt_model(input_dtype, output0_dtype, output1_dtype,
-                                     input_shape, output0_shape, output1_shape):
+def create_plan_modelfile(
+    models_dir,
+    max_batch,
+    model_version,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    swap=False,
+    min_dim=1,
+    max_dim=32,
+):
+    if not tu.validate_for_trt_model(
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+        input_shape,
+        output0_shape,
+        output1_shape,
+    ):
         return
 
-    if input_dtype != np.float32 or output0_dtype != np.float32 or output1_dtype != np.float32:
-        if (not tu.shape_is_fixed(input_shape) or
-                not tu.shape_is_fixed(output0_shape) or
-                not tu.shape_is_fixed(output1_shape)):
-            create_plan_dynamic_rf_modelfile(models_dir, max_batch,
-                                             model_version, input_shape,
-                                             output0_shape, output1_shape,
-                                             input_dtype, output0_dtype,
-                                             output1_dtype, swap, min_dim,
-                                             max_dim)
+    if (
+        input_dtype == np.uint8
+        or output0_dtype == np.uint8
+        or output1_dtype == np.uint8
+    ):
+        # TRT uint8 cannot be used to represent quantized floating-point value yet
+        # EXPLICIT_BATCH network and conversion are required to create models
+        create_plan_dynamic_rf_modelfile(
+            models_dir,
+            max_batch,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            swap,
+            min_dim,
+            max_dim,
+        )
+
+    elif (
+        input_dtype != np.float32
+        or output0_dtype != np.float32
+        or output1_dtype != np.float32
+    ):
+        if (
+            not tu.shape_is_fixed(input_shape)
+            or not tu.shape_is_fixed(output0_shape)
+            or not tu.shape_is_fixed(output1_shape)
+        ):
+            create_plan_dynamic_rf_modelfile(
+                models_dir,
+                max_batch,
+                model_version,
+                input_shape,
+                output0_shape,
+                output1_shape,
+                input_dtype,
+                output0_dtype,
+                output1_dtype,
+                swap,
+                min_dim,
+                max_dim,
+            )
         else:
-            create_plan_fixed_rf_modelfile(models_dir, max_batch, model_version,
-                                           input_shape, output0_shape,
-                                           output1_shape, input_dtype,
-                                           output0_dtype, output1_dtype, swap)
+            create_plan_fixed_rf_modelfile(
+                models_dir,
+                max_batch,
+                model_version,
+                input_shape,
+                output0_shape,
+                output1_shape,
+                input_dtype,
+                output0_dtype,
+                output1_dtype,
+                swap,
+            )
 
     else:
-        if (not tu.shape_is_fixed(input_shape) or
-                not tu.shape_is_fixed(output0_shape) or
-                not tu.shape_is_fixed(output1_shape)):
-            create_plan_dynamic_modelfile(models_dir, max_batch, model_version,
-                                          input_shape, output0_shape,
-                                          output1_shape, input_dtype,
-                                          output0_dtype, output1_dtype, swap,
-                                          min_dim, max_dim)
+        if (
+            not tu.shape_is_fixed(input_shape)
+            or not tu.shape_is_fixed(output0_shape)
+            or not tu.shape_is_fixed(output1_shape)
+        ):
+            create_plan_dynamic_modelfile(
+                models_dir,
+                max_batch,
+                model_version,
+                input_shape,
+                output0_shape,
+                output1_shape,
+                input_dtype,
+                output0_dtype,
+                output1_dtype,
+                swap,
+                min_dim,
+                max_dim,
+            )
         else:
-            create_plan_fixed_modelfile(models_dir, max_batch, model_version,
-                                        input_shape, output0_shape,
-                                        output1_shape, input_dtype,
-                                        output0_dtype, output1_dtype, swap)
-
-
-def create_plan_modelconfig(models_dir,
-                            max_batch,
-                            model_version,
-                            input_shape,
-                            output0_shape,
-                            output1_shape,
-                            input_dtype,
-                            output0_dtype,
-                            output1_dtype,
-                            output0_label_cnt,
-                            version_policy,
-                            min_dim=1,
-                            max_dim=32):
-
-    if not tu.validate_for_trt_model(input_dtype, output0_dtype, output1_dtype,
-                                     input_shape, output0_shape, output1_shape):
+            create_plan_fixed_modelfile(
+                models_dir,
+                max_batch,
+                model_version,
+                input_shape,
+                output0_shape,
+                output1_shape,
+                input_dtype,
+                output0_dtype,
+                output1_dtype,
+                swap,
+            )
+
+
+def create_plan_modelconfig(
+    models_dir,
+    max_batch,
+    model_version,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    output0_label_cnt,
+    version_policy,
+    min_dim=1,
+    max_dim=32,
+):
+    if not tu.validate_for_trt_model(
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+        input_shape,
+        output0_shape,
+        output1_shape,
+    ):
         return
 
     # Unpack version policy
     version_policy_str = "{ latest { num_versions: 1 }}"
     if version_policy is not None:
         type, val = version_policy
-        if type == 'latest':
-            version_policy_str = "{{ latest {{ num_versions: {} }}}}".format(
-                val)
-        elif type == 'specific':
+        if type == "latest":
+            version_policy_str = "{{ latest {{ num_versions: {} }}}}".format(val)
+        elif type == "specific":
             version_policy_str = "{{ specific {{ versions: {} }}}}".format(val)
         else:
             version_policy_str = "{ all { }}"
 
     # Use a different model name for different kinds of models
-    model_name = tu.get_model_name("plan_nobatch" if max_batch == 0 else "plan",
-                                   input_dtype, output0_dtype, output1_dtype)
+    model_name = tu.get_model_name(
+        "plan_nobatch" if max_batch == 0 else "plan",
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+    )
     if min_dim != 1 or max_dim != 32:
         model_name = "{}-{}-{}".format(model_name, min_dim, max_dim)
 
@@ -934,7 +1081,7 @@ def create_plan_modelconfig(models_dir,
         # Note the min and max shapes of first and sixth
         # profile are identical.
         profile_index = 6 if input_dtype == np.float32 else 0
-        config = '''
+        config = """
 name: "{}"
 platform: "tensorrt_plan"
 max_batch_size: {}
@@ -969,15 +1116,22 @@ def create_plan_modelconfig(models_dir,
       profile:"{}"
   }}
 ]
-'''.format(model_name, max_batch, version_policy_str,
-           np_to_model_dtype(input_dtype), tu.shape_to_dims_str(input_shape),
-           np_to_model_dtype(input_dtype), tu.shape_to_dims_str(input_shape),
-           np_to_model_dtype(output0_dtype),
-           tu.shape_to_dims_str(output0_shape),
-           np_to_model_dtype(output1_dtype),
-           tu.shape_to_dims_str(output1_shape), profile_index)
+""".format(
+            model_name,
+            max_batch,
+            version_policy_str,
+            np_to_model_dtype(input_dtype),
+            tu.shape_to_dims_str(input_shape),
+            np_to_model_dtype(input_dtype),
+            tu.shape_to_dims_str(input_shape),
+            np_to_model_dtype(output0_dtype),
+            tu.shape_to_dims_str(output0_shape),
+            np_to_model_dtype(output1_dtype),
+            tu.shape_to_dims_str(output1_shape),
+            profile_index,
+        )
     else:
-        config = '''
+        config = """
 name: "{}"
 platform: "tensorrt_plan"
 max_batch_size: {}
@@ -1007,13 +1161,19 @@ def create_plan_modelconfig(models_dir,
     dims: [ {} ]
   }}
 ]
-'''.format(model_name, max_batch, version_policy_str,
-           np_to_model_dtype(input_dtype), tu.shape_to_dims_str(input_shape),
-           np_to_model_dtype(input_dtype), tu.shape_to_dims_str(input_shape),
-           np_to_model_dtype(output0_dtype),
-           tu.shape_to_dims_str(output0_shape),
-           np_to_model_dtype(output1_dtype),
-           tu.shape_to_dims_str(output1_shape))
+""".format(
+            model_name,
+            max_batch,
+            version_policy_str,
+            np_to_model_dtype(input_dtype),
+            tu.shape_to_dims_str(input_shape),
+            np_to_model_dtype(input_dtype),
+            tu.shape_to_dims_str(input_shape),
+            np_to_model_dtype(output0_dtype),
+            tu.shape_to_dims_str(output0_shape),
+            np_to_model_dtype(output1_dtype),
+            tu.shape_to_dims_str(output1_shape),
+        )
 
     try:
         os.makedirs(config_dir)
@@ -1028,20 +1188,26 @@ def create_plan_modelconfig(models_dir,
             lfile.write("label" + str(l) + "\n")
 
 
-def create_onnx_modelfile(models_dir,
-                          max_batch,
-                          model_version,
-                          input_shape,
-                          output0_shape,
-                          output1_shape,
-                          input_dtype,
-                          output0_dtype,
-                          output1_dtype,
-                          swap=False):
-
-    if not tu.validate_for_onnx_model(input_dtype, output0_dtype, output1_dtype,
-                                      input_shape, output0_shape,
-                                      output1_shape):
+def create_onnx_modelfile(
+    models_dir,
+    max_batch,
+    model_version,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    swap=False,
+):
+    if not tu.validate_for_onnx_model(
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+        input_shape,
+        output0_shape,
+        output1_shape,
+    ):
         return
 
     onnx_input_dtype = np_to_onnx_dtype(input_dtype)
@@ -1053,43 +1219,55 @@ def create_onnx_modelfile(models_dir,
     onnx_output1_shape, idx = tu.shape_to_onnx_shape(input_shape, idx)
 
     # Create the model
-    model_name = tu.get_model_name("onnx_nobatch" if max_batch == 0 else "onnx",
-                                   input_dtype, output0_dtype, output1_dtype)
+    model_name = tu.get_model_name(
+        "onnx_nobatch" if max_batch == 0 else "onnx",
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     batch_dim = [] if max_batch == 0 else [None]
 
-    in0 = onnx.helper.make_tensor_value_info("INPUT0", onnx_input_dtype,
-                                             batch_dim + onnx_input_shape)
-    in1 = onnx.helper.make_tensor_value_info("INPUT1", onnx_input_dtype,
-                                             batch_dim + onnx_input_shape)
+    in0 = onnx.helper.make_tensor_value_info(
+        "INPUT0", onnx_input_dtype, batch_dim + onnx_input_shape
+    )
+    in1 = onnx.helper.make_tensor_value_info(
+        "INPUT1", onnx_input_dtype, batch_dim + onnx_input_shape
+    )
 
-    out0 = onnx.helper.make_tensor_value_info("OUTPUT0", onnx_output0_dtype,
-                                              batch_dim + onnx_output0_shape)
-    out1 = onnx.helper.make_tensor_value_info("OUTPUT1", onnx_output1_dtype,
-                                              batch_dim + onnx_output1_shape)
+    out0 = onnx.helper.make_tensor_value_info(
+        "OUTPUT0", onnx_output0_dtype, batch_dim + onnx_output0_shape
+    )
+    out1 = onnx.helper.make_tensor_value_info(
+        "OUTPUT1", onnx_output1_dtype, batch_dim + onnx_output1_shape
+    )
 
     internal_in0 = onnx.helper.make_node("Identity", ["INPUT0"], ["_INPUT0"])
     internal_in1 = onnx.helper.make_node("Identity", ["INPUT1"], ["_INPUT1"])
 
-    # cast int8, int16 input to higer precision int as Onnx Add/Sub operator doesn't support those type
+    # cast int8, int16 input to higher precision int as Onnx Add/Sub operator doesn't support those type
     # Also casting String data type to int32
-    if ((onnx_input_dtype == onnx.TensorProto.INT8) or
-        (onnx_input_dtype == onnx.TensorProto.INT16) or
-        (onnx_input_dtype == onnx.TensorProto.STRING)):
-        internal_in0 = onnx.helper.make_node("Cast", ["INPUT0"], ["_INPUT0"],
-                                             to=onnx.TensorProto.INT32)
-        internal_in1 = onnx.helper.make_node("Cast", ["INPUT1"], ["_INPUT1"],
-                                             to=onnx.TensorProto.INT32)
-
-    add = onnx.helper.make_node("Add", ["_INPUT0", "_INPUT1"],
-                                ["CAST0" if not swap else "CAST1"])
-    sub = onnx.helper.make_node("Sub", ["_INPUT0", "_INPUT1"],
-                                ["CAST1" if not swap else "CAST0"])
-    cast0 = onnx.helper.make_node("Cast", ["CAST0"], ["OUTPUT0"],
-                                  to=onnx_output0_dtype)
-    cast1 = onnx.helper.make_node("Cast", ["CAST1"], ["OUTPUT1"],
-                                  to=onnx_output1_dtype)
+    if (
+        (onnx_input_dtype == onnx.TensorProto.INT8)
+        or (onnx_input_dtype == onnx.TensorProto.INT16)
+        or (onnx_input_dtype == onnx.TensorProto.STRING)
+    ):
+        internal_in0 = onnx.helper.make_node(
+            "Cast", ["INPUT0"], ["_INPUT0"], to=onnx.TensorProto.INT32
+        )
+        internal_in1 = onnx.helper.make_node(
+            "Cast", ["INPUT1"], ["_INPUT1"], to=onnx.TensorProto.INT32
+        )
+
+    add = onnx.helper.make_node(
+        "Add", ["_INPUT0", "_INPUT1"], ["CAST0" if not swap else "CAST1"]
+    )
+    sub = onnx.helper.make_node(
+        "Sub", ["_INPUT0", "_INPUT1"], ["CAST1" if not swap else "CAST0"]
+    )
+    cast0 = onnx.helper.make_node("Cast", ["CAST0"], ["OUTPUT0"], to=onnx_output0_dtype)
+    cast1 = onnx.helper.make_node("Cast", ["CAST1"], ["OUTPUT1"], to=onnx_output1_dtype)
 
     # Avoid cast from float16 to float16
     # (bug in Onnx Runtime, cast from float16 to float16 will become cast from float16 to float32)
@@ -1103,13 +1281,14 @@ def create_onnx_modelfile(models_dir,
     onnx_inputs = [in0, in1]
     onnx_outputs = [out0, out1]
 
-    graph_proto = onnx.helper.make_graph(onnx_nodes, model_name, onnx_inputs,
-                                         onnx_outputs)
+    graph_proto = onnx.helper.make_graph(
+        onnx_nodes, model_name, onnx_inputs, onnx_outputs
+    )
     if FLAGS.onnx_opset > 0:
         model_opset = onnx.helper.make_operatorsetid("", FLAGS.onnx_opset)
-        model_def = onnx.helper.make_model(graph_proto,
-                                           producer_name="triton",
-                                           opset_imports=[model_opset])
+        model_def = onnx.helper.make_model(
+            graph_proto, producer_name="triton", opset_imports=[model_opset]
+        )
     else:
         model_def = onnx.helper.make_model(graph_proto, producer_name="triton")
 
@@ -1121,35 +1300,54 @@ def create_onnx_modelfile(models_dir,
     onnx.save(model_def, model_version_dir + "/model.onnx")
 
 
-def create_onnx_modelconfig(models_dir, max_batch, model_version, input_shape,
-                            output0_shape, output1_shape, input_dtype,
-                            output0_dtype, output1_dtype, output0_label_cnt,
-                            version_policy):
-
-    if not tu.validate_for_onnx_model(input_dtype, output0_dtype, output1_dtype,
-                                      input_shape, output0_shape,
-                                      output1_shape):
+def create_onnx_modelconfig(
+    models_dir,
+    max_batch,
+    model_version,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    output0_label_cnt,
+    version_policy,
+):
+    if not tu.validate_for_onnx_model(
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+        input_shape,
+        output0_shape,
+        output1_shape,
+    ):
         return
 
     # Use a different model name for the non-batching variant
-    model_name = tu.get_model_name("onnx_nobatch" if max_batch == 0 else "onnx",
-                                   input_dtype, output0_dtype, output1_dtype)
+    model_name = tu.get_model_name(
+        "onnx_nobatch" if max_batch == 0 else "onnx",
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+    )
     config_dir = models_dir + "/" + model_name
 
     # [TODO] move create_general_modelconfig() out of emu as it is general
     # enough for all backends to use
-    config = emu.create_general_modelconfig(model_name,
-                                            "onnxruntime_onnx",
-                                            max_batch,
-                                            emu.repeat(input_dtype, 2),
-                                            emu.repeat(input_shape, 2),
-                                            emu.repeat(None, 2),
-                                            [output0_dtype, output1_dtype],
-                                            [output0_shape, output1_shape],
-                                            emu.repeat(None, 2),
-                                            ["output0_labels.txt", None],
-                                            version_policy=version_policy,
-                                            force_tensor_number_suffix=True)
+    config = emu.create_general_modelconfig(
+        model_name,
+        "onnxruntime_onnx",
+        max_batch,
+        emu.repeat(input_dtype, 2),
+        emu.repeat(input_shape, 2),
+        emu.repeat(None, 2),
+        [output0_dtype, output1_dtype],
+        [output0_shape, output1_shape],
+        emu.repeat(None, 2),
+        ["output0_labels.txt", None],
+        version_policy=version_policy,
+        force_tensor_number_suffix=True,
+    )
 
     try:
         os.makedirs(config_dir)
@@ -1164,38 +1362,49 @@ def create_onnx_modelconfig(models_dir, max_batch, model_version, input_shape,
             lfile.write("label" + str(l) + "\n")
 
 
-def create_libtorch_modelfile(models_dir,
-                              max_batch,
-                              model_version,
-                              input_shape,
-                              output0_shape,
-                              output1_shape,
-                              input_dtype,
-                              output0_dtype,
-                              output1_dtype,
-                              swap=False):
-
+def create_libtorch_modelfile(
+    models_dir,
+    max_batch,
+    model_version,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    swap=False,
+):
     if not tu.validate_for_libtorch_model(
-            input_dtype, output0_dtype, output1_dtype, input_shape,
-            output0_shape, output1_shape, max_batch):
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+        input_shape,
+        output0_shape,
+        output1_shape,
+        max_batch,
+    ):
         return
 
     torch_output0_dtype = np_to_torch_dtype(output0_dtype)
     torch_output1_dtype = np_to_torch_dtype(output1_dtype)
 
     model_name = tu.get_model_name(
-        "libtorch_nobatch" if max_batch == 0 else "libtorch", input_dtype,
-        output0_dtype, output1_dtype)
+        "libtorch_nobatch" if max_batch == 0 else "libtorch",
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+    )
     # handle for -1 (when variable) since can't create tensor with shape of [-1]
     input_shape = [abs(ips) for ips in input_shape]
 
     # Create the model
-    if (input_dtype
-            == np_dtype_string) and (output0_dtype != np_dtype_string) and (
-                output1_dtype != np_dtype_string):
+    if (
+        (input_dtype == np_dtype_string)
+        and (output0_dtype != np_dtype_string)
+        and (output1_dtype != np_dtype_string)
+    ):
 
         class AddSubNet(nn.Module):
-
             def __init__(self, *args):
                 self.output0_dtype = args[0][0]
                 self.output1_dtype = args[0][1]
@@ -1205,97 +1414,127 @@ def __init__(self, *args):
             def forward(self, INPUT0: List[str], INPUT1: List[str]):
                 input0_int = torch.tensor([int(i) for i in INPUT0])
                 input1_int = torch.tensor([int(i) for i in INPUT1])
-                op0 = input0_int + input1_int if not self.swap else input0_int - input1_int
-                op1 = input0_int - input1_int if not self.swap else input0_int + input1_int
+                op0 = (
+                    input0_int + input1_int
+                    if not self.swap
+                    else input0_int - input1_int
+                )
+                op1 = (
+                    input0_int - input1_int
+                    if not self.swap
+                    else input0_int + input1_int
+                )
                 return op0.to(self.output0_dtype), op1.to(self.output1_dtype)
-    elif (input_dtype
-          == np_dtype_string) and (output0_dtype
-                                   == np_dtype_string) and (output1_dtype
-                                                            == np_dtype_string):
 
-        class AddSubNet(nn.Module):
+    elif (
+        (input_dtype == np_dtype_string)
+        and (output0_dtype == np_dtype_string)
+        and (output1_dtype == np_dtype_string)
+    ):
 
+        class AddSubNet(nn.Module):
             def __init__(self, *args):
                 self.output0_dtype = args[0][0]
                 self.output1_dtype = args[0][1]
                 self.swap = args[0][2]
                 super(AddSubNet, self).__init__()
 
-            def forward(self, INPUT0: List[str],
-                        INPUT1: List[str]) -> Tuple[List[str], List[str]]:
+            def forward(
+                self, INPUT0: List[str], INPUT1: List[str]
+            ) -> Tuple[List[str], List[str]]:
                 input0_int = torch.tensor([int(i) for i in INPUT0])
                 input1_int = torch.tensor([int(i) for i in INPUT1])
                 op0 = [
                     str(i.item())
-                    for i in (input0_int +
-                              input1_int if not self.swap else input0_int -
-                              input1_int)
+                    for i in (
+                        input0_int + input1_int
+                        if not self.swap
+                        else input0_int - input1_int
+                    )
                 ]
                 op1 = [
                     str(i.item())
-                    for i in (input0_int -
-                              input1_int if not self.swap else input0_int +
-                              input1_int)
+                    for i in (
+                        input0_int - input1_int
+                        if not self.swap
+                        else input0_int + input1_int
+                    )
                 ]
                 return op0, op1
-    elif (input_dtype
-          == np_dtype_string) and (output0_dtype == np_dtype_string) and (
-              output1_dtype != np_dtype_string):
 
-        class AddSubNet(nn.Module):
+    elif (
+        (input_dtype == np_dtype_string)
+        and (output0_dtype == np_dtype_string)
+        and (output1_dtype != np_dtype_string)
+    ):
 
+        class AddSubNet(nn.Module):
             def __init__(self, *args):
                 self.output0_dtype = args[0][0]
                 self.output1_dtype = args[0][1]
                 self.swap = args[0][2]
                 super(AddSubNet, self).__init__()
 
-            def forward(self, INPUT0: List[str],
-                        INPUT1: List[str]) -> Tuple[List[str], torch.Tensor]:
+            def forward(
+                self, INPUT0: List[str], INPUT1: List[str]
+            ) -> Tuple[List[str], torch.Tensor]:
                 input0_int = torch.tensor([int(i) for i in INPUT0])
                 input1_int = torch.tensor([int(i) for i in INPUT1])
                 op0 = [
                     str(i.item())
-                    for i in (input0_int +
-                              input1_int if not self.swap else input0_int -
-                              input1_int)
+                    for i in (
+                        input0_int + input1_int
+                        if not self.swap
+                        else input0_int - input1_int
+                    )
                 ]
-                op1 = (input0_int -
-                       input1_int if not self.swap else input0_int +
-                       input1_int).to(self.output1_dtype)
+                op1 = (
+                    input0_int - input1_int
+                    if not self.swap
+                    else input0_int + input1_int
+                ).to(self.output1_dtype)
                 return op0, op1
-    elif (input_dtype == np_dtype_string) and (
-            output0_dtype != np_dtype_string) and (output1_dtype
-                                                   == np_dtype_string):
 
-        class AddSubNet(nn.Module):
+    elif (
+        (input_dtype == np_dtype_string)
+        and (output0_dtype != np_dtype_string)
+        and (output1_dtype == np_dtype_string)
+    ):
 
+        class AddSubNet(nn.Module):
             def __init__(self, *args):
                 self.output0_dtype = args[0][0]
                 self.output1_dtype = args[0][1]
                 self.swap = args[0][2]
                 super(AddSubNet, self).__init__()
 
-            def forward(self, INPUT0: List[str],
-                        INPUT1: List[str]) -> Tuple[torch.Tensor, List[str]]:
+            def forward(
+                self, INPUT0: List[str], INPUT1: List[str]
+            ) -> Tuple[torch.Tensor, List[str]]:
                 input0_int = torch.tensor([int(i) for i in INPUT0])
                 input1_int = torch.tensor([int(i) for i in INPUT1])
-                op0 = (input0_int +
-                       input1_int if not self.swap else input0_int -
-                       input1_int).to(self.output0_dtype)
+                op0 = (
+                    input0_int + input1_int
+                    if not self.swap
+                    else input0_int - input1_int
+                ).to(self.output0_dtype)
                 op1 = [
                     str(i.item())
-                    for i in (input0_int -
-                              input1_int if not self.swap else input0_int +
-                              input1_int)
+                    for i in (
+                        input0_int - input1_int
+                        if not self.swap
+                        else input0_int + input1_int
+                    )
                 ]
                 return op0, op1
-    elif (input_dtype != np_dtype_string) and (
-            output0_dtype == np_dtype_string) and (output1_dtype
-                                                   == np_dtype_string):
 
-        class AddSubNet(nn.Module):
+    elif (
+        (input_dtype != np_dtype_string)
+        and (output0_dtype == np_dtype_string)
+        and (output1_dtype == np_dtype_string)
+    ):
 
+        class AddSubNet(nn.Module):
             def __init__(self, *args):
                 self.output0_dtype = args[0][0]
                 self.output1_dtype = args[0][1]
@@ -1305,21 +1544,21 @@ def __init__(self, *args):
             def forward(self, INPUT0, INPUT1) -> Tuple[List[str], List[str]]:
                 op0 = [
                     str(i.item())
-                    for i in (INPUT0 + INPUT1 if not self.swap else INPUT0 -
-                              INPUT1)
+                    for i in (INPUT0 + INPUT1 if not self.swap else INPUT0 - INPUT1)
                 ]
                 op1 = [
                     str(i.item())
-                    for i in (INPUT0 - INPUT1 if not self.swap else INPUT0 +
-                              INPUT1)
+                    for i in (INPUT0 - INPUT1 if not self.swap else INPUT0 + INPUT1)
                 ]
                 return op0, op1
-    elif (input_dtype != np_dtype_string) and (
-            output0_dtype != np_dtype_string) and (output1_dtype
-                                                   == np_dtype_string):
 
-        class AddSubNet(nn.Module):
+    elif (
+        (input_dtype != np_dtype_string)
+        and (output0_dtype != np_dtype_string)
+        and (output1_dtype == np_dtype_string)
+    ):
 
+        class AddSubNet(nn.Module):
             def __init__(self, *args):
                 self.output0_dtype = args[0][0]
                 self.output1_dtype = args[0][1]
@@ -1327,20 +1566,22 @@ def __init__(self, *args):
                 super(AddSubNet, self).__init__()
 
             def forward(self, INPUT0, INPUT1) -> Tuple[torch.Tensor, List[str]]:
-                op0 = (INPUT0 + INPUT1 if not self.swap else INPUT0 -
-                       INPUT1).to(self.output0_dtype)
+                op0 = (INPUT0 + INPUT1 if not self.swap else INPUT0 - INPUT1).to(
+                    self.output0_dtype
+                )
                 op1 = [
                     str(i.item())
-                    for i in (INPUT0 - INPUT1 if not self.swap else INPUT0 +
-                              INPUT1)
+                    for i in (INPUT0 - INPUT1 if not self.swap else INPUT0 + INPUT1)
                 ]
                 return op0, op1
-    elif (input_dtype != np_dtype_string) and (
-            output0_dtype
-            == np_dtype_string) and (output1_dtype != np_dtype_string):
 
-        class AddSubNet(nn.Module):
+    elif (
+        (input_dtype != np_dtype_string)
+        and (output0_dtype == np_dtype_string)
+        and (output1_dtype != np_dtype_string)
+    ):
 
+        class AddSubNet(nn.Module):
             def __init__(self, *args):
                 self.output0_dtype = args[0][0]
                 self.output1_dtype = args[0][1]
@@ -1350,16 +1591,16 @@ def __init__(self, *args):
             def forward(self, INPUT0, INPUT1) -> Tuple[List[str], torch.Tensor]:
                 op0 = [
                     str(i.item())
-                    for i in (INPUT0 + INPUT1 if not self.swap else INPUT0 -
-                              INPUT1)
+                    for i in (INPUT0 + INPUT1 if not self.swap else INPUT0 - INPUT1)
                 ]
-                op1 = (INPUT0 - INPUT1 if not self.swap else INPUT0 +
-                       INPUT1).to(self.output1_dtype)
+                op1 = (INPUT0 - INPUT1 if not self.swap else INPUT0 + INPUT1).to(
+                    self.output1_dtype
+                )
                 return op0, op1
+
     else:
 
         class AddSubNet(nn.Module):
-
             def __init__(self, *args):
                 self.output0_dtype = args[0][0]
                 self.output1_dtype = args[0][1]
@@ -1367,10 +1608,12 @@ def __init__(self, *args):
                 super(AddSubNet, self).__init__()
 
             def forward(self, INPUT0, INPUT1):
-                op0 = (INPUT0 + INPUT1 if not self.swap else INPUT0 -
-                       INPUT1).to(self.output0_dtype)
-                op1 = (INPUT0 - INPUT1 if not self.swap else INPUT0 +
-                       INPUT1).to(self.output1_dtype)
+                op0 = (INPUT0 + INPUT1 if not self.swap else INPUT0 - INPUT1).to(
+                    self.output0_dtype
+                )
+                op1 = (INPUT0 - INPUT1 if not self.swap else INPUT0 + INPUT1).to(
+                    self.output1_dtype
+                )
                 return op0, op1
 
     addSubModel = AddSubNet((torch_output0_dtype, torch_output1_dtype, swap))
@@ -1386,34 +1629,50 @@ def forward(self, INPUT0, INPUT1):
     traced.save(model_version_dir + "/model.pt")
 
 
-def create_libtorch_modelconfig(models_dir, max_batch, model_version,
-                                input_shape, output0_shape, output1_shape,
-                                input_dtype, output0_dtype, output1_dtype,
-                                output0_label_cnt, version_policy):
-
+def create_libtorch_modelconfig(
+    models_dir,
+    max_batch,
+    model_version,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    output0_label_cnt,
+    version_policy,
+):
     if not tu.validate_for_libtorch_model(
-            input_dtype, output0_dtype, output1_dtype, input_shape,
-            output0_shape, output1_shape, max_batch):
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+        input_shape,
+        output0_shape,
+        output1_shape,
+        max_batch,
+    ):
         return
 
     # Unpack version policy
     version_policy_str = "{ latest { num_versions: 1 }}"
     if version_policy is not None:
         type, val = version_policy
-        if type == 'latest':
-            version_policy_str = "{{ latest {{ num_versions: {} }}}}".format(
-                val)
-        elif type == 'specific':
+        if type == "latest":
+            version_policy_str = "{{ latest {{ num_versions: {} }}}}".format(val)
+        elif type == "specific":
             version_policy_str = "{{ specific {{ versions: {} }}}}".format(val)
         else:
             version_policy_str = "{ all { }}"
 
     # Use a different model name for the non-batching variant
     model_name = tu.get_model_name(
-        "libtorch_nobatch" if max_batch == 0 else "libtorch", input_dtype,
-        output0_dtype, output1_dtype)
+        "libtorch_nobatch" if max_batch == 0 else "libtorch",
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+    )
     config_dir = models_dir + "/" + model_name
-    config = '''
+    config = """
 name: "{}"
 platform: "pytorch_libtorch"
 max_batch_size: {}
@@ -1443,13 +1702,19 @@ def create_libtorch_modelconfig(models_dir, max_batch, model_version,
     dims: [ {} ]
   }}
 ]
-'''.format(model_name, max_batch, version_policy_str,
-           np_to_model_dtype(input_dtype), tu.shape_to_dims_str(input_shape),
-           np_to_model_dtype(input_dtype), tu.shape_to_dims_str(input_shape),
-           np_to_model_dtype(output0_dtype),
-           tu.shape_to_dims_str(output0_shape),
-           np_to_model_dtype(output1_dtype),
-           tu.shape_to_dims_str(output1_shape))
+""".format(
+        model_name,
+        max_batch,
+        version_policy_str,
+        np_to_model_dtype(input_dtype),
+        tu.shape_to_dims_str(input_shape),
+        np_to_model_dtype(input_dtype),
+        tu.shape_to_dims_str(input_shape),
+        np_to_model_dtype(output0_dtype),
+        tu.shape_to_dims_str(output0_shape),
+        np_to_model_dtype(output1_dtype),
+        tu.shape_to_dims_str(output1_shape),
+    )
 
     try:
         os.makedirs(config_dir)
@@ -1464,35 +1729,40 @@ def create_libtorch_modelconfig(models_dir, max_batch, model_version,
             lfile.write("label" + str(l) + "\n")
 
 
-def create_openvino_modelfile(models_dir,
-                              max_batch,
-                              model_version,
-                              input_shape,
-                              output0_shape,
-                              output1_shape,
-                              input_dtype,
-                              output0_dtype,
-                              output1_dtype,
-                              swap=False):
-
+def create_openvino_modelfile(
+    models_dir,
+    max_batch,
+    model_version,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    swap=False,
+):
     batch_dim = () if max_batch == 0 else (max_batch,)
     if not tu.validate_for_openvino_model(
-            input_dtype, output0_dtype, output1_dtype, batch_dim + input_shape,
-            batch_dim + output0_shape, batch_dim + output1_shape):
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+        batch_dim + input_shape,
+        batch_dim + output0_shape,
+        batch_dim + output1_shape,
+    ):
         return
 
     # Create the model
     model_name = tu.get_model_name(
-        "openvino_nobatch" if max_batch == 0 else "openvino", input_dtype,
-        output0_dtype, output1_dtype)
+        "openvino_nobatch" if max_batch == 0 else "openvino",
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
-    in0 = ng.parameter(shape=batch_dim + input_shape,
-                       dtype=input_dtype,
-                       name="INPUT0")
-    in1 = ng.parameter(shape=batch_dim + input_shape,
-                       dtype=input_dtype,
-                       name="INPUT1")
+    in0 = ng.parameter(shape=batch_dim + input_shape, dtype=input_dtype, name="INPUT0")
+    in1 = ng.parameter(shape=batch_dim + input_shape, dtype=input_dtype, name="INPUT1")
 
     r0 = ng.add(in0, in1) if not swap else ng.subtract(in0, in1)
     r1 = ng.subtract(in0, in1) if not swap else ng.add(in0, in1)
@@ -1511,41 +1781,57 @@ def create_openvino_modelfile(models_dir,
     except OSError as ex:
         pass  # ignore existing dir
 
-    ie_network.serialize(model_version_dir + "/model.xml",
-                         model_version_dir + "/model.bin")
-
-
-def create_openvino_modelconfig(models_dir, max_batch, model_version,
-                                input_shape, output0_shape, output1_shape,
-                                input_dtype, output0_dtype, output1_dtype,
-                                output0_label_cnt, version_policy):
-
+    ie_network.serialize(
+        model_version_dir + "/model.xml", model_version_dir + "/model.bin"
+    )
+
+
+def create_openvino_modelconfig(
+    models_dir,
+    max_batch,
+    model_version,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    output0_label_cnt,
+    version_policy,
+):
     batch_dim = () if max_batch == 0 else (max_batch,)
     if not tu.validate_for_openvino_model(
-            input_dtype, output0_dtype, output1_dtype, batch_dim + input_shape,
-            batch_dim + output0_shape, batch_dim + output1_shape):
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+        batch_dim + input_shape,
+        batch_dim + output0_shape,
+        batch_dim + output1_shape,
+    ):
         return
 
     # Unpack version policy
     version_policy_str = "{ latest { num_versions: 1 }}"
     if version_policy is not None:
         type, val = version_policy
-        if type == 'latest':
-            version_policy_str = "{{ latest {{ num_versions: {} }}}}".format(
-                val)
-        elif type == 'specific':
+        if type == "latest":
+            version_policy_str = "{{ latest {{ num_versions: {} }}}}".format(val)
+        elif type == "specific":
             version_policy_str = "{{ specific {{ versions: {} }}}}".format(val)
         else:
             version_policy_str = "{ all { }}"
 
     # Use a different model name for the non-batching variant
     model_name = tu.get_model_name(
-        "openvino_nobatch" if max_batch == 0 else "openvino", input_dtype,
-        output0_dtype, output1_dtype)
+        "openvino_nobatch" if max_batch == 0 else "openvino",
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+    )
     config_dir = models_dir + "/" + model_name
 
     # platform is empty and backend is 'openvino' for openvino model
-    config = '''
+    config = """
 name: "{}"
 backend: "openvino"
 max_batch_size: {}
@@ -1575,14 +1861,19 @@ def create_openvino_modelconfig(models_dir, max_batch, model_version,
     dims: [ {} ]
   }}
 ]
-instance_group [ {{ kind: KIND_CPU }}]
-'''.format(model_name, max_batch, version_policy_str,
-           np_to_model_dtype(input_dtype), tu.shape_to_dims_str(input_shape),
-           np_to_model_dtype(input_dtype), tu.shape_to_dims_str(input_shape),
-           np_to_model_dtype(output0_dtype),
-           tu.shape_to_dims_str(output0_shape),
-           np_to_model_dtype(output1_dtype),
-           tu.shape_to_dims_str(output1_shape))
+""".format(
+        model_name,
+        max_batch,
+        version_policy_str,
+        np_to_model_dtype(input_dtype),
+        tu.shape_to_dims_str(input_shape),
+        np_to_model_dtype(input_dtype),
+        tu.shape_to_dims_str(input_shape),
+        np_to_model_dtype(output0_dtype),
+        tu.shape_to_dims_str(output0_shape),
+        np_to_model_dtype(output1_dtype),
+        tu.shape_to_dims_str(output1_shape),
+    )
 
     try:
         os.makedirs(config_dir)
@@ -1597,165 +1888,381 @@ def create_openvino_modelconfig(models_dir, max_batch, model_version,
             lfile.write("label" + str(l) + "\n")
 
 
-def create_models(models_dir,
-                  input_dtype,
-                  output0_dtype,
-                  output1_dtype,
-                  input_shape,
-                  output0_shape,
-                  output1_shape,
-                  output0_label_cnt,
-                  version_policy=None):
+def create_models(
+    models_dir,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    output0_label_cnt,
+    version_policy=None,
+):
     model_version = 1
 
     # Create two models, one that supports batching with a max-batch
     # of 8, and one that does not with a max-batch of 0
     if FLAGS.graphdef:
         # max-batch 8
-        create_graphdef_modelconfig(models_dir, 8, model_version, input_shape,
-                                    output0_shape, output1_shape, input_dtype,
-                                    output0_dtype, output1_dtype,
-                                    output0_label_cnt, version_policy)
-        create_graphdef_modelfile(models_dir, 8, model_version, input_shape,
-                                  output0_shape, output1_shape, input_dtype,
-                                  output0_dtype, output1_dtype)
+        create_graphdef_modelconfig(
+            models_dir,
+            8,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            output0_label_cnt,
+            version_policy,
+        )
+        create_graphdef_modelfile(
+            models_dir,
+            8,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+        )
         # max-batch 0
-        create_graphdef_modelconfig(models_dir, 0, model_version, input_shape,
-                                    output0_shape, output1_shape, input_dtype,
-                                    output0_dtype, output1_dtype,
-                                    output0_label_cnt, version_policy)
-        create_graphdef_modelfile(models_dir, 0, model_version, input_shape,
-                                  output0_shape, output1_shape, input_dtype,
-                                  output0_dtype, output1_dtype)
+        create_graphdef_modelconfig(
+            models_dir,
+            0,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            output0_label_cnt,
+            version_policy,
+        )
+        create_graphdef_modelfile(
+            models_dir,
+            0,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+        )
 
     if FLAGS.savedmodel:
         # max-batch 8
-        create_savedmodel_modelconfig(models_dir, 8, model_version, input_shape,
-                                      output0_shape, output1_shape, input_dtype,
-                                      output0_dtype, output1_dtype,
-                                      output0_label_cnt, version_policy)
-        create_savedmodel_modelfile(models_dir, 8, model_version, input_shape,
-                                    output0_shape, output1_shape, input_dtype,
-                                    output0_dtype, output1_dtype)
+        create_savedmodel_modelconfig(
+            models_dir,
+            8,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            output0_label_cnt,
+            version_policy,
+        )
+        create_savedmodel_modelfile(
+            models_dir,
+            8,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+        )
         # max-batch 0
-        create_savedmodel_modelconfig(models_dir, 0, model_version, input_shape,
-                                      output0_shape, output1_shape, input_dtype,
-                                      output0_dtype, output1_dtype,
-                                      output0_label_cnt, version_policy)
-        create_savedmodel_modelfile(models_dir, 0, model_version, input_shape,
-                                    output0_shape, output1_shape, input_dtype,
-                                    output0_dtype, output1_dtype)
+        create_savedmodel_modelconfig(
+            models_dir,
+            0,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            output0_label_cnt,
+            version_policy,
+        )
+        create_savedmodel_modelfile(
+            models_dir,
+            0,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+        )
 
     if FLAGS.tensorrt:
         # max-batch 8
         suffix = ()
-        if input_dtype == np.int8 or output0_dtype == np.int8 or output1_dtype == np.int8:
+        if (
+            input_dtype == np.int8
+            or output0_dtype == np.int8
+            or output1_dtype == np.int8
+        ):
             suffix = (1, 1)
-        create_plan_modelconfig(models_dir, 8, model_version,
-                                input_shape + suffix, output0_shape + suffix,
-                                output1_shape + suffix, input_dtype,
-                                output0_dtype, output1_dtype, output0_label_cnt,
-                                version_policy)
-        create_plan_modelfile(models_dir, 8, model_version,
-                              input_shape + suffix, output0_shape + suffix,
-                              output1_shape + suffix, input_dtype,
-                              output0_dtype, output1_dtype)
+        create_plan_modelconfig(
+            models_dir,
+            8,
+            model_version,
+            input_shape + suffix,
+            output0_shape + suffix,
+            output1_shape + suffix,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            output0_label_cnt,
+            version_policy,
+        )
+        create_plan_modelfile(
+            models_dir,
+            8,
+            model_version,
+            input_shape + suffix,
+            output0_shape + suffix,
+            output1_shape + suffix,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+        )
         # max-batch 0
-        create_plan_modelconfig(models_dir, 0, model_version,
-                                input_shape + suffix, output0_shape + suffix,
-                                output1_shape + suffix, input_dtype,
-                                output0_dtype, output1_dtype, output0_label_cnt,
-                                version_policy)
-        create_plan_modelfile(models_dir, 0, model_version,
-                              input_shape + suffix, output0_shape + suffix,
-                              output1_shape + suffix, input_dtype,
-                              output0_dtype, output1_dtype)
+        create_plan_modelconfig(
+            models_dir,
+            0,
+            model_version,
+            input_shape + suffix,
+            output0_shape + suffix,
+            output1_shape + suffix,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            output0_label_cnt,
+            version_policy,
+        )
+        create_plan_modelfile(
+            models_dir,
+            0,
+            model_version,
+            input_shape + suffix,
+            output0_shape + suffix,
+            output1_shape + suffix,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+        )
 
         if -1 in input_shape:
             # models for testing optimization profiles
-            create_plan_modelconfig(models_dir,
-                                    8,
-                                    model_version,
-                                    input_shape + suffix,
-                                    output0_shape + suffix,
-                                    output1_shape + suffix,
-                                    input_dtype,
-                                    output0_dtype,
-                                    output1_dtype,
-                                    output0_label_cnt,
-                                    version_policy,
-                                    min_dim=4,
-                                    max_dim=32)
-            create_plan_modelfile(models_dir,
-                                  8,
-                                  model_version,
-                                  input_shape + suffix,
-                                  output0_shape + suffix,
-                                  output1_shape + suffix,
-                                  input_dtype,
-                                  output0_dtype,
-                                  output1_dtype,
-                                  min_dim=4,
-                                  max_dim=32)
+            create_plan_modelconfig(
+                models_dir,
+                8,
+                model_version,
+                input_shape + suffix,
+                output0_shape + suffix,
+                output1_shape + suffix,
+                input_dtype,
+                output0_dtype,
+                output1_dtype,
+                output0_label_cnt,
+                version_policy,
+                min_dim=4,
+                max_dim=32,
+            )
+            create_plan_modelfile(
+                models_dir,
+                8,
+                model_version,
+                input_shape + suffix,
+                output0_shape + suffix,
+                output1_shape + suffix,
+                input_dtype,
+                output0_dtype,
+                output1_dtype,
+                min_dim=4,
+                max_dim=32,
+            )
 
     if FLAGS.onnx:
         # max-batch 8
-        create_onnx_modelconfig(models_dir, 8, model_version, input_shape,
-                                output0_shape, output1_shape, input_dtype,
-                                output0_dtype, output1_dtype, output0_label_cnt,
-                                version_policy)
-        create_onnx_modelfile(models_dir, 8, model_version, input_shape,
-                              output0_shape, output1_shape, input_dtype,
-                              output0_dtype, output1_dtype)
+        create_onnx_modelconfig(
+            models_dir,
+            8,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            output0_label_cnt,
+            version_policy,
+        )
+        create_onnx_modelfile(
+            models_dir,
+            8,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+        )
         # max-batch 0
-        create_onnx_modelconfig(models_dir, 0, model_version, input_shape,
-                                output0_shape, output1_shape, input_dtype,
-                                output0_dtype, output1_dtype, output0_label_cnt,
-                                version_policy)
-        create_onnx_modelfile(models_dir, 0, model_version, input_shape,
-                              output0_shape, output1_shape, input_dtype,
-                              output0_dtype, output1_dtype)
+        create_onnx_modelconfig(
+            models_dir,
+            0,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            output0_label_cnt,
+            version_policy,
+        )
+        create_onnx_modelfile(
+            models_dir,
+            0,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+        )
 
     if FLAGS.libtorch:
         # max-batch 8
-        create_libtorch_modelconfig(models_dir, 8, model_version, input_shape,
-                                    output0_shape, output1_shape, input_dtype,
-                                    output0_dtype, output1_dtype,
-                                    output0_label_cnt, version_policy)
-        create_libtorch_modelfile(models_dir, 8, model_version, input_shape,
-                                  output0_shape, output1_shape, input_dtype,
-                                  output0_dtype, output1_dtype)
+        create_libtorch_modelconfig(
+            models_dir,
+            8,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            output0_label_cnt,
+            version_policy,
+        )
+        create_libtorch_modelfile(
+            models_dir,
+            8,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+        )
         # max-batch 0
-        create_libtorch_modelconfig(models_dir, 0, model_version, input_shape,
-                                    output0_shape, output1_shape, input_dtype,
-                                    output0_dtype, output1_dtype,
-                                    output0_label_cnt, version_policy)
-        create_libtorch_modelfile(models_dir, 0, model_version, input_shape,
-                                  output0_shape, output1_shape, input_dtype,
-                                  output0_dtype, output1_dtype)
+        create_libtorch_modelconfig(
+            models_dir,
+            0,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            output0_label_cnt,
+            version_policy,
+        )
+        create_libtorch_modelfile(
+            models_dir,
+            0,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+        )
 
     if FLAGS.openvino:
         # max-batch 8
-        create_openvino_modelconfig(models_dir, 8, model_version, input_shape,
-                                    output0_shape, output1_shape, input_dtype,
-                                    output0_dtype, output1_dtype,
-                                    output0_label_cnt, version_policy)
-        create_openvino_modelfile(models_dir, 8, model_version, input_shape,
-                                  output0_shape, output1_shape, input_dtype,
-                                  output0_dtype, output1_dtype)
+        create_openvino_modelconfig(
+            models_dir,
+            8,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            output0_label_cnt,
+            version_policy,
+        )
+        create_openvino_modelfile(
+            models_dir,
+            8,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+        )
         # max-batch 0
-        create_openvino_modelconfig(models_dir, 0, model_version, input_shape,
-                                    output0_shape, output1_shape, input_dtype,
-                                    output0_dtype, output1_dtype,
-                                    output0_label_cnt, version_policy)
-        create_openvino_modelfile(models_dir, 0, model_version, input_shape,
-                                  output0_shape, output1_shape, input_dtype,
-                                  output0_dtype, output1_dtype)
+        create_openvino_modelconfig(
+            models_dir,
+            0,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            output0_label_cnt,
+            version_policy,
+        )
+        create_openvino_modelfile(
+            models_dir,
+            0,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+        )
 
     if FLAGS.ensemble:
         for pair in emu.platform_types_and_validation():
-            if not pair[1](input_dtype, output0_dtype, output1_dtype,
-                           input_shape, output0_shape, output1_shape):
+            if not pair[1](
+                input_dtype,
+                output0_dtype,
+                output1_dtype,
+                input_shape,
+                output0_shape,
+                output1_shape,
+            ):
                 continue
 
             config_input_shape = input_shape
@@ -1770,98 +2277,159 @@ def create_models(models_dir,
                     config_output1_shape = (output1_shape[0], 1, 1)
 
             # max-batch 0
-            emu.create_ensemble_modelconfig(pair[0], models_dir, 0,
-                                            model_version, config_input_shape,
-                                            config_output0_shape,
-                                            config_output1_shape, input_dtype,
-                                            output0_dtype, output1_dtype,
-                                            output0_label_cnt, version_policy)
-            emu.create_ensemble_modelfile(pair[0], models_dir, 0, model_version,
-                                          config_input_shape,
-                                          config_output0_shape,
-                                          config_output1_shape, input_dtype,
-                                          output0_dtype, output1_dtype)
+            emu.create_ensemble_modelconfig(
+                pair[0],
+                models_dir,
+                0,
+                model_version,
+                config_input_shape,
+                config_output0_shape,
+                config_output1_shape,
+                input_dtype,
+                output0_dtype,
+                output1_dtype,
+                output0_label_cnt,
+                version_policy,
+            )
+            emu.create_ensemble_modelfile(
+                pair[0],
+                models_dir,
+                0,
+                model_version,
+                config_input_shape,
+                config_output0_shape,
+                config_output1_shape,
+                input_dtype,
+                output0_dtype,
+                output1_dtype,
+            )
 
             # max-batch 8 (Skip for PyTorch models with String I/O)
             if (pair[0] == "libtorch") and not pair[1](
-                    input_dtype, output0_dtype, output1_dtype, input_shape,
-                    output0_shape, output1_shape, 8):
+                input_dtype,
+                output0_dtype,
+                output1_dtype,
+                input_shape,
+                output0_shape,
+                output1_shape,
+                8,
+            ):
                 continue
 
             emu.create_ensemble_modelconfig(
-                pair[0], models_dir, 8, model_version, config_input_shape,
-                config_output0_shape, config_output1_shape, input_dtype,
-                output0_dtype, output1_dtype, output0_label_cnt,
-                version_policy)
-            emu.create_ensemble_modelfile(pair[0], models_dir, 8,
-                                            model_version, config_input_shape,
-                                            config_output0_shape,
-                                            config_output1_shape, input_dtype,
-                                            output0_dtype, output1_dtype)
-
-
-def create_fixed_models(models_dir,
-                        input_dtype,
-                        output0_dtype,
-                        output1_dtype,
-                        version_policy=None):
+                pair[0],
+                models_dir,
+                8,
+                model_version,
+                config_input_shape,
+                config_output0_shape,
+                config_output1_shape,
+                input_dtype,
+                output0_dtype,
+                output1_dtype,
+                output0_label_cnt,
+                version_policy,
+            )
+            emu.create_ensemble_modelfile(
+                pair[0],
+                models_dir,
+                8,
+                model_version,
+                config_input_shape,
+                config_output0_shape,
+                config_output1_shape,
+                input_dtype,
+                output0_dtype,
+                output1_dtype,
+            )
+
+
+def create_fixed_models(
+    models_dir, input_dtype, output0_dtype, output1_dtype, version_policy=None
+):
     input_size = 16
-    create_models(models_dir, input_dtype, output0_dtype, output1_dtype,
-                  (input_size,), (input_size,), (input_size,), input_size,
-                  version_policy)
-
-
-if __name__ == '__main__':
+    create_models(
+        models_dir,
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+        (input_size,),
+        (input_size,),
+        (input_size,),
+        input_size,
+        version_policy,
+    )
+
+
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('--models_dir',
-                        type=str,
-                        required=True,
-                        help='Top-level model directory')
-    parser.add_argument('--graphdef',
-                        required=False,
-                        action='store_true',
-                        help='Generate GraphDef models')
-    parser.add_argument('--savedmodel',
-                        required=False,
-                        action='store_true',
-                        help='Generate SavedModel models')
-    parser.add_argument('--tensorrt',
-                        required=False,
-                        action='store_true',
-                        help='Generate TensorRT PLAN models')
-    parser.add_argument('--onnx',
-                        required=False,
-                        action='store_true',
-                        help='Generate Onnx Runtime Onnx models')
     parser.add_argument(
-        '--onnx_opset',
+        "--models_dir", type=str, required=True, help="Top-level model directory"
+    )
+    parser.add_argument(
+        "--graphdef",
+        required=False,
+        action="store_true",
+        help="Generate GraphDef models",
+    )
+    parser.add_argument(
+        "--savedmodel",
+        required=False,
+        action="store_true",
+        help="Generate SavedModel models",
+    )
+    parser.add_argument(
+        "--tensorrt",
+        required=False,
+        action="store_true",
+        help="Generate TensorRT PLAN models",
+    )
+    parser.add_argument(
+        "--onnx",
+        required=False,
+        action="store_true",
+        help="Generate Onnx Runtime Onnx models",
+    )
+    parser.add_argument(
+        "--onnx_opset",
         type=int,
         required=False,
         default=0,
-        help='Opset used for Onnx models. Default is to use ONNXRT default')
-    parser.add_argument('--libtorch',
-                        required=False,
-                        action='store_true',
-                        help='Generate Pytorch LibTorch models')
-    parser.add_argument('--openvino',
-                        required=False,
-                        action='store_true',
-                        help='Generate Openvino models')
-    parser.add_argument('--variable',
-                        required=False,
-                        action='store_true',
-                        help='Used variable-shape tensors for input/output')
-    parser.add_argument('--ensemble',
-                        required=False,
-                        action='store_true',
-                        help='Generate ensemble models against the models' +
-                        ' in all platforms. Note that the models generated' +
-                        ' are not completed.')
+        help="Opset used for Onnx models. Default is to use ONNXRT default",
+    )
+    parser.add_argument(
+        "--libtorch",
+        required=False,
+        action="store_true",
+        help="Generate Pytorch LibTorch models",
+    )
+    parser.add_argument(
+        "--openvino",
+        required=False,
+        action="store_true",
+        help="Generate Openvino models",
+    )
+    parser.add_argument(
+        "--variable",
+        required=False,
+        action="store_true",
+        help="Used variable-shape tensors for input/output",
+    )
+    parser.add_argument(
+        "--ensemble",
+        required=False,
+        action="store_true",
+        help="Generate ensemble models against the models"
+        + " in all platforms. Note that the models generated"
+        + " are not completed.",
+    )
     FLAGS, unparsed = parser.parse_known_args()
 
     if FLAGS.graphdef or FLAGS.savedmodel:
         import tensorflow as tf
         from tensorflow.python.framework import graph_io
+
+        tf.compat.v1.disable_eager_execution()
     if FLAGS.tensorrt:
         import tensorrt as trt
     if FLAGS.onnx:
@@ -1877,338 +2445,367 @@ def create_fixed_models(models_dir,
 
     # Tests with models that accept fixed-shape input/output tensors
     if not FLAGS.variable:
-        create_fixed_models(FLAGS.models_dir, np.int8, np.int8, np.int8,
-                            ('latest', 1))
-        create_fixed_models(FLAGS.models_dir, np.int16, np.int16, np.int16,
-                            ('latest', 2))
-        create_fixed_models(FLAGS.models_dir, np.int32, np.int32, np.int32,
-                            ('all', None))
+        create_fixed_models(
+            FLAGS.models_dir, np.uint8, np.uint8, np.uint8, ("latest", 3)
+        )
+        create_fixed_models(FLAGS.models_dir, np.int8, np.int8, np.int8, ("latest", 1))
+        create_fixed_models(
+            FLAGS.models_dir, np.int16, np.int16, np.int16, ("latest", 2)
+        )
+        create_fixed_models(
+            FLAGS.models_dir, np.int32, np.int32, np.int32, ("all", None)
+        )
         create_fixed_models(FLAGS.models_dir, np.int64, np.int64, np.int64)
-        create_fixed_models(FLAGS.models_dir, np.float16, np.float16,
-                            np.float16, ('specific', [
-                                1,
-                            ]))
-        create_fixed_models(FLAGS.models_dir, np.float32, np.float32,
-                            np.float32, ('specific', [1, 3]))
-        create_fixed_models(FLAGS.models_dir, np.float16, np.float32,
-                            np.float32)
+        create_fixed_models(
+            FLAGS.models_dir,
+            np.float16,
+            np.float16,
+            np.float16,
+            (
+                "specific",
+                [
+                    1,
+                ],
+            ),
+        )
+        create_fixed_models(
+            FLAGS.models_dir, np.float32, np.float32, np.float32, ("specific", [1, 3])
+        )
+        create_fixed_models(FLAGS.models_dir, np.float16, np.float32, np.float32)
         create_fixed_models(FLAGS.models_dir, np.int32, np.int8, np.int8)
         create_fixed_models(FLAGS.models_dir, np.int8, np.int32, np.int32)
         create_fixed_models(FLAGS.models_dir, np.int32, np.int8, np.int16)
+        create_fixed_models(FLAGS.models_dir, np.float32, np.uint8, np.uint8)
+        create_fixed_models(FLAGS.models_dir, np.uint8, np.float32, np.float32)
+        create_fixed_models(FLAGS.models_dir, np.float32, np.uint8, np.float16)
         create_fixed_models(FLAGS.models_dir, np.int32, np.float32, np.float32)
         create_fixed_models(FLAGS.models_dir, np.float32, np.int32, np.int32)
         create_fixed_models(FLAGS.models_dir, np.int32, np.float16, np.int16)
 
-        create_fixed_models(FLAGS.models_dir, np_dtype_string, np.int32,
-                            np.int32)
-        create_fixed_models(FLAGS.models_dir, np_dtype_string, np_dtype_string,
-                            np_dtype_string)
-        create_fixed_models(FLAGS.models_dir, np_dtype_string, np.int32,
-                            np_dtype_string)
-        create_fixed_models(FLAGS.models_dir, np_dtype_string, np_dtype_string,
-                            np.int32)
-        create_fixed_models(FLAGS.models_dir, np.int32, np_dtype_string,
-                            np_dtype_string)
-        create_fixed_models(FLAGS.models_dir, np.int32, np.int32,
-                            np_dtype_string)
-        create_fixed_models(FLAGS.models_dir, np.int32, np_dtype_string,
-                            np.int32)
+        create_fixed_models(FLAGS.models_dir, np_dtype_string, np.int32, np.int32)
+        create_fixed_models(
+            FLAGS.models_dir, np_dtype_string, np_dtype_string, np_dtype_string
+        )
+        create_fixed_models(
+            FLAGS.models_dir, np_dtype_string, np.int32, np_dtype_string
+        )
+        create_fixed_models(
+            FLAGS.models_dir, np_dtype_string, np_dtype_string, np.int32
+        )
+        create_fixed_models(
+            FLAGS.models_dir, np.int32, np_dtype_string, np_dtype_string
+        )
+        create_fixed_models(FLAGS.models_dir, np.int32, np.int32, np_dtype_string)
+        create_fixed_models(FLAGS.models_dir, np.int32, np_dtype_string, np.int32)
 
         # Make multiple versions of some models for version testing
         # (they use different version policies when created above)
         if FLAGS.graphdef:
             for vt in [np.float16, np.float32, np.int8, np.int16, np.int32]:
-                create_graphdef_modelfile(FLAGS.models_dir,
-                                          8,
-                                          2, (16,), (16,), (16,),
-                                          vt,
-                                          vt,
-                                          vt,
-                                          swap=True)
-                create_graphdef_modelfile(FLAGS.models_dir,
-                                          8,
-                                          3, (16,), (16,), (16,),
-                                          vt,
-                                          vt,
-                                          vt,
-                                          swap=True)
-                create_graphdef_modelfile(FLAGS.models_dir,
-                                          0,
-                                          2, (16,), (16,), (16,),
-                                          vt,
-                                          vt,
-                                          vt,
-                                          swap=True)
-                create_graphdef_modelfile(FLAGS.models_dir,
-                                          0,
-                                          3, (16,), (16,), (16,),
-                                          vt,
-                                          vt,
-                                          vt,
-                                          swap=True)
+                create_graphdef_modelfile(
+                    FLAGS.models_dir, 8, 2, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
+                create_graphdef_modelfile(
+                    FLAGS.models_dir, 8, 3, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
+                create_graphdef_modelfile(
+                    FLAGS.models_dir, 0, 2, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
+                create_graphdef_modelfile(
+                    FLAGS.models_dir, 0, 3, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
 
         if FLAGS.savedmodel:
             for vt in [np.float16, np.float32, np.int8, np.int16, np.int32]:
-                create_savedmodel_modelfile(FLAGS.models_dir,
-                                            8,
-                                            2, (16,), (16,), (16,),
-                                            vt,
-                                            vt,
-                                            vt,
-                                            swap=True)
-                create_savedmodel_modelfile(FLAGS.models_dir,
-                                            8,
-                                            3, (16,), (16,), (16,),
-                                            vt,
-                                            vt,
-                                            vt,
-                                            swap=True)
-                create_savedmodel_modelfile(FLAGS.models_dir,
-                                            0,
-                                            2, (16,), (16,), (16,),
-                                            vt,
-                                            vt,
-                                            vt,
-                                            swap=True)
-                create_savedmodel_modelfile(FLAGS.models_dir,
-                                            0,
-                                            3, (16,), (16,), (16,),
-                                            vt,
-                                            vt,
-                                            vt,
-                                            swap=True)
+                create_savedmodel_modelfile(
+                    FLAGS.models_dir, 8, 2, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
+                create_savedmodel_modelfile(
+                    FLAGS.models_dir, 8, 3, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
+                create_savedmodel_modelfile(
+                    FLAGS.models_dir, 0, 2, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
+                create_savedmodel_modelfile(
+                    FLAGS.models_dir, 0, 3, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
 
         if FLAGS.tensorrt:
-            for vt in [np.float32, np.float16, np.int32]:
-                create_plan_modelfile(FLAGS.models_dir,
-                                      8,
-                                      2, (16,), (16,), (16,),
-                                      vt,
-                                      vt,
-                                      vt,
-                                      swap=True)
-                create_plan_modelfile(FLAGS.models_dir,
-                                      8,
-                                      3, (16,), (16,), (16,),
-                                      vt,
-                                      vt,
-                                      vt,
-                                      swap=True)
-                create_plan_modelfile(FLAGS.models_dir,
-                                      0,
-                                      2, (16,), (16,), (16,),
-                                      vt,
-                                      vt,
-                                      vt,
-                                      swap=True)
-                create_plan_modelfile(FLAGS.models_dir,
-                                      0,
-                                      3, (16,), (16,), (16,),
-                                      vt,
-                                      vt,
-                                      vt,
-                                      swap=True)
+            for vt in [np.float32, np.float16, np.int32, np.uint8]:
+                create_plan_modelfile(
+                    FLAGS.models_dir, 8, 2, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
+                create_plan_modelfile(
+                    FLAGS.models_dir, 8, 3, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
+                create_plan_modelfile(
+                    FLAGS.models_dir, 0, 2, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
+                create_plan_modelfile(
+                    FLAGS.models_dir, 0, 3, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
 
             vt = np.int8
-            #handle INT8 separately as it doesn't allow 1d tensors
-            create_plan_modelfile(FLAGS.models_dir,
-                                  8,
-                                  2, (16, 1, 1), (16, 1, 1), (16, 1, 1),
-                                  vt,
-                                  vt,
-                                  vt,
-                                  swap=True)
-            create_plan_modelfile(FLAGS.models_dir,
-                                  8,
-                                  3, (16, 1, 1), (16, 1, 1), (16, 1, 1),
-                                  vt,
-                                  vt,
-                                  vt,
-                                  swap=True)
-            create_plan_modelfile(FLAGS.models_dir,
-                                  0,
-                                  2, (16, 1, 1), (16, 1, 1), (16, 1, 1),
-                                  vt,
-                                  vt,
-                                  vt,
-                                  swap=True)
-            create_plan_modelfile(FLAGS.models_dir,
-                                  0,
-                                  3, (16, 1, 1), (16, 1, 1), (16, 1, 1),
-                                  vt,
-                                  vt,
-                                  vt,
-                                  swap=True)
+            # handle INT8 separately as it doesn't allow 1d tensors
+            create_plan_modelfile(
+                FLAGS.models_dir,
+                8,
+                2,
+                (16, 1, 1),
+                (16, 1, 1),
+                (16, 1, 1),
+                vt,
+                vt,
+                vt,
+                swap=True,
+            )
+            create_plan_modelfile(
+                FLAGS.models_dir,
+                8,
+                3,
+                (16, 1, 1),
+                (16, 1, 1),
+                (16, 1, 1),
+                vt,
+                vt,
+                vt,
+                swap=True,
+            )
+            create_plan_modelfile(
+                FLAGS.models_dir,
+                0,
+                2,
+                (16, 1, 1),
+                (16, 1, 1),
+                (16, 1, 1),
+                vt,
+                vt,
+                vt,
+                swap=True,
+            )
+            create_plan_modelfile(
+                FLAGS.models_dir,
+                0,
+                3,
+                (16, 1, 1),
+                (16, 1, 1),
+                (16, 1, 1),
+                vt,
+                vt,
+                vt,
+                swap=True,
+            )
 
         if FLAGS.onnx:
             for vt in [np.float16, np.float32, np.int8, np.int16, np.int32]:
-                create_onnx_modelfile(FLAGS.models_dir,
-                                      8,
-                                      2, (16,), (16,), (16,),
-                                      vt,
-                                      vt,
-                                      vt,
-                                      swap=True)
-                create_onnx_modelfile(FLAGS.models_dir,
-                                      8,
-                                      3, (16,), (16,), (16,),
-                                      vt,
-                                      vt,
-                                      vt,
-                                      swap=True)
-                create_onnx_modelfile(FLAGS.models_dir,
-                                      0,
-                                      2, (16,), (16,), (16,),
-                                      vt,
-                                      vt,
-                                      vt,
-                                      swap=True)
-                create_onnx_modelfile(FLAGS.models_dir,
-                                      0,
-                                      3, (16,), (16,), (16,),
-                                      vt,
-                                      vt,
-                                      vt,
-                                      swap=True)
+                create_onnx_modelfile(
+                    FLAGS.models_dir, 8, 2, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
+                create_onnx_modelfile(
+                    FLAGS.models_dir, 8, 3, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
+                create_onnx_modelfile(
+                    FLAGS.models_dir, 0, 2, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
+                create_onnx_modelfile(
+                    FLAGS.models_dir, 0, 3, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
         if FLAGS.libtorch:
             for vt in [np.float32, np.int32, np.int16, np.int8]:
-                create_libtorch_modelfile(FLAGS.models_dir,
-                                          8,
-                                          2, (16,), (16,), (16,),
-                                          vt,
-                                          vt,
-                                          vt,
-                                          swap=True)
-                create_libtorch_modelfile(FLAGS.models_dir,
-                                          8,
-                                          3, (16,), (16,), (16,),
-                                          vt,
-                                          vt,
-                                          vt,
-                                          swap=True)
-                create_libtorch_modelfile(FLAGS.models_dir,
-                                          0,
-                                          2, (16,), (16,), (16,),
-                                          vt,
-                                          vt,
-                                          vt,
-                                          swap=True)
-                create_libtorch_modelfile(FLAGS.models_dir,
-                                          0,
-                                          3, (16,), (16,), (16,),
-                                          vt,
-                                          vt,
-                                          vt,
-                                          swap=True)
+                create_libtorch_modelfile(
+                    FLAGS.models_dir, 8, 2, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
+                create_libtorch_modelfile(
+                    FLAGS.models_dir, 8, 3, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
+                create_libtorch_modelfile(
+                    FLAGS.models_dir, 0, 2, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
+                create_libtorch_modelfile(
+                    FLAGS.models_dir, 0, 3, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
         if FLAGS.openvino:
             for vt in [np.float16, np.float32, np.int8, np.int16, np.int32]:
-                create_openvino_modelfile(FLAGS.models_dir,
-                                          8,
-                                          2, (16,), (16,), (16,),
-                                          vt,
-                                          vt,
-                                          vt,
-                                          swap=True)
-                create_openvino_modelfile(FLAGS.models_dir,
-                                          8,
-                                          3, (16,), (16,), (16,),
-                                          vt,
-                                          vt,
-                                          vt,
-                                          swap=True)
-                create_openvino_modelfile(FLAGS.models_dir,
-                                          0,
-                                          2, (16,), (16,), (16,),
-                                          vt,
-                                          vt,
-                                          vt,
-                                          swap=True)
-                create_openvino_modelfile(FLAGS.models_dir,
-                                          0,
-                                          3, (16,), (16,), (16,),
-                                          vt,
-                                          vt,
-                                          vt,
-                                          swap=True)
+                create_openvino_modelfile(
+                    FLAGS.models_dir, 8, 2, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
+                create_openvino_modelfile(
+                    FLAGS.models_dir, 8, 3, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
+                create_openvino_modelfile(
+                    FLAGS.models_dir, 0, 2, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
+                create_openvino_modelfile(
+                    FLAGS.models_dir, 0, 3, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
 
         if FLAGS.ensemble:
             for pair in emu.platform_types_and_validation():
                 for vt in [np.float16, np.float32, np.int8, np.int16, np.int32]:
-                    shape = (16, 1, 1) if (pair[0] == "plan" and
-                                           vt == np.int8) else (16,)
+                    shape = (
+                        (16, 1, 1) if (pair[0] == "plan" and vt == np.int8) else (16,)
+                    )
                     if not pair[1](vt, vt, vt, shape, shape, shape):
                         continue
-                    emu.create_ensemble_modelfile(pair[0],
-                                                  FLAGS.models_dir,
-                                                  8,
-                                                  2,
-                                                  shape,
-                                                  shape,
-                                                  shape,
-                                                  vt,
-                                                  vt,
-                                                  vt,
-                                                  swap=True)
-                    emu.create_ensemble_modelfile(pair[0],
-                                                  FLAGS.models_dir,
-                                                  8,
-                                                  3,
-                                                  shape,
-                                                  shape,
-                                                  shape,
-                                                  vt,
-                                                  vt,
-                                                  vt,
-                                                  swap=True)
-                    emu.create_ensemble_modelfile(pair[0],
-                                                  FLAGS.models_dir,
-                                                  0,
-                                                  2,
-                                                  shape,
-                                                  shape,
-                                                  shape,
-                                                  vt,
-                                                  vt,
-                                                  vt,
-                                                  swap=True)
-                    emu.create_ensemble_modelfile(pair[0],
-                                                  FLAGS.models_dir,
-                                                  0,
-                                                  3,
-                                                  shape,
-                                                  shape,
-                                                  shape,
-                                                  vt,
-                                                  vt,
-                                                  vt,
-                                                  swap=True)
+                    emu.create_ensemble_modelfile(
+                        pair[0],
+                        FLAGS.models_dir,
+                        8,
+                        2,
+                        shape,
+                        shape,
+                        shape,
+                        vt,
+                        vt,
+                        vt,
+                        swap=True,
+                    )
+                    emu.create_ensemble_modelfile(
+                        pair[0],
+                        FLAGS.models_dir,
+                        8,
+                        3,
+                        shape,
+                        shape,
+                        shape,
+                        vt,
+                        vt,
+                        vt,
+                        swap=True,
+                    )
+                    emu.create_ensemble_modelfile(
+                        pair[0],
+                        FLAGS.models_dir,
+                        0,
+                        2,
+                        shape,
+                        shape,
+                        shape,
+                        vt,
+                        vt,
+                        vt,
+                        swap=True,
+                    )
+                    emu.create_ensemble_modelfile(
+                        pair[0],
+                        FLAGS.models_dir,
+                        0,
+                        3,
+                        shape,
+                        shape,
+                        shape,
+                        vt,
+                        vt,
+                        vt,
+                        swap=True,
+                    )
 
     # Tests with models that accept variable-shape input/output tensors
     if FLAGS.variable:
-        create_models(FLAGS.models_dir, np.float32, np.float32, np.float32,
-                      (-1,), (-1,), (-1,), 16)
-        create_models(FLAGS.models_dir, np.float32, np.int32, np.int32,
-                      (-1, -1), (-1, -1), (-1, -1), 16)
-        create_models(FLAGS.models_dir, np.float32, np.int64, np.int64, (8, -1),
-                      (8, -1), (8, -1), 32)
-        create_models(FLAGS.models_dir, np.float32, np.int32, np.int64,
-                      (-1, 8, -1), (-1, 8, -1), (-1, 8, -1), 32)
-        create_models(FLAGS.models_dir, np.float32, np.float32, np.int32, (-1,),
-                      (-1,), (-1,), 16)
-        create_models(FLAGS.models_dir, np.int32, np.int32, np.int32, (-1, -1),
-                      (-1, -1), (-1, -1), 16)
-        create_models(FLAGS.models_dir, np.int32, np.int32, np.float32,
-                      (-1, 8, -1), (-1, 8, -1), (-1, 8, -1), 32)
-
-        create_models(FLAGS.models_dir, np_dtype_string, np_dtype_string,
-                      np_dtype_string, (-1,), (-1,), (-1,), 16)
-        create_models(FLAGS.models_dir, np_dtype_string, np.int32, np.int32,
-                      (-1, -1), (-1, -1), (-1, -1), 16)
-        create_models(FLAGS.models_dir, np_dtype_string, np_dtype_string,
-                      np.int32, (8, -1), (8, -1), (8, -1), 32)
-        create_models(FLAGS.models_dir, np_dtype_string, np.int32,
-                      np_dtype_string, (-1, 8, -1), (-1, 8, -1), (-1, 8, -1),
-                      32)
+        create_models(
+            FLAGS.models_dir,
+            np.float32,
+            np.float32,
+            np.float32,
+            (-1,),
+            (-1,),
+            (-1,),
+            16,
+        )
+        create_models(
+            FLAGS.models_dir,
+            np.float32,
+            np.int32,
+            np.int32,
+            (-1, -1),
+            (-1, -1),
+            (-1, -1),
+            16,
+        )
+        create_models(
+            FLAGS.models_dir,
+            np.float32,
+            np.int64,
+            np.int64,
+            (8, -1),
+            (8, -1),
+            (8, -1),
+            32,
+        )
+        create_models(
+            FLAGS.models_dir,
+            np.float32,
+            np.int32,
+            np.int64,
+            (-1, 8, -1),
+            (-1, 8, -1),
+            (-1, 8, -1),
+            32,
+        )
+        create_models(
+            FLAGS.models_dir, np.float32, np.float32, np.int32, (-1,), (-1,), (-1,), 16
+        )
+        create_models(
+            FLAGS.models_dir,
+            np.int32,
+            np.int32,
+            np.int32,
+            (-1, -1),
+            (-1, -1),
+            (-1, -1),
+            16,
+        )
+        create_models(
+            FLAGS.models_dir,
+            np.int32,
+            np.int32,
+            np.float32,
+            (-1, 8, -1),
+            (-1, 8, -1),
+            (-1, 8, -1),
+            32,
+        )
+
+        create_models(
+            FLAGS.models_dir,
+            np_dtype_string,
+            np_dtype_string,
+            np_dtype_string,
+            (-1,),
+            (-1,),
+            (-1,),
+            16,
+        )
+        create_models(
+            FLAGS.models_dir,
+            np_dtype_string,
+            np.int32,
+            np.int32,
+            (-1, -1),
+            (-1, -1),
+            (-1, -1),
+            16,
+        )
+        create_models(
+            FLAGS.models_dir,
+            np_dtype_string,
+            np_dtype_string,
+            np.int32,
+            (8, -1),
+            (8, -1),
+            (8, -1),
+            32,
+        )
+        create_models(
+            FLAGS.models_dir,
+            np_dtype_string,
+            np.int32,
+            np_dtype_string,
+            (-1, 8, -1),
+            (-1, 8, -1),
+            (-1, 8, -1),
+            32,
+        )
 
     if FLAGS.ensemble:
         # Create utility models used in ensemble
@@ -2224,5 +2821,4 @@ def create_fixed_models(models_dir,
             # Use variable size to handle all shape. Note: piping variable size output
             # to fixed size model is not safe but doable
             for model_shape in [(-1,), (-1, -1), (-1, -1, -1)]:
-                emu.create_nop_modelconfig(FLAGS.models_dir, model_shape,
-                                           model_dtype)
+                emu.create_nop_modelconfig(FLAGS.models_dir, model_shape, model_dtype)
diff --git a/qa/common/gen_qa_noshape_models.py b/qa/common/gen_qa_noshape_models.py
old mode 100644
new mode 100755
index 944914600b..af26017495
--- a/qa/common/gen_qa_noshape_models.py
+++ b/qa/common/gen_qa_noshape_models.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,79 +27,36 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import argparse
-from builtins import range
 import os
+from builtins import range
+
 import numpy as np
+from gen_common import np_to_model_dtype, np_to_tf_dtype
 
 FLAGS = None
 np_dtype_string = np.dtype(object)
 
 
-def np_to_model_dtype(np_dtype):
-    if np_dtype == bool:
-        return "TYPE_BOOL"
-    elif np_dtype == np.int8:
-        return "TYPE_INT8"
-    elif np_dtype == np.int16:
-        return "TYPE_INT16"
-    elif np_dtype == np.int32:
-        return "TYPE_INT32"
-    elif np_dtype == np.int64:
-        return "TYPE_INT64"
-    elif np_dtype == np.uint8:
-        return "TYPE_UINT8"
-    elif np_dtype == np.uint16:
-        return "TYPE_UINT16"
-    elif np_dtype == np.float16:
-        return "TYPE_FP16"
-    elif np_dtype == np.float32:
-        return "TYPE_FP32"
-    elif np_dtype == np.float64:
-        return "TYPE_FP64"
-    elif np_dtype == np_dtype_string:
-        return "TYPE_STRING"
-    return None
-
-
-def np_to_tf_dtype(np_dtype):
-    if np_dtype == bool:
-        return tf.bool
-    elif np_dtype == np.int8:
-        return tf.int8
-    elif np_dtype == np.int16:
-        return tf.int16
-    elif np_dtype == np.int32:
-        return tf.int32
-    elif np_dtype == np.int64:
-        return tf.int64
-    elif np_dtype == np.uint8:
-        return tf.uint8
-    elif np_dtype == np.uint16:
-        return tf.uint16
-    elif np_dtype == np.float16:
-        return tf.float16
-    elif np_dtype == np.float32:
-        return tf.float32
-    elif np_dtype == np.float64:
-        return tf.float64
-    elif np_dtype == np_dtype_string:
-        return tf.string
-    return None
-
-
-def create_savedmodel_modelfile(models_dir,
-                                max_batch,
-                                model_version,
-                                input_shape,
-                                output0_shape,
-                                output1_shape,
-                                input_dtype,
-                                output0_dtype,
-                                output1_dtype,
-                                swap=False):
-
-    if not tu.validate_for_tf_model(input_dtype, output0_dtype, output1_dtype,
-                                    input_shape, output0_shape, output1_shape):
+def create_savedmodel_modelfile(
+    models_dir,
+    max_batch,
+    model_version,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    swap=False,
+):
+    if not tu.validate_for_tf_model(
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+        input_shape,
+        output0_shape,
+        output1_shape,
+    ):
         return
 
     tf_input_dtype = np_to_tf_dtype(input_dtype)
@@ -106,18 +65,26 @@ def create_savedmodel_modelfile(models_dir,
 
     # Create the model. If non-batching then don't include the batch
     # dimension.
-    tf.reset_default_graph()
+    tf.compat.v1.reset_default_graph()
     if max_batch == 0:
-        in0 = tf.placeholder(tf_input_dtype, tu.shape_to_tf_shape([]),
-                             "TENSOR_INPUT0")
-        in1 = tf.placeholder(tf_input_dtype, tu.shape_to_tf_shape(input_shape),
-                             "TENSOR_INPUT1")
+        in0 = tf.compat.v1.placeholder(
+            tf_input_dtype, tu.shape_to_tf_shape([]), "TENSOR_INPUT0"
+        )
+        in1 = tf.compat.v1.placeholder(
+            tf_input_dtype, tu.shape_to_tf_shape(input_shape), "TENSOR_INPUT1"
+        )
     else:
-        in0 = tf.placeholder(tf_input_dtype, tu.shape_to_tf_shape([]),
-                             "TENSOR_INPUT0")
-        in1 = tf.placeholder(tf_input_dtype, [
-            None,
-        ] + tu.shape_to_tf_shape(input_shape), "TENSOR_INPUT1")
+        in0 = tf.compat.v1.placeholder(
+            tf_input_dtype, tu.shape_to_tf_shape([]), "TENSOR_INPUT0"
+        )
+        in1 = tf.compat.v1.placeholder(
+            tf_input_dtype,
+            [
+                None,
+            ]
+            + tu.shape_to_tf_shape(input_shape),
+            "TENSOR_INPUT1",
+        )
 
     # If the input is a string, then convert each string to the
     # equivalent float value.
@@ -130,22 +97,25 @@ def create_savedmodel_modelfile(models_dir,
 
     # Cast or convert result to the output dtype.
     if tf_output0_dtype == tf.string:
-        cast0 = tf.dtypes.as_string(add if not swap else sub, name="TOSTR0")
+        cast0 = tf.strings.as_string(add if not swap else sub, name="TOSTR0")
     else:
         cast0 = tf.cast(add if not swap else sub, tf_output0_dtype, "CAST0")
 
     if tf_output1_dtype == tf.string:
-        cast1 = tf.dtypes.as_string(sub if not swap else add, name="TOSTR1")
+        cast1 = tf.strings.as_string(sub if not swap else add, name="TOSTR1")
     else:
         cast1 = tf.cast(sub if not swap else add, tf_output1_dtype, "CAST1")
 
-    out0 = tf.identity(cast0, "TENSOR_OUTPUT0")
-    out1 = tf.identity(cast1, "TENSOR_OUTPUT1")
+    tf.identity(cast0, "TENSOR_OUTPUT0")
+    tf.identity(cast1, "TENSOR_OUTPUT1")
 
     # Use a different model name for the non-batching variant
     model_name = tu.get_model_name(
-        "savedmodel_nobatch" if max_batch == 0 else "savedmodel", input_dtype,
-        output0_dtype, output1_dtype)
+        "savedmodel_nobatch" if max_batch == 0 else "savedmodel",
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     try:
@@ -153,54 +123,70 @@ def create_savedmodel_modelfile(models_dir,
     except OSError as ex:
         pass  # ignore existing dir
 
-    with tf.Session() as sess:
-        input0_tensor = tf.get_default_graph().get_tensor_by_name(
-            "TENSOR_INPUT0:0")
-        input1_tensor = tf.get_default_graph().get_tensor_by_name(
-            "TENSOR_INPUT1:0")
-        output0_tensor = tf.get_default_graph().get_tensor_by_name(
-            "TENSOR_OUTPUT0:0")
-        output1_tensor = tf.get_default_graph().get_tensor_by_name(
-            "TENSOR_OUTPUT1:0")
-        tf.saved_model.simple_save(sess,
-                                   model_version_dir + "/model.savedmodel",
-                                   inputs={
-                                       "INPUT0": input0_tensor,
-                                       "INPUT1": input1_tensor
-                                   },
-                                   outputs={
-                                       "OUTPUT0": output0_tensor,
-                                       "OUTPUT1": output1_tensor
-                                   })
-
-
-def create_savedmodel_modelconfig(models_dir, max_batch, model_version,
-                                  input_shape, output0_shape, output1_shape,
-                                  input_dtype, output0_dtype, output1_dtype,
-                                  output0_label_cnt, version_policy):
-
-    if not tu.validate_for_tf_model(input_dtype, output0_dtype, output1_dtype,
-                                    input_shape, output0_shape, output1_shape):
+    with tf.compat.v1.Session() as sess:
+        input0_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+            "TENSOR_INPUT0:0"
+        )
+        input1_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+            "TENSOR_INPUT1:0"
+        )
+        output0_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+            "TENSOR_OUTPUT0:0"
+        )
+        output1_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+            "TENSOR_OUTPUT1:0"
+        )
+        tf.compat.v1.saved_model.simple_save(
+            sess,
+            model_version_dir + "/model.savedmodel",
+            inputs={"INPUT0": input0_tensor, "INPUT1": input1_tensor},
+            outputs={"OUTPUT0": output0_tensor, "OUTPUT1": output1_tensor},
+        )
+
+
+def create_savedmodel_modelconfig(
+    models_dir,
+    max_batch,
+    model_version,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    output0_label_cnt,
+    version_policy,
+):
+    if not tu.validate_for_tf_model(
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+        input_shape,
+        output0_shape,
+        output1_shape,
+    ):
         return
 
     # Unpack version policy
     version_policy_str = "{ latest { num_versions: 1 }}"
     if version_policy is not None:
         type, val = version_policy
-        if type == 'latest':
-            version_policy_str = "{{ latest {{ num_versions: {} }}}}".format(
-                val)
-        elif type == 'specific':
+        if type == "latest":
+            version_policy_str = "{{ latest {{ num_versions: {} }}}}".format(val)
+        elif type == "specific":
             version_policy_str = "{{ specific {{ versions: {} }}}}".format(val)
         else:
             version_policy_str = "{ all { }}"
 
     # Use a different model name for the non-batching variant
     model_name = tu.get_model_name(
-        "savedmodel_nobatch" if max_batch == 0 else "savedmodel", input_dtype,
-        output0_dtype, output1_dtype)
+        "savedmodel_nobatch" if max_batch == 0 else "savedmodel",
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+    )
     config_dir = models_dir + "/" + model_name
-    config = '''
+    config = """
 name: "{}"
 platform: "tensorflow_savedmodel"
 max_batch_size: {}
@@ -230,13 +216,19 @@ def create_savedmodel_modelconfig(models_dir, max_batch, model_version,
     dims: [ {} ]
   }}
 ]
-'''.format(model_name, max_batch, version_policy_str,
-           np_to_model_dtype(input_dtype), tu.shape_to_dims_str(input_shape),
-           np_to_model_dtype(input_dtype), tu.shape_to_dims_str(input_shape),
-           np_to_model_dtype(output0_dtype),
-           tu.shape_to_dims_str(output0_shape),
-           np_to_model_dtype(output1_dtype),
-           tu.shape_to_dims_str(output1_shape))
+""".format(
+        model_name,
+        max_batch,
+        version_policy_str,
+        np_to_model_dtype(input_dtype),
+        tu.shape_to_dims_str(input_shape),
+        np_to_model_dtype(input_dtype),
+        tu.shape_to_dims_str(input_shape),
+        np_to_model_dtype(output0_dtype),
+        tu.shape_to_dims_str(output0_shape),
+        np_to_model_dtype(output1_dtype),
+        tu.shape_to_dims_str(output1_shape),
+    )
 
     try:
         os.makedirs(config_dir)
@@ -251,15 +243,17 @@ def create_savedmodel_modelconfig(models_dir, max_batch, model_version,
             lfile.write("label" + str(l) + "\n")
 
 
-def create_models(models_dir,
-                  input_dtype,
-                  output0_dtype,
-                  output1_dtype,
-                  input_shape,
-                  output0_shape,
-                  output1_shape,
-                  output0_label_cnt,
-                  version_policy=None):
+def create_models(
+    models_dir,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    output0_label_cnt,
+    version_policy=None,
+):
     model_version = 1
 
     # Create two models, one that supports batching with a max-batch
@@ -267,96 +261,159 @@ def create_models(models_dir,
 
     if FLAGS.savedmodel:
         # max-batch 8
-        create_savedmodel_modelconfig(models_dir, 8, model_version, input_shape,
-                                      output0_shape, output1_shape, input_dtype,
-                                      output0_dtype, output1_dtype,
-                                      output0_label_cnt, version_policy)
-        create_savedmodel_modelfile(models_dir, 8, model_version, input_shape,
-                                    output0_shape, output1_shape, input_dtype,
-                                    output0_dtype, output1_dtype)
+        create_savedmodel_modelconfig(
+            models_dir,
+            8,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            output0_label_cnt,
+            version_policy,
+        )
+        create_savedmodel_modelfile(
+            models_dir,
+            8,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+        )
         # max-batch 0
-        create_savedmodel_modelconfig(models_dir, 0, model_version, input_shape,
-                                      output0_shape, output1_shape, input_dtype,
-                                      output0_dtype, output1_dtype,
-                                      output0_label_cnt, version_policy)
-        create_savedmodel_modelfile(models_dir, 0, model_version, input_shape,
-                                    output0_shape, output1_shape, input_dtype,
-                                    output0_dtype, output1_dtype)
-
-
-def create_fixed_models(models_dir,
-                        input_dtype,
-                        output0_dtype,
-                        output1_dtype,
-                        version_policy=None):
+        create_savedmodel_modelconfig(
+            models_dir,
+            0,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            output0_label_cnt,
+            version_policy,
+        )
+        create_savedmodel_modelfile(
+            models_dir,
+            0,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+        )
+
+
+def create_fixed_models(
+    models_dir, input_dtype, output0_dtype, output1_dtype, version_policy=None
+):
     input_size = 16
 
-    create_models(models_dir, input_dtype, output0_dtype, output1_dtype,
-                  (input_size,), (input_size,), (input_size,), input_size,
-                  version_policy)
+    create_models(
+        models_dir,
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+        (input_size,),
+        (input_size,),
+        (input_size,),
+        input_size,
+        version_policy,
+    )
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('--models_dir',
-                        type=str,
-                        required=True,
-                        help='Top-level model directory')
-    parser.add_argument('--graphdef',
-                        required=False,
-                        action='store_true',
-                        help='Generate GraphDef models')
-    parser.add_argument('--savedmodel',
-                        required=False,
-                        action='store_true',
-                        help='Generate SavedModel models')
-    parser.add_argument('--tensorrt',
-                        required=False,
-                        action='store_true',
-                        help='Generate TensorRT PLAN models')
-    parser.add_argument('--onnx',
-                        required=False,
-                        action='store_true',
-                        help='Generate Onnx Runtime Onnx models')
-    parser.add_argument('--libtorch',
-                        required=False,
-                        action='store_true',
-                        help='Generate Pytorch LibTorch models')
-    parser.add_argument('--variable',
-                        required=False,
-                        action='store_true',
-                        help='Used variable-shape tensors for input/output')
-    parser.add_argument('--ensemble',
-                        required=False,
-                        action='store_true',
-                        help='Generate ensemble models against the models' +
-                        ' in all platforms. Note that the models generated' +
-                        ' are not completed.')
+    parser.add_argument(
+        "--models_dir", type=str, required=True, help="Top-level model directory"
+    )
+    parser.add_argument(
+        "--graphdef",
+        required=False,
+        action="store_true",
+        help="Generate GraphDef models",
+    )
+    parser.add_argument(
+        "--savedmodel",
+        required=False,
+        action="store_true",
+        help="Generate SavedModel models",
+    )
+    parser.add_argument(
+        "--tensorrt",
+        required=False,
+        action="store_true",
+        help="Generate TensorRT PLAN models",
+    )
+    parser.add_argument(
+        "--onnx",
+        required=False,
+        action="store_true",
+        help="Generate Onnx Runtime Onnx models",
+    )
+    parser.add_argument(
+        "--libtorch",
+        required=False,
+        action="store_true",
+        help="Generate Pytorch LibTorch models",
+    )
+    parser.add_argument(
+        "--variable",
+        required=False,
+        action="store_true",
+        help="Used variable-shape tensors for input/output",
+    )
+    parser.add_argument(
+        "--ensemble",
+        required=False,
+        action="store_true",
+        help="Generate ensemble models against the models"
+        + " in all platforms. Note that the models generated"
+        + " are not completed.",
+    )
     FLAGS, unparsed = parser.parse_known_args()
 
     if FLAGS.savedmodel:
         import tensorflow as tf
-        from tensorflow.python.framework import graph_util
+
+        tf.compat.v1.disable_eager_execution()
 
     import test_util as tu
 
     # Tests with models that accept fixed-shape input/output tensors
     if not FLAGS.variable:
-        create_fixed_models(FLAGS.models_dir, np.int8, np.int8, np.int8,
-                            ('latest', 1))
-        create_fixed_models(FLAGS.models_dir, np.int16, np.int16, np.int16,
-                            ('latest', 2))
-        create_fixed_models(FLAGS.models_dir, np.int32, np.int32, np.int32,
-                            ('all', None))
+        create_fixed_models(FLAGS.models_dir, np.int8, np.int8, np.int8, ("latest", 1))
+        create_fixed_models(
+            FLAGS.models_dir, np.int16, np.int16, np.int16, ("latest", 2)
+        )
+        create_fixed_models(
+            FLAGS.models_dir, np.int32, np.int32, np.int32, ("all", None)
+        )
         create_fixed_models(FLAGS.models_dir, np.int64, np.int64, np.int64)
-        create_fixed_models(FLAGS.models_dir, np.float16, np.float16,
-                            np.float16, ('specific', [
-                                1,
-                            ]))
-        create_fixed_models(FLAGS.models_dir, np.float32, np.float32,
-                            np.float32, ('specific', [1, 3]))
-        create_fixed_models(FLAGS.models_dir, np.float16, np.float32,
-                            np.float32)
+        create_fixed_models(
+            FLAGS.models_dir,
+            np.float16,
+            np.float16,
+            np.float16,
+            (
+                "specific",
+                [
+                    1,
+                ],
+            ),
+        )
+        create_fixed_models(
+            FLAGS.models_dir, np.float32, np.float32, np.float32, ("specific", [1, 3])
+        )
+        create_fixed_models(FLAGS.models_dir, np.float16, np.float32, np.float32)
         create_fixed_models(FLAGS.models_dir, np.int32, np.int8, np.int8)
         create_fixed_models(FLAGS.models_dir, np.int8, np.int32, np.int32)
         create_fixed_models(FLAGS.models_dir, np.int32, np.int8, np.int16)
@@ -366,31 +423,15 @@ def create_fixed_models(models_dir,
 
         if FLAGS.savedmodel:
             for vt in [np.float16, np.float32, np.int8, np.int16, np.int32]:
-                create_savedmodel_modelfile(FLAGS.models_dir,
-                                            8,
-                                            2, (16,), (16,), (16,),
-                                            vt,
-                                            vt,
-                                            vt,
-                                            swap=True)
-                create_savedmodel_modelfile(FLAGS.models_dir,
-                                            8,
-                                            3, (16,), (16,), (16,),
-                                            vt,
-                                            vt,
-                                            vt,
-                                            swap=True)
-                create_savedmodel_modelfile(FLAGS.models_dir,
-                                            0,
-                                            2, (16,), (16,), (16,),
-                                            vt,
-                                            vt,
-                                            vt,
-                                            swap=True)
-                create_savedmodel_modelfile(FLAGS.models_dir,
-                                            0,
-                                            3, (16,), (16,), (16,),
-                                            vt,
-                                            vt,
-                                            vt,
-                                            swap=True)
+                create_savedmodel_modelfile(
+                    FLAGS.models_dir, 8, 2, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
+                create_savedmodel_modelfile(
+                    FLAGS.models_dir, 8, 3, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
+                create_savedmodel_modelfile(
+                    FLAGS.models_dir, 0, 2, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
+                create_savedmodel_modelfile(
+                    FLAGS.models_dir, 0, 3, (16,), (16,), (16,), vt, vt, vt, swap=True
+                )
diff --git a/qa/common/gen_qa_ort_scalar_models.py b/qa/common/gen_qa_ort_scalar_models.py
new file mode 100755
index 0000000000..f2ddb35912
--- /dev/null
+++ b/qa/common/gen_qa_ort_scalar_models.py
@@ -0,0 +1,130 @@
+#!/usr/bin/env python3
+
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+import argparse
+import os
+
+import numpy as np
+import onnx
+import test_util as tu
+from gen_common import np_to_model_dtype, np_to_onnx_dtype
+
+
+def create_onnx_modelfile(models_dir, shape, dtype, model_version=1):
+    onnx_io_dtype = np_to_onnx_dtype(dtype)
+
+    # Create the model
+    model_name = f"onnx_scalar_{len(shape)}dim"
+    model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
+
+    input = onnx.helper.make_tensor_value_info("INPUT", onnx_io_dtype, None)
+
+    output = onnx.helper.make_tensor_value_info("OUTPUT", onnx_io_dtype, None)
+
+    identity = onnx.helper.make_node("Identity", ["INPUT"], ["OUTPUT"])
+
+    onnx_nodes = [identity]
+    onnx_inputs = [input]
+    onnx_outputs = [output]
+
+    graph_proto = onnx.helper.make_graph(
+        onnx_nodes, model_name, onnx_inputs, onnx_outputs
+    )
+    if FLAGS.onnx_opset > 0:
+        model_opset = onnx.helper.make_operatorsetid("", FLAGS.onnx_opset)
+        model_def = onnx.helper.make_model(
+            graph_proto, producer_name="triton", opset_imports=[model_opset]
+        )
+    else:
+        model_def = onnx.helper.make_model(graph_proto, producer_name="triton")
+
+    try:
+        os.makedirs(model_version_dir)
+    except OSError as ex:
+        pass  # ignore existing dir
+
+    onnx.save(model_def, model_version_dir + "/model.onnx")
+
+
+def create_onnx_modelconfig(models_dir, dtype, shape):
+    # Create the model
+    model_name = f"onnx_scalar_{len(shape)}dim"
+    config_dir = models_dir + "/" + model_name
+
+    config = """
+input [
+  {{
+    name: "INPUT"
+    data_type: {}
+    dims: [ {} ]
+  }}
+]
+output [
+  {{
+    name: "OUTPUT"
+    data_type: {}
+    dims: [ {} ]
+  }}
+]
+""".format(
+        np_to_model_dtype(dtype),
+        tu.shape_to_dims_str(shape),
+        np_to_model_dtype(dtype),
+        tu.shape_to_dims_str(shape),
+    )
+
+    try:
+        os.makedirs(config_dir)
+    except OSError as ex:
+        pass  # ignore existing dir
+
+    with open(config_dir + "/config.pbtxt", "w") as cfile:
+        cfile.write(config)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--models_dir", type=str, required=True, help="Top-level model directory"
+    )
+    parser.add_argument(
+        "--onnx_opset",
+        type=int,
+        required=False,
+        default=0,
+        help="Opset used for Onnx models. Default is to use ONNXRT default",
+    )
+
+    FLAGS = parser.parse_args()
+    create_onnx_modelfile(FLAGS.models_dir, shape=[1], dtype=np.float32)
+    create_onnx_modelconfig(FLAGS.models_dir, shape=[1], dtype=np.float32)
+    create_onnx_modelfile(FLAGS.models_dir, shape=[1, 1], dtype=np.float32)
+    create_onnx_modelconfig(FLAGS.models_dir, shape=[1, 1], dtype=np.float32)
diff --git a/qa/common/gen_qa_pytorch_model.py b/qa/common/gen_qa_pytorch_model.py
new file mode 100644
index 0000000000..2daee9cffc
--- /dev/null
+++ b/qa/common/gen_qa_pytorch_model.py
@@ -0,0 +1,124 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+import argparse
+import os
+
+import torch
+from torch import nn
+
+
+class AddSubNet(nn.Module):
+    def __init__(self):
+        super(AddSubNet, self).__init__()
+
+    def forward(self, input0, input1):
+        return (input0 + input1), (input0 - input1)
+
+
+def generate_model(model_dir):
+    model = AddSubNet()
+
+    traced_model = torch.jit.trace(
+        model,
+        (torch.rand(1, 4, dtype=torch.float), torch.rand(1, 4, dtype=torch.float)),
+    )
+
+    os.makedirs(model_dir, exist_ok=True)
+    model_path = os.path.join(model_dir, "model.pt")
+
+    traced_model.save(model_path)
+
+
+def generate_config(config_path):
+    with open(f"{config_path}/config.pbtxt", "w") as f:
+        f.write(
+            """
+backend: "pytorch"
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  }
+]
+input [
+  {
+    name: "INPUT1"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT1"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  }
+]
+"""
+        )
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "-m",
+        "--model-directory",
+        type=str,
+        required=True,
+        help="The path to the model repository.",
+    )
+    parser.add_argument(
+        "--model-name",
+        type=str,
+        required=False,
+        default="add_sub_pytorch",
+        help="Model name",
+    )
+    parser.add_argument(
+        "--version",
+        type=str,
+        required=False,
+        default="1",
+        help="Model version",
+    )
+
+    args = parser.parse_args()
+
+    model_directory = os.path.join(args.model_directory, args.model_name)
+    os.makedirs(model_directory, exist_ok=True)
+
+    generate_model(model_dir=os.path.join(model_directory, args.version))
+    generate_config(model_directory)
diff --git a/qa/common/gen_qa_ragged_models.py b/qa/common/gen_qa_ragged_models.py
old mode 100644
new mode 100755
index a234246638..18d465dc94
--- a/qa/common/gen_qa_ragged_models.py
+++ b/qa/common/gen_qa_ragged_models.py
@@ -1,4 +1,6 @@
-# Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,98 +30,14 @@
 import os
 
 import numpy as np
+from gen_common import (
+    np_to_model_dtype,
+    np_to_onnx_dtype,
+    np_to_tf_dtype,
+    np_to_trt_dtype,
+)
 
-
-def np_to_model_dtype(np_dtype):
-    if np_dtype == bool:
-        return "TYPE_BOOL"
-    elif np_dtype == np.int8:
-        return "TYPE_INT8"
-    elif np_dtype == np.int16:
-        return "TYPE_INT16"
-    elif np_dtype == np.int32:
-        return "TYPE_INT32"
-    elif np_dtype == np.int64:
-        return "TYPE_INT64"
-    elif np_dtype == np.uint8:
-        return "TYPE_UINT8"
-    elif np_dtype == np.uint16:
-        return "TYPE_UINT16"
-    elif np_dtype == np.float16:
-        return "TYPE_FP16"
-    elif np_dtype == np.float32:
-        return "TYPE_FP32"
-    elif np_dtype == np.float64:
-        return "TYPE_FP64"
-    elif np_dtype == np_dtype_string:
-        return "TYPE_STRING"
-    return None
-
-
-def np_to_trt_dtype(np_dtype):
-    if np_dtype == bool:
-        return trt.bool
-    elif np_dtype == np.int8:
-        return trt.int8
-    elif np_dtype == np.int32:
-        return trt.int32
-    elif np_dtype == np.float16:
-        return trt.float16
-    elif np_dtype == np.float32:
-        return trt.float32
-    return None
-
-
-def np_to_tf_dtype(np_dtype):
-    if np_dtype == bool:
-        return tf.bool
-    elif np_dtype == np.int8:
-        return tf.int8
-    elif np_dtype == np.int16:
-        return tf.int16
-    elif np_dtype == np.int32:
-        return tf.int32
-    elif np_dtype == np.int64:
-        return tf.int64
-    elif np_dtype == np.uint8:
-        return tf.uint8
-    elif np_dtype == np.uint16:
-        return tf.uint16
-    elif np_dtype == np.float16:
-        return tf.float16
-    elif np_dtype == np.float32:
-        return tf.float32
-    elif np_dtype == np.float64:
-        return tf.float64
-    elif np_dtype == np_dtype_string:
-        return tf.string
-    return None
-
-
-def np_to_onnx_dtype(np_dtype):
-    if np_dtype == bool:
-        return onnx.TensorProto.BOOL
-    elif np_dtype == np.int8:
-        return onnx.TensorProto.INT8
-    elif np_dtype == np.int16:
-        return onnx.TensorProto.INT16
-    elif np_dtype == np.int32:
-        return onnx.TensorProto.INT32
-    elif np_dtype == np.int64:
-        return onnx.TensorProto.INT64
-    elif np_dtype == np.uint8:
-        return onnx.TensorProto.UINT8
-    elif np_dtype == np.uint16:
-        return onnx.TensorProto.UINT16
-    elif np_dtype == np.float16:
-        return onnx.TensorProto.FLOAT16
-    elif np_dtype == np.float32:
-        return onnx.TensorProto.FLOAT
-    elif np_dtype == np.float64:
-        return onnx.TensorProto.DOUBLE
-    elif np_dtype == np_dtype_string:
-        return onnx.TensorProto.STRING
-    return None
+np_dtype_string = np.dtype(object)
 
 
 def create_savedmodel_modelfile(models_dir, model_version, dtype):
@@ -146,13 +64,16 @@ def create_savedmodel_modelfile(models_dir, model_version, dtype):
 
     tf_dtype = np_to_tf_dtype(dtype)
 
-    tf.reset_default_graph()
-    in_node = tf.placeholder(tf_dtype, tu.shape_to_tf_shape([-1]),
-                             "TENSOR_RAGGED_INPUT")
-    bs_node = tf.placeholder(tf_dtype, tu.shape_to_tf_shape([-1]),
-                             "TENSOR_BATCH_AND_SIZE_INPUT")
-    batch_node = tf.placeholder(tf_dtype, tu.shape_to_tf_shape([-1]),
-                                "TENSOR_BATCH_INPUT")
+    tf.compat.v1.reset_default_graph()
+    in_node = tf.compat.v1.placeholder(
+        tf_dtype, tu.shape_to_tf_shape([-1]), "TENSOR_RAGGED_INPUT"
+    )
+    bs_node = tf.compat.v1.placeholder(
+        tf_dtype, tu.shape_to_tf_shape([-1]), "TENSOR_BATCH_AND_SIZE_INPUT"
+    )
+    batch_node = tf.compat.v1.placeholder(
+        tf_dtype, tu.shape_to_tf_shape([-1]), "TENSOR_BATCH_INPUT"
+    )
 
     in_mat = tf.reshape(in_node, [1, -1])
     bs_mat = tf.reshape(bs_node, [1, -1])
@@ -161,12 +82,10 @@ def create_savedmodel_modelfile(models_dir, model_version, dtype):
     output_expander = tf.reshape(tf.divide(bs_node, bs_node), [-1, 1])
 
     out_node = tf.matmul(output_expander, in_mat, name="TENSOR_RAGGED_OUTPUT")
-    bs_out_node = tf.matmul(output_expander,
-                            bs_mat,
-                            name="TENSOR_BATCH_AND_SIZE_OUTPUT")
-    batch_out_node = tf.matmul(output_expander,
-                               batch_mat,
-                               name="TENSOR_BATCH_OUTPUT")
+    bs_out_node = tf.matmul(
+        output_expander, bs_mat, name="TENSOR_BATCH_AND_SIZE_OUTPUT"
+    )
+    batch_out_node = tf.matmul(output_expander, batch_mat, name="TENSOR_BATCH_OUTPUT")
 
     model_name = "savedmodel_batch_input"
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
@@ -176,31 +95,39 @@ def create_savedmodel_modelfile(models_dir, model_version, dtype):
     except OSError as ex:
         pass  # ignore existing dir
 
-    with tf.Session() as sess:
-        in_tensor = tf.get_default_graph().get_tensor_by_name(
-            "TENSOR_RAGGED_INPUT:0")
-        bs_tensor = tf.get_default_graph().get_tensor_by_name(
-            "TENSOR_BATCH_AND_SIZE_INPUT:0")
-        batch_tensor = tf.get_default_graph().get_tensor_by_name(
-            "TENSOR_BATCH_INPUT:0")
-        out_tensor = tf.get_default_graph().get_tensor_by_name(
-            "TENSOR_RAGGED_OUTPUT:0")
-        bs_out_tensor = tf.get_default_graph().get_tensor_by_name(
-            "TENSOR_BATCH_AND_SIZE_OUTPUT:0")
-        batch_out_tensor = tf.get_default_graph().get_tensor_by_name(
-            "TENSOR_BATCH_OUTPUT:0")
-        tf.saved_model.simple_save(sess,
-                                   model_version_dir + "/model.savedmodel",
-                                   inputs={
-                                       "RAGGED_INPUT": in_tensor,
-                                       "BATCH_AND_SIZE_INPUT": bs_tensor,
-                                       "BATCH_INPUT": batch_tensor,
-                                   },
-                                   outputs={
-                                       "RAGGED_OUTPUT": out_tensor,
-                                       "BATCH_AND_SIZE_OUTPUT": bs_out_tensor,
-                                       "BATCH_OUTPUT": batch_out_tensor,
-                                   })
+    with tf.compat.v1.Session() as sess:
+        in_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+            "TENSOR_RAGGED_INPUT:0"
+        )
+        bs_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+            "TENSOR_BATCH_AND_SIZE_INPUT:0"
+        )
+        batch_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+            "TENSOR_BATCH_INPUT:0"
+        )
+        out_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+            "TENSOR_RAGGED_OUTPUT:0"
+        )
+        bs_out_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+            "TENSOR_BATCH_AND_SIZE_OUTPUT:0"
+        )
+        batch_out_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+            "TENSOR_BATCH_OUTPUT:0"
+        )
+        tf.compat.v1.saved_model.simple_save(
+            sess,
+            model_version_dir + "/model.savedmodel",
+            inputs={
+                "RAGGED_INPUT": in_tensor,
+                "BATCH_AND_SIZE_INPUT": bs_tensor,
+                "BATCH_INPUT": batch_tensor,
+            },
+            outputs={
+                "RAGGED_OUTPUT": out_tensor,
+                "BATCH_AND_SIZE_OUTPUT": bs_out_tensor,
+                "BATCH_OUTPUT": batch_out_tensor,
+            },
+        )
 
 
 def create_plan_modelfile(models_dir, model_version, dtype):
@@ -228,7 +155,8 @@ def create_plan_modelfile(models_dir, model_version, dtype):
     TRT_LOGGER = trt.Logger(trt.Logger.INFO)
     builder = trt.Builder(TRT_LOGGER)
     network = builder.create_network(
-        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
+        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+    )
     trt_dtype = np_to_trt_dtype(dtype)
 
     in_node = network.add_input("RAGGED_INPUT", trt_dtype, [-1])
@@ -243,21 +171,27 @@ def create_plan_modelfile(models_dir, model_version, dtype):
     batch_mat = network.add_shuffle(batch_node)
     batch_mat.reshape_dims = reshape_dims
 
-    batch_entry = network.add_elementwise(bs_mat.get_output(0),
-                                          bs_mat.get_output(0),
-                                          trt.ElementWiseOperation.DIV)
-    out_node = network.add_matrix_multiply(batch_entry.get_output(0),
-                                           trt.MatrixOperation.NONE,
-                                           in_mat.get_output(0),
-                                           trt.MatrixOperation.TRANSPOSE)
-    bs_out_node = network.add_matrix_multiply(batch_entry.get_output(0),
-                                              trt.MatrixOperation.NONE,
-                                              bs_mat.get_output(0),
-                                              trt.MatrixOperation.TRANSPOSE)
-    batch_out_node = network.add_matrix_multiply(batch_entry.get_output(0),
-                                                 trt.MatrixOperation.NONE,
-                                                 batch_mat.get_output(0),
-                                                 trt.MatrixOperation.TRANSPOSE)
+    batch_entry = network.add_elementwise(
+        bs_mat.get_output(0), bs_mat.get_output(0), trt.ElementWiseOperation.DIV
+    )
+    out_node = network.add_matrix_multiply(
+        batch_entry.get_output(0),
+        trt.MatrixOperation.NONE,
+        in_mat.get_output(0),
+        trt.MatrixOperation.TRANSPOSE,
+    )
+    bs_out_node = network.add_matrix_multiply(
+        batch_entry.get_output(0),
+        trt.MatrixOperation.NONE,
+        bs_mat.get_output(0),
+        trt.MatrixOperation.TRANSPOSE,
+    )
+    batch_out_node = network.add_matrix_multiply(
+        batch_entry.get_output(0),
+        trt.MatrixOperation.NONE,
+        batch_mat.get_output(0),
+        trt.MatrixOperation.TRANSPOSE,
+    )
     out_node.get_output(0).name = "RAGGED_OUTPUT"
     bs_out_node.get_output(0).name = "BATCH_AND_SIZE_OUTPUT"
     batch_out_node.get_output(0).name = "BATCH_OUTPUT"
@@ -272,11 +206,10 @@ def create_plan_modelfile(models_dir, model_version, dtype):
 
     profile = builder.create_optimization_profile()
     for input_name in ["RAGGED_INPUT", "BATCH_AND_SIZE_INPUT", "BATCH_INPUT"]:
-        profile.set_shape("{}".format(input_name), min_shape, opt_shape,
-                          max_shape)
+        profile.set_shape("{}".format(input_name), min_shape, opt_shape, max_shape)
     config = builder.create_builder_config()
     config.add_optimization_profile(profile)
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
     except AttributeError:
@@ -295,8 +228,6 @@ def create_plan_modelfile(models_dir, model_version, dtype):
     with open(model_version_dir + "/model.plan", "wb") as f:
         f.write(engine_bytes)
 
-    del builder
-
 
 def create_onnx_modelfile(models_dir, model_version, dtype):
     # Create special identity model for batch input testing.
@@ -330,72 +261,94 @@ def create_onnx_modelfile(models_dir, model_version, dtype):
     bs_shape, idx = tu.shape_to_onnx_shape([-1], 0)
     batch_shape, idx = tu.shape_to_onnx_shape([-1], 0)
 
-    in0 = onnx.helper.make_tensor_value_info("RAGGED_INPUT", onnx_dtype,
-                                             in0_shape)
-    bs_in = onnx.helper.make_tensor_value_info("BATCH_AND_SIZE_INPUT",
-                                               onnx_dtype, bs_shape)
-    batch_in = onnx.helper.make_tensor_value_info("BATCH_INPUT", onnx_dtype,
-                                                  batch_shape)
+    in0 = onnx.helper.make_tensor_value_info("RAGGED_INPUT", onnx_dtype, in0_shape)
+    bs_in = onnx.helper.make_tensor_value_info(
+        "BATCH_AND_SIZE_INPUT", onnx_dtype, bs_shape
+    )
+    batch_in = onnx.helper.make_tensor_value_info(
+        "BATCH_INPUT", onnx_dtype, batch_shape
+    )
 
     out_shape, idx = tu.shape_to_onnx_shape([-1, -1], idx)
     bs_out_shape, idx = tu.shape_to_onnx_shape([-1, -1], idx)
     batch_out_shape, idx = tu.shape_to_onnx_shape([-1, -1], idx)
 
-    out = onnx.helper.make_tensor_value_info("RAGGED_OUTPUT", onnx_dtype,
-                                             out_shape)
-    bs_out = onnx.helper.make_tensor_value_info("BATCH_AND_SIZE_OUTPUT",
-                                                onnx_dtype, bs_out_shape)
-    batch_out = onnx.helper.make_tensor_value_info("BATCH_OUTPUT", onnx_dtype,
-                                                   batch_out_shape)
+    out = onnx.helper.make_tensor_value_info("RAGGED_OUTPUT", onnx_dtype, out_shape)
+    bs_out = onnx.helper.make_tensor_value_info(
+        "BATCH_AND_SIZE_OUTPUT", onnx_dtype, bs_out_shape
+    )
+    batch_out = onnx.helper.make_tensor_value_info(
+        "BATCH_OUTPUT", onnx_dtype, batch_out_shape
+    )
 
     const_node_shape = onnx.helper.make_node(
-        'Constant', [], ["shape"],
-        value=onnx.helper.make_tensor("const_shape", onnx.TensorProto.INT64,
-                                      [2], [1, -1]))
+        "Constant",
+        [],
+        ["shape"],
+        value=onnx.helper.make_tensor(
+            "const_shape", onnx.TensorProto.INT64, [2], [1, -1]
+        ),
+    )
 
     const_node_expander_shape = onnx.helper.make_node(
-        'Constant', [], ["expander_shape"],
-        value=onnx.helper.make_tensor("const_expander_shape",
-                                      onnx.TensorProto.INT64, [2], [-1, 1]))
-
-    in0_mat_node = onnx.helper.make_node("Reshape", ["RAGGED_INPUT", "shape"],
-                                         ["in_mat"])
-    bs_mat_node = onnx.helper.make_node("Reshape",
-                                        ["BATCH_AND_SIZE_INPUT", "shape"],
-                                        ["bs_mat"])
-    batch_mat_node = onnx.helper.make_node("Reshape", ["BATCH_INPUT", "shape"],
-                                           ["batch_mat"])
+        "Constant",
+        [],
+        ["expander_shape"],
+        value=onnx.helper.make_tensor(
+            "const_expander_shape", onnx.TensorProto.INT64, [2], [-1, 1]
+        ),
+    )
+
+    in0_mat_node = onnx.helper.make_node(
+        "Reshape", ["RAGGED_INPUT", "shape"], ["in_mat"]
+    )
+    bs_mat_node = onnx.helper.make_node(
+        "Reshape", ["BATCH_AND_SIZE_INPUT", "shape"], ["bs_mat"]
+    )
+    batch_mat_node = onnx.helper.make_node(
+        "Reshape", ["BATCH_INPUT", "shape"], ["batch_mat"]
+    )
 
     internal_node_div = onnx.helper.make_node(
-        "Div", ["BATCH_AND_SIZE_INPUT", "BATCH_AND_SIZE_INPUT"],
-        ["output_expander_int"])
+        "Div", ["BATCH_AND_SIZE_INPUT", "BATCH_AND_SIZE_INPUT"], ["output_expander_int"]
+    )
     internal_node_reshape = onnx.helper.make_node(
-        "Reshape", ["output_expander_int", "expander_shape"],
-        ["output_expander"])
-
-    out_node = onnx.helper.make_node("MatMul", ["output_expander", "in_mat"],
-                                     ["RAGGED_OUTPUT"])
-    bs_out_node = onnx.helper.make_node("MatMul", ["output_expander", "bs_mat"],
-                                        ["BATCH_AND_SIZE_OUTPUT"])
-    batch_out_node = onnx.helper.make_node("MatMul",
-                                           ["output_expander", "batch_mat"],
-                                           ["BATCH_OUTPUT"])
+        "Reshape", ["output_expander_int", "expander_shape"], ["output_expander"]
+    )
+
+    out_node = onnx.helper.make_node(
+        "MatMul", ["output_expander", "in_mat"], ["RAGGED_OUTPUT"]
+    )
+    bs_out_node = onnx.helper.make_node(
+        "MatMul", ["output_expander", "bs_mat"], ["BATCH_AND_SIZE_OUTPUT"]
+    )
+    batch_out_node = onnx.helper.make_node(
+        "MatMul", ["output_expander", "batch_mat"], ["BATCH_OUTPUT"]
+    )
 
     onnx_nodes = [
-        const_node_shape, const_node_expander_shape, in0_mat_node, bs_mat_node,
-        batch_mat_node, internal_node_div, internal_node_reshape, out_node,
-        bs_out_node, batch_out_node
+        const_node_shape,
+        const_node_expander_shape,
+        in0_mat_node,
+        bs_mat_node,
+        batch_mat_node,
+        internal_node_div,
+        internal_node_reshape,
+        out_node,
+        bs_out_node,
+        batch_out_node,
     ]
     onnx_inputs = [in0, bs_in, batch_in]
     onnx_outputs = [out, bs_out, batch_out]
 
-    graph_proto = onnx.helper.make_graph(onnx_nodes, model_name, onnx_inputs,
-                                         onnx_outputs)
+    graph_proto = onnx.helper.make_graph(
+        onnx_nodes, model_name, onnx_inputs, onnx_outputs
+    )
     if FLAGS.onnx_opset > 0:
         model_opset = onnx.helper.make_operatorsetid("", FLAGS.onnx_opset)
-        model_def = onnx.helper.make_model(graph_proto,
-                                           producer_name="triton",
-                                           opset_imports=[model_opset])
+        model_def = onnx.helper.make_model(
+            graph_proto, producer_name="triton", opset_imports=[model_opset]
+        )
     else:
         model_def = onnx.helper.make_model(graph_proto, producer_name="triton")
 
@@ -407,21 +360,87 @@ def create_onnx_modelfile(models_dir, model_version, dtype):
     onnx.save(model_def, model_version_dir + "/model.onnx")
 
 
-def create_modelconfig(models_dir, max_batch, model_version, dtype, backend,
-                       platform):
+def create_libtorch_modelfile(models_dir, model_version, dtype):
+    # Create special identity model for batch input testing.
+    # Because the ragged input and batch input are one dimensional vector
+    # when passing to the model, the model must generate output with batch
+    # dimension so that Triton can scatter it to different responses along
+    # the batch dimension.
+    # 'BATCH_AND_SIZE_INPUT' is also used as a hint to generate output with
+    # batch dimension, 'BATCH_AND_SIZE_INPUT' must have shape [batch_size].
+    # Each output corresponds to the input with the same name, so if there
+    # are two requests, one has "RAGGED_INPUT" [2, 4] and the other has [1],
+    # since the input is ragged, the model sees the input as [2, 4, 1], and
+    # "BATCH_AND_SIZE_INPUT" will have shape [2]. Then the model output will
+    # be [[2, 4, 1], [2, 4, 1]] and Triton will send responses that each has
+    # value [[2, 4, 1]].
+    # For "BATCH_INPUT", the input tensor must only have one variable dimension
+    # to be broadcasted along the batch dimension properly, thus the currently
+    # allowed batch input types are:
+    # - BATCH_ACCUMULATED_ELEMENT_COUNT
+    # - BATCH_ACCUMULATED_ELEMENT_COUNT_WITH_ZERO
+    # - BATCH_MAX_ELEMENT_COUNT_AS_SHAPE
+    # - BATCH_ITEM_SHAPE_FLATTEN
+
+    # Create the model
+    model_name = "libtorch_batch_input"
+    model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
+
+    if dtype == np_dtype_string:
+        raise Exception(
+            "PyTorch ragged model generation for string models not yet implemented"
+        )
+
+    else:
+
+        class IdentityNet(nn.Module):
+            def __init__(self):
+                super(IdentityNet, self).__init__()
+
+            def forward(self, BATCH_INPUT, BATCH_AND_SIZE_INPUT, RAGGED_INPUT):
+                batch_entry = BATCH_AND_SIZE_INPUT / BATCH_AND_SIZE_INPUT
+                batch_entry = batch_entry.view(-1, 1)
+
+                BATCH_INPUT = BATCH_INPUT.view(1, -1)
+                BATCH_OUTPUT = torch.matmul(batch_entry, BATCH_INPUT)
+
+                BATCH_AND_SIZE_INPUT = BATCH_AND_SIZE_INPUT.view(1, -1)
+                BATCH_AND_SIZE_OUTPUT = torch.matmul(batch_entry, BATCH_AND_SIZE_INPUT)
+
+                RAGGED_INPUT = RAGGED_INPUT.view(1, -1)
+                RAGGED_OUTPUT = torch.matmul(batch_entry, RAGGED_INPUT)
+
+                return RAGGED_OUTPUT, BATCH_AND_SIZE_OUTPUT, BATCH_OUTPUT
+
+    identityModel = IdentityNet()
+    traced = torch.jit.script(identityModel)
+
+    try:
+        os.makedirs(model_version_dir)
+    except OSError as ex:
+        pass  # ignore existing dir
+
+    traced.save(model_version_dir + "/model.pt")
+
+
+def create_modelconfig(models_dir, max_batch, model_version, dtype, backend, platform):
     version_policy_str = "{ latest { num_versions: 1 }}"
 
-    backend_spec = '''
+    backend_spec = """
 backend: "{}"
-'''.format(backend)
+""".format(
+        backend
+    )
     if backend == "tensorflow":
-        backend_spec += '''
+        backend_spec += """
 platform: "{}_{}"
-'''.format(backend, platform)
+""".format(
+            backend, platform
+        )
 
     model_name = "{}_batch_input".format(platform)
     config_dir = models_dir + "/" + model_name
-    config = '''
+    config = """
 name: "{}"
 {}
 max_batch_size: {}
@@ -472,11 +491,13 @@ def create_modelconfig(models_dir, max_batch, model_version, dtype, backend,
 dynamic_batching {{
   max_queue_delay_microseconds: 1000000
 }}
-'''.format(model_name,
-           backend_spec,
-           max_batch,
-           version_policy_str,
-           data_type=np_to_model_dtype(dtype))
+""".format(
+        model_name,
+        backend_spec,
+        max_batch,
+        version_policy_str,
+        data_type=np_to_model_dtype(dtype),
+    )
 
     try:
         os.makedirs(config_dir)
@@ -496,13 +517,15 @@ def create_savedmodel_itemshape_modelfile(models_dir, model_version, dtype):
 
     tf_dtype = np_to_tf_dtype(dtype)
 
-    tf.reset_default_graph()
-    in_node = tf.placeholder(tf_dtype, tu.shape_to_tf_shape([-1]),
-                             "TENSOR_RAGGED_INPUT")
+    tf.compat.v1.reset_default_graph()
+    tf.compat.v1.placeholder(
+        tf_dtype, tu.shape_to_tf_shape([-1]), "TENSOR_RAGGED_INPUT"
+    )
     # Shape is predefined
-    batch_node = tf.placeholder(tf_dtype, tu.shape_to_tf_shape([-1, 2]),
-                                "TENSOR_BATCH_INPUT")
-    batch_output_node = tf.identity(batch_node, name="TENSOR_BATCH_OUTPUT")
+    batch_node = tf.compat.v1.placeholder(
+        tf_dtype, tu.shape_to_tf_shape([-1, 2]), "TENSOR_BATCH_INPUT"
+    )
+    tf.identity(batch_node, name="TENSOR_BATCH_OUTPUT")
 
     model_name = "savedmodel_batch_item"
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
@@ -512,22 +535,27 @@ def create_savedmodel_itemshape_modelfile(models_dir, model_version, dtype):
     except OSError as ex:
         pass  # ignore existing dir
 
-    with tf.Session() as sess:
-        in_tensor = tf.get_default_graph().get_tensor_by_name(
-            "TENSOR_RAGGED_INPUT:0")
-        batch_tensor = tf.get_default_graph().get_tensor_by_name(
-            "TENSOR_BATCH_INPUT:0")
-        batch_out_tensor = tf.get_default_graph().get_tensor_by_name(
-            "TENSOR_BATCH_OUTPUT:0")
-        tf.saved_model.simple_save(sess,
-                                   model_version_dir + "/model.savedmodel",
-                                   inputs={
-                                       "RAGGED_INPUT": in_tensor,
-                                       "BATCH_INPUT": batch_tensor,
-                                   },
-                                   outputs={
-                                       "BATCH_OUTPUT": batch_out_tensor,
-                                   })
+    with tf.compat.v1.Session() as sess:
+        in_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+            "TENSOR_RAGGED_INPUT:0"
+        )
+        batch_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+            "TENSOR_BATCH_INPUT:0"
+        )
+        batch_out_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+            "TENSOR_BATCH_OUTPUT:0"
+        )
+        tf.compat.v1.saved_model.simple_save(
+            sess,
+            model_version_dir + "/model.savedmodel",
+            inputs={
+                "RAGGED_INPUT": in_tensor,
+                "BATCH_INPUT": batch_tensor,
+            },
+            outputs={
+                "BATCH_OUTPUT": batch_out_tensor,
+            },
+        )
 
 
 def create_plan_itemshape_modelfile(models_dir, model_version, dtype):
@@ -540,7 +568,8 @@ def create_plan_itemshape_modelfile(models_dir, model_version, dtype):
     TRT_LOGGER = trt.Logger(trt.Logger.INFO)
     builder = trt.Builder(TRT_LOGGER)
     network = builder.create_network(
-        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
+        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+    )
     trt_dtype = np_to_trt_dtype(dtype)
 
     in_node = network.add_input("RAGGED_INPUT", trt_dtype, [-1])
@@ -557,11 +586,10 @@ def create_plan_itemshape_modelfile(models_dir, model_version, dtype):
 
     profile = builder.create_optimization_profile()
     profile.set_shape("RAGGED_INPUT", min_shape, opt_shape, max_shape)
-    profile.set_shape("BATCH_INPUT", min_shape + [2], opt_shape + [2],
-                      max_shape + [2])
+    profile.set_shape("BATCH_INPUT", min_shape + [2], opt_shape + [2], max_shape + [2])
     config = builder.create_builder_config()
     config.add_optimization_profile(profile)
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
     except AttributeError:
@@ -580,8 +608,6 @@ def create_plan_itemshape_modelfile(models_dir, model_version, dtype):
     with open(model_version_dir + "/model.plan", "wb") as f:
         f.write(engine_bytes)
 
-    del builder
-
 
 def create_onnx_itemshape_modelfile(models_dir, model_version, dtype):
     # Create special identity model for batch input 'BATCH_ITEM_SHAPE' testing,
@@ -599,28 +625,28 @@ def create_onnx_itemshape_modelfile(models_dir, model_version, dtype):
     in0_shape, idx = tu.shape_to_onnx_shape([-1], 0)
     batch_shape, idx = tu.shape_to_onnx_shape([-1, 2], 0)
 
-    in0 = onnx.helper.make_tensor_value_info("RAGGED_INPUT", onnx_dtype,
-                                             in0_shape)
-    batch_in = onnx.helper.make_tensor_value_info("BATCH_INPUT", onnx_dtype,
-                                                  batch_shape)
+    in0 = onnx.helper.make_tensor_value_info("RAGGED_INPUT", onnx_dtype, in0_shape)
+    batch_in = onnx.helper.make_tensor_value_info(
+        "BATCH_INPUT", onnx_dtype, batch_shape
+    )
 
     batch_out_shape, idx = tu.shape_to_onnx_shape([-1, -1], idx)
-    batch_out = onnx.helper.make_tensor_value_info("BATCH_OUTPUT", onnx_dtype,
-                                                   batch_out_shape)
+    batch_out = onnx.helper.make_tensor_value_info(
+        "BATCH_OUTPUT", onnx_dtype, batch_out_shape
+    )
 
-    onnx_nodes = [
-        onnx.helper.make_node("Identity", ["BATCH_INPUT"], ["BATCH_OUTPUT"])
-    ]
+    onnx_nodes = [onnx.helper.make_node("Identity", ["BATCH_INPUT"], ["BATCH_OUTPUT"])]
     onnx_inputs = [in0, batch_in]
     onnx_outputs = [batch_out]
 
-    graph_proto = onnx.helper.make_graph(onnx_nodes, model_name, onnx_inputs,
-                                         onnx_outputs)
+    graph_proto = onnx.helper.make_graph(
+        onnx_nodes, model_name, onnx_inputs, onnx_outputs
+    )
     if FLAGS.onnx_opset > 0:
         model_opset = onnx.helper.make_operatorsetid("", FLAGS.onnx_opset)
-        model_def = onnx.helper.make_model(graph_proto,
-                                           producer_name="triton",
-                                           opset_imports=[model_opset])
+        model_def = onnx.helper.make_model(
+            graph_proto, producer_name="triton", opset_imports=[model_opset]
+        )
     else:
         model_def = onnx.helper.make_model(graph_proto, producer_name="triton")
 
@@ -632,21 +658,62 @@ def create_onnx_itemshape_modelfile(models_dir, model_version, dtype):
     onnx.save(model_def, model_version_dir + "/model.onnx")
 
 
-def create_itemshape_modelconfig(models_dir, max_batch, model_version, dtype,
-                                 backend, platform):
+def create_libtorch_itemshape_modelfile(models_dir, model_version, dtype):
+    # Create special identity model for batch input 'BATCH_ITEM_SHAPE' testing,
+    # such model has one ragged input and one batch input, and one output to
+    # return the batch input directly. Because 'BATCH_ITEM_SHAPE' should be
+    # generated to have matching batch dimension, the output can be produced
+    # via identity op and expect Triton will scatter the output properly.
+
+    # Create the model
+    model_name = "libtorch_batch_item"
+    model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
+
+    if dtype == np_dtype_string:
+        raise Exception(
+            "PyTorch ragged model generation for string models not yet implemented"
+        )
+
+    else:
+
+        class IdentityNet(nn.Module):
+            def __init__(self):
+                super(IdentityNet, self).__init__()
+
+            def forward(self, RAGGED_INPUT, BATCH_INPUT):
+                return BATCH_INPUT
+
+    identityModel = IdentityNet()
+    traced = torch.jit.script(identityModel)
+
+    try:
+        os.makedirs(model_version_dir)
+    except OSError as ex:
+        pass  # ignore existing dir
+
+    traced.save(model_version_dir + "/model.pt")
+
+
+def create_itemshape_modelconfig(
+    models_dir, max_batch, model_version, dtype, backend, platform
+):
     version_policy_str = "{ latest { num_versions: 1 }}"
 
-    backend_spec = '''
+    backend_spec = """
 backend: "{}"
-'''.format(backend)
+""".format(
+        backend
+    )
     if backend == "tensorflow":
-        backend_spec += '''
+        backend_spec += """
 platform: "{}_{}"
-'''.format(backend, platform)
+""".format(
+            backend, platform
+        )
 
     model_name = "{}_batch_item".format(platform)
     config_dir = models_dir + "/" + model_name
-    config = '''
+    config = """
 name: "{}"
 {}
 max_batch_size: {}
@@ -677,11 +744,13 @@ def create_itemshape_modelconfig(models_dir, max_batch, model_version, dtype,
 dynamic_batching {{
   max_queue_delay_microseconds: 1000000
 }}
-'''.format(model_name,
-           backend_spec,
-           max_batch,
-           version_policy_str,
-           data_type=np_to_model_dtype(dtype))
+""".format(
+        model_name,
+        backend_spec,
+        max_batch,
+        version_policy_str,
+        data_type=np_to_model_dtype(dtype),
+    )
 
     try:
         os.makedirs(config_dir)
@@ -695,67 +764,98 @@ def create_itemshape_modelconfig(models_dir, max_batch, model_version, dtype,
 def create_batch_input_models(models_dir):
     model_version = 1
     if FLAGS.tensorrt:
-        create_modelconfig(models_dir, 4, model_version, np.float32, "tensorrt",
-                           "plan")
+        create_modelconfig(models_dir, 4, model_version, np.float32, "tensorrt", "plan")
         create_plan_modelfile(models_dir, model_version, np.float32)
-        create_itemshape_modelconfig(models_dir, 4, model_version, np.float32,
-                                     "tensorrt", "plan")
+        create_itemshape_modelconfig(
+            models_dir, 4, model_version, np.float32, "tensorrt", "plan"
+        )
         create_plan_itemshape_modelfile(models_dir, model_version, np.float32)
     if FLAGS.savedmodel:
-        create_modelconfig(models_dir, 4, model_version, np.float32,
-                           "tensorflow", "savedmodel")
+        create_modelconfig(
+            models_dir, 4, model_version, np.float32, "tensorflow", "savedmodel"
+        )
         create_savedmodel_modelfile(models_dir, model_version, np.float32)
-        create_itemshape_modelconfig(models_dir, 4, model_version, np.float32,
-                                     "tensorflow", "savedmodel")
-        create_savedmodel_itemshape_modelfile(models_dir, model_version,
-                                              np.float32)
+        create_itemshape_modelconfig(
+            models_dir, 4, model_version, np.float32, "tensorflow", "savedmodel"
+        )
+        create_savedmodel_itemshape_modelfile(models_dir, model_version, np.float32)
     if FLAGS.onnx:
-        create_modelconfig(models_dir, 4, model_version, np.float32,
-                           "onnxruntime", "onnx")
+        create_modelconfig(
+            models_dir, 4, model_version, np.float32, "onnxruntime", "onnx"
+        )
         create_onnx_modelfile(models_dir, model_version, np.float32)
-        create_itemshape_modelconfig(models_dir, 4, model_version, np.float32,
-                                     "onnxruntime", "onnx")
+        create_itemshape_modelconfig(
+            models_dir, 4, model_version, np.float32, "onnxruntime", "onnx"
+        )
         create_onnx_itemshape_modelfile(models_dir, model_version, np.float32)
-
-
-if __name__ == '__main__':
+    if FLAGS.libtorch:
+        create_modelconfig(
+            models_dir, 4, model_version, np.float32, "pytorch", "libtorch"
+        )
+        create_libtorch_modelfile(models_dir, model_version, np.float32)
+        create_itemshape_modelconfig(
+            models_dir, 4, model_version, np.float32, "pytorch", "libtorch"
+        )
+        create_libtorch_itemshape_modelfile(models_dir, model_version, np.float32)
+
+
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('--models_dir',
-                        type=str,
-                        required=True,
-                        help='Top-level model directory')
-    parser.add_argument('--tensorrt',
-                        required=False,
-                        action='store_true',
-                        help='Generate TensorRT PLAN models')
-    parser.add_argument('--savedmodel',
-                        required=False,
-                        action='store_true',
-                        help='Generate SavedModel models')
-    parser.add_argument('--graphdef',
-                        required=False,
-                        action='store_true',
-                        help='Generate GraphDef models')
-    parser.add_argument('--onnx',
-                        required=False,
-                        action='store_true',
-                        help='Generate Onnx Runtime Onnx models')
     parser.add_argument(
-        '--onnx_opset',
+        "--models_dir", type=str, required=True, help="Top-level model directory"
+    )
+    parser.add_argument(
+        "--tensorrt",
+        required=False,
+        action="store_true",
+        help="Generate TensorRT PLAN models",
+    )
+    parser.add_argument(
+        "--savedmodel",
+        required=False,
+        action="store_true",
+        help="Generate SavedModel models",
+    )
+    parser.add_argument(
+        "--graphdef",
+        required=False,
+        action="store_true",
+        help="Generate GraphDef models",
+    )
+    parser.add_argument(
+        "--onnx",
+        required=False,
+        action="store_true",
+        help="Generate Onnx Runtime Onnx models",
+    )
+    parser.add_argument(
+        "--libtorch",
+        required=False,
+        action="store_true",
+        help="Generate Libtorch models",
+    )
+    parser.add_argument(
+        "--onnx_opset",
         type=int,
         required=False,
         default=0,
-        help='Opset used for Onnx models. Default is to use ONNXRT default')
+        help="Opset used for Onnx models. Default is to use ONNXRT default",
+    )
 
     FLAGS, unparsed = parser.parse_known_args()
 
     import test_util as tu
+
     if FLAGS.tensorrt:
         import tensorrt as trt
     if FLAGS.graphdef or FLAGS.savedmodel:
         import tensorflow as tf
-        from tensorflow.python.framework import graph_util
+
+        tf.compat.v1.disable_eager_execution()
     if FLAGS.onnx:
         import onnx
+    if FLAGS.libtorch:
+        import torch
+        from torch import nn
 
     create_batch_input_models(FLAGS.models_dir)
diff --git a/qa/common/gen_qa_reshape_models.py b/qa/common/gen_qa_reshape_models.py
old mode 100644
new mode 100755
index 9657826bc8..4ac5347a79
--- a/qa/common/gen_qa_reshape_models.py
+++ b/qa/common/gen_qa_reshape_models.py
@@ -1,4 +1,6 @@
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,179 +27,85 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import argparse
-from builtins import range
 import os
-import numpy as np
+from builtins import range
+
 import gen_ensemble_model_utils as emu
+import numpy as np
+from gen_common import (
+    np_to_model_dtype,
+    np_to_onnx_dtype,
+    np_to_tf_dtype,
+    np_to_torch_dtype,
+    np_to_trt_dtype,
+)
 
 FLAGS = None
 np_dtype_string = np.dtype(object)
-
-
-def np_to_model_dtype(np_dtype):
-    if np_dtype == bool:
-        return "TYPE_BOOL"
-    elif np_dtype == np.int8:
-        return "TYPE_INT8"
-    elif np_dtype == np.int16:
-        return "TYPE_INT16"
-    elif np_dtype == np.int32:
-        return "TYPE_INT32"
-    elif np_dtype == np.int64:
-        return "TYPE_INT64"
-    elif np_dtype == np.uint8:
-        return "TYPE_UINT8"
-    elif np_dtype == np.uint16:
-        return "TYPE_UINT16"
-    elif np_dtype == np.float16:
-        return "TYPE_FP16"
-    elif np_dtype == np.float32:
-        return "TYPE_FP32"
-    elif np_dtype == np.float64:
-        return "TYPE_FP64"
-    elif np_dtype == np_dtype_string:
-        return "TYPE_STRING"
-    return None
-
-
-def np_to_tf_dtype(np_dtype):
-    if np_dtype == bool:
-        return tf.bool
-    elif np_dtype == np.int8:
-        return tf.int8
-    elif np_dtype == np.int16:
-        return tf.int16
-    elif np_dtype == np.int32:
-        return tf.int32
-    elif np_dtype == np.int64:
-        return tf.int64
-    elif np_dtype == np.uint8:
-        return tf.uint8
-    elif np_dtype == np.uint16:
-        return tf.uint16
-    elif np_dtype == np.float16:
-        return tf.float16
-    elif np_dtype == np.float32:
-        return tf.float32
-    elif np_dtype == np.float64:
-        return tf.float64
-    elif np_dtype == np_dtype_string:
-        return tf.string
-    return None
-
-
-def np_to_trt_dtype(np_dtype):
-    if np_dtype == bool:
-        return trt.bool
-    elif np_dtype == np.int8:
-        return trt.int8
-    elif np_dtype == np.int32:
-        return trt.int32
-    elif np_dtype == np.float16:
-        return trt.float16
-    elif np_dtype == np.float32:
-        return trt.float32
-    return None
-
-
-def np_to_onnx_dtype(np_dtype):
-    if np_dtype == bool:
-        return onnx.TensorProto.BOOL
-    elif np_dtype == np.int8:
-        return onnx.TensorProto.INT8
-    elif np_dtype == np.int16:
-        return onnx.TensorProto.INT16
-    elif np_dtype == np.int32:
-        return onnx.TensorProto.INT32
-    elif np_dtype == np.int64:
-        return onnx.TensorProto.INT64
-    elif np_dtype == np.uint8:
-        return onnx.TensorProto.UINT8
-    elif np_dtype == np.uint16:
-        return onnx.TensorProto.UINT16
-    elif np_dtype == np.float16:
-        return onnx.TensorProto.FLOAT16
-    elif np_dtype == np.float32:
-        return onnx.TensorProto.FLOAT
-    elif np_dtype == np.float64:
-        return onnx.TensorProto.DOUBLE
-    elif np_dtype == np_dtype_string:
-        return onnx.TensorProto.STRING
-    return None
-
-
-def np_to_torch_dtype(np_dtype):
-    if np_dtype == bool:
-        return torch.bool
-    elif np_dtype == np.int8:
-        return torch.int8
-    elif np_dtype == np.int16:
-        return torch.int16
-    elif np_dtype == np.int32:
-        return torch.int
-    elif np_dtype == np.int64:
-        return torch.long
-    elif np_dtype == np.uint8:
-        return torch.uint8
-    elif np_dtype == np.uint16:
-        return None  # Not supported in Torch
-    elif np_dtype == np.float16:
-        return None
-    elif np_dtype == np.float32:
-        return torch.float
-    elif np_dtype == np.float64:
-        return torch.double
-    elif np_dtype == np_dtype_string:
-        return None  # Not supported in Torch
-    return None
-
-
-def create_tf_modelfile(create_savedmodel, models_dir, model_version, max_batch,
-                        dtype, input_shapes, output_shapes):
-
+from typing import List
+
+
+def create_tf_modelfile(
+    create_savedmodel,
+    models_dir,
+    model_version,
+    max_batch,
+    dtype,
+    input_shapes,
+    output_shapes,
+):
     assert len(input_shapes) == len(output_shapes)
-    if not tu.validate_for_tf_model(dtype, dtype, dtype, input_shapes[0],
-                                    input_shapes[0], input_shapes[0]):
+    if not tu.validate_for_tf_model(
+        dtype, dtype, dtype, input_shapes[0], input_shapes[0], input_shapes[0]
+    ):
         return
 
     tf_dtype = np_to_tf_dtype(dtype)
     io_cnt = len(input_shapes)
 
     # Create the model that copies inputs to corresponding outputs.
-    tf.reset_default_graph()
+    tf.compat.v1.reset_default_graph()
     for io_num in range(io_cnt):
         input_name = "INPUT{}".format(io_num)
         output_name = "OUTPUT{}".format(io_num)
         if max_batch == 0:
-            tin = tf.placeholder(tf_dtype,
-                                 tu.shape_to_tf_shape(input_shapes[io_num]),
-                                 input_name)
+            tin = tf.compat.v1.placeholder(
+                tf_dtype, tu.shape_to_tf_shape(input_shapes[io_num]), input_name
+            )
         else:
-            tin = tf.placeholder(tf_dtype, [
-                None,
-            ] + tu.shape_to_tf_shape(input_shapes[io_num]), input_name)
+            tin = tf.compat.v1.placeholder(
+                tf_dtype,
+                [
+                    None,
+                ]
+                + tu.shape_to_tf_shape(input_shapes[io_num]),
+                input_name,
+            )
 
         if input_shapes == output_shapes:
-            toutput = tf.identity(tin, name=output_name)
+            tf.identity(tin, name=output_name)
         else:
             if max_batch == 0:
-                toutput = tf.reshape(tin,
-                                     output_shapes[io_num],
-                                     name=output_name)
+                tf.reshape(tin, output_shapes[io_num], name=output_name)
             else:
-                toutput = tf.reshape(tin, [
-                    -1,
-                ] + output_shapes[io_num],
-                                     name=output_name)
+                tf.reshape(
+                    tin,
+                    [
+                        -1,
+                    ]
+                    + output_shapes[io_num],
+                    name=output_name,
+                )
 
     # Use model name based on input/output count and non-batching variant
     if create_savedmodel:
         model_name = tu.get_zero_model_name(
-            "savedmodel_nobatch" if max_batch == 0 else "savedmodel", io_cnt,
-            dtype)
+            "savedmodel_nobatch" if max_batch == 0 else "savedmodel", io_cnt, dtype
+        )
     else:
         model_name = tu.get_zero_model_name(
-            "graphdef_nobatch" if max_batch == 0 else "graphdef", io_cnt, dtype)
+            "graphdef_nobatch" if max_batch == 0 else "graphdef", io_cnt, dtype
+        )
 
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
@@ -207,39 +115,53 @@ def create_tf_modelfile(create_savedmodel, models_dir, model_version, max_batch,
         pass  # ignore existing dir
 
     if create_savedmodel:
-        with tf.Session() as sess:
+        with tf.compat.v1.Session() as sess:
             input_dict = {}
             output_dict = {}
             for io_num in range(io_cnt):
                 input_name = "INPUT{}".format(io_num)
                 output_name = "OUTPUT{}".format(io_num)
-                input_tensor = tf.get_default_graph().get_tensor_by_name(
-                    input_name + ":0")
-                output_tensor = tf.get_default_graph().get_tensor_by_name(
-                    output_name + ":0")
+                input_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+                    input_name + ":0"
+                )
+                output_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+                    output_name + ":0"
+                )
                 input_dict[input_name] = input_tensor
                 output_dict[output_name] = output_tensor
-            tf.saved_model.simple_save(sess,
-                                       model_version_dir + "/model.savedmodel",
-                                       inputs=input_dict,
-                                       outputs=output_dict)
+            tf.compat.v1.saved_model.simple_save(
+                sess,
+                model_version_dir + "/model.savedmodel",
+                inputs=input_dict,
+                outputs=output_dict,
+            )
     else:
-        with tf.Session() as sess:
-            graph_io.write_graph(sess.graph.as_graph_def(),
-                                 model_version_dir,
-                                 "model.graphdef",
-                                 as_text=False)
-
-
-def create_tf_modelconfig(create_savedmodel, models_dir, model_version,
-                          max_batch, dtype, input_shapes, input_model_shapes,
-                          output_shapes, output_model_shapes):
-
+        with tf.compat.v1.Session() as sess:
+            graph_io.write_graph(
+                sess.graph.as_graph_def(),
+                model_version_dir,
+                "model.graphdef",
+                as_text=False,
+            )
+
+
+def create_tf_modelconfig(
+    create_savedmodel,
+    models_dir,
+    model_version,
+    max_batch,
+    dtype,
+    input_shapes,
+    input_model_shapes,
+    output_shapes,
+    output_model_shapes,
+):
     assert len(input_shapes) == len(input_model_shapes)
     assert len(output_shapes) == len(output_model_shapes)
     assert len(input_shapes) == len(output_shapes)
-    if not tu.validate_for_tf_model(dtype, dtype, dtype, input_shapes[0],
-                                    input_shapes[0], input_shapes[0]):
+    if not tu.validate_for_tf_model(
+        dtype, dtype, dtype, input_shapes[0], input_shapes[0], input_shapes[0]
+    ):
         return
 
     io_cnt = len(input_shapes)
@@ -247,24 +169,26 @@ def create_tf_modelconfig(create_savedmodel, models_dir, model_version,
     # Use a different model name for the non-batching variant
     if create_savedmodel:
         model_name = tu.get_zero_model_name(
-            "savedmodel_nobatch" if max_batch == 0 else "savedmodel", io_cnt,
-            dtype)
+            "savedmodel_nobatch" if max_batch == 0 else "savedmodel", io_cnt, dtype
+        )
     else:
         model_name = tu.get_zero_model_name(
-            "graphdef_nobatch" if max_batch == 0 else "graphdef", io_cnt, dtype)
+            "graphdef_nobatch" if max_batch == 0 else "graphdef", io_cnt, dtype
+        )
 
     config_dir = models_dir + "/" + model_name
-    config = '''
+    config = """
 name: "{}"
 platform: "{}"
 max_batch_size: {}
-'''.format(
+""".format(
         model_name,
         "tensorflow_savedmodel" if create_savedmodel else "tensorflow_graphdef",
-        max_batch)
+        max_batch,
+    )
 
     for io_num in range(io_cnt):
-        config += '''
+        config += """
 input [
   {{
     name: "INPUT{}"
@@ -281,17 +205,24 @@ def create_tf_modelconfig(create_savedmodel, models_dir, model_version,
     {}
   }}
 ]
-'''.format(
-            io_num, np_to_model_dtype(dtype),
+""".format(
+            io_num,
+            np_to_model_dtype(dtype),
             tu.shape_to_dims_str(input_shapes[io_num]),
             "reshape: {{ shape: [ {} ] }}".format(
-                tu.shape_to_dims_str(input_model_shapes[io_num]))
-            if input_shapes[io_num] != input_model_shapes[io_num] else "",
-            io_num, np_to_model_dtype(dtype),
+                tu.shape_to_dims_str(input_model_shapes[io_num])
+            )
+            if input_shapes[io_num] != input_model_shapes[io_num]
+            else "",
+            io_num,
+            np_to_model_dtype(dtype),
             tu.shape_to_dims_str(output_shapes[io_num]),
             "reshape: {{ shape: [ {} ] }}".format(
-                tu.shape_to_dims_str(output_model_shapes[io_num]))
-            if output_shapes[io_num] != output_model_shapes[io_num] else "")
+                tu.shape_to_dims_str(output_model_shapes[io_num])
+            )
+            if output_shapes[io_num] != output_model_shapes[io_num]
+            else "",
+        )
 
     try:
         os.makedirs(config_dir)
@@ -302,12 +233,13 @@ def create_tf_modelconfig(create_savedmodel, models_dir, model_version,
         cfile.write(config)
 
 
-def create_plan_modelfile(models_dir, model_version, max_batch, dtype,
-                          input_shapes, output_shapes):
-
+def create_plan_modelfile(
+    models_dir, model_version, max_batch, dtype, input_shapes, output_shapes
+):
     assert len(input_shapes) == len(output_shapes)
-    if not tu.validate_for_trt_model(dtype, dtype, dtype, input_shapes[0],
-                                     input_shapes[0], input_shapes[0]):
+    if not tu.validate_for_trt_model(
+        dtype, dtype, dtype, input_shapes[0], input_shapes[0], input_shapes[0]
+    ):
         return
 
     trt_dtype = np_to_trt_dtype(dtype)
@@ -332,7 +264,7 @@ def create_plan_modelfile(models_dir, model_version, max_batch, dtype,
         network.mark_output(out0.get_output(0))
 
     config = builder.create_builder_config()
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     builder.max_batch_size = max(1, max_batch)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
@@ -343,7 +275,8 @@ def create_plan_modelfile(models_dir, model_version, max_batch, dtype,
     del network
 
     model_name = tu.get_zero_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", io_cnt, dtype)
+        "plan_nobatch" if max_batch == 0 else "plan", io_cnt, dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     try:
@@ -355,30 +288,40 @@ def create_plan_modelfile(models_dir, model_version, max_batch, dtype,
         f.write(engine_bytes)
 
 
-def create_plan_modelconfig(models_dir, model_version, max_batch, dtype,
-                            input_shapes, input_model_shapes, output_shapes,
-                            output_model_shapes):
-
+def create_plan_modelconfig(
+    models_dir,
+    model_version,
+    max_batch,
+    dtype,
+    input_shapes,
+    input_model_shapes,
+    output_shapes,
+    output_model_shapes,
+):
     assert len(input_shapes) == len(input_model_shapes)
     assert len(output_shapes) == len(output_model_shapes)
     assert len(input_shapes) == len(output_shapes)
-    if not tu.validate_for_trt_model(dtype, dtype, dtype, input_shapes[0],
-                                     input_shapes[0], input_shapes[0]):
+    if not tu.validate_for_trt_model(
+        dtype, dtype, dtype, input_shapes[0], input_shapes[0], input_shapes[0]
+    ):
         return
 
     io_cnt = len(input_shapes)
 
     model_name = tu.get_zero_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", io_cnt, dtype)
+        "plan_nobatch" if max_batch == 0 else "plan", io_cnt, dtype
+    )
     config_dir = models_dir + "/" + model_name
-    config = '''
+    config = """
 name: "{}"
 platform: "tensorrt_plan"
 max_batch_size: {}
-'''.format(model_name, max_batch)
+""".format(
+        model_name, max_batch
+    )
 
     for io_num in range(io_cnt):
-        config += '''
+        config += """
 input [
   {{
     name: "INPUT{}"
@@ -395,17 +338,24 @@ def create_plan_modelconfig(models_dir, model_version, max_batch, dtype,
     {}
   }}
 ]
-'''.format(
-            io_num, np_to_model_dtype(dtype),
+""".format(
+            io_num,
+            np_to_model_dtype(dtype),
             tu.shape_to_dims_str(input_shapes[io_num]),
             "reshape: {{ shape: [ {} ] }}".format(
-                tu.shape_to_dims_str(input_model_shapes[io_num]))
-            if input_shapes[io_num] != input_model_shapes[io_num] else "",
-            io_num, np_to_model_dtype(dtype),
+                tu.shape_to_dims_str(input_model_shapes[io_num])
+            )
+            if input_shapes[io_num] != input_model_shapes[io_num]
+            else "",
+            io_num,
+            np_to_model_dtype(dtype),
             tu.shape_to_dims_str(output_shapes[io_num]),
             "reshape: {{ shape: [ {} ] }}".format(
-                tu.shape_to_dims_str(output_model_shapes[io_num]))
-            if output_shapes[io_num] != output_model_shapes[io_num] else "")
+                tu.shape_to_dims_str(output_model_shapes[io_num])
+            )
+            if output_shapes[io_num] != output_model_shapes[io_num]
+            else "",
+        )
 
     try:
         os.makedirs(config_dir)
@@ -416,96 +366,212 @@ def create_plan_modelconfig(models_dir, model_version, max_batch, dtype,
         cfile.write(config)
 
 
-def create_libtorch_modelfile(models_dir, model_version, max_batch, dtype,
-                              input_shapes, output_shapes):
-
+def create_libtorch_modelfile(
+    models_dir, model_version, max_batch, dtype, input_shapes, output_shapes
+):
     assert len(input_shapes) == len(output_shapes)
-    if not tu.validate_for_libtorch_model(dtype, dtype, dtype, input_shapes[0],
-                                          input_shapes[0], input_shapes[0],
-                                          max_batch):
+    if not tu.validate_for_libtorch_model(
+        dtype,
+        dtype,
+        dtype,
+        input_shapes[0],
+        input_shapes[0],
+        input_shapes[0],
+        max_batch,
+        reshape=True,
+    ):
         return
 
     torch_dtype = np_to_torch_dtype(dtype)
     io_cnt = len(input_shapes)
     model_name = tu.get_zero_model_name(
-        "libtorch_nobatch" if max_batch == 0 else "libtorch", io_cnt, dtype)
+        "libtorch_nobatch" if max_batch == 0 else "libtorch", io_cnt, dtype
+    )
 
     # Create the model that reshapes inputs to corresponding outputs
+    # Note that string I/O is supported only for 1-dimensional inputs/outputs.
+    # Use identity model for string I/O models and add 'reshape' field with
+    # empty shape so that batching is supported and the full shape becomes [-1].
     if io_cnt == 1:
+        if dtype == np_dtype_string:
 
-        class ReshapeNet(nn.Module):
+            class IdentityNet(nn.Module):
+                def __init__(self):
+                    super(IdentityNet, self).__init__()
 
-            def __init__(self, *args):
-                super(ReshapeNet, self).__init__()
-                self.shape = args[0][0]
-                self.max_batch = args[0][1]
+                def forward(self, input0: List[str]) -> List[str]:
+                    return input0
+
+        else:
+
+            class ReshapeNet(nn.Module):
+                def __init__(self, *args):
+                    super(ReshapeNet, self).__init__()
+                    self.shape = args[0][0]
+                    self.max_batch = args[0][1]
+
+                def forward(self, input0):
+                    if self.max_batch == 0:
+                        return input0.view(self.shape[0])
+                    else:
+                        return input0.view(
+                            [
+                                -1,
+                            ]
+                            + self.shape[0]
+                        )
 
-            def forward(self, input0):
-                if self.max_batch == 0:
-                    return input0.view(self.shape[0])
-                else:
-                    return input0.view([
-                        -1,
-                    ] + self.shape[0])
     elif io_cnt == 2:
+        if dtype == np_dtype_string:
 
-        class ReshapeNet(nn.Module):
+            class IdentityNet(nn.Module):
+                def __init__(self):
+                    super(IdentityNet, self).__init__()
 
-            def __init__(self, *args):
-                super(ReshapeNet, self).__init__()
-                self.shape = args[0][0]
-                self.max_batch = args[0][1]
+                def forward(
+                    self, input0: List[str], input1: List[str]
+                ) -> Tuple[List[str], List[str]]:
+                    return input0, input1
+
+        else:
+
+            class ReshapeNet(nn.Module):
+                def __init__(self, *args):
+                    super(ReshapeNet, self).__init__()
+                    self.shape = args[0][0]
+                    self.max_batch = args[0][1]
+
+                def forward(self, input0, input1):
+                    if self.max_batch == 0:
+                        return input0.view(self.shape[0]), input1.view(self.shape[1])
+                    else:
+                        return input0.view(
+                            [
+                                -1,
+                            ]
+                            + self.shape[0]
+                        ), input1.view(
+                            [
+                                -1,
+                            ]
+                            + self.shape[1]
+                        )
 
-            def forward(self, input0, input1):
-                if self.max_batch == 0:
-                    return input0.view(self.shape[0]), input1.view(
-                        self.shape[1])
-                else:
-                    return input0.view([
-                        -1,
-                    ] + self.shape[0]), input1.view([
-                        -1,
-                    ] + self.shape[1])
     elif io_cnt == 3:
+        if dtype == np_dtype_string:
 
-        class ReshapeNet(nn.Module):
+            class IdentityNet(nn.Module):
+                def __init__(self):
+                    super(IdentityNet, self).__init__()
 
-            def __init__(self, *args):
-                super(ReshapeNet, self).__init__()
-                self.shape = args[0][0]
-                self.max_batch = args[0][1]
+                def forward(
+                    self, input0: List[str], input1: List[str], input2: List[str]
+                ) -> Tuple[List[str], List[str], List[str]]:
+                    return input0, input1, input2
+
+        else:
+
+            class ReshapeNet(nn.Module):
+                def __init__(self, *args):
+                    super(ReshapeNet, self).__init__()
+                    self.shape = args[0][0]
+                    self.max_batch = args[0][1]
+
+                def forward(self, input0, input1, input2):
+                    if self.max_batch == 0:
+                        return (
+                            input0.view(self.shape[0]),
+                            input1.view(self.shape[1]),
+                            input2.view(self.shape[2]),
+                        )
+                    else:
+                        return (
+                            input0.view(
+                                [
+                                    -1,
+                                ]
+                                + self.shape[0]
+                            ),
+                            input1.view(
+                                [
+                                    -1,
+                                ]
+                                + self.shape[1]
+                            ),
+                            input2.view(
+                                [
+                                    -1,
+                                ]
+                                + self.shape[2]
+                            ),
+                        )
 
-            def forward(self, input0, input1, input2):
-                if self.max_batch == 0:
-                    return input0.view(self.shape[0]), input1.view(
-                        self.shape[1]), input2.view(self.shape[2])
-                else:
-                    return input0.view([-1,]+self.shape[0]), input1.view([-1,]+self.shape[1]), \
-                         input2.view([-1,]+self.shape[2])
     elif io_cnt == 4:
+        if dtype == np_dtype_string:
+
+            class IdentityNet(nn.Module):
+                def __init__(self):
+                    super(IdentityNet, self).__init__()
 
-        class ReshapeNet(nn.Module):
-
-            def __init__(self, *args):
-                super(ReshapeNet, self).__init__()
-                self.shape = args[0][0]
-                self.max_batch = args[0][1]
-
-            def forward(self, input0, input1, input2, input3):
-                if self.max_batch == 0:
-                    return input0.view(self.shape[0]), input1.view(self.shape[1]), input2.view(self.shape[2]), \
-                        input3.view(self.shape[3])
-                else:
-                    return input0.view([-1,]+self.shape[0]), input1.view([-1,]+self.shape[1]), \
-                        input2.view([-1,]+self.shape[2]), input3.view([-1,]+self.shape[3])
-
-    reshapeModel = ReshapeNet([[op_shape for op_shape in output_shapes],
-                               max_batch])
-    example_inputs = [
-        torch.zeros(input_shapes[i], dtype=torch_dtype) for i in range(io_cnt)
-    ]
-    traced = torch.jit.trace(reshapeModel,
-                             tuple(example_inputs[i] for i in range(io_cnt)))
+                def forward(
+                    self,
+                    input0: List[str],
+                    input1: List[str],
+                    input2: List[str],
+                    input3: List[str],
+                ) -> Tuple[List[str], List[str], List[str], List[str]]:
+                    return input0, input1, input2, input3
+
+        else:
+
+            class ReshapeNet(nn.Module):
+                def __init__(self, *args):
+                    super(ReshapeNet, self).__init__()
+                    self.shape = args[0][0]
+                    self.max_batch = args[0][1]
+
+                def forward(self, input0, input1, input2, input3):
+                    if self.max_batch == 0:
+                        return (
+                            input0.view(self.shape[0]),
+                            input1.view(self.shape[1]),
+                            input2.view(self.shape[2]),
+                            input3.view(self.shape[3]),
+                        )
+                    else:
+                        return (
+                            input0.view(
+                                [
+                                    -1,
+                                ]
+                                + self.shape[0]
+                            ),
+                            input1.view(
+                                [
+                                    -1,
+                                ]
+                                + self.shape[1]
+                            ),
+                            input2.view(
+                                [
+                                    -1,
+                                ]
+                                + self.shape[2]
+                            ),
+                            input3.view(
+                                [
+                                    -1,
+                                ]
+                                + self.shape[3]
+                            ),
+                        )
+
+    if dtype == np_dtype_string:
+        identityModel = IdentityNet()
+        traced = torch.jit.script(identityModel)
+    else:
+        reshapeModel = ReshapeNet([[op_shape for op_shape in output_shapes], max_batch])
+        traced = torch.jit.script(reshapeModel)
 
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
@@ -517,31 +583,47 @@ def forward(self, input0, input1, input2, input3):
     traced.save(model_version_dir + "/model.pt")
 
 
-def create_libtorch_modelconfig(models_dir, model_version, max_batch, dtype,
-                                input_shapes, input_model_shapes, output_shapes,
-                                output_model_shapes):
-
+def create_libtorch_modelconfig(
+    models_dir,
+    model_version,
+    max_batch,
+    dtype,
+    input_shapes,
+    input_model_shapes,
+    output_shapes,
+    output_model_shapes,
+):
     assert len(input_shapes) == len(input_model_shapes)
     assert len(output_shapes) == len(output_model_shapes)
     assert len(input_shapes) == len(output_shapes)
-    if not tu.validate_for_libtorch_model(dtype, dtype, dtype, input_shapes[0],
-                                          input_shapes[0], input_shapes[0],
-                                          max_batch):
+    if not tu.validate_for_libtorch_model(
+        dtype,
+        dtype,
+        dtype,
+        input_shapes[0],
+        input_shapes[0],
+        input_shapes[0],
+        max_batch,
+        reshape=True,
+    ):
         return
 
     io_cnt = len(input_shapes)
 
     model_name = tu.get_zero_model_name(
-        "libtorch_nobatch" if max_batch == 0 else "libtorch", io_cnt, dtype)
+        "libtorch_nobatch" if max_batch == 0 else "libtorch", io_cnt, dtype
+    )
     config_dir = models_dir + "/" + model_name
-    config = '''
+    config = """
 name: "{}"
 platform: "pytorch_libtorch"
 max_batch_size: {}
-'''.format(model_name, max_batch)
+""".format(
+        model_name, max_batch
+    )
 
     for io_num in range(io_cnt):
-        config += '''
+        config += """
 input [
   {{
     name: "INPUT__{}"
@@ -558,17 +640,24 @@ def create_libtorch_modelconfig(models_dir, model_version, max_batch, dtype,
     {}
   }}
 ]
-'''.format(
-            io_num, np_to_model_dtype(dtype),
+""".format(
+            io_num,
+            np_to_model_dtype(dtype),
             tu.shape_to_dims_str(input_shapes[io_num]),
             "reshape: {{ shape: [ {} ] }}".format(
-                tu.shape_to_dims_str(input_model_shapes[io_num]))
-            if input_shapes[io_num] != input_model_shapes[io_num] else "",
-            io_num, np_to_model_dtype(dtype),
+                tu.shape_to_dims_str(input_model_shapes[io_num])
+            )
+            if input_shapes[io_num] != input_model_shapes[io_num]
+            else "",
+            io_num,
+            np_to_model_dtype(dtype),
             tu.shape_to_dims_str(output_shapes[io_num]),
             "reshape: {{ shape: [ {} ] }}".format(
-                tu.shape_to_dims_str(output_model_shapes[io_num]))
-            if output_shapes[io_num] != output_model_shapes[io_num] else "")
+                tu.shape_to_dims_str(output_model_shapes[io_num])
+            )
+            if output_shapes[io_num] != output_model_shapes[io_num]
+            else "",
+        )
 
     try:
         os.makedirs(config_dir)
@@ -579,30 +668,54 @@ def create_libtorch_modelconfig(models_dir, model_version, max_batch, dtype,
         cfile.write(config)
 
 
-def create_ensemble_modelfile(models_dir, model_version, max_batch, dtype,
-                              input_shapes, output_shapes):
-
+def create_ensemble_modelfile(
+    models_dir, model_version, max_batch, dtype, input_shapes, output_shapes
+):
     assert len(input_shapes) == len(output_shapes)
-    if not tu.validate_for_ensemble_model("reshape", dtype, dtype, dtype,
-                                          input_shapes[0], input_shapes[0],
-                                          input_shapes[0]):
+    if not tu.validate_for_ensemble_model(
+        "reshape",
+        dtype,
+        dtype,
+        dtype,
+        input_shapes[0],
+        input_shapes[0],
+        input_shapes[0],
+    ):
         return
 
-    emu.create_identity_ensemble_modelfile("reshape", models_dir, model_version,
-                                           max_batch, dtype, input_shapes,
-                                           output_shapes)
-
-
-def create_ensemble_modelconfig(models_dir, model_version, max_batch, dtype,
-                                input_shapes, input_model_shapes, output_shapes,
-                                output_model_shapes):
-
+    emu.create_identity_ensemble_modelfile(
+        "reshape",
+        models_dir,
+        model_version,
+        max_batch,
+        dtype,
+        input_shapes,
+        output_shapes,
+    )
+
+
+def create_ensemble_modelconfig(
+    models_dir,
+    model_version,
+    max_batch,
+    dtype,
+    input_shapes,
+    input_model_shapes,
+    output_shapes,
+    output_model_shapes,
+):
     assert len(input_shapes) == len(input_model_shapes)
     assert len(output_shapes) == len(output_model_shapes)
     assert len(input_shapes) == len(output_shapes)
-    if not tu.validate_for_ensemble_model("reshape", dtype, dtype, dtype,
-                                          input_shapes[0], input_shapes[0],
-                                          input_shapes[0]):
+    if not tu.validate_for_ensemble_model(
+        "reshape",
+        dtype,
+        dtype,
+        dtype,
+        input_shapes[0],
+        input_shapes[0],
+        input_shapes[0],
+    ):
         return
 
     # No reason to reshape ensemble inputs / outputs to empty as the inner models
@@ -619,20 +732,26 @@ def create_ensemble_modelconfig(models_dir, model_version, max_batch, dtype,
         else:
             output_model_shapes_list.append(output_model_shapes[idx])
 
-    emu.create_identity_ensemble_modelconfig("reshape", models_dir,
-                                             model_version, max_batch, dtype,
-                                             input_shapes,
-                                             tuple(input_model_shapes_list),
-                                             output_shapes,
-                                             tuple(output_model_shapes_list))
-
+    emu.create_identity_ensemble_modelconfig(
+        "reshape",
+        models_dir,
+        model_version,
+        max_batch,
+        dtype,
+        input_shapes,
+        tuple(input_model_shapes_list),
+        output_shapes,
+        tuple(output_model_shapes_list),
+    )
 
-def create_onnx_modelfile(models_dir, model_version, max_batch, dtype,
-                          input_shapes, output_shapes):
 
+def create_onnx_modelfile(
+    models_dir, model_version, max_batch, dtype, input_shapes, output_shapes
+):
     assert len(input_shapes) == len(output_shapes)
-    if not tu.validate_for_onnx_model(dtype, dtype, dtype, input_shapes[0],
-                                      input_shapes[0], input_shapes[0]):
+    if not tu.validate_for_onnx_model(
+        dtype, dtype, dtype, input_shapes[0], input_shapes[0], input_shapes[0]
+    ):
         return
 
     onnx_dtype = np_to_onnx_dtype(dtype)
@@ -640,7 +759,8 @@ def create_onnx_modelfile(models_dir, model_version, max_batch, dtype,
 
     # Create the model
     model_name = tu.get_zero_model_name(
-        "onnx_nobatch" if max_batch == 0 else "onnx", io_cnt, dtype)
+        "onnx_nobatch" if max_batch == 0 else "onnx", io_cnt, dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     batch_dim = [] if max_batch == 0 else [None]
@@ -658,29 +778,34 @@ def create_onnx_modelfile(models_dir, model_version, max_batch, dtype,
         out_shape_name = out_name + "_shape"
 
         onnx_inputs.append(
-            onnx.helper.make_tensor_value_info(in_name, onnx_dtype,
-                                               batch_dim + in_shape))
+            onnx.helper.make_tensor_value_info(
+                in_name, onnx_dtype, batch_dim + in_shape
+            )
+        )
         onnx_outputs.append(
-            onnx.helper.make_tensor_value_info(out_name, onnx_dtype,
-                                               batch_dim + out_shape))
+            onnx.helper.make_tensor_value_info(
+                out_name, onnx_dtype, batch_dim + out_shape
+            )
+        )
 
         if input_shapes == output_shapes:
-            onnx_nodes.append(
-                onnx.helper.make_node("Identity", [in_name], [out_name]))
+            onnx_nodes.append(onnx.helper.make_node("Identity", [in_name], [out_name]))
         else:
             onnx_nodes.append(
-                onnx.helper.make_node("Shape", [out_name], [out_shape_name]))
+                onnx.helper.make_node("Shape", [out_name], [out_shape_name])
+            )
             onnx_nodes.append(
-                onnx.helper.make_node("Reshape", [in_name, out_shape_name],
-                                      [out_name]))
+                onnx.helper.make_node("Reshape", [in_name, out_shape_name], [out_name])
+            )
 
-    graph_proto = onnx.helper.make_graph(onnx_nodes, model_name, onnx_inputs,
-                                         onnx_outputs)
+    graph_proto = onnx.helper.make_graph(
+        onnx_nodes, model_name, onnx_inputs, onnx_outputs
+    )
     if FLAGS.onnx_opset > 0:
         model_opset = onnx.helper.make_operatorsetid("", FLAGS.onnx_opset)
-        model_def = onnx.helper.make_model(graph_proto,
-                                           producer_name="triton",
-                                           opset_imports=[model_opset])
+        model_def = onnx.helper.make_model(
+            graph_proto, producer_name="triton", opset_imports=[model_opset]
+        )
     else:
         model_def = onnx.helper.make_model(graph_proto, producer_name="triton")
 
@@ -692,35 +817,45 @@ def create_onnx_modelfile(models_dir, model_version, max_batch, dtype,
     onnx.save(model_def, model_version_dir + "/model.onnx")
 
 
-def create_onnx_modelconfig(models_dir, model_version, max_batch, dtype,
-                            input_shapes, input_model_shapes, output_shapes,
-                            output_model_shapes):
-
+def create_onnx_modelconfig(
+    models_dir,
+    model_version,
+    max_batch,
+    dtype,
+    input_shapes,
+    input_model_shapes,
+    output_shapes,
+    output_model_shapes,
+):
     assert len(input_shapes) == len(input_model_shapes)
     assert len(output_shapes) == len(output_model_shapes)
     assert len(input_shapes) == len(output_shapes)
-    if not tu.validate_for_onnx_model(dtype, dtype, dtype, input_shapes[0],
-                                      input_shapes[0], input_shapes[0]):
+    if not tu.validate_for_onnx_model(
+        dtype, dtype, dtype, input_shapes[0], input_shapes[0], input_shapes[0]
+    ):
         return
 
     io_cnt = len(input_shapes)
 
     # Use a different model name for the non-batching variant
     model_name = tu.get_zero_model_name(
-        "onnx_nobatch" if max_batch == 0 else "onnx", io_cnt, dtype)
+        "onnx_nobatch" if max_batch == 0 else "onnx", io_cnt, dtype
+    )
     config_dir = models_dir + "/" + model_name
 
-    config = emu.create_general_modelconfig(model_name,
-                                            "onnxruntime_onnx",
-                                            max_batch,
-                                            emu.repeat(dtype, io_cnt),
-                                            input_shapes,
-                                            input_model_shapes,
-                                            emu.repeat(dtype, io_cnt),
-                                            output_shapes,
-                                            output_model_shapes,
-                                            emu.repeat(None, io_cnt),
-                                            force_tensor_number_suffix=True)
+    config = emu.create_general_modelconfig(
+        model_name,
+        "onnxruntime_onnx",
+        max_batch,
+        emu.repeat(dtype, io_cnt),
+        input_shapes,
+        input_model_shapes,
+        emu.repeat(dtype, io_cnt),
+        output_shapes,
+        output_model_shapes,
+        emu.repeat(None, io_cnt),
+        force_tensor_number_suffix=True,
+    )
 
     try:
         os.makedirs(config_dir)
@@ -731,23 +866,33 @@ def create_onnx_modelconfig(models_dir, model_version, max_batch, dtype,
         cfile.write(config)
 
 
-def create_openvino_modelfile(models_dir, model_version, max_batch, dtype,
-                              input_shapes, output_shapes):
-
+def create_openvino_modelfile(
+    models_dir, model_version, max_batch, dtype, input_shapes, output_shapes
+):
     assert len(input_shapes) == len(output_shapes)
-    batch_dim = [] if max_batch == 0 else [
-        max_batch,
-    ]
+    batch_dim = (
+        []
+        if max_batch == 0
+        else [
+            max_batch,
+        ]
+    )
     if not tu.validate_for_openvino_model(
-            dtype, dtype, dtype, batch_dim + input_shapes[0],
-            batch_dim + input_shapes[0], batch_dim + input_shapes[0]):
+        dtype,
+        dtype,
+        dtype,
+        batch_dim + input_shapes[0],
+        batch_dim + input_shapes[0],
+        batch_dim + input_shapes[0],
+    ):
         return
 
     io_cnt = len(input_shapes)
 
     # Create the model
     model_name = tu.get_zero_model_name(
-        "openvino_nobatch" if max_batch == 0 else "openvino", io_cnt, dtype)
+        "openvino_nobatch" if max_batch == 0 else "openvino", io_cnt, dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     openvino_inputs = []
@@ -756,15 +901,19 @@ def create_openvino_modelfile(models_dir, model_version, max_batch, dtype,
         in_name = "INPUT{}".format(io_num)
         out_name = "OUTPUT{}".format(io_num)
         openvino_inputs.append(
-            ng.parameter(shape=batch_dim + input_shapes[io_num],
-                         dtype=dtype,
-                         name=in_name))
+            ng.parameter(
+                shape=batch_dim + input_shapes[io_num], dtype=dtype, name=in_name
+            )
+        )
 
         openvino_outputs.append(
-            ng.reshape(openvino_inputs[io_num],
-                       batch_dim + output_shapes[io_num],
-                       name=out_name,
-                       special_zero=False))
+            ng.reshape(
+                openvino_inputs[io_num],
+                batch_dim + output_shapes[io_num],
+                name=out_name,
+                special_zero=False,
+            )
+        )
 
     function = ng.impl.Function(openvino_outputs, openvino_inputs, model_name)
     ie_network = IENetwork(ng.impl.Function.to_capsule(function))
@@ -774,41 +923,59 @@ def create_openvino_modelfile(models_dir, model_version, max_batch, dtype,
     except OSError as ex:
         pass  # ignore existing dir
 
-    ie_network.serialize(model_version_dir + "/model.xml",
-                         model_version_dir + "/model.bin")
-
-
-def create_openvino_modelconfig(models_dir, model_version, max_batch, dtype,
-                                input_shapes, input_model_shapes, output_shapes,
-                                output_model_shapes):
-
+    ie_network.serialize(
+        model_version_dir + "/model.xml", model_version_dir + "/model.bin"
+    )
+
+
+def create_openvino_modelconfig(
+    models_dir,
+    model_version,
+    max_batch,
+    dtype,
+    input_shapes,
+    input_model_shapes,
+    output_shapes,
+    output_model_shapes,
+):
     assert len(input_shapes) == len(input_model_shapes)
     assert len(output_shapes) == len(output_model_shapes)
     assert len(input_shapes) == len(output_shapes)
-    batch_dim = [] if max_batch == 0 else [
-        max_batch,
-    ]
+    batch_dim = (
+        []
+        if max_batch == 0
+        else [
+            max_batch,
+        ]
+    )
     if not tu.validate_for_openvino_model(
-            dtype, dtype, dtype, batch_dim + input_shapes[0],
-            batch_dim + input_shapes[0], batch_dim + input_shapes[0]):
+        dtype,
+        dtype,
+        dtype,
+        batch_dim + input_shapes[0],
+        batch_dim + input_shapes[0],
+        batch_dim + input_shapes[0],
+    ):
         return
 
     io_cnt = len(input_shapes)
 
     # Use a different model name for the non-batching variant
     model_name = tu.get_zero_model_name(
-        "openvino_nobatch" if max_batch == 0 else "openvino", io_cnt, dtype)
+        "openvino_nobatch" if max_batch == 0 else "openvino", io_cnt, dtype
+    )
     config_dir = models_dir + "/" + model_name
 
-    config = '''
+    config = """
 name: "{}"
 backend: "openvino"
 max_batch_size: {}
-instance_group [ {{ kind: KIND_CPU }}]
-'''.format(model_name, max_batch)
+""".format(
+        model_name, max_batch
+    )
 
     for io_num in range(io_cnt):
-        config += '''
+        config += """
 input [
   {{
     name: "INPUT{}"
@@ -825,17 +992,24 @@ def create_openvino_modelconfig(models_dir, model_version, max_batch, dtype,
     {}
   }}
 ]
-'''.format(
-            io_num, np_to_model_dtype(dtype),
+""".format(
+            io_num,
+            np_to_model_dtype(dtype),
             tu.shape_to_dims_str(input_shapes[io_num]),
             "reshape: {{ shape: [ {} ] }}".format(
-                tu.shape_to_dims_str(input_model_shapes[io_num]))
-            if input_shapes[io_num] != input_model_shapes[io_num] else "",
-            io_num, np_to_model_dtype(dtype),
+                tu.shape_to_dims_str(input_model_shapes[io_num])
+            )
+            if input_shapes[io_num] != input_model_shapes[io_num]
+            else "",
+            io_num,
+            np_to_model_dtype(dtype),
             tu.shape_to_dims_str(output_shapes[io_num]),
             "reshape: {{ shape: [ {} ] }}".format(
-                tu.shape_to_dims_str(output_model_shapes[io_num]))
-            if output_shapes[io_num] != output_model_shapes[io_num] else "")
+                tu.shape_to_dims_str(output_model_shapes[io_num])
+            )
+            if output_shapes[io_num] != output_model_shapes[io_num]
+            else "",
+        )
 
     try:
         os.makedirs(config_dir)
@@ -846,13 +1020,15 @@ def create_openvino_modelconfig(models_dir, model_version, max_batch, dtype,
         cfile.write(config)
 
 
-def create_models(models_dir,
-                  dtype,
-                  input_shapes,
-                  input_model_shapes,
-                  output_shapes=None,
-                  output_model_shapes=None,
-                  no_batch=True):
+def create_models(
+    models_dir,
+    dtype,
+    input_shapes,
+    input_model_shapes,
+    output_shapes=None,
+    output_model_shapes=None,
+    no_batch=True,
+):
     model_version = 1
     if output_shapes is None:
         output_shapes = input_shapes
@@ -860,43 +1036,124 @@ def create_models(models_dir,
         output_model_shapes = input_model_shapes
 
     if FLAGS.graphdef:
-        create_tf_modelconfig(False, models_dir, model_version, 8, dtype,
-                              input_shapes, input_model_shapes, output_shapes,
-                              output_model_shapes)
-        create_tf_modelfile(False, models_dir, model_version, 8, dtype,
-                            input_model_shapes, output_model_shapes)
+        create_tf_modelconfig(
+            False,
+            models_dir,
+            model_version,
+            8,
+            dtype,
+            input_shapes,
+            input_model_shapes,
+            output_shapes,
+            output_model_shapes,
+        )
+        create_tf_modelfile(
+            False,
+            models_dir,
+            model_version,
+            8,
+            dtype,
+            input_model_shapes,
+            output_model_shapes,
+        )
         if no_batch:
-            create_tf_modelconfig(False, models_dir, model_version, 0, dtype,
-                                  input_shapes, input_model_shapes,
-                                  output_shapes, output_model_shapes)
-            create_tf_modelfile(False, models_dir, model_version, 0, dtype,
-                                input_model_shapes, output_model_shapes)
+            create_tf_modelconfig(
+                False,
+                models_dir,
+                model_version,
+                0,
+                dtype,
+                input_shapes,
+                input_model_shapes,
+                output_shapes,
+                output_model_shapes,
+            )
+            create_tf_modelfile(
+                False,
+                models_dir,
+                model_version,
+                0,
+                dtype,
+                input_model_shapes,
+                output_model_shapes,
+            )
 
     if FLAGS.savedmodel:
-        create_tf_modelconfig(True, models_dir, model_version, 8, dtype,
-                              input_shapes, input_model_shapes, output_shapes,
-                              output_model_shapes)
-        create_tf_modelfile(True, models_dir, model_version, 8, dtype,
-                            input_model_shapes, output_model_shapes)
+        create_tf_modelconfig(
+            True,
+            models_dir,
+            model_version,
+            8,
+            dtype,
+            input_shapes,
+            input_model_shapes,
+            output_shapes,
+            output_model_shapes,
+        )
+        create_tf_modelfile(
+            True,
+            models_dir,
+            model_version,
+            8,
+            dtype,
+            input_model_shapes,
+            output_model_shapes,
+        )
         if no_batch:
-            create_tf_modelconfig(True, models_dir, model_version, 0, dtype,
-                                  input_shapes, input_model_shapes,
-                                  output_shapes, output_model_shapes)
-            create_tf_modelfile(True, models_dir, model_version, 0, dtype,
-                                input_model_shapes, output_model_shapes)
+            create_tf_modelconfig(
+                True,
+                models_dir,
+                model_version,
+                0,
+                dtype,
+                input_shapes,
+                input_model_shapes,
+                output_shapes,
+                output_model_shapes,
+            )
+            create_tf_modelfile(
+                True,
+                models_dir,
+                model_version,
+                0,
+                dtype,
+                input_model_shapes,
+                output_model_shapes,
+            )
 
     if FLAGS.onnx:
-        create_onnx_modelconfig(models_dir, model_version, 8, dtype,
-                                input_shapes, input_model_shapes, output_shapes,
-                                output_model_shapes)
-        create_onnx_modelfile(models_dir, model_version, 8, dtype,
-                              input_model_shapes, output_model_shapes)
+        create_onnx_modelconfig(
+            models_dir,
+            model_version,
+            8,
+            dtype,
+            input_shapes,
+            input_model_shapes,
+            output_shapes,
+            output_model_shapes,
+        )
+        create_onnx_modelfile(
+            models_dir, model_version, 8, dtype, input_model_shapes, output_model_shapes
+        )
         if no_batch:
-            create_onnx_modelconfig(models_dir, model_version, 0, dtype,
-                                    input_shapes, input_model_shapes,
-                                    output_shapes, output_model_shapes)
-            create_onnx_modelfile(models_dir, model_version, 0, dtype,
-                                  input_model_shapes, output_model_shapes)
+            create_onnx_modelconfig(
+                models_dir,
+                model_version,
+                0,
+                dtype,
+                input_shapes,
+                input_model_shapes,
+                output_shapes,
+                output_model_shapes,
+            )
+            create_onnx_modelfile(
+                models_dir,
+                model_version,
+                0,
+                dtype,
+                input_model_shapes,
+                output_model_shapes,
+            )
 
     # Shouldn't create ensembles that reshape to zero-sized tensors. Reshaping
     # from / to zero dimension is not allow as ensemble inputs / outputs
@@ -909,26 +1166,49 @@ def create_models(models_dir,
             emu.create_nop_modelconfig(models_dir, shape, np.float32)
             emu.create_nop_tunnel_modelconfig(models_dir, shape, np.float32)
             emu.create_nop_modelconfig(models_dir, [-1], np.float32)
-        create_ensemble_modelconfig(models_dir, model_version, 8, dtype,
-                                    input_shapes, input_model_shapes,
-                                    output_shapes, output_model_shapes)
-        create_ensemble_modelfile(models_dir, model_version, 8, dtype,
-                                  input_model_shapes, output_model_shapes)
+        create_ensemble_modelconfig(
+            models_dir,
+            model_version,
+            8,
+            dtype,
+            input_shapes,
+            input_model_shapes,
+            output_shapes,
+            output_model_shapes,
+        )
+        create_ensemble_modelfile(
+            models_dir, model_version, 8, dtype, input_model_shapes, output_model_shapes
+        )
         if no_batch:
-            create_ensemble_modelconfig(models_dir, model_version, 0, dtype,
-                                        input_shapes, input_model_shapes,
-                                        output_shapes, output_model_shapes)
-            create_ensemble_modelfile(models_dir, model_version, 0, dtype,
-                                      input_model_shapes, output_model_shapes)
-
-
-def create_trt_models(models_dir,
-                      dtype,
-                      input_shapes,
-                      input_model_shapes,
-                      output_shapes=None,
-                      output_model_shapes=None,
-                      no_batch=True):
+            create_ensemble_modelconfig(
+                models_dir,
+                model_version,
+                0,
+                dtype,
+                input_shapes,
+                input_model_shapes,
+                output_shapes,
+                output_model_shapes,
+            )
+            create_ensemble_modelfile(
+                models_dir,
+                model_version,
+                0,
+                dtype,
+                input_model_shapes,
+                output_model_shapes,
+            )
+
+
+def create_trt_models(
+    models_dir,
+    dtype,
+    input_shapes,
+    input_model_shapes,
+    output_shapes=None,
+    output_model_shapes=None,
+    no_batch=True,
+):
     model_version = 1
     if output_shapes is None:
         output_shapes = input_shapes
@@ -936,26 +1216,49 @@ def create_trt_models(models_dir,
         output_model_shapes = input_model_shapes
 
     if FLAGS.tensorrt:
-        create_plan_modelconfig(models_dir, model_version, 8, dtype,
-                                input_shapes, input_model_shapes, output_shapes,
-                                output_model_shapes)
-        create_plan_modelfile(models_dir, model_version, 8, dtype,
-                              input_model_shapes, output_model_shapes)
+        create_plan_modelconfig(
+            models_dir,
+            model_version,
+            8,
+            dtype,
+            input_shapes,
+            input_model_shapes,
+            output_shapes,
+            output_model_shapes,
+        )
+        create_plan_modelfile(
+            models_dir, model_version, 8, dtype, input_model_shapes, output_model_shapes
+        )
         if no_batch:
-            create_plan_modelconfig(models_dir, model_version, 0, dtype,
-                                    input_shapes, input_model_shapes,
-                                    output_shapes, output_model_shapes)
-            create_plan_modelfile(models_dir, model_version, 0, dtype,
-                                  input_model_shapes, output_model_shapes)
-
-
-def create_libtorch_models(models_dir,
-                           dtype,
-                           input_shapes,
-                           input_model_shapes,
-                           output_shapes=None,
-                           output_model_shapes=None,
-                           no_batch=True):
+            create_plan_modelconfig(
+                models_dir,
+                model_version,
+                0,
+                dtype,
+                input_shapes,
+                input_model_shapes,
+                output_shapes,
+                output_model_shapes,
+            )
+            create_plan_modelfile(
+                models_dir,
+                model_version,
+                0,
+                dtype,
+                input_model_shapes,
+                output_model_shapes,
+            )
+
+
+def create_libtorch_models(
+    models_dir,
+    dtype,
+    input_shapes,
+    input_model_shapes,
+    output_shapes=None,
+    output_model_shapes=None,
+    no_batch=True,
+):
     model_version = 1
     if output_shapes is None:
         output_shapes = input_shapes
@@ -963,26 +1266,50 @@ def create_libtorch_models(models_dir,
         output_model_shapes = input_model_shapes
 
     if FLAGS.libtorch:
-        create_libtorch_modelconfig(models_dir, model_version, 8, dtype,
-                                    input_shapes, input_model_shapes,
-                                    output_shapes, output_model_shapes)
-        create_libtorch_modelfile(models_dir, model_version, 8, dtype,
-                                  input_model_shapes, output_model_shapes)
-        if no_batch:
-            create_libtorch_modelconfig(models_dir, model_version, 0, dtype,
-                                        input_shapes, input_model_shapes,
-                                        output_shapes, output_model_shapes)
-            create_libtorch_modelfile(models_dir, model_version, 0, dtype,
-                                      input_model_shapes, output_model_shapes)
-
-
-def create_openvino_models(models_dir,
-                           dtype,
-                           input_shapes,
-                           input_model_shapes,
-                           output_shapes=None,
-                           output_model_shapes=None,
-                           no_batch=True):
+        create_libtorch_modelconfig(
+            models_dir,
+            model_version,
+            8,
+            dtype,
+            input_shapes,
+            input_model_shapes,
+            output_shapes,
+            output_model_shapes,
+        )
+        create_libtorch_modelfile(
+            models_dir, model_version, 8, dtype, input_model_shapes, output_model_shapes
+        )
+        # skip for libtorch string I/O
+        if no_batch and (dtype != np_dtype_string):
+            create_libtorch_modelconfig(
+                models_dir,
+                model_version,
+                0,
+                dtype,
+                input_shapes,
+                input_model_shapes,
+                output_shapes,
+                output_model_shapes,
+            )
+            create_libtorch_modelfile(
+                models_dir,
+                model_version,
+                0,
+                dtype,
+                input_model_shapes,
+                output_model_shapes,
+            )
+
+
+def create_openvino_models(
+    models_dir,
+    dtype,
+    input_shapes,
+    input_model_shapes,
+    output_shapes=None,
+    output_model_shapes=None,
+    no_batch=True,
+):
     model_version = 1
     if output_shapes is None:
         output_shapes = input_shapes
@@ -990,68 +1317,107 @@ def create_openvino_models(models_dir,
         output_model_shapes = input_model_shapes
 
     if FLAGS.openvino:
-        create_openvino_modelconfig(models_dir, model_version, 8, dtype,
-                                    input_shapes, input_model_shapes,
-                                    output_shapes, output_model_shapes)
-        create_openvino_modelfile(models_dir, model_version, 8, dtype,
-                                  input_model_shapes, output_model_shapes)
+        create_openvino_modelconfig(
+            models_dir,
+            model_version,
+            8,
+            dtype,
+            input_shapes,
+            input_model_shapes,
+            output_shapes,
+            output_model_shapes,
+        )
+        create_openvino_modelfile(
+            models_dir, model_version, 8, dtype, input_model_shapes, output_model_shapes
+        )
         if no_batch:
-            create_openvino_modelconfig(models_dir, model_version, 0, dtype,
-                                        input_shapes, input_model_shapes,
-                                        output_shapes, output_model_shapes)
-            create_openvino_modelfile(models_dir, model_version, 0, dtype,
-                                      input_model_shapes, output_model_shapes)
-
-
-if __name__ == '__main__':
+            create_openvino_modelconfig(
+                models_dir,
+                model_version,
+                0,
+                dtype,
+                input_shapes,
+                input_model_shapes,
+                output_shapes,
+                output_model_shapes,
+            )
+            create_openvino_modelfile(
+                models_dir,
+                model_version,
+                0,
+                dtype,
+                input_model_shapes,
+                output_model_shapes,
+            )
+
+
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('--models_dir',
-                        type=str,
-                        required=True,
-                        help='Top-level model directory')
-    parser.add_argument('--graphdef',
-                        required=False,
-                        action='store_true',
-                        help='Generate GraphDef models')
-    parser.add_argument('--savedmodel',
-                        required=False,
-                        action='store_true',
-                        help='Generate SavedModel models')
-    parser.add_argument('--tensorrt',
-                        required=False,
-                        action='store_true',
-                        help='Generate TensorRT PLAN models')
-    parser.add_argument('--onnx',
-                        required=False,
-                        action='store_true',
-                        help='Generate Onnx Runtime Onnx models')
     parser.add_argument(
-        '--onnx_opset',
+        "--models_dir", type=str, required=True, help="Top-level model directory"
+    )
+    parser.add_argument(
+        "--graphdef",
+        required=False,
+        action="store_true",
+        help="Generate GraphDef models",
+    )
+    parser.add_argument(
+        "--savedmodel",
+        required=False,
+        action="store_true",
+        help="Generate SavedModel models",
+    )
+    parser.add_argument(
+        "--tensorrt",
+        required=False,
+        action="store_true",
+        help="Generate TensorRT PLAN models",
+    )
+    parser.add_argument(
+        "--onnx",
+        required=False,
+        action="store_true",
+        help="Generate Onnx Runtime Onnx models",
+    )
+    parser.add_argument(
+        "--onnx_opset",
         type=int,
         required=False,
         default=0,
-        help='Opset used for Onnx models. Default is to use ONNXRT default')
-    parser.add_argument('--libtorch',
-                        required=False,
-                        action='store_true',
-                        help='Generate Pytorch LibTorch models')
-    parser.add_argument('--openvino',
-                        required=False,
-                        action='store_true',
-                        help='Generate OpenVino models')
-    parser.add_argument('--ensemble',
-                        required=False,
-                        action='store_true',
-                        help='Generate ensemble models')
-    parser.add_argument('--variable',
-                        required=False,
-                        action='store_true',
-                        help='Used variable-shape tensors for input/output')
+        help="Opset used for Onnx models. Default is to use ONNXRT default",
+    )
+    parser.add_argument(
+        "--libtorch",
+        required=False,
+        action="store_true",
+        help="Generate Pytorch LibTorch models",
+    )
+    parser.add_argument(
+        "--openvino",
+        required=False,
+        action="store_true",
+        help="Generate OpenVino models",
+    )
+    parser.add_argument(
+        "--ensemble",
+        required=False,
+        action="store_true",
+        help="Generate ensemble models",
+    )
+    parser.add_argument(
+        "--variable",
+        required=False,
+        action="store_true",
+        help="Used variable-shape tensors for input/output",
+    )
     FLAGS, unparsed = parser.parse_known_args()
 
     if FLAGS.graphdef or FLAGS.savedmodel:
         import tensorflow as tf
         from tensorflow.python.framework import graph_io
+
+        tf.compat.v1.disable_eager_execution()
     if FLAGS.tensorrt:
         import tensorrt as trt
     if FLAGS.onnx:
@@ -1067,61 +1433,82 @@ def create_openvino_models(models_dir,
 
     # TensorRT, OpenVino and LibTorch must be handled separately since they
     # don't support zero-sized tensors.
-    create_models(FLAGS.models_dir,
-                  np_dtype_string, ([1],), ([],),
-                  no_batch=False)
+    create_models(FLAGS.models_dir, np_dtype_string, ([1],), ([],), no_batch=False)
     create_models(FLAGS.models_dir, np.float32, ([1],), ([],), no_batch=False)
-    create_models(FLAGS.models_dir,
-                  np.float32, ([1], [8]), ([], [4, 1, 2]),
-                  no_batch=False)
-    create_models(FLAGS.models_dir, np.float32, ([4, 4], [2], [2, 2, 3]),
-                  ([16], [1, 2], [3, 2, 2]))
-    create_libtorch_models(FLAGS.models_dir,
-                           np.float32, ([1],), ([1, 1, 1],),
-                           no_batch=False)
-    create_libtorch_models(FLAGS.models_dir,
-                           np.float32, ([1], [8]), ([1, 1, 1], [4, 1, 2]),
-                           no_batch=False)
-    create_libtorch_models(FLAGS.models_dir, np.float32,
-                           ([4, 4], [2], [2, 2, 3]), ([16], [1, 2], [3, 2, 2]))
-    create_openvino_models(FLAGS.models_dir,
-                           np.float32, ([1],), ([1, 1, 1],),
-                           no_batch=False)
-    create_openvino_models(FLAGS.models_dir,
-                           np.float32, ([1], [8]), ([1, 1, 1], [4, 1, 2]),
-                           no_batch=False)
-    create_openvino_models(FLAGS.models_dir, np.float32,
-                           ([4, 4], [2], [2, 2, 3]), ([16], [1, 2], [3, 2, 2]))
-    create_trt_models(FLAGS.models_dir, np.float32, ([1], [8]),
-                      ([1, 1, 1], [4, 1, 2]))
+    create_models(
+        FLAGS.models_dir, np.float32, ([1], [8]), ([], [4, 1, 2]), no_batch=False
+    )
+    create_models(
+        FLAGS.models_dir,
+        np.float32,
+        ([4, 4], [2], [2, 2, 3]),
+        ([16], [1, 2], [3, 2, 2]),
+    )
+    create_libtorch_models(
+        FLAGS.models_dir, np.float32, ([1],), ([1, 1, 1],), no_batch=False
+    )
+    create_libtorch_models(
+        FLAGS.models_dir, np.float32, ([1], [8]), ([1, 1, 1], [4, 1, 2]), no_batch=False
+    )
+    create_libtorch_models(
+        FLAGS.models_dir,
+        np.float32,
+        ([4, 4], [2], [2, 2, 3]),
+        ([16], [1, 2], [3, 2, 2]),
+    )
+    create_libtorch_models(
+        FLAGS.models_dir, np_dtype_string, ([1],), ([],), no_batch=False
+    )
+    create_openvino_models(
+        FLAGS.models_dir, np.float32, ([1],), ([1, 1, 1],), no_batch=False
+    )
+    create_openvino_models(
+        FLAGS.models_dir, np.float32, ([1], [8]), ([1, 1, 1], [4, 1, 2]), no_batch=False
+    )
+    create_openvino_models(
+        FLAGS.models_dir,
+        np.float32,
+        ([4, 4], [2], [2, 2, 3]),
+        ([16], [1, 2], [3, 2, 2]),
+    )
+    create_trt_models(FLAGS.models_dir, np.float32, ([1], [8]), ([1, 1, 1], [4, 1, 2]))
 
     # Models that reshape only the input, not the output.
-    create_models(FLAGS.models_dir,
-                  np.float32, ([4, 4], [2], [2, 2, 3], [1]),
-                  ([16], [1, 2], [3, 2, 2], [1]),
-                  output_shapes=([16], [1, 2], [3, 2, 2], [1]),
-                  output_model_shapes=([16], [1, 2], [3, 2, 2], [1]))
-
-    create_libtorch_models(FLAGS.models_dir,
-                           np.float32, ([4, 4], [2], [2, 2, 3], [1]),
-                           ([16], [1, 2], [3, 2, 2], [1]),
-                           output_shapes=([16], [1, 2], [3, 2, 2], [1]),
-                           output_model_shapes=([16], [1, 2], [3, 2, 2], [1]))
-
-    create_openvino_models(FLAGS.models_dir,
-                           np.float32, ([4, 4], [2], [2, 2, 3], [1]),
-                           ([16], [1, 2], [3, 2, 2], [1]),
-                           output_shapes=([16], [1, 2], [3, 2, 2], [1]),
-                           output_model_shapes=([16], [1, 2], [3, 2, 2], [1]))
-
-    create_trt_models(FLAGS.models_dir,
-                      np.float32, ([4, 4], [2], [2, 2, 3], [1]),
-                      ([2, 2, 4], [1, 2, 1], [3, 2, 2], [1, 1, 1]),
-                      output_shapes=([2, 2, 4], [1, 2, 1], [3, 2, 2], [1, 1,
-                                                                       1]),
-                      output_model_shapes=([2, 2, 4], [1, 2, 1], [3, 2,
-                                                                  2], [1, 1,
-                                                                       1]))
+    create_models(
+        FLAGS.models_dir,
+        np.float32,
+        ([4, 4], [2], [2, 2, 3], [1]),
+        ([16], [1, 2], [3, 2, 2], [1]),
+        output_shapes=([16], [1, 2], [3, 2, 2], [1]),
+        output_model_shapes=([16], [1, 2], [3, 2, 2], [1]),
+    )
+
+    create_libtorch_models(
+        FLAGS.models_dir,
+        np.float32,
+        ([4, 4], [2], [2, 2, 3], [1]),
+        ([16], [1, 2], [3, 2, 2], [1]),
+        output_shapes=([16], [1, 2], [3, 2, 2], [1]),
+        output_model_shapes=([16], [1, 2], [3, 2, 2], [1]),
+    )
+
+    create_openvino_models(
+        FLAGS.models_dir,
+        np.float32,
+        ([4, 4], [2], [2, 2, 3], [1]),
+        ([16], [1, 2], [3, 2, 2], [1]),
+        output_shapes=([16], [1, 2], [3, 2, 2], [1]),
+        output_model_shapes=([16], [1, 2], [3, 2, 2], [1]),
+    )
+
+    create_trt_models(
+        FLAGS.models_dir,
+        np.float32,
+        ([4, 4], [2], [2, 2, 3], [1]),
+        ([2, 2, 4], [1, 2, 1], [3, 2, 2], [1, 1, 1]),
+        output_shapes=([2, 2, 4], [1, 2, 1], [3, 2, 2], [1, 1, 1]),
+        output_model_shapes=([2, 2, 4], [1, 2, 1], [3, 2, 2], [1, 1, 1]),
+    )
 
     # Tests with models that accept variable-shape input/output tensors and reshape
     # TensorRT is ignored as it only allows fixed-shape tensors
@@ -1129,14 +1516,21 @@ def create_openvino_models(models_dir,
     # based on input used for tracing), need to find equivalent operation that
     # is not shape dependent.
     if FLAGS.variable:
-        create_models(FLAGS.models_dir, np.int32, ([2, 4, -1, 6],),
-                      ([8, -1, 1, 6],))
-        create_models(FLAGS.models_dir, np.int32, ([1, -1, 1], [-1], [2, 2, 3]),
-                      ([-1], [1, -1, 1], [3, 2, 2]))
-        create_models(FLAGS.models_dir,
-                      np.int32, ([-1, 1], [2]), ([1, -1], [1, 2]),
-                      output_shapes=([1, -1], [1, 2]),
-                      output_model_shapes=([1, -1], [1, 2]))
+        create_models(FLAGS.models_dir, np.int32, ([2, 4, -1, 6],), ([8, -1, 1, 6],))
+        create_models(
+            FLAGS.models_dir,
+            np.int32,
+            ([1, -1, 1], [-1], [2, 2, 3]),
+            ([-1], [1, -1, 1], [3, 2, 2]),
+        )
+        create_models(
+            FLAGS.models_dir,
+            np.int32,
+            ([-1, 1], [2]),
+            ([1, -1], [1, 2]),
+            output_shapes=([1, -1], [1, 2]),
+            output_model_shapes=([1, -1], [1, 2]),
+        )
 
     # TRT plan that reshapes neither input nor output. Needed for
     # L0_perflab_nomodel.
diff --git a/qa/common/gen_qa_sequence_models.py b/qa/common/gen_qa_sequence_models.py
old mode 100644
new mode 100755
index c39a32c8e5..4c9ca9d8e5
--- a/qa/common/gen_qa_sequence_models.py
+++ b/qa/common/gen_qa_sequence_models.py
@@ -1,4 +1,6 @@
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,133 +28,24 @@
 
 import argparse
 import os
-import numpy as np
+
 import gen_ensemble_model_utils as emu
+import numpy as np
+from gen_common import (
+    np_to_model_dtype,
+    np_to_onnx_dtype,
+    np_to_tf_dtype,
+    np_to_torch_dtype,
+    np_to_trt_dtype,
+)
 
 FLAGS = None
 np_dtype_string = np.dtype(object)
 
 
-def np_to_model_dtype(np_dtype):
-    if np_dtype == bool:
-        return "TYPE_BOOL"
-    elif np_dtype == np.int8:
-        return "TYPE_INT8"
-    elif np_dtype == np.int16:
-        return "TYPE_INT16"
-    elif np_dtype == np.int32:
-        return "TYPE_INT32"
-    elif np_dtype == np.int64:
-        return "TYPE_INT64"
-    elif np_dtype == np.uint8:
-        return "TYPE_UINT8"
-    elif np_dtype == np.uint16:
-        return "TYPE_UINT16"
-    elif np_dtype == np.float16:
-        return "TYPE_FP16"
-    elif np_dtype == np.float32:
-        return "TYPE_FP32"
-    elif np_dtype == np.float64:
-        return "TYPE_FP64"
-    elif np_dtype == np_dtype_string:
-        return "TYPE_STRING"
-    return None
-
-
-def np_to_tf_dtype(np_dtype):
-    if np_dtype == bool:
-        return tf.bool
-    elif np_dtype == np.int8:
-        return tf.int8
-    elif np_dtype == np.int16:
-        return tf.int16
-    elif np_dtype == np.int32:
-        return tf.int32
-    elif np_dtype == np.int64:
-        return tf.int64
-    elif np_dtype == np.uint8:
-        return tf.uint8
-    elif np_dtype == np.uint16:
-        return tf.uint16
-    elif np_dtype == np.float16:
-        return tf.float16
-    elif np_dtype == np.float32:
-        return tf.float32
-    elif np_dtype == np.float64:
-        return tf.float64
-    elif np_dtype == np_dtype_string:
-        return tf.string
-    return None
-
-
-def np_to_trt_dtype(np_dtype):
-    if np_dtype == bool:
-        return trt.bool
-    elif np_dtype == np.int8:
-        return trt.int8
-    elif np_dtype == np.int32:
-        return trt.int32
-    elif np_dtype == np.float16:
-        return trt.float16
-    elif np_dtype == np.float32:
-        return trt.float32
-    return None
-
-
-def np_to_onnx_dtype(np_dtype):
-    if np_dtype == bool:
-        return onnx.TensorProto.BOOL
-    elif np_dtype == np.int8:
-        return onnx.TensorProto.INT8
-    elif np_dtype == np.int16:
-        return onnx.TensorProto.INT16
-    elif np_dtype == np.int32:
-        return onnx.TensorProto.INT32
-    elif np_dtype == np.int64:
-        return onnx.TensorProto.INT64
-    elif np_dtype == np.uint8:
-        return onnx.TensorProto.UINT8
-    elif np_dtype == np.uint16:
-        return onnx.TensorProto.UINT16
-    elif np_dtype == np.float16:
-        return onnx.TensorProto.FLOAT16
-    elif np_dtype == np.float32:
-        return onnx.TensorProto.FLOAT
-    elif np_dtype == np.float64:
-        return onnx.TensorProto.DOUBLE
-    elif np_dtype == np_dtype_string:
-        return onnx.TensorProto.STRING
-
-
-def np_to_torch_dtype(np_dtype):
-    if np_dtype == bool:
-        return torch.bool
-    elif np_dtype == np.int8:
-        return torch.int8
-    elif np_dtype == np.int16:
-        return torch.int16
-    elif np_dtype == np.int32:
-        return torch.int
-    elif np_dtype == np.int64:
-        return torch.long
-    elif np_dtype == np.uint8:
-        return torch.uint8
-    elif np_dtype == np.uint16:
-        return None  # Not supported in Torch
-    elif np_dtype == np.float16:
-        return None
-    elif np_dtype == np.float32:
-        return torch.float
-    elif np_dtype == np.float64:
-        return torch.double
-    elif np_dtype == np_dtype_string:
-        return None  # Not supported in Torch
-    return None
-
-
-def create_tf_modelfile(create_savedmodel, models_dir, model_version, max_batch,
-                        dtype, shape):
-
+def create_tf_modelfile(
+    create_savedmodel, models_dir, model_version, max_batch, dtype, shape
+):
     if not tu.validate_for_tf_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
@@ -175,37 +68,52 @@ def create_tf_modelfile(create_savedmodel, models_dir, model_version, max_batch,
 
     # Create the model. If non-batching then don't include the batch
     # dimension.
-    tf.reset_default_graph()
+    tf.compat.v1.reset_default_graph()
     if create_savedmodel and (max_batch == 0):
-        input0 = tf.placeholder(tf_input_dtype, [
-            1,
-        ], "INPUT")
+        input0 = tf.compat.v1.placeholder(
+            tf_input_dtype,
+            [
+                1,
+            ],
+            "INPUT",
+        )
         if tf_input_dtype == tf.string:
-            input0 = tf.strings.to_number(tf.strings.join(["0", input0]),
-                                          tf_dtype)
-        start0 = tf.placeholder(tf_control_type, [
-            1,
-        ], "START")
-        ready0 = tf.placeholder(tf_control_type, [
-            1,
-        ], "READY")
-        acc = tf.get_variable("ACC", [
-            1,
-        ], dtype=tf_dtype)
+            input0 = tf.strings.to_number(tf.strings.join(["0", input0]), tf_dtype)
+        start0 = tf.compat.v1.placeholder(
+            tf_control_type,
+            [
+                1,
+            ],
+            "START",
+        )
+        ready0 = tf.compat.v1.placeholder(
+            tf_control_type,
+            [
+                1,
+            ],
+            "READY",
+        )
+        acc = tf.compat.v1.get_variable(
+            "ACC",
+            [
+                1,
+            ],
+            dtype=tf_dtype,
+        )
 
         # Convert boolean value to int32 value
         if tf_control_type == tf.bool:
             start0 = tf.cast(start0, tf.int32)
             ready0 = tf.cast(ready0, tf.int32)
 
-        tmp = tf.where(tf.equal(start0, 1), input0, tf.add(acc, input0))
-        newacc = tf.where(tf.equal(ready0, 1), tmp, acc)
+        tmp = tf.compat.v1.where(tf.equal(start0, 1), input0, tf.add(acc, input0))
+        newacc = tf.compat.v1.where(tf.equal(ready0, 1), tmp, acc)
 
-        assign = tf.assign(acc, newacc)
+        assign = tf.compat.v1.assign(acc, newacc)
         if tf_input_dtype == tf.string:
-            output0 = tf.dtypes.as_string(assign, name="OUTPUT")
+            tf.strings.as_string(assign, name="OUTPUT")
         else:
-            output0 = tf.identity(assign, name="OUTPUT")
+            tf.identity(assign, name="OUTPUT")
     else:
         # For batching we can't use a tf.variable to hold the
         # accumulated values since that forces the size of the output
@@ -214,35 +122,44 @@ def create_tf_modelfile(create_savedmodel, models_dir, model_version, max_batch,
         # output shape being [None, 1]. So instead we just return 0 if
         # not-ready and 'INPUT'+'START' otherwise... the tests know to
         # expect this.
-        input0 = tf.placeholder(tf_input_dtype, [
-            None,
-        ] + tu.shape_to_tf_shape(shape), "INPUT")
+        input0 = tf.compat.v1.placeholder(
+            tf_input_dtype,
+            [
+                None,
+            ]
+            + tu.shape_to_tf_shape(shape),
+            "INPUT",
+        )
         if tf_input_dtype == tf.string:
-            input0 = tf.strings.to_number(tf.strings.join(["0", input0]),
-                                          tf_dtype)
-        start0 = tf.placeholder(tf_control_type, [None, 1], "START")
-        ready0 = tf.placeholder(tf_control_type, [None, 1], "READY")
+            input0 = tf.strings.to_number(tf.strings.join(["0", input0]), tf_dtype)
+        start0 = tf.compat.v1.placeholder(tf_control_type, [None, 1], "START")
+        ready0 = tf.compat.v1.placeholder(tf_control_type, [None, 1], "READY")
 
         # Convert boolean value to int32 value
         if tf_control_type == tf.bool:
             start0 = tf.cast(start0, tf.int32)
             ready0 = tf.cast(ready0, tf.int32)
 
-        tmp = tf.where(tf.equal(ready0, 1), tf.add(start0, input0),
-                       tf.zeros(tf.shape(input0), dtype=tf_dtype))
+        tmp = tf.compat.v1.where(
+            tf.equal(ready0, 1),
+            tf.add(start0, input0),
+            tf.zeros(tf.shape(input=input0), dtype=tf_dtype),
+        )
 
         if tf_input_dtype == tf.string:
-            output0 = tf.dtypes.as_string(tmp, name="OUTPUT")
+            tf.strings.as_string(tmp, name="OUTPUT")
         else:
-            output0 = tf.identity(tmp, name="OUTPUT")
+            tf.identity(tmp, name="OUTPUT")
 
     # Use a different model name for the non-batching variant
     if create_savedmodel:
         model_name = tu.get_sequence_model_name(
-            "savedmodel_nobatch" if max_batch == 0 else "savedmodel", dtype)
+            "savedmodel_nobatch" if max_batch == 0 else "savedmodel", dtype
+        )
     else:
         model_name = tu.get_sequence_model_name(
-            "graphdef_nobatch" if max_batch == 0 else "graphdef", dtype)
+            "graphdef_nobatch" if max_batch == 0 else "graphdef", dtype
+        )
 
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
@@ -252,43 +169,56 @@ def create_tf_modelfile(create_savedmodel, models_dir, model_version, max_batch,
         pass  # ignore existing dir
 
     if create_savedmodel:
-        with tf.Session() as sess:
-            sess.run(tf.initializers.global_variables())
-            input0_tensor = tf.get_default_graph().get_tensor_by_name("INPUT:0")
-            start0_tensor = tf.get_default_graph().get_tensor_by_name("START:0")
-            ready0_tensor = tf.get_default_graph().get_tensor_by_name("READY:0")
-            output0_tensor = tf.get_default_graph().get_tensor_by_name(
-                "OUTPUT:0")
-            tf.saved_model.simple_save(sess,
-                                       model_version_dir + "/model.savedmodel",
-                                       inputs={
-                                           "INPUT": input0_tensor,
-                                           "START": start0_tensor,
-                                           "READY": ready0_tensor
-                                       },
-                                       outputs={"OUTPUT": output0_tensor})
+        with tf.compat.v1.Session() as sess:
+            sess.run(tf.compat.v1.initializers.global_variables())
+            input0_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+                "INPUT:0"
+            )
+            start0_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+                "START:0"
+            )
+            ready0_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+                "READY:0"
+            )
+            output0_tensor = tf.compat.v1.get_default_graph().get_tensor_by_name(
+                "OUTPUT:0"
+            )
+            tf.compat.v1.saved_model.simple_save(
+                sess,
+                model_version_dir + "/model.savedmodel",
+                inputs={
+                    "INPUT": input0_tensor,
+                    "START": start0_tensor,
+                    "READY": ready0_tensor,
+                },
+                outputs={"OUTPUT": output0_tensor},
+            )
     else:
-        with tf.Session() as sess:
-            sess.run(tf.initializers.global_variables())
-            graph_io.write_graph(sess.graph.as_graph_def(),
-                                 model_version_dir,
-                                 "model.graphdef",
-                                 as_text=False)
-
-
-def create_tf_modelconfig(create_savedmodel, models_dir, model_version,
-                          max_batch, dtype, shape):
-
+        with tf.compat.v1.Session() as sess:
+            sess.run(tf.compat.v1.initializers.global_variables())
+            graph_io.write_graph(
+                sess.graph.as_graph_def(),
+                model_version_dir,
+                "model.graphdef",
+                as_text=False,
+            )
+
+
+def create_tf_modelconfig(
+    create_savedmodel, models_dir, model_version, max_batch, dtype, shape
+):
     if not tu.validate_for_tf_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
     # Use a different model name for the non-batching variant
     if create_savedmodel:
         model_name = tu.get_sequence_model_name(
-            "savedmodel_nobatch" if max_batch == 0 else "savedmodel", dtype)
+            "savedmodel_nobatch" if max_batch == 0 else "savedmodel", dtype
+        )
     else:
         model_name = tu.get_sequence_model_name(
-            "graphdef_nobatch" if max_batch == 0 else "graphdef", dtype)
+            "graphdef_nobatch" if max_batch == 0 else "graphdef", dtype
+        )
 
     if dtype == np.float32:
         control_type = "fp32"
@@ -299,7 +229,7 @@ def create_tf_modelconfig(create_savedmodel, models_dir, model_version,
         control_type = "int32"
 
     config_dir = models_dir + "/" + model_name
-    config = '''
+    config = """
 name: "{}"
 platform: "{}"
 max_batch_size: {}
@@ -345,11 +275,16 @@ def create_tf_modelconfig(create_savedmodel, models_dir, model_version,
     kind: KIND_GPU
   }}
 ]
-'''.format(
+""".format(
         model_name,
         "tensorflow_savedmodel" if create_savedmodel else "tensorflow_graphdef",
-        max_batch, control_type, control_type, np_to_model_dtype(dtype),
-        tu.shape_to_dims_str(shape), np_to_model_dtype(dtype))
+        max_batch,
+        control_type,
+        control_type,
+        np_to_model_dtype(dtype),
+        tu.shape_to_dims_str(shape),
+        np_to_model_dtype(dtype),
+    )
 
     try:
         os.makedirs(config_dir)
@@ -360,8 +295,9 @@ def create_tf_modelconfig(create_savedmodel, models_dir, model_version,
         cfile.write(config)
 
 
-def create_plan_shape_tensor_modelfile(models_dir, model_version, max_batch,
-                                       dtype, shape):
+def create_plan_shape_tensor_modelfile(
+    models_dir, model_version, max_batch, dtype, shape
+):
     # Note that resize layer does not support int tensors.
     # The model takes two inputs (INPUT and SHAPE_INPUT)
     # and two control inputs(START and READY).
@@ -377,12 +313,12 @@ def create_plan_shape_tensor_modelfile(models_dir, model_version, max_batch,
     TRT_LOGGER = trt.Logger(trt.Logger.INFO)
     builder = trt.Builder(TRT_LOGGER)
     network = builder.create_network(
-        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
+        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+    )
 
-    unit_shape = ([1] * len(shape))
+    unit_shape = [1] * len(shape)
     if max_batch != 0:
-        shape_in0 = network.add_input("SHAPE_INPUT", trt.int32,
-                                      [1 + len(shape)])
+        shape_in0 = network.add_input("SHAPE_INPUT", trt.int32, [1 + len(shape)])
         in0 = network.add_input("INPUT", trt_dtype, [-1] + shape)
         start0 = network.add_input("START", trt_dtype, [-1] + unit_shape)
         ready0 = network.add_input("READY", trt_dtype, [-1] + unit_shape)
@@ -393,8 +329,9 @@ def create_plan_shape_tensor_modelfile(models_dir, model_version, max_batch,
         ready0 = network.add_input("READY", trt_dtype, unit_shape)
 
     add = network.add_elementwise(in0, start0, trt.ElementWiseOperation.SUM)
-    out0 = network.add_elementwise(add.get_output(0), ready0,
-                                   trt.ElementWiseOperation.PROD).get_output(0)
+    out0 = network.add_elementwise(
+        add.get_output(0), ready0, trt.ElementWiseOperation.PROD
+    ).get_output(0)
 
     resize_layer = network.add_resize(input=in0)
     resize_layer.set_input(1, shape_in0)
@@ -421,7 +358,7 @@ def create_plan_shape_tensor_modelfile(models_dir, model_version, max_batch,
     shape_out0.get_output(0).allowed_formats = 1 << int(trt_memory_format)
     resized_out0.allowed_formats = 1 << int(trt_memory_format)
 
-    if (trt_dtype == trt.int8):
+    if trt_dtype == trt.int8:
         in0.dynamic_range = (-128.0, 127.0)
         out0.dynamic_range = (-128.0, 127.0)
         resized_out0.dynamic_range = (-128.0, 127.0)
@@ -430,9 +367,9 @@ def create_plan_shape_tensor_modelfile(models_dir, model_version, max_batch,
 
     flags = 1 << int(trt.BuilderFlag.STRICT_TYPES)
 
-    if (trt_dtype == trt.int8):
+    if trt_dtype == trt.int8:
         flags |= 1 << int(trt.BuilderFlag.INT8)
-    elif (trt_dtype == trt.float16):
+    elif trt_dtype == trt.float16:
         flags |= 1 << int(trt.BuilderFlag.FP16)
 
     min_prefix = []
@@ -451,16 +388,24 @@ def create_plan_shape_tensor_modelfile(models_dir, model_version, max_batch,
     profile = builder.create_optimization_profile()
     profile.set_shape_input("SHAPE_INPUT", min_shape, opt_shape, max_shape)
     profile.set_shape("INPUT", min_shape, opt_shape, max_shape)
-    profile.set_shape("START", min_prefix + unit_shape, opt_prefix + unit_shape,
-                      opt_prefix + unit_shape)
-    profile.set_shape("READY", min_prefix + unit_shape, opt_prefix + unit_shape,
-                      opt_prefix + unit_shape)
+    profile.set_shape(
+        "START",
+        min_prefix + unit_shape,
+        opt_prefix + unit_shape,
+        opt_prefix + unit_shape,
+    )
+    profile.set_shape(
+        "READY",
+        min_prefix + unit_shape,
+        opt_prefix + unit_shape,
+        opt_prefix + unit_shape,
+    )
 
     config = builder.create_builder_config()
     config.flags = flags
     config.add_optimization_profile(profile)
 
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
     except AttributeError:
@@ -469,7 +414,8 @@ def create_plan_shape_tensor_modelfile(models_dir, model_version, max_batch,
         del engine
 
     model_name = tu.get_sequence_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", dtype)
+        "plan_nobatch" if max_batch == 0 else "plan", dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     try:
@@ -481,8 +427,7 @@ def create_plan_shape_tensor_modelfile(models_dir, model_version, max_batch,
         f.write(engine_bytes)
 
 
-def create_plan_fixed_modelfile(models_dir, model_version, max_batch, dtype,
-                                shape):
+def create_plan_fixed_modelfile(models_dir, model_version, max_batch, dtype, shape):
     trt_dtype = np_to_trt_dtype(dtype)
     # Create the model. For now don't implement a proper accumulator
     # just return 0 if not-ready and 'INPUT'+'START' otherwise...  the
@@ -494,14 +439,15 @@ def create_plan_fixed_modelfile(models_dir, model_version, max_batch, dtype,
     start0 = network.add_input("START", trt_dtype, [1 for i in shape])
     ready0 = network.add_input("READY", trt_dtype, [1 for i in shape])
     add = network.add_elementwise(in0, start0, trt.ElementWiseOperation.SUM)
-    out0 = network.add_elementwise(add.get_output(0), ready0,
-                                   trt.ElementWiseOperation.PROD)
+    out0 = network.add_elementwise(
+        add.get_output(0), ready0, trt.ElementWiseOperation.PROD
+    )
 
     out0.get_output(0).name = "OUTPUT"
     network.mark_output(out0.get_output(0))
 
     config = builder.create_builder_config()
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     builder.max_batch_size = max(1, max_batch)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
@@ -512,7 +458,8 @@ def create_plan_fixed_modelfile(models_dir, model_version, max_batch, dtype,
     del network
 
     model_name = tu.get_sequence_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", dtype)
+        "plan_nobatch" if max_batch == 0 else "plan", dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     try:
@@ -524,8 +471,7 @@ def create_plan_fixed_modelfile(models_dir, model_version, max_batch, dtype,
         f.write(engine_bytes)
 
 
-def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype,
-                                   shape):
+def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype, shape):
     trt_dtype = np_to_trt_dtype(dtype)
     trt_memory_format = trt.TensorFormat.LINEAR
     # Create the model. For now don't implement a proper accumulator
@@ -538,8 +484,9 @@ def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype,
     start0 = network.add_input("START", trt_dtype, [1 for i in shape])
     ready0 = network.add_input("READY", trt_dtype, [1 for i in shape])
     add = network.add_elementwise(in0, start0, trt.ElementWiseOperation.SUM)
-    out0 = network.add_elementwise(add.get_output(0), ready0,
-                                   trt.ElementWiseOperation.PROD)
+    out0 = network.add_elementwise(
+        add.get_output(0), ready0, trt.ElementWiseOperation.PROD
+    )
 
     out0.get_output(0).name = "OUTPUT"
     network.mark_output(out0.get_output(0))
@@ -551,7 +498,7 @@ def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype,
     ready0.allowed_formats = 1 << int(trt_memory_format)
     out0.get_output(0).allowed_formats = 1 << int(trt_memory_format)
 
-    if (trt_dtype == trt.int8):
+    if trt_dtype == trt.int8:
         in0.dynamic_range = (-128.0, 127.0)
         out0.dynamic_range = (-128.0, 127.0)
         start0.dynamic_range = (-128.0, 127.0)
@@ -559,14 +506,14 @@ def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype,
 
     flags = 1 << int(trt.BuilderFlag.STRICT_TYPES)
 
-    if (trt_dtype == trt.int8):
+    if trt_dtype == trt.int8:
         flags |= 1 << int(trt.BuilderFlag.INT8)
-    elif (trt_dtype == trt.float16):
+    elif trt_dtype == trt.float16:
         flags |= 1 << int(trt.BuilderFlag.FP16)
 
     config = builder.create_builder_config()
     config.flags = flags
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     builder.max_batch_size = max(1, max_batch)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
@@ -576,7 +523,8 @@ def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype,
         del engine
 
     model_name = tu.get_sequence_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", dtype)
+        "plan_nobatch" if max_batch == 0 else "plan", dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     try:
@@ -588,8 +536,7 @@ def create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch, dtype,
         f.write(engine_bytes)
 
 
-def create_plan_dynamic_modelfile(models_dir, model_version, max_batch, dtype,
-                                  shape):
+def create_plan_dynamic_modelfile(models_dir, model_version, max_batch, dtype, shape):
     trt_dtype = np_to_trt_dtype(dtype)
     # Create the model. For now don't implement a proper accumulator
     # just return 0 if not-ready and 'INPUT'+'START' otherwise...  the
@@ -597,9 +544,10 @@ def create_plan_dynamic_modelfile(models_dir, model_version, max_batch, dtype,
     TRT_LOGGER = trt.Logger(trt.Logger.INFO)
     builder = trt.Builder(TRT_LOGGER)
     network = builder.create_network(
-        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
+        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+    )
 
-    unit_shape = ([1] * len(shape))
+    unit_shape = [1] * len(shape)
     if max_batch != 0:
         in0 = network.add_input("INPUT", trt_dtype, [-1] + shape)
         start0 = network.add_input("START", trt_dtype, [-1] + unit_shape)
@@ -610,8 +558,9 @@ def create_plan_dynamic_modelfile(models_dir, model_version, max_batch, dtype,
         ready0 = network.add_input("READY", trt_dtype, unit_shape)
 
     add = network.add_elementwise(in0, start0, trt.ElementWiseOperation.SUM)
-    out0 = network.add_elementwise(add.get_output(0), ready0,
-                                   trt.ElementWiseOperation.PROD)
+    out0 = network.add_elementwise(
+        add.get_output(0), ready0, trt.ElementWiseOperation.PROD
+    )
 
     out0.get_output(0).name = "OUTPUT"
     network.mark_output(out0.get_output(0))
@@ -636,17 +585,25 @@ def create_plan_dynamic_modelfile(models_dir, model_version, max_batch, dtype,
     profile = builder.create_optimization_profile()
     profile.set_shape("INPUT", min_shape, opt_shape, max_shape)
     if max_batch != 0:
-        profile.set_shape("START", [1] + unit_shape, [max_batch] + unit_shape,
-                          [max_batch] + unit_shape)
-        profile.set_shape("READY", [1] + unit_shape, [max_batch] + unit_shape,
-                          [max_batch] + unit_shape)
+        profile.set_shape(
+            "START",
+            [1] + unit_shape,
+            [max_batch] + unit_shape,
+            [max_batch] + unit_shape,
+        )
+        profile.set_shape(
+            "READY",
+            [1] + unit_shape,
+            [max_batch] + unit_shape,
+            [max_batch] + unit_shape,
+        )
     else:
         profile.set_shape("START", unit_shape, unit_shape, unit_shape)
         profile.set_shape("READY", unit_shape, unit_shape, unit_shape)
     config = builder.create_builder_config()
     config.add_optimization_profile(profile)
 
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
     except AttributeError:
@@ -655,7 +612,8 @@ def create_plan_dynamic_modelfile(models_dir, model_version, max_batch, dtype,
         del engine
 
     model_name = tu.get_sequence_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", dtype)
+        "plan_nobatch" if max_batch == 0 else "plan", dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     try:
@@ -667,8 +625,9 @@ def create_plan_dynamic_modelfile(models_dir, model_version, max_batch, dtype,
         f.write(engine_bytes)
 
 
-def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
-                                     dtype, shape):
+def create_plan_dynamic_rf_modelfile(
+    models_dir, model_version, max_batch, dtype, shape
+):
     trt_dtype = np_to_trt_dtype(dtype)
     trt_memory_format = trt.TensorFormat.LINEAR
 
@@ -678,9 +637,10 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
     TRT_LOGGER = trt.Logger(trt.Logger.INFO)
     builder = trt.Builder(TRT_LOGGER)
     network = builder.create_network(
-        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
+        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+    )
 
-    unit_shape = ([1] * len(shape))
+    unit_shape = [1] * len(shape)
     if max_batch != 0:
         in0 = network.add_input("INPUT", trt_dtype, [-1] + shape)
         start0 = network.add_input("START", trt_dtype, [-1] + unit_shape)
@@ -691,8 +651,9 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
         ready0 = network.add_input("READY", trt_dtype, unit_shape)
 
     add = network.add_elementwise(in0, start0, trt.ElementWiseOperation.SUM)
-    out0 = network.add_elementwise(add.get_output(0), ready0,
-                                   trt.ElementWiseOperation.PROD)
+    out0 = network.add_elementwise(
+        add.get_output(0), ready0, trt.ElementWiseOperation.PROD
+    )
 
     out0.get_output(0).name = "OUTPUT"
     network.mark_output(out0.get_output(0))
@@ -704,7 +665,7 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
     ready0.allowed_formats = 1 << int(trt_memory_format)
     out0.get_output(0).allowed_formats = 1 << int(trt_memory_format)
 
-    if (trt_dtype == trt.int8):
+    if trt_dtype == trt.int8:
         in0.dynamic_range = (-128.0, 127.0)
         out0.dynamic_range = (-128.0, 127.0)
         start0.dynamic_range = (-128.0, 127.0)
@@ -712,9 +673,9 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
 
     flags = 1 << int(trt.BuilderFlag.STRICT_TYPES)
 
-    if (trt_dtype == trt.int8):
+    if trt_dtype == trt.int8:
         flags |= 1 << int(trt.BuilderFlag.INT8)
-    elif (trt_dtype == trt.float16):
+    elif trt_dtype == trt.float16:
         flags |= 1 << int(trt.BuilderFlag.FP16)
 
     min_shape = []
@@ -737,10 +698,18 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
     profile = builder.create_optimization_profile()
     profile.set_shape("INPUT", min_shape, opt_shape, max_shape)
     if max_batch != 0:
-        profile.set_shape("START", [1] + unit_shape, [max_batch] + unit_shape,
-                          [max_batch] + unit_shape)
-        profile.set_shape("READY", [1] + unit_shape, [max_batch] + unit_shape,
-                          [max_batch] + unit_shape)
+        profile.set_shape(
+            "START",
+            [1] + unit_shape,
+            [max_batch] + unit_shape,
+            [max_batch] + unit_shape,
+        )
+        profile.set_shape(
+            "READY",
+            [1] + unit_shape,
+            [max_batch] + unit_shape,
+            [max_batch] + unit_shape,
+        )
     else:
         profile.set_shape("START", unit_shape, unit_shape, unit_shape)
         profile.set_shape("READY", unit_shape, unit_shape, unit_shape)
@@ -749,7 +718,7 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
     config.flags = flags
     config.add_optimization_profile(profile)
 
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
     except AttributeError:
@@ -758,7 +727,8 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
         del engine
 
     model_name = tu.get_sequence_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", dtype)
+        "plan_nobatch" if max_batch == 0 else "plan", dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     try:
@@ -771,37 +741,40 @@ def create_plan_dynamic_rf_modelfile(models_dir, model_version, max_batch,
 
 
 def create_plan_modelfile(models_dir, model_version, max_batch, dtype, shape):
-
     if not tu.validate_for_trt_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
     if dtype != np.float32:
-        if (not tu.shape_is_fixed(shape)):
-            create_plan_dynamic_rf_modelfile(models_dir, model_version,
-                                             max_batch, dtype, shape)
+        if not tu.shape_is_fixed(shape):
+            create_plan_dynamic_rf_modelfile(
+                models_dir, model_version, max_batch, dtype, shape
+            )
         else:
-            create_plan_fixed_rf_modelfile(models_dir, model_version, max_batch,
-                                           dtype, shape)
+            create_plan_fixed_rf_modelfile(
+                models_dir, model_version, max_batch, dtype, shape
+            )
     else:
-        if (not tu.shape_is_fixed(shape)):
-            create_plan_dynamic_modelfile(models_dir, model_version, max_batch,
-                                          dtype, shape)
+        if not tu.shape_is_fixed(shape):
+            create_plan_dynamic_modelfile(
+                models_dir, model_version, max_batch, dtype, shape
+            )
         else:
-            create_plan_fixed_modelfile(models_dir, model_version, max_batch,
-                                        dtype, shape)
+            create_plan_fixed_modelfile(
+                models_dir, model_version, max_batch, dtype, shape
+            )
 
 
 def create_plan_modelconfig(models_dir, model_version, max_batch, dtype, shape):
-
     if not tu.validate_for_trt_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
     model_name = tu.get_sequence_model_name(
-        "plan_nobatch" if max_batch == 0 else "plan", dtype)
+        "plan_nobatch" if max_batch == 0 else "plan", dtype
+    )
     config_dir = models_dir + "/" + model_name
     if FLAGS.tensorrt_shape_io:
         shape_tensor_dim = len(shape)
-        config = '''
+        config = """
 name: "{}"
 platform: "tensorrt_plan"
 max_batch_size: {}
@@ -870,15 +843,23 @@ def create_plan_modelconfig(models_dir, model_version, max_batch, dtype, shape):
     kind: KIND_GPU
   }}
 ]
-'''.format(model_name, max_batch, "int32" if dtype == np.int32 else "fp32",
-           "int32" if dtype == np.int32 else "fp32", np_to_model_dtype(dtype),
-           tu.shape_to_dims_str(shape), shape_tensor_dim,
-           np_to_model_dtype(dtype), tu.shape_to_dims_str(shape),
-           np_to_model_dtype(dtype), tu.shape_to_dims_str(shape),
-           shape_tensor_dim)
+""".format(
+            model_name,
+            max_batch,
+            "int32" if dtype == np.int32 else "fp32",
+            "int32" if dtype == np.int32 else "fp32",
+            np_to_model_dtype(dtype),
+            tu.shape_to_dims_str(shape),
+            shape_tensor_dim,
+            np_to_model_dtype(dtype),
+            tu.shape_to_dims_str(shape),
+            np_to_model_dtype(dtype),
+            tu.shape_to_dims_str(shape),
+            shape_tensor_dim,
+        )
 
     else:
-        config = '''
+        config = """
 name: "{}"
 platform: "tensorrt_plan"
 max_batch_size: {}
@@ -924,10 +905,16 @@ def create_plan_modelconfig(models_dir, model_version, max_batch, dtype, shape):
     kind: KIND_GPU
   }}
 ]
-'''.format(model_name, max_batch, "int32" if dtype == np.int32 else "fp32",
-           "int32" if dtype == np.int32 else "fp32", np_to_model_dtype(dtype),
-           tu.shape_to_dims_str(shape), np_to_model_dtype(dtype),
-           tu.shape_to_dims_str(shape))
+""".format(
+            model_name,
+            max_batch,
+            "int32" if dtype == np.int32 else "fp32",
+            "int32" if dtype == np.int32 else "fp32",
+            np_to_model_dtype(dtype),
+            tu.shape_to_dims_str(shape),
+            np_to_model_dtype(dtype),
+            tu.shape_to_dims_str(shape),
+        )
 
     try:
         os.makedirs(config_dir)
@@ -939,12 +926,12 @@ def create_plan_modelconfig(models_dir, model_version, max_batch, dtype, shape):
 
 
 def create_onnx_modelfile(models_dir, model_version, max_batch, dtype, shape):
-
     if not tu.validate_for_onnx_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
     model_name = tu.get_sequence_model_name(
-        "onnx_nobatch" if max_batch == 0 else "onnx", dtype)
+        "onnx_nobatch" if max_batch == 0 else "onnx", dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     # Create the model. For now don't implement a proper accumulator
@@ -968,30 +955,39 @@ def create_onnx_modelfile(models_dir, model_version, max_batch, dtype, shape):
     batch_dim = [] if max_batch == 0 else [None]
 
     onnx_input = onnx.helper.make_tensor_value_info(
-        "INPUT", onnx_dtype, batch_dim + onnx_input_shape)
-    onnx_start = onnx.helper.make_tensor_value_info("START", onnx_control_dtype,
-                                                    batch_dim + [1])
-    onnx_ready = onnx.helper.make_tensor_value_info("READY", onnx_control_dtype,
-                                                    batch_dim + [1])
+        "INPUT", onnx_dtype, batch_dim + onnx_input_shape
+    )
+    onnx_start = onnx.helper.make_tensor_value_info(
+        "START", onnx_control_dtype, batch_dim + [1]
+    )
+    onnx_ready = onnx.helper.make_tensor_value_info(
+        "READY", onnx_control_dtype, batch_dim + [1]
+    )
     onnx_output = onnx.helper.make_tensor_value_info(
-        "OUTPUT", onnx_dtype, batch_dim + onnx_output_shape)
+        "OUTPUT", onnx_dtype, batch_dim + onnx_output_shape
+    )
 
     internal_input = onnx.helper.make_node("Identity", ["INPUT"], ["_INPUT"])
 
-    # cast int8, int16 input to higer precision int as Onnx Add/Sub operator doesn't support those type
+    # cast int8, int16 input to higher precision int as Onnx Add/Sub operator doesn't support those type
     # Also casting String data type to int32
-    if ((onnx_dtype == onnx.TensorProto.INT8) or
-        (onnx_dtype == onnx.TensorProto.INT16) or
-        (onnx_dtype == onnx.TensorProto.STRING)):
-        internal_input = onnx.helper.make_node("Cast", ["INPUT"], ["_INPUT"],
-                                               to=onnx.TensorProto.INT32)
+    if (
+        (onnx_dtype == onnx.TensorProto.INT8)
+        or (onnx_dtype == onnx.TensorProto.INT16)
+        or (onnx_dtype == onnx.TensorProto.STRING)
+    ):
+        internal_input = onnx.helper.make_node(
+            "Cast", ["INPUT"], ["_INPUT"], to=onnx.TensorProto.INT32
+        )
 
     # Convert boolean value to int32 value
     if onnx_control_dtype == onnx.TensorProto.BOOL:
-        internal_input1 = onnx.helper.make_node("Cast", ["START"], ["_START"],
-                                                to=onnx.TensorProto.INT32)
-        internal_input2 = onnx.helper.make_node("Cast", ["READY"], ["_READY"],
-                                                to=onnx.TensorProto.INT32)
+        internal_input1 = onnx.helper.make_node(
+            "Cast", ["START"], ["_START"], to=onnx.TensorProto.INT32
+        )
+        internal_input2 = onnx.helper.make_node(
+            "Cast", ["READY"], ["_READY"], to=onnx.TensorProto.INT32
+        )
         add = onnx.helper.make_node("Add", ["_INPUT", "_START"], ["add"])
         # Take advantage of knowledge that the READY false value is 0 and true is 1
         mul = onnx.helper.make_node("Mul", ["_READY", "add"], ["CAST"])
@@ -1009,21 +1005,20 @@ def create_onnx_modelfile(models_dir, model_version, max_batch, dtype, shape):
         cast = onnx.helper.make_node("Identity", ["CAST"], ["OUTPUT"])
 
     if onnx_control_dtype == onnx.TensorProto.BOOL:
-        onnx_nodes = [
-            internal_input, internal_input1, internal_input2, add, mul, cast
-        ]
+        onnx_nodes = [internal_input, internal_input1, internal_input2, add, mul, cast]
     else:
         onnx_nodes = [internal_input, add, mul, cast]
     onnx_inputs = [onnx_input, onnx_start, onnx_ready]
     onnx_outputs = [onnx_output]
 
-    graph_proto = onnx.helper.make_graph(onnx_nodes, model_name, onnx_inputs,
-                                         onnx_outputs)
+    graph_proto = onnx.helper.make_graph(
+        onnx_nodes, model_name, onnx_inputs, onnx_outputs
+    )
     if FLAGS.onnx_opset > 0:
         model_opset = onnx.helper.make_operatorsetid("", FLAGS.onnx_opset)
-        model_def = onnx.helper.make_model(graph_proto,
-                                           producer_name="triton",
-                                           opset_imports=[model_opset])
+        model_def = onnx.helper.make_model(
+            graph_proto, producer_name="triton", opset_imports=[model_opset]
+        )
     else:
         model_def = onnx.helper.make_model(graph_proto, producer_name="triton")
 
@@ -1036,12 +1031,12 @@ def create_onnx_modelfile(models_dir, model_version, max_batch, dtype, shape):
 
 
 def create_onnx_modelconfig(models_dir, model_version, max_batch, dtype, shape):
-
     if not tu.validate_for_onnx_model(dtype, dtype, dtype, shape, shape, shape):
         return
 
     model_name = tu.get_sequence_model_name(
-        "onnx_nobatch" if max_batch == 0 else "onnx", dtype)
+        "onnx_nobatch" if max_batch == 0 else "onnx", dtype
+    )
     config_dir = models_dir + "/" + model_name
 
     if dtype == np.float32:
@@ -1052,24 +1047,32 @@ def create_onnx_modelconfig(models_dir, model_version, max_batch, dtype, shape):
     else:
         control_type = "int32"
 
-    instance_group_string = '''
+    instance_group_string = """
 instance_group [
   {
     kind: KIND_GPU
   }
 ]
-'''
+"""
 
     # [TODO] move create_general_modelconfig() out of emu as it is general
     # enough for all backends to use
     config = emu.create_general_modelconfig(
         model_name,
         "onnxruntime_onnx",
-        max_batch, [dtype], [shape], [None], [dtype], [shape], [None], [None],
+        max_batch,
+        [dtype],
+        [shape],
+        [None],
+        [dtype],
+        [shape],
+        [None],
+        [None],
         force_tensor_number_suffix=False,
-        instance_group_str=instance_group_string)
+        instance_group_str=instance_group_string,
+    )
 
-    config += '''
+    config += """
 sequence_batching {{
   max_sequence_idle_microseconds: 5000000
   control_input [
@@ -1093,7 +1096,9 @@ def create_onnx_modelconfig(models_dir, model_version, max_batch, dtype, shape):
     }}
   ]
 }}
-'''.format(type=control_type)
+""".format(
+        type=control_type
+    )
 
     try:
         os.makedirs(config_dir)
@@ -1104,11 +1109,10 @@ def create_onnx_modelconfig(models_dir, model_version, max_batch, dtype, shape):
         cfile.write(config)
 
 
-def create_libtorch_modelfile(models_dir, model_version, max_batch, dtype,
-                              shape):
-
-    if not tu.validate_for_libtorch_model(dtype, dtype, dtype, shape, shape,
-                                          shape, max_batch):
+def create_libtorch_modelfile(models_dir, model_version, max_batch, dtype, shape):
+    if not tu.validate_for_libtorch_model(
+        dtype, dtype, dtype, shape, shape, shape, max_batch
+    ):
         return
 
     torch_dtype = np_to_torch_dtype(dtype)
@@ -1120,12 +1124,12 @@ def create_libtorch_modelfile(models_dir, model_version, max_batch, dtype,
         torch_dtype = torch.int32
 
     model_name = tu.get_sequence_model_name(
-        "libtorch_nobatch" if max_batch == 0 else "libtorch", dtype)
+        "libtorch_nobatch" if max_batch == 0 else "libtorch", dtype
+    )
     # handle for -1 (when variable) since can't create tensor with shape of [-1]
     shape = [abs(ips) for ips in shape]
 
     class SequenceNet(nn.Module):
-
         def __init__(self):
             super(SequenceNet, self).__init__()
 
@@ -1143,8 +1147,9 @@ def forward(self, input0, start0, ready0):
         example_input1 = example_input1.long()
         example_input2 = example_input2.long()
 
-    traced = torch.jit.trace(sequenceModel,
-                             (example_input0, example_input1, example_input2))
+    traced = torch.jit.trace(
+        sequenceModel, (example_input0, example_input1, example_input2)
+    )
 
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
@@ -1156,15 +1161,15 @@ def forward(self, input0, start0, ready0):
     traced.save(model_version_dir + "/model.pt")
 
 
-def create_libtorch_modelconfig(models_dir, model_version, max_batch, dtype,
-                                shape):
-
-    if not tu.validate_for_libtorch_model(dtype, dtype, dtype, shape, shape,
-                                          shape, max_batch):
+def create_libtorch_modelconfig(models_dir, model_version, max_batch, dtype, shape):
+    if not tu.validate_for_libtorch_model(
+        dtype, dtype, dtype, shape, shape, shape, max_batch
+    ):
         return
 
     model_name = tu.get_sequence_model_name(
-        "libtorch_nobatch" if max_batch == 0 else "libtorch", dtype)
+        "libtorch_nobatch" if max_batch == 0 else "libtorch", dtype
+    )
     config_dir = models_dir + "/" + model_name
 
     if dtype == np.float32:
@@ -1176,7 +1181,7 @@ def create_libtorch_modelconfig(models_dir, model_version, max_batch, dtype,
         control_type = "int32"
 
     #  FIX FOR LibTorch
-    config = '''
+    config = """
 name: "{}"
 platform: "pytorch_libtorch"
 max_batch_size: {}
@@ -1222,9 +1227,15 @@ def create_libtorch_modelconfig(models_dir, model_version, max_batch, dtype,
     kind: KIND_GPU
   }}
 ]
-'''.format(model_name, max_batch, control_type, control_type,
-           np_to_model_dtype(dtype), tu.shape_to_dims_str(shape),
-           np_to_model_dtype(dtype))
+""".format(
+        model_name,
+        max_batch,
+        control_type,
+        control_type,
+        np_to_model_dtype(dtype),
+        tu.shape_to_dims_str(shape),
+        np_to_model_dtype(dtype),
+    )
 
     try:
         os.makedirs(config_dir)
@@ -1235,19 +1246,22 @@ def create_libtorch_modelconfig(models_dir, model_version, max_batch, dtype,
         cfile.write(config)
 
 
-def create_openvino_modelfile(models_dir, model_version, max_batch, dtype,
-                              shape):
-
-    batch_dim = [] if max_batch == 0 else [
-        max_batch,
-    ]
-    if not tu.validate_for_openvino_model(dtype, dtype, dtype,
-                                          batch_dim + shape, batch_dim + shape,
-                                          batch_dim + shape):
+def create_openvino_modelfile(models_dir, model_version, max_batch, dtype, shape):
+    batch_dim = (
+        []
+        if max_batch == 0
+        else [
+            max_batch,
+        ]
+    )
+    if not tu.validate_for_openvino_model(
+        dtype, dtype, dtype, batch_dim + shape, batch_dim + shape, batch_dim + shape
+    ):
         return
 
     model_name = tu.get_sequence_model_name(
-        "openvino_nobatch" if max_batch == 0 else "openvino", dtype)
+        "openvino_nobatch" if max_batch == 0 else "openvino", dtype
+    )
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     in0 = ng.parameter(shape=batch_dim + shape, dtype=dtype, name="INPUT")
@@ -1265,25 +1279,29 @@ def create_openvino_modelfile(models_dir, model_version, max_batch, dtype,
     except OSError as ex:
         pass  # ignore existing dir
 
-    ie_network.serialize(model_version_dir + "/model.xml",
-                         model_version_dir + "/model.bin")
-
+    ie_network.serialize(
+        model_version_dir + "/model.xml", model_version_dir + "/model.bin"
+    )
 
-def create_openvino_modelconfig(models_dir, model_version, max_batch, dtype,
-                                shape):
 
-    batch_dim = [] if max_batch == 0 else [
-        max_batch,
-    ]
-    if not tu.validate_for_openvino_model(dtype, dtype, dtype,
-                                          batch_dim + shape, batch_dim + shape,
-                                          batch_dim + shape):
+def create_openvino_modelconfig(models_dir, model_version, max_batch, dtype, shape):
+    batch_dim = (
+        []
+        if max_batch == 0
+        else [
+            max_batch,
+        ]
+    )
+    if not tu.validate_for_openvino_model(
+        dtype, dtype, dtype, batch_dim + shape, batch_dim + shape, batch_dim + shape
+    ):
         return
 
     model_name = tu.get_sequence_model_name(
-        "openvino_nobatch" if max_batch == 0 else "openvino", dtype)
+        "openvino_nobatch" if max_batch == 0 else "openvino", dtype
+    )
     config_dir = models_dir + "/" + model_name
-    config = '''
+    config = """
 name: "{}"
 backend: "openvino"
 max_batch_size: {}
@@ -1324,14 +1342,15 @@ def create_openvino_modelconfig(models_dir, model_version, max_batch, dtype,
     dims: [ 1 ]
   }}
 ]
-instance_group [
-  {{
-    kind: KIND_CPU
-  }}
-]
-'''.format(model_name, max_batch, "int32" if dtype == np.int32 else "fp32",
-           "int32" if dtype == np.int32 else "fp32", np_to_model_dtype(dtype),
-           tu.shape_to_dims_str(shape), np_to_model_dtype(dtype))
+""".format(
+        model_name,
+        max_batch,
+        "int32" if dtype == np.int32 else "fp32",
+        "int32" if dtype == np.int32 else "fp32",
+        np_to_model_dtype(dtype),
+        tu.shape_to_dims_str(shape),
+        np_to_model_dtype(dtype),
+    )
 
     try:
         os.makedirs(config_dir)
@@ -1346,12 +1365,10 @@ def create_shape_tensor_models(models_dir, dtype, shape, no_batch=True):
     model_version = 1
 
     create_plan_modelconfig(models_dir, model_version, 8, dtype, shape)
-    create_plan_shape_tensor_modelfile(models_dir, model_version, 8, dtype,
-                                       shape)
+    create_plan_shape_tensor_modelfile(models_dir, model_version, 8, dtype, shape)
     if no_batch:
         create_plan_modelconfig(models_dir, model_version, 0, dtype, shape)
-        create_plan_shape_tensor_modelfile(models_dir, model_version, 0, dtype,
-                                           shape)
+        create_plan_shape_tensor_modelfile(models_dir, model_version, 0, dtype, shape)
 
 
 def create_models(models_dir, dtype, shape, no_batch=True):
@@ -1361,19 +1378,15 @@ def create_models(models_dir, dtype, shape, no_batch=True):
         create_tf_modelconfig(False, models_dir, model_version, 8, dtype, shape)
         create_tf_modelfile(False, models_dir, model_version, 8, dtype, shape)
         if no_batch:
-            create_tf_modelconfig(False, models_dir, model_version, 0, dtype,
-                                  shape)
-            create_tf_modelfile(False, models_dir, model_version, 0, dtype,
-                                shape)
+            create_tf_modelconfig(False, models_dir, model_version, 0, dtype, shape)
+            create_tf_modelfile(False, models_dir, model_version, 0, dtype, shape)
 
     if FLAGS.savedmodel:
         create_tf_modelconfig(True, models_dir, model_version, 8, dtype, shape)
         create_tf_modelfile(True, models_dir, model_version, 8, dtype, shape)
         if no_batch:
-            create_tf_modelconfig(True, models_dir, model_version, 0, dtype,
-                                  shape)
-            create_tf_modelfile(True, models_dir, model_version, 0, dtype,
-                                shape)
+            create_tf_modelconfig(True, models_dir, model_version, 0, dtype, shape)
+            create_tf_modelfile(True, models_dir, model_version, 0, dtype, shape)
 
     if FLAGS.tensorrt:
         if dtype == bool:
@@ -1382,15 +1395,11 @@ def create_models(models_dir, dtype, shape, no_batch=True):
         if dtype == np.int8:
             suffix = [1, 1]
 
-        create_plan_modelconfig(models_dir, model_version, 8, dtype,
-                                shape + suffix)
-        create_plan_modelfile(models_dir, model_version, 8, dtype,
-                              shape + suffix)
+        create_plan_modelconfig(models_dir, model_version, 8, dtype, shape + suffix)
+        create_plan_modelfile(models_dir, model_version, 8, dtype, shape + suffix)
         if no_batch:
-            create_plan_modelconfig(models_dir, model_version, 0, dtype,
-                                    shape + suffix)
-            create_plan_modelfile(models_dir, model_version, 0, dtype,
-                                  shape + suffix)
+            create_plan_modelconfig(models_dir, model_version, 0, dtype, shape + suffix)
+            create_plan_modelfile(models_dir, model_version, 0, dtype, shape + suffix)
 
     if FLAGS.onnx:
         create_onnx_modelconfig(models_dir, model_version, 8, dtype, shape)
@@ -1404,19 +1413,15 @@ def create_models(models_dir, dtype, shape, no_batch=True):
         create_libtorch_modelconfig(models_dir, model_version, 8, dtype, shape)
         create_libtorch_modelfile(models_dir, model_version, 8, dtype, shape)
         if no_batch:
-            create_libtorch_modelconfig(models_dir, model_version, 0, dtype,
-                                        shape)
-            create_libtorch_modelfile(models_dir, model_version, 0, dtype,
-                                      shape)
+            create_libtorch_modelconfig(models_dir, model_version, 0, dtype, shape)
+            create_libtorch_modelfile(models_dir, model_version, 0, dtype, shape)
 
     if FLAGS.openvino:
         create_openvino_modelconfig(models_dir, model_version, 8, dtype, shape)
         create_openvino_modelfile(models_dir, model_version, 8, dtype, shape)
         if no_batch:
-            create_openvino_modelconfig(models_dir, model_version, 0, dtype,
-                                        shape)
-            create_openvino_modelfile(models_dir, model_version, 0, dtype,
-                                      shape)
+            create_openvino_modelconfig(models_dir, model_version, 0, dtype, shape)
+            create_openvino_modelfile(models_dir, model_version, 0, dtype, shape)
 
     if FLAGS.ensemble:
         if dtype == bool:
@@ -1425,80 +1430,98 @@ def create_models(models_dir, dtype, shape, no_batch=True):
             config_shape = shape
             if pair[0] == "plan" and dtype == np.int8:
                 config_shape = shape + [1, 1]
-            if not pair[1](dtype, dtype, dtype, config_shape, config_shape,
-                           config_shape):
+            if not pair[1](
+                dtype, dtype, dtype, config_shape, config_shape, config_shape
+            ):
                 continue
 
-            emu.create_sequence_ensemble_modelconfig(pair[0], models_dir, 8,
-                                                     model_version,
-                                                     config_shape, dtype)
-            emu.create_sequence_ensemble_modelfile(pair[0], models_dir, 8,
-                                                   model_version, config_shape,
-                                                   dtype)
+            emu.create_sequence_ensemble_modelconfig(
+                pair[0], models_dir, 8, model_version, config_shape, dtype
+            )
+            emu.create_sequence_ensemble_modelfile(
+                pair[0], models_dir, 8, model_version, config_shape, dtype
+            )
             if no_batch:
                 emu.create_sequence_ensemble_modelconfig(
-                    pair[0], models_dir, 0, model_version, config_shape, dtype)
-                emu.create_sequence_ensemble_modelfile(pair[0], models_dir, 0,
-                                                       model_version,
-                                                       config_shape, dtype)
+                    pair[0], models_dir, 0, model_version, config_shape, dtype
+                )
+                emu.create_sequence_ensemble_modelfile(
+                    pair[0], models_dir, 0, model_version, config_shape, dtype
+                )
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('--models_dir',
-                        type=str,
-                        required=True,
-                        help='Top-level model directory')
-    parser.add_argument('--graphdef',
-                        required=False,
-                        action='store_true',
-                        help='Generate GraphDef models')
-    parser.add_argument('--savedmodel',
-                        required=False,
-                        action='store_true',
-                        help='Generate SavedModel models')
-    parser.add_argument('--tensorrt',
-                        required=False,
-                        action='store_true',
-                        help='Generate TensorRT PLAN models')
     parser.add_argument(
-        '--tensorrt-shape-io',
+        "--models_dir", type=str, required=True, help="Top-level model directory"
+    )
+    parser.add_argument(
+        "--graphdef",
+        required=False,
+        action="store_true",
+        help="Generate GraphDef models",
+    )
+    parser.add_argument(
+        "--savedmodel",
         required=False,
-        action='store_true',
-        help='Generate TensorRT PLAN models w/ shape tensor i/o')
-    parser.add_argument('--onnx',
-                        required=False,
-                        action='store_true',
-                        help='Generate Onnx models')
+        action="store_true",
+        help="Generate SavedModel models",
+    )
     parser.add_argument(
-        '--onnx_opset',
+        "--tensorrt",
+        required=False,
+        action="store_true",
+        help="Generate TensorRT PLAN models",
+    )
+    parser.add_argument(
+        "--tensorrt-shape-io",
+        required=False,
+        action="store_true",
+        help="Generate TensorRT PLAN models w/ shape tensor i/o",
+    )
+    parser.add_argument(
+        "--onnx", required=False, action="store_true", help="Generate Onnx models"
+    )
+    parser.add_argument(
+        "--onnx_opset",
         type=int,
         required=False,
         default=0,
-        help='Opset used for Onnx models. Default is to use ONNXRT default')
-    parser.add_argument('--libtorch',
-                        required=False,
-                        action='store_true',
-                        help='Generate Pytorch LibTorch models')
-    parser.add_argument('--openvino',
-                        required=False,
-                        action='store_true',
-                        help='Generate OpenVino models')
-    parser.add_argument('--variable',
-                        required=False,
-                        action='store_true',
-                        help='Used variable-shape tensors for input/output')
-    parser.add_argument('--ensemble',
-                        required=False,
-                        action='store_true',
-                        help='Generate ensemble models against the models' +
-                        ' in all platforms. Note that the models generated' +
-                        ' are not completed.')
+        help="Opset used for Onnx models. Default is to use ONNXRT default",
+    )
+    parser.add_argument(
+        "--libtorch",
+        required=False,
+        action="store_true",
+        help="Generate Pytorch LibTorch models",
+    )
+    parser.add_argument(
+        "--openvino",
+        required=False,
+        action="store_true",
+        help="Generate OpenVino models",
+    )
+    parser.add_argument(
+        "--variable",
+        required=False,
+        action="store_true",
+        help="Used variable-shape tensors for input/output",
+    )
+    parser.add_argument(
+        "--ensemble",
+        required=False,
+        action="store_true",
+        help="Generate ensemble models against the models"
+        + " in all platforms. Note that the models generated"
+        + " are not completed.",
+    )
     FLAGS, unparsed = parser.parse_known_args()
 
     if FLAGS.graphdef or FLAGS.savedmodel:
         import tensorflow as tf
         from tensorflow.python.framework import graph_io
+
+        tf.compat.v1.disable_eager_execution()
     if FLAGS.tensorrt or FLAGS.tensorrt_shape_io:
         import tensorrt as trt
     if FLAGS.onnx:
@@ -1513,43 +1536,84 @@ def create_models(models_dir, dtype, shape, no_batch=True):
     import test_util as tu
 
     if FLAGS.tensorrt_shape_io:
-        create_shape_tensor_models(FLAGS.models_dir, np.float32, [
-            -1,
-        ])
+        create_shape_tensor_models(
+            FLAGS.models_dir,
+            np.float32,
+            [
+                -1,
+            ],
+        )
     else:
         # Tests with models that accept fixed-shape input/output tensors
         if not FLAGS.variable:
-            create_models(FLAGS.models_dir, np.float32, [
-                1,
-            ])
-            create_models(FLAGS.models_dir, np.int32, [
-                1,
-            ])
-            create_models(FLAGS.models_dir, np_dtype_string, [
-                1,
-            ])
-            create_models(FLAGS.models_dir, bool, [
-                1,
-            ])
+            create_models(
+                FLAGS.models_dir,
+                np.float32,
+                [
+                    1,
+                ],
+            )
+            create_models(
+                FLAGS.models_dir,
+                np.int32,
+                [
+                    1,
+                ],
+            )
+            create_models(
+                FLAGS.models_dir,
+                np_dtype_string,
+                [
+                    1,
+                ],
+            )
+            create_models(
+                FLAGS.models_dir,
+                bool,
+                [
+                    1,
+                ],
+            )
 
         # Tests with models that accept variable-shape input/output tensors
         if FLAGS.variable:
-            create_models(FLAGS.models_dir, np.int32, [
-                -1,
-            ], False)
-            create_models(FLAGS.models_dir, np.float32, [
-                -1,
-            ], False)
-            create_models(FLAGS.models_dir, np_dtype_string, [
-                -1,
-            ], False)
-            create_models(FLAGS.models_dir, bool, [
-                -1,
-            ], False)
+            create_models(
+                FLAGS.models_dir,
+                np.int32,
+                [
+                    -1,
+                ],
+                False,
+            )
+            create_models(
+                FLAGS.models_dir,
+                np.float32,
+                [
+                    -1,
+                ],
+                False,
+            )
+            create_models(
+                FLAGS.models_dir,
+                np_dtype_string,
+                [
+                    -1,
+                ],
+                False,
+            )
+            create_models(
+                FLAGS.models_dir,
+                bool,
+                [
+                    -1,
+                ],
+                False,
+            )
 
         if FLAGS.ensemble:
             # Create nop models used in ensemble
             for model_dtype in ["TYPE_INT32", "TYPE_FP32"]:
                 for model_shape in [(-1,)]:
-                    emu.create_nop_modelconfig(FLAGS.models_dir, model_shape,
-                                               model_dtype)
+                    emu.create_nop_modelconfig(
+                        FLAGS.models_dir, model_shape, model_dtype
+                    )
diff --git a/qa/common/gen_qa_tf_parameters.py b/qa/common/gen_qa_tf_parameters.py
new file mode 100755
index 0000000000..9c99ba1a6f
--- /dev/null
+++ b/qa/common/gen_qa_tf_parameters.py
@@ -0,0 +1,122 @@
+#!/usr/bin/env python3
+
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+
+import os
+
+import tensorflow as tf
+from tensorflow.python.framework import graph_io
+
+
+def create_graphdefmodel(models_dir, model_name, model_version=1):
+    """A simple tensorflow model that accumulates the INPUT with internal
+    model parameter named VARIABLE and produces OUTPUT.
+    """
+
+    tf.compat.v1.reset_default_graph()
+    tf.compat.v1.disable_eager_execution()
+    input0 = tf.compat.v1.placeholder(
+        tf.int32,
+        [
+            1,
+        ],
+        "INPUT",
+    )
+    variable = tf.compat.v1.get_variable(
+        "VARIABLE",
+        [
+            1,
+        ],
+        initializer=tf.compat.v1.zeros_initializer(),
+        dtype=tf.int32,
+    )
+    tf.add(variable, input0, name="OUTPUT")
+    tf.compat.v1.global_variables_initializer()
+    model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
+
+    try:
+        os.makedirs(model_version_dir)
+    except OSError as ex:
+        pass  # ignore existing dir
+
+    with tf.compat.v1.Session() as sess:
+        graph_io.write_graph(
+            sess.graph.as_graph_def(),
+            model_version_dir,
+            "model.graphdef",
+            as_text=False,
+        )
+
+
+def create_graphdef_modelconfig(models_dir, model_name):
+    config_dir = models_dir + "/" + model_name
+    config = """
+name: "{}"
+platform: "tensorflow_graphdef"
+input [
+  {{
+    name: "INPUT"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  }}
+]
+output [
+  {{
+    name: "OUTPUT"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  }}
+]
+""".format(
+        model_name
+    )
+
+    try:
+        os.makedirs(config_dir)
+    except OSError as ex:
+        pass  # ignore existing dir
+
+    with open(config_dir + "/config.pbtxt", "w") as cfile:
+        cfile.write(config)
+
+
+if __name__ == "__main__":
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--models_dir", type=str, required=True, help="Top-level model directory"
+    )
+    args = parser.parse_args()
+
+    model_name = "graphdef_variable"
+    create_graphdefmodel(args.models_dir, model_name)
+    create_graphdef_modelconfig(args.models_dir, model_name=model_name)
diff --git a/qa/common/gen_qa_torchtrt_models.py b/qa/common/gen_qa_torchtrt_models.py
old mode 100644
new mode 100755
index d9fe15d619..5f6ea04581
--- a/qa/common/gen_qa_torchtrt_models.py
+++ b/qa/common/gen_qa_torchtrt_models.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -27,11 +29,9 @@
 import argparse
 import os
 
-import numpy as np
 import torch
-from torch import nn
-import torchvision
 import torch_tensorrt
+import torchvision
 
 
 def create_resnet50_torchtrt(models_dir, max_batch):
@@ -41,15 +41,17 @@ def create_resnet50_torchtrt(models_dir, max_batch):
 
     resnet50_ts = torch.jit.trace(model, example_input)
 
-    trt_ts_module = torch_tensorrt.compile(resnet50_ts,
-        inputs = [ 
+    trt_ts_module = torch_tensorrt.compile(
+        resnet50_ts,
+        inputs=[
             torch_tensorrt.Input(
                 min_shape=[1, 3, 224, 224],
                 opt_shape=[1, 3, 224, 224],
                 max_shape=[max_batch, 3, 224, 224],
-                dtype=torch.float)
+                dtype=torch.float,
+            )
         ],
-        enabled_precisions = {torch.float},
+        enabled_precisions={torch.float},
     )
 
     model_name = "resnet50_libtorch"
@@ -66,10 +68,9 @@ def create_resnet50_torchtrt(models_dir, max_batch):
 
 
 def create_resnet50_torchtrt_modelconfig(models_dir, max_batch):
-
     model_name = "resnet50_libtorch"
     config_dir = models_dir + "/" + model_name
-    config = '''
+    config = """
 name: "{}"
 backend: "pytorch"
 max_batch_size: {}
@@ -89,7 +90,9 @@ def create_resnet50_torchtrt_modelconfig(models_dir, max_batch):
     label_filename: "resnet50_labels.txt"
   }}
 ]
-'''.format(model_name, max_batch)
+""".format(
+        model_name, max_batch
+    )
 
     try:
         os.makedirs(config_dir)
@@ -100,12 +103,11 @@ def create_resnet50_torchtrt_modelconfig(models_dir, max_batch):
         cfile.write(config)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('--models_dir',
-                        type=str,
-                        required=True,
-                        help='Top-level model directory')
+    parser.add_argument(
+        "--models_dir", type=str, required=True, help="Top-level model directory"
+    )
     FLAGS, unparsed = parser.parse_known_args()
 
     create_resnet50_torchtrt(FLAGS.models_dir, 128)
diff --git a/qa/common/gen_qa_trt_data_dependent_shape.py b/qa/common/gen_qa_trt_data_dependent_shape.py
new file mode 100755
index 0000000000..c6600ed919
--- /dev/null
+++ b/qa/common/gen_qa_trt_data_dependent_shape.py
@@ -0,0 +1,158 @@
+#!/usr/bin/env python3
+
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import argparse
+import os
+
+import numpy as np
+import tensorrt as trt
+import test_util as tu
+from gen_common import np_to_model_dtype, np_to_trt_dtype
+
+
+# The 'nonzero' model that we use for data dependent shape is naturally
+# not support batching, because the layer output is not trivially separable
+# based on the request batch size.
+# input_shape is config shape
+def create_data_dependent_modelfile(
+    models_dir, model_name, input_shape, input_dtype=np.int32, min_dim=1, max_dim=32
+):
+    trt_input_dtype = np_to_trt_dtype(input_dtype)
+
+    # Create the model
+    TRT_LOGGER = trt.Logger(trt.Logger.INFO)
+    builder = trt.Builder(TRT_LOGGER)
+    network = builder.create_network(
+        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+    )
+
+    # input
+    in0 = network.add_input("INPUT", trt_input_dtype, input_shape)
+
+    # layers
+    non_zero = network.add_non_zero(in0)
+
+    # configure output
+    out0 = non_zero.get_output(0)
+    out0.name = "OUTPUT"
+    network.mark_output(out0)
+
+    # optimization profile
+    min_shape = []
+    opt_shape = []
+    max_shape = []
+    for i in input_shape:
+        if i == -1:
+            min_shape = min_shape + [min_dim]
+            opt_shape = opt_shape + [int((max_dim + min_dim) / 2)]
+            max_shape = max_shape + [max_dim]
+        else:
+            min_shape = min_shape + [i]
+            opt_shape = opt_shape + [i]
+            max_shape = max_shape + [i]
+
+    profile = builder.create_optimization_profile()
+    profile.set_shape("INPUT", min_shape, opt_shape, max_shape)
+    config = builder.create_builder_config()
+    config.add_optimization_profile(profile)
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
+
+    # serialized model
+    engine_bytes = builder.build_serialized_network(network, config)
+
+    model_version_dir = models_dir + "/" + model_name + "/1"
+    try:
+        os.makedirs(model_version_dir)
+    except OSError as ex:
+        pass  # ignore existing dir
+
+    with open(model_version_dir + "/model.plan", "wb") as f:
+        f.write(engine_bytes)
+
+
+def create_data_dependent_modelconfig(
+    models_dir, model_name, input_shape, input_dtype=np.int32
+):
+    config_dir = models_dir + "/" + model_name
+    config = """
+name: "{}"
+platform: "tensorrt_plan"
+max_batch_size: 0
+input [
+  {{
+    name: "INPUT"
+    data_type: {}
+    dims: [ {} ]
+  }}
+]
+output [
+  {{
+    name: "OUTPUT"
+    data_type: {}
+    dims: [ {} ]
+   }}
+]
+""".format(
+        model_name,
+        np_to_model_dtype(input_dtype),
+        tu.shape_to_dims_str(input_shape),
+        np_to_model_dtype(np.int32),
+        tu.shape_to_dims_str((len(input_shape), -1)),
+    )
+
+    try:
+        os.makedirs(config_dir)
+    except OSError as ex:
+        pass  # ignore existing dir
+
+    with open(config_dir + "/config.pbtxt", "w") as cfile:
+        cfile.write(config)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--models_dir", type=str, required=True, help="Top-level model directory"
+    )
+    FLAGS, unparsed = parser.parse_known_args()
+
+    # Fixed input shape
+    create_data_dependent_modelfile(
+        FLAGS.models_dir, "plan_nobatch_nonzero_fixed", (4, 4)
+    )
+    create_data_dependent_modelconfig(
+        FLAGS.models_dir, "plan_nobatch_nonzero_fixed", (4, 4)
+    )
+
+    # Dynamic input shape
+    create_data_dependent_modelfile(
+        FLAGS.models_dir, "plan_nobatch_nonzero_dynamic", (-1, -1)
+    )
+    create_data_dependent_modelconfig(
+        FLAGS.models_dir, "plan_nobatch_nonzero_dynamic", (-1, -1)
+    )
diff --git a/qa/common/gen_qa_trt_format_models.py b/qa/common/gen_qa_trt_format_models.py
old mode 100644
new mode 100755
index 677af8e7b0..e077139aec
--- a/qa/common/gen_qa_trt_format_models.py
+++ b/qa/common/gen_qa_trt_format_models.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,49 +28,13 @@
 
 import argparse
 import os
+
 import numpy as np
 import tensorrt as trt
 import test_util as tu
+from gen_common import np_to_model_dtype, np_to_trt_dtype
 
-
-def np_to_model_dtype(np_dtype):
-    if np_dtype == bool:
-        return "TYPE_BOOL"
-    elif np_dtype == np.int8:
-        return "TYPE_INT8"
-    elif np_dtype == np.int16:
-        return "TYPE_INT16"
-    elif np_dtype == np.int32:
-        return "TYPE_INT32"
-    elif np_dtype == np.int64:
-        return "TYPE_INT64"
-    elif np_dtype == np.uint8:
-        return "TYPE_UINT8"
-    elif np_dtype == np.uint16:
-        return "TYPE_UINT16"
-    elif np_dtype == np.float16:
-        return "TYPE_FP16"
-    elif np_dtype == np.float32:
-        return "TYPE_FP32"
-    elif np_dtype == np.float64:
-        return "TYPE_FP64"
-    elif np_dtype == np_dtype_string:
-        return "TYPE_STRING"
-    return None
-
-
-def np_to_trt_dtype(np_dtype):
-    if np_dtype == bool:
-        return trt.bool
-    elif np_dtype == np.int8:
-        return trt.int8
-    elif np_dtype == np.int32:
-        return trt.int32
-    elif np_dtype == np.float16:
-        return trt.float16
-    elif np_dtype == np.float32:
-        return trt.float32
-    return None
+np_dtype_string = np.dtype(object)
 
 
 def trt_format_to_string(trt_format):
@@ -94,19 +60,21 @@ def trt_format_to_string(trt_format):
     return "INVALID"
 
 
-def create_plan_dynamic_modelfile(models_dir,
-                                  max_batch,
-                                  model_version,
-                                  input_shape,
-                                  output0_shape,
-                                  output1_shape,
-                                  input_dtype,
-                                  output0_dtype,
-                                  output1_dtype,
-                                  input_memory_format,
-                                  output_memory_format,
-                                  min_dim=1,
-                                  max_dim=64):
+def create_plan_dynamic_modelfile(
+    models_dir,
+    max_batch,
+    model_version,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    input_memory_format,
+    output_memory_format,
+    min_dim=1,
+    max_dim=64,
+):
     trt_input_dtype = np_to_trt_dtype(input_dtype)
     trt_output0_dtype = np_to_trt_dtype(output0_dtype)
     trt_output1_dtype = np_to_trt_dtype(output1_dtype)
@@ -117,7 +85,8 @@ def create_plan_dynamic_modelfile(models_dir,
     TRT_LOGGER = trt.Logger(trt.Logger.INFO)
     builder = trt.Builder(TRT_LOGGER)
     network = builder.create_network(
-        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
+        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+    )
     if max_batch == 0:
         input_with_batchsize = [i for i in input_shape]
     else:
@@ -144,12 +113,12 @@ def create_plan_dynamic_modelfile(models_dir,
     out0.get_output(0).allowed_formats = 1 << int(trt_output_memory_format)
     out1.get_output(0).allowed_formats = 1 << int(trt_output_memory_format)
 
-    if (trt_input_dtype == trt.int8):
+    if trt_input_dtype == trt.int8:
         in0.dynamic_range = (-128.0, 127.0)
         in1.dynamic_range = (-128.0, 127.0)
-    if (trt_output0_dtype == trt.int8):
+    if trt_output0_dtype == trt.int8:
         out0.get_output(0).dynamic_range = (-128.0, 127.0)
-    if (trt_output1_dtype == trt.int8):
+    if trt_output1_dtype == trt.int8:
         out1.get_output(0).dynamic_range = (-128.0, 127.0)
 
     min_shape = []
@@ -175,14 +144,14 @@ def create_plan_dynamic_modelfile(models_dir,
     flags = 1 << int(trt.BuilderFlag.STRICT_TYPES)
     datatype_set = set([trt_input_dtype, trt_output0_dtype, trt_output1_dtype])
     for dt in datatype_set:
-        if (dt == trt.int8):
+        if dt == trt.int8:
             flags |= 1 << int(trt.BuilderFlag.INT8)
-        elif (dt == trt.float16):
+        elif dt == trt.float16:
             flags |= 1 << int(trt.BuilderFlag.FP16)
     config = builder.create_builder_config()
     config.flags = flags
     config.add_optimization_profile(profile)
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
     except AttributeError:
@@ -192,10 +161,13 @@ def create_plan_dynamic_modelfile(models_dir,
 
     # Use a different model name for different kinds of models
     base_name = "plan_nobatch" if max_batch == 0 else "plan"
-    base_name += "_" + trt_format_to_string(
-        input_memory_format) + "_" + trt_format_to_string(output_memory_format)
-    model_name = tu.get_model_name(base_name, input_dtype, output0_dtype,
-                                   output1_dtype)
+    base_name += (
+        "_"
+        + trt_format_to_string(input_memory_format)
+        + "_"
+        + trt_format_to_string(output_memory_format)
+    )
+    model_name = tu.get_model_name(base_name, input_dtype, output0_dtype, output1_dtype)
 
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
@@ -207,13 +179,20 @@ def create_plan_dynamic_modelfile(models_dir,
     with open(model_version_dir + "/model.plan", "wb") as f:
         f.write(engine_bytes)
 
-    del builder
 
-
-def create_plan_fixed_modelfile(models_dir, max_batch, model_version,
-                                input_shape, output0_shape, output1_shape,
-                                input_dtype, output0_dtype, output1_dtype,
-                                input_memory_format, output_memory_format):
+def create_plan_fixed_modelfile(
+    models_dir,
+    max_batch,
+    model_version,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    input_memory_format,
+    output_memory_format,
+):
     trt_input_dtype = np_to_trt_dtype(input_dtype)
     trt_output0_dtype = np_to_trt_dtype(output0_dtype)
     trt_output1_dtype = np_to_trt_dtype(output1_dtype)
@@ -245,24 +224,24 @@ def create_plan_fixed_modelfile(models_dir, max_batch, model_version,
     out0.get_output(0).allowed_formats = 1 << int(trt_output_memory_format)
     out1.get_output(0).allowed_formats = 1 << int(trt_output_memory_format)
 
-    if (trt_input_dtype == trt.int8):
+    if trt_input_dtype == trt.int8:
         in0.dynamic_range = (-128.0, 127.0)
         in1.dynamic_range = (-128.0, 127.0)
-    if (trt_output0_dtype == trt.int8):
+    if trt_output0_dtype == trt.int8:
         out0.get_output(0).dynamic_range = (-128.0, 127.0)
-    if (trt_output1_dtype == trt.int8):
+    if trt_output1_dtype == trt.int8:
         out1.get_output(0).dynamic_range = (-128.0, 127.0)
 
     flags = 1 << int(trt.BuilderFlag.STRICT_TYPES)
     datatype_set = set([trt_input_dtype, trt_output0_dtype, trt_output1_dtype])
     for dt in datatype_set:
-        if (dt == trt.int8):
+        if dt == trt.int8:
             flags |= 1 << int(trt.BuilderFlag.INT8)
-        elif (dt == trt.float16):
+        elif dt == trt.float16:
             flags |= 1 << int(trt.BuilderFlag.FP16)
     config = builder.create_builder_config()
     config.flags = flags
-    config.max_workspace_size = 1 << 20
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
     builder.max_batch_size = max(1, max_batch)
     try:
         engine_bytes = builder.build_serialized_network(network, config)
@@ -272,10 +251,13 @@ def create_plan_fixed_modelfile(models_dir, max_batch, model_version,
         del engine
 
     base_name = "plan_nobatch" if max_batch == 0 else "plan"
-    base_name += "_" + trt_format_to_string(
-        input_memory_format) + "_" + trt_format_to_string(output_memory_format)
-    model_name = tu.get_model_name(base_name, input_dtype, output0_dtype,
-                                   output1_dtype)
+    base_name += (
+        "_"
+        + trt_format_to_string(input_memory_format)
+        + "_"
+        + trt_format_to_string(output_memory_format)
+    )
+    model_name = tu.get_model_name(base_name, input_dtype, output0_dtype, output1_dtype)
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
     try:
@@ -286,41 +268,56 @@ def create_plan_fixed_modelfile(models_dir, max_batch, model_version,
     with open(model_version_dir + "/model.plan", "wb") as f:
         f.write(engine_bytes)
 
-    del builder
-
 
-def create_plan_modelconfig(models_dir, max_batch, model_version, input_shape,
-                            output0_shape, output1_shape, input_dtype,
-                            output0_dtype, output1_dtype, input_memory_format,
-                            output_memory_format, version_policy):
-
-    if not tu.validate_for_trt_model(input_dtype, output0_dtype, output1_dtype,
-                                     input_shape, output0_shape, output1_shape):
+def create_plan_modelconfig(
+    models_dir,
+    max_batch,
+    model_version,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    input_memory_format,
+    output_memory_format,
+    version_policy,
+):
+    if not tu.validate_for_trt_model(
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+        input_shape,
+        output0_shape,
+        output1_shape,
+    ):
         return
 
     # Unpack version policy
     version_policy_str = "{ latest { num_versions: 1 }}"
     if version_policy is not None:
         type, val = version_policy
-        if type == 'latest':
-            version_policy_str = "{{ latest {{ num_versions: {} }}}}".format(
-                val)
-        elif type == 'specific':
+        if type == "latest":
+            version_policy_str = "{{ latest {{ num_versions: {} }}}}".format(val)
+        elif type == "specific":
             version_policy_str = "{{ specific {{ versions: {} }}}}".format(val)
         else:
             version_policy_str = "{ all { }}"
 
     # Use a different model name for different kinds of models
     base_name = "plan_nobatch" if max_batch == 0 else "plan"
-    base_name += "_" + trt_format_to_string(
-        input_memory_format) + "_" + trt_format_to_string(output_memory_format)
-    model_name = tu.get_model_name(base_name, input_dtype, output0_dtype,
-                                   output1_dtype)
+    base_name += (
+        "_"
+        + trt_format_to_string(input_memory_format)
+        + "_"
+        + trt_format_to_string(output_memory_format)
+    )
+    model_name = tu.get_model_name(base_name, input_dtype, output0_dtype, output1_dtype)
 
     config_dir = models_dir + "/" + model_name
     if -1 in input_shape:
         profile_index = 0
-        config = '''
+        config = """
 name: "{}"
 platform: "tensorrt_plan"
 max_batch_size: {}
@@ -354,15 +351,22 @@ def create_plan_modelconfig(models_dir, max_batch, model_version, input_shape,
       profile:"{}"
   }}
 ]
-'''.format(model_name, max_batch, version_policy_str,
-           np_to_model_dtype(input_dtype), tu.shape_to_dims_str(input_shape),
-           np_to_model_dtype(input_dtype), tu.shape_to_dims_str(input_shape),
-           np_to_model_dtype(output0_dtype),
-           tu.shape_to_dims_str(output0_shape),
-           np_to_model_dtype(output1_dtype),
-           tu.shape_to_dims_str(output1_shape), profile_index)
+""".format(
+            model_name,
+            max_batch,
+            version_policy_str,
+            np_to_model_dtype(input_dtype),
+            tu.shape_to_dims_str(input_shape),
+            np_to_model_dtype(input_dtype),
+            tu.shape_to_dims_str(input_shape),
+            np_to_model_dtype(output0_dtype),
+            tu.shape_to_dims_str(output0_shape),
+            np_to_model_dtype(output1_dtype),
+            tu.shape_to_dims_str(output1_shape),
+            profile_index,
+        )
     else:
-        config = '''
+        config = """
 name: "{}"
 platform: "tensorrt_plan"
 max_batch_size: {}
@@ -391,13 +395,19 @@ def create_plan_modelconfig(models_dir, max_batch, model_version, input_shape,
     dims: [ {} ]
   }}
 ]
-'''.format(model_name, max_batch, version_policy_str,
-           np_to_model_dtype(input_dtype), tu.shape_to_dims_str(input_shape),
-           np_to_model_dtype(input_dtype), tu.shape_to_dims_str(input_shape),
-           np_to_model_dtype(output0_dtype),
-           tu.shape_to_dims_str(output0_shape),
-           np_to_model_dtype(output1_dtype),
-           tu.shape_to_dims_str(output1_shape))
+""".format(
+            model_name,
+            max_batch,
+            version_policy_str,
+            np_to_model_dtype(input_dtype),
+            tu.shape_to_dims_str(input_shape),
+            np_to_model_dtype(input_dtype),
+            tu.shape_to_dims_str(input_shape),
+            np_to_model_dtype(output0_dtype),
+            tu.shape_to_dims_str(output0_shape),
+            np_to_model_dtype(output1_dtype),
+            tu.shape_to_dims_str(output1_shape),
+        )
 
     try:
         os.makedirs(config_dir)
@@ -408,57 +418,141 @@ def create_plan_modelconfig(models_dir, max_batch, model_version, input_shape,
         cfile.write(config)
 
 
-def create_plan_model(models_dir, max_batch, model_version, input_shape,
-                      output0_shape, output1_shape, input_dtype, output0_dtype,
-                      output1_dtype, input_memory_format, output_memory_format):
-
-    if not tu.validate_for_trt_model(input_dtype, output0_dtype, output1_dtype,
-                                     input_shape, output0_shape, output1_shape):
+def create_plan_model(
+    models_dir,
+    max_batch,
+    model_version,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    input_memory_format,
+    output_memory_format,
+):
+    if not tu.validate_for_trt_model(
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+        input_shape,
+        output0_shape,
+        output1_shape,
+    ):
         return
 
-    create_plan_modelconfig(models_dir, max_batch, model_version, input_shape,
-                            output0_shape, output1_shape, input_dtype,
-                            output0_dtype, output1_dtype, input_memory_format,
-                            output_memory_format, None)
-
-    if (not tu.shape_is_fixed(input_shape) or
-            not tu.shape_is_fixed(output0_shape) or
-            not tu.shape_is_fixed(output1_shape)):
-        create_plan_dynamic_modelfile(models_dir, max_batch, model_version,
-                                      input_shape, output0_shape, output1_shape,
-                                      input_dtype, output0_dtype, output1_dtype,
-                                      input_memory_format, output_memory_format)
+    create_plan_modelconfig(
+        models_dir,
+        max_batch,
+        model_version,
+        input_shape,
+        output0_shape,
+        output1_shape,
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+        input_memory_format,
+        output_memory_format,
+        None,
+    )
+
+    if (
+        not tu.shape_is_fixed(input_shape)
+        or not tu.shape_is_fixed(output0_shape)
+        or not tu.shape_is_fixed(output1_shape)
+    ):
+        create_plan_dynamic_modelfile(
+            models_dir,
+            max_batch,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            input_memory_format,
+            output_memory_format,
+        )
     else:
-        create_plan_fixed_modelfile(models_dir, max_batch, model_version,
-                                    input_shape, output0_shape, output1_shape,
-                                    input_dtype, output0_dtype, output1_dtype,
-                                    input_memory_format, output_memory_format)
-
-
-if __name__ == '__main__':
+        create_plan_fixed_modelfile(
+            models_dir,
+            max_batch,
+            model_version,
+            input_shape,
+            output0_shape,
+            output1_shape,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            input_memory_format,
+            output_memory_format,
+        )
+
+
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('--models_dir',
-                        type=str,
-                        required=True,
-                        help='Top-level model directory')
+    parser.add_argument(
+        "--models_dir", type=str, required=True, help="Top-level model directory"
+    )
     FLAGS, unparsed = parser.parse_known_args()
 
     # reformat-free input
     # Fixed shape
-    create_plan_model(FLAGS.models_dir, 0, 1, (13, 2, 1), (13, 2, 1),
-                      (13, 2, 1), np.float16, np.float16, np.float16,
-                      trt.TensorFormat.CHW2, trt.TensorFormat.LINEAR)
-    create_plan_model(FLAGS.models_dir, 8, 1, (13, 2, 1), (13, 2, 1),
-                      (13, 2, 1), np.float16, np.float16, np.float16,
-                      trt.TensorFormat.CHW2, trt.TensorFormat.LINEAR)
+    create_plan_model(
+        FLAGS.models_dir,
+        0,
+        1,
+        (13, 2, 1),
+        (13, 2, 1),
+        (13, 2, 1),
+        np.float16,
+        np.float16,
+        np.float16,
+        trt.TensorFormat.CHW2,
+        trt.TensorFormat.LINEAR,
+    )
+    create_plan_model(
+        FLAGS.models_dir,
+        8,
+        1,
+        (13, 2, 1),
+        (13, 2, 1),
+        (13, 2, 1),
+        np.float16,
+        np.float16,
+        np.float16,
+        trt.TensorFormat.CHW2,
+        trt.TensorFormat.LINEAR,
+    )
 
     # Dynamic shape
-    create_plan_model(FLAGS.models_dir, 0, 1, (-1, 2, 1), (-1, 2, 1),
-                      (-1, 2, 1), np.float32, np.float32, np.float32,
-                      trt.TensorFormat.CHW32, trt.TensorFormat.LINEAR)
-    create_plan_model(FLAGS.models_dir, 8, 1, (-1, 2, 1), (-1, 2, 1),
-                      (-1, 2, 1), np.float32, np.float32, np.float32,
-                      trt.TensorFormat.CHW32, trt.TensorFormat.LINEAR)
+    create_plan_model(
+        FLAGS.models_dir,
+        0,
+        1,
+        (-1, 2, 1),
+        (-1, 2, 1),
+        (-1, 2, 1),
+        np.float32,
+        np.float32,
+        np.float32,
+        trt.TensorFormat.CHW32,
+        trt.TensorFormat.LINEAR,
+    )
+    create_plan_model(
+        FLAGS.models_dir,
+        8,
+        1,
+        (-1, 2, 1),
+        (-1, 2, 1),
+        (-1, 2, 1),
+        np.float32,
+        np.float32,
+        np.float32,
+        trt.TensorFormat.CHW32,
+        trt.TensorFormat.LINEAR,
+    )
 
     # reformat-free output
     # reformat-free I/O
diff --git a/qa/common/gen_qa_trt_plugin_models.py b/qa/common/gen_qa_trt_plugin_models.py
old mode 100644
new mode 100755
index cbe54f6578..10c9d5284a
--- a/qa/common/gen_qa_trt_plugin_models.py
+++ b/qa/common/gen_qa_trt_plugin_models.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -29,115 +31,124 @@
 
 import numpy as np
 import tensorrt as trt
+from gen_common import np_to_model_dtype, np_to_trt_dtype
+
+np_dtype_string = np.dtype(object)
 
 TRT_LOGGER = trt.Logger()
 
-trt.init_libnvinfer_plugins(TRT_LOGGER, '')
+trt.init_libnvinfer_plugins(TRT_LOGGER, "")
 PLUGIN_CREATORS = trt.get_plugin_registry().plugin_creator_list
 
 
-def np_to_model_dtype(np_dtype):
-    if np_dtype == bool:
-        return "TYPE_BOOL"
-    elif np_dtype == np.int8:
-        return "TYPE_INT8"
-    elif np_dtype == np.int16:
-        return "TYPE_INT16"
-    elif np_dtype == np.int32:
-        return "TYPE_INT32"
-    elif np_dtype == np.int64:
-        return "TYPE_INT64"
-    elif np_dtype == np.uint8:
-        return "TYPE_UINT8"
-    elif np_dtype == np.uint16:
-        return "TYPE_UINT16"
-    elif np_dtype == np.float16:
-        return "TYPE_FP16"
-    elif np_dtype == np.float32:
-        return "TYPE_FP32"
-    elif np_dtype == np.float64:
-        return "TYPE_FP64"
-    elif np_dtype == np_dtype_string:
-        return "TYPE_STRING"
-    return None
-
-
-def np_to_trt_dtype(np_dtype):
-    if np_dtype == bool:
-        return trt.bool
-    elif np_dtype == np.int8:
-        return trt.int8
-    elif np_dtype == np.int32:
-        return trt.int32
-    elif np_dtype == np.float16:
-        return trt.float16
-    elif np_dtype == np.float32:
-        return trt.float32
-    return None
-
-
 def get_trt_plugin(plugin_name):
     plugin = None
+    field_collection = None
     for plugin_creator in PLUGIN_CREATORS:
-        if (plugin_creator.name == "Normalize_TRT") and \
-                (plugin_name == "Normalize_TRT"):
-            nbWeights = trt.PluginField("nbWeights",
-                                        np.array([1], dtype=np.int32),
-                                        trt.PluginFieldType.INT32)
-            eps = trt.PluginField("eps", np.array([0.00001], dtype=np.float32),
-                                  trt.PluginFieldType.FLOAT32)
-            weights = trt.PluginField('weights',
-                                      np.array([1] * 16, dtype=np.float32),
-                                      trt.PluginFieldType.FLOAT32)
-            field_collection = trt.PluginFieldCollection(
-                [weights, eps, nbWeights])
-            plugin = plugin_creator.create_plugin(
-                name=plugin_name, field_collection=field_collection)
+        if (plugin_creator.name == "Normalize_TRT") and (
+            plugin_name == "Normalize_TRT"
+        ):
+            nbWeights = trt.PluginField(
+                "nbWeights", np.array([1], dtype=np.int32), trt.PluginFieldType.INT32
+            )
+            eps = trt.PluginField(
+                "eps",
+                np.array([0.00001], dtype=np.float32),
+                trt.PluginFieldType.FLOAT32,
+            )
+            weights = trt.PluginField(
+                "weights",
+                np.array([1] * 16, dtype=np.float32),
+                trt.PluginFieldType.FLOAT32,
+            )
+            field_collection = trt.PluginFieldCollection([weights, eps, nbWeights])
             break
-        elif (plugin_creator.name
-              == "CustomGeluPluginDynamic") and (plugin_name
-                                                 == "CustomGeluPluginDynamic"):
-            type_id = trt.PluginField("type_id", np.array([0], np.int32),
-                                      trt.PluginFieldType.INT32)
-            bias = trt.PluginField("bias", np.array([[[1]]], np.float32),
-                                   trt.PluginFieldType.FLOAT32)
+        elif (plugin_creator.name == "CustomGeluPluginDynamic") and (
+            plugin_name == "CustomGeluPluginDynamic"
+        ):
+            type_id = trt.PluginField(
+                "type_id", np.array([0], np.int32), trt.PluginFieldType.INT32
+            )
+            bias = trt.PluginField(
+                "bias", np.array([[[1]]], np.float32), trt.PluginFieldType.FLOAT32
+            )
             field_collection = trt.PluginFieldCollection([type_id, bias])
-            plugin = plugin_creator.create_plugin(
-                name=plugin_name, field_collection=field_collection)
             break
-    return plugin
+        elif (plugin_creator.name == "CustomClipPlugin") and (
+            plugin_name == "CustomClipPlugin"
+        ):
+            min_clip = trt.PluginField(
+                "clipMin",
+                np.array([0.1], dtype=np.float32),
+                trt.PluginFieldType.FLOAT32,
+            )
+            max_clip = trt.PluginField(
+                "clipMax",
+                np.array([0.5], dtype=np.float32),
+                trt.PluginFieldType.FLOAT32,
+            )
+            field_collection = trt.PluginFieldCollection([min_clip, max_clip])
+            break
+
+    if field_collection is None:
+        raise RuntimeError("Plugin not found: " + plugin_name)
+    plugin = plugin_creator.create_plugin(
+        name=plugin_name, field_collection=field_collection
+    )
 
+    return plugin
 
-def create_plan_modelfile(models_dir, max_batch, model_version, plugin_name,
-                          input_shape, output0_shape, input_dtype,
-                          output0_dtype):
 
-    if not tu.validate_for_trt_model(input_dtype, output0_dtype, output0_dtype,
-                                     input_shape, output0_shape, output0_shape):
+def create_plan_modelfile(
+    models_dir,
+    max_batch,
+    model_version,
+    plugin_name,
+    input_shape,
+    output0_shape,
+    input_dtype,
+    output0_dtype,
+):
+    if not tu.validate_for_trt_model(
+        input_dtype,
+        output0_dtype,
+        output0_dtype,
+        input_shape,
+        output0_shape,
+        output0_shape,
+    ):
         return
 
     trt_input_dtype = np_to_trt_dtype(input_dtype)
 
-    model_name = tu.get_model_name("plan_nobatch" if max_batch == 0 else "plan",
-                                   input_dtype, output0_dtype,
-                                   output0_dtype) + '_' + plugin_name
+    model_name = (
+        tu.get_model_name(
+            "plan_nobatch" if max_batch == 0 else "plan",
+            input_dtype,
+            output0_dtype,
+            output0_dtype,
+        )
+        + "_"
+        + plugin_name
+    )
 
     # using explicit batch is necessary for CustomGeluPluginDynamic
     if plugin_name == "CustomGeluPluginDynamic":
-        explicit_batch = 1 << (int)(
-            trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+        explicit_batch = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
         with trt.Builder(TRT_LOGGER) as builder, builder.create_network(
-                explicit_batch) as network:
-            input_layer = network.add_input(name="INPUT0",
-                                            dtype=trt_input_dtype,
-                                            shape=input_shape)
+            explicit_batch
+        ) as network:
+            input_layer = network.add_input(
+                name="INPUT0", dtype=trt_input_dtype, shape=input_shape
+            )
             plugin_layer = network.add_plugin_v2(
-                inputs=[input_layer], plugin=get_trt_plugin(plugin_name))
+                inputs=[input_layer], plugin=get_trt_plugin(plugin_name)
+            )
             plugin_layer.get_output(0).name = "OUTPUT0"
             network.mark_output(plugin_layer.get_output(0))
 
             config = builder.create_builder_config()
-            config.max_workspace_size = 1 << 20
+            config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
 
             try:
                 engine_bytes = builder.build_serialized_network(network, config)
@@ -146,8 +157,7 @@ def create_plan_modelfile(models_dir, max_batch, model_version, plugin_name,
                 engine_bytes = engine.serialize()
                 del engine
 
-            model_version_dir = models_dir + "/" + model_name + "/" + str(
-                model_version)
+            model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
             try:
                 os.makedirs(model_version_dir)
@@ -157,18 +167,18 @@ def create_plan_modelfile(models_dir, max_batch, model_version, plugin_name,
             with open(model_version_dir + "/model.plan", "wb") as f:
                 f.write(engine_bytes)
     else:
-        with trt.Builder(
-                TRT_LOGGER) as builder, builder.create_network() as network:
-            input_layer = network.add_input(name="INPUT0",
-                                            dtype=trt_input_dtype,
-                                            shape=input_shape)
+        with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network:
+            input_layer = network.add_input(
+                name="INPUT0", dtype=trt_input_dtype, shape=input_shape
+            )
             plugin_layer = network.add_plugin_v2(
-                inputs=[input_layer], plugin=get_trt_plugin(plugin_name))
+                inputs=[input_layer], plugin=get_trt_plugin(plugin_name)
+            )
             plugin_layer.get_output(0).name = "OUTPUT0"
             network.mark_output(plugin_layer.get_output(0))
 
             config = builder.create_builder_config()
-            config.max_workspace_size = 1 << 20
+            config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20)
             builder.max_batch_size = max(1, max_batch)
 
             try:
@@ -178,8 +188,7 @@ def create_plan_modelfile(models_dir, max_batch, model_version, plugin_name,
                 engine_bytes = engine.serialize()
                 del engine
 
-            model_version_dir = models_dir + "/" + model_name + "/" + str(
-                model_version)
+            model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
             try:
                 os.makedirs(model_version_dir)
@@ -190,22 +199,41 @@ def create_plan_modelfile(models_dir, max_batch, model_version, plugin_name,
                 f.write(engine_bytes)
 
 
-def create_plan_modelconfig(models_dir, max_batch, model_version, plugin_name,
-                            input_shape, output0_shape, input_dtype,
-                            output0_dtype):
-
-    if not tu.validate_for_trt_model(input_dtype, output0_dtype, output0_dtype,
-                                     input_shape, output0_shape, output0_shape):
+def create_plan_modelconfig(
+    models_dir,
+    max_batch,
+    model_version,
+    plugin_name,
+    input_shape,
+    output0_shape,
+    input_dtype,
+    output0_dtype,
+):
+    if not tu.validate_for_trt_model(
+        input_dtype,
+        output0_dtype,
+        output0_dtype,
+        input_shape,
+        output0_shape,
+        output0_shape,
+    ):
         return
 
     version_policy_str = "{ latest { num_versions: 1 }}"
 
     # Use a different model name for the non-batching variant
-    model_name = tu.get_model_name("plan_nobatch" if max_batch == 0 else "plan",
-                                   input_dtype, output0_dtype,
-                                   output0_dtype) + '_' + plugin_name
+    model_name = (
+        tu.get_model_name(
+            "plan_nobatch" if max_batch == 0 else "plan",
+            input_dtype,
+            output0_dtype,
+            output0_dtype,
+        )
+        + "_"
+        + plugin_name
+    )
     config_dir = models_dir + "/" + model_name
-    config = '''
+    config = """
 name: "{}"
 platform: "tensorrt_plan"
 max_batch_size: {}
@@ -224,10 +252,15 @@ def create_plan_modelconfig(models_dir, max_batch, model_version, plugin_name,
     dims: [ {} ]
    }}
 ]
-'''.format(model_name, max_batch, version_policy_str,
-           np_to_model_dtype(input_dtype), tu.shape_to_dims_str(input_shape),
-           np_to_model_dtype(output0_dtype),
-           tu.shape_to_dims_str(output0_shape))
+""".format(
+        model_name,
+        max_batch,
+        version_policy_str,
+        np_to_model_dtype(input_dtype),
+        tu.shape_to_dims_str(input_shape),
+        np_to_model_dtype(output0_dtype),
+        tu.shape_to_dims_str(output0_shape),
+    )
 
     try:
         os.makedirs(config_dir)
@@ -241,41 +274,94 @@ def create_plan_modelconfig(models_dir, max_batch, model_version, plugin_name,
 def create_plugin_models(models_dir):
     model_version = 1
 
+    # custom CustomClipPlugin
+    create_plan_modelconfig(
+        models_dir,
+        8,
+        model_version,
+        "CustomClipPlugin",
+        (16,),
+        (16,),
+        np.float32,
+        np.float32,
+    )
+    create_plan_modelfile(
+        models_dir,
+        8,
+        model_version,
+        "CustomClipPlugin",
+        (16,),
+        (16,),
+        np.float32,
+        np.float32,
+    )
+
     # default CustomGeluPluginDynamic plugin
-    create_plan_modelconfig(models_dir, 0, model_version,
-                            "CustomGeluPluginDynamic", (16, 1, 1), (16, 1, 1),
-                            np.float32, np.float32)
-    create_plan_modelfile(models_dir, 0, model_version,
-                          "CustomGeluPluginDynamic", (16, 1, 1), (16, 1, 1),
-                          np.float32, np.float32)
-
-    # custom Normalize_TRT
-    create_plan_modelconfig(models_dir, 8, model_version, "Normalize_TRT", (
-        16,
-        16,
-        16,
-    ), (
-        16,
-        16,
-        16,
-    ), np.float32, np.float32)
-    create_plan_modelfile(models_dir, 8, model_version, "Normalize_TRT", (
-        16,
-        16,
-        16,
-    ), (
-        16,
-        16,
-        16,
-    ), np.float32, np.float32)
-
-
-if __name__ == '__main__':
+    create_plan_modelconfig(
+        models_dir,
+        0,
+        model_version,
+        "CustomGeluPluginDynamic",
+        (16, 1, 1),
+        (16, 1, 1),
+        np.float32,
+        np.float32,
+    )
+    create_plan_modelfile(
+        models_dir,
+        0,
+        model_version,
+        "CustomGeluPluginDynamic",
+        (16, 1, 1),
+        (16, 1, 1),
+        np.float32,
+        np.float32,
+    )
+
+    # default Normalize_TRT
+    create_plan_modelconfig(
+        models_dir,
+        8,
+        model_version,
+        "Normalize_TRT",
+        (
+            16,
+            16,
+            16,
+        ),
+        (
+            16,
+            16,
+            16,
+        ),
+        np.float32,
+        np.float32,
+    )
+    create_plan_modelfile(
+        models_dir,
+        8,
+        model_version,
+        "Normalize_TRT",
+        (
+            16,
+            16,
+            16,
+        ),
+        (
+            16,
+            16,
+            16,
+        ),
+        np.float32,
+        np.float32,
+    )
+
+
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('--models_dir',
-                        type=str,
-                        required=True,
-                        help='Top-level model directory')
+    parser.add_argument(
+        "--models_dir", type=str, required=True, help="Top-level model directory"
+    )
     FLAGS, unparsed = parser.parse_known_args()
 
     import test_util as tu
diff --git a/qa/common/gen_tag_sigdef.py b/qa/common/gen_tag_sigdef.py
old mode 100644
new mode 100755
index aabf9b06f1..9c0c5ffbf7
--- a/qa/common/gen_tag_sigdef.py
+++ b/qa/common/gen_tag_sigdef.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -31,29 +33,37 @@
 import os
 from builtins import range
 
-from tensorflow.python.framework import ops
-from tensorflow.python.saved_model import builder
-from tensorflow.python.saved_model import signature_constants
-from tensorflow.python.saved_model import tag_constants
-import tensorflow.compat.v1 as tf
 import gen_ensemble_model_utils as gu
-"""Create SaveModels that contains multiple tags and multiple signature defs"""
+import tensorflow.compat.v1 as tf
+from tensorflow.python.framework import ops
+from tensorflow.python.saved_model import builder, signature_constants, tag_constants
+
+"""Create SavedModels that contains multiple tags and multiple signature defs"""
 
 
-def create_savedmodel(models_dir,
-                      model_version=1,
-                      dims=16,
-                      model_name="sig_tag",
-                      tag_name="testTag",
-                      signature_def_name="testSigDef"):
+def create_savedmodel(
+    models_dir,
+    model_version=1,
+    dims=16,
+    model_name="sig_tag",
+    tag_name="testTag",
+    signature_def_name="testSigDef",
+    different_io=False,
+):
     """
-    Creates 4 SavedModels that have different combinations of model_name and tag_name.
-    The models multiplies the input tensor by a multiplier and the multiplier value is different for each model.
-    Naming convention and config used:
-    <model_name>0: tag: "serve",    signature_def: "serving_default",    multiplier 1
-    <model_name>1: tag: "serve",    signature_def: <signature_def_name>, multiplier 2
-    <model_name>2: tag: <tag_name>, signature_def: "serving_default",    multiplier 3
-    <model_name>3: tag: <tag_name>, signature_def: <signature_def_name>, multiplier 4
+    Creates one SavedModel with four variants of the model based on provided tag and signature_def.
+    The models multiplies the input tensor by a multiplier and the multiplier value is different for each variant.
+    Naming convention and config:
+    <model_name>_0: tag: "serve",    signature_def: "serving_default",    multiplier 1
+    <model_name>_1: tag: "serve",    signature_def: <signature_def_name>, multiplier 2
+    <model_name>_2: tag: <tag_name>, signature_def: "serving_default",    multiplier 3
+    <model_name>_3: tag: <tag_name>, signature_def: <signature_def_name>, multiplier 4
+
+    If different_io is true, there will be two variants of the model created.
+    The variants will have different numbers of inputs and outputs.
+    Alternate naming convention and config:
+    <model_name>0: tag: "serve",    signature_def: "serving_default",    two inputs/outputs
+    <model_name>1: tag: <tag_name>, signature_def: <signature_def_name>, one input/output
     """
     model_version_dir = models_dir + "/" + model_name + "/" + str(model_version)
 
@@ -74,84 +84,105 @@ def create_savedmodel(models_dir,
         # tag:tag_name, signature_def:signature_def_name
         multiplier_3 = tf.constant(4.0, name="multiplier_3")
 
-        output_tensor_0 = tf.multiply(multiplier_0,
-                                      input_tensor,
-                                      name="TENSOR_OUTPUT")
-        output_tensor_1 = tf.multiply(multiplier_1,
-                                      input_tensor,
-                                      name="TENSOR_OUTPUT")
-        output_tensor_2 = tf.multiply(multiplier_2,
-                                      input_tensor,
-                                      name="TENSOR_OUTPUT")
-        output_tensor_3 = tf.multiply(multiplier_3,
-                                      input_tensor,
-                                      name="TENSOR_OUTPUT")
+        output_tensor_0 = tf.multiply(multiplier_0, input_tensor, name="TENSOR_OUTPUT")
+        output_tensor_1 = tf.multiply(multiplier_1, input_tensor, name="TENSOR_OUTPUT")
+        output_tensor_2 = tf.multiply(multiplier_2, input_tensor, name="TENSOR_OUTPUT")
+        output_tensor_3 = tf.multiply(multiplier_3, input_tensor, name="TENSOR_OUTPUT")
 
         # build_tensor_info_op could be used if build_tensor_info is deprecated
         input_tensor_info = tf.saved_model.utils.build_tensor_info(input_tensor)
-        output_tensor_info_0 = tf.saved_model.utils.build_tensor_info(
-            output_tensor_0)
-        output_tensor_info_1 = tf.saved_model.utils.build_tensor_info(
-            output_tensor_1)
-        output_tensor_info_2 = tf.saved_model.utils.build_tensor_info(
-            output_tensor_2)
-        output_tensor_info_3 = tf.saved_model.utils.build_tensor_info(
-            output_tensor_3)
+        output_tensor_info_0 = tf.saved_model.utils.build_tensor_info(output_tensor_0)
+        output_tensor_info_1 = tf.saved_model.utils.build_tensor_info(output_tensor_1)
+        output_tensor_info_2 = tf.saved_model.utils.build_tensor_info(output_tensor_2)
+        output_tensor_info_3 = tf.saved_model.utils.build_tensor_info(output_tensor_3)
 
         # Using predict method name because simple save uses it
         # tag:"serve", signature_def:"serving_default"
         signature_0 = tf.saved_model.signature_def_utils.build_signature_def(
             inputs={"INPUT": input_tensor_info},
             outputs={"OUTPUT": output_tensor_info_0},
-            method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME)
+            method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME,
+        )
         # tag:"serve", signature_def:signature_def_name
         signature_1 = tf.saved_model.signature_def_utils.build_signature_def(
             inputs={"INPUT": input_tensor_info},
             outputs={"OUTPUT": output_tensor_info_1},
-            method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME)
+            method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME,
+        )
         # tag:tag_name, signature_def:"serving_default"
         signature_2 = tf.saved_model.signature_def_utils.build_signature_def(
             inputs={"INPUT": input_tensor_info},
             outputs={"OUTPUT": output_tensor_info_2},
-            method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME)
+            method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME,
+        )
         # tag:tag_name, signature_def:signature_def_name
         signature_3 = tf.saved_model.signature_def_utils.build_signature_def(
             inputs={"INPUT": input_tensor_info},
             outputs={"OUTPUT": output_tensor_info_3},
-            method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME)
-
-        signature_def_map_0 = {
-            signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY: signature_0,
-            signature_def_name: signature_1
-        }
-        signature_def_map_1 = {
-            signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY: signature_2,
-            signature_def_name: signature_3
-        }
+            method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME,
+        )
+        # tag:tag_name, signature_def:signature_def_name, two inputs/outputs
+        signature_4 = tf.saved_model.signature_def_utils.build_signature_def(
+            inputs={"INPUT": input_tensor_info, "INPUT1": input_tensor_info},
+            outputs={"OUTPUT": output_tensor_info_0, "OUTPUT1": output_tensor_info_1},
+            method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME,
+        )
 
         b = builder.SavedModelBuilder(model_version_dir + "/model.savedmodel")
-        b.add_meta_graph_and_variables(sess,
-                                       tags=[tag_constants.SERVING],
-                                       signature_def_map=signature_def_map_0,
-                                       assets_collection=ops.get_collection(
-                                           ops.GraphKeys.ASSET_FILEPATHS),
-                                       clear_devices=True)
-        b.add_meta_graph(tags=[tag_name],
-                         signature_def_map=signature_def_map_1,
-                         assets_collection=ops.get_collection(
-                             ops.GraphKeys.ASSET_FILEPATHS),
-                         clear_devices=True)
+
+        if different_io:
+            b.add_meta_graph_and_variables(
+                sess,
+                tags=[tag_name],
+                signature_def_map={signature_def_name: signature_0},
+                assets_collection=ops.get_collection(ops.GraphKeys.ASSET_FILEPATHS),
+                clear_devices=True,
+            )
+            b.add_meta_graph(
+                tags=[tag_constants.SERVING],
+                signature_def_map={
+                    signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY: signature_4
+                },
+                assets_collection=ops.get_collection(ops.GraphKeys.ASSET_FILEPATHS),
+                clear_devices=True,
+            )
+        else:
+            signature_def_map_0 = {
+                signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY: signature_0,
+                signature_def_name: signature_1,
+            }
+            signature_def_map_1 = {
+                signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY: signature_2,
+                signature_def_name: signature_3,
+            }
+
+            b.add_meta_graph_and_variables(
+                sess,
+                tags=[tag_constants.SERVING],
+                signature_def_map=signature_def_map_0,
+                assets_collection=ops.get_collection(ops.GraphKeys.ASSET_FILEPATHS),
+                clear_devices=True,
+            )
+            b.add_meta_graph(
+                tags=[tag_name],
+                signature_def_map=signature_def_map_1,
+                assets_collection=ops.get_collection(ops.GraphKeys.ASSET_FILEPATHS),
+                clear_devices=True,
+            )
+
         b.save()
 
 
-def create_savedmodel_modelconfig(models_dir,
-                                  model_version=1,
-                                  dims=16,
-                                  model_name="sig_tag",
-                                  tag_name="testTag",
-                                  signature_def_name="testSigDef"):
+def create_savedmodel_modelconfig(
+    models_dir,
+    model_version=1,
+    dims=16,
+    model_name="sig_tag",
+    tag_name="testTag",
+    signature_def_name="testSigDef",
+):
     config_dir = models_dir + "/" + model_name
-    config = '''
+    config = """
 name: "{}"
 platform: "tensorflow_savedmodel"
 input [
@@ -180,9 +211,15 @@ def create_savedmodel_modelconfig(models_dir,
 string_value: "{}"
 }}
 }}
-'''.format(model_name, gu.np_to_model_dtype(tf.float32), str(dims),
-           gu.np_to_model_dtype(tf.float32), str(dims), tag_name,
-           signature_def_name)
+""".format(
+        model_name,
+        gu.np_to_model_dtype(tf.float32),
+        str(dims),
+        gu.np_to_model_dtype(tf.float32),
+        str(dims),
+        tag_name,
+        signature_def_name,
+    )
 
     try:
         os.makedirs(config_dir)
@@ -193,10 +230,11 @@ def create_savedmodel_modelconfig(models_dir,
         cfile.write(config)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     import argparse
-    parser = argparse.ArgumentParser(description='getting model output dir')
-    parser.add_argument('--dir', help='directory to run model in')
+
+    parser = argparse.ArgumentParser(description="getting model output dir")
+    parser.add_argument("--dir", help="directory to run model in", required=True)
     args = parser.parse_args()
     base_dir = args.dir
     base_model_name = "sig_tag"
@@ -207,23 +245,46 @@ def create_savedmodel_modelconfig(models_dir,
 
     for i in range(4):
         model_name = base_model_name + str(i)
-        create_savedmodel(args.dir,
-                          model_name=model_name,
-                          tag_name=test_tag,
-                          signature_def_name=test_sig_def)
-    create_savedmodel_modelconfig(args.dir,
-                                  model_name="sig_tag0",
-                                  tag_name=base_tag,
-                                  signature_def_name=base_sig_def)
-    create_savedmodel_modelconfig(args.dir,
-                                  model_name="sig_tag1",
-                                  tag_name=base_tag,
-                                  signature_def_name=test_sig_def)
-    create_savedmodel_modelconfig(args.dir,
-                                  model_name="sig_tag2",
-                                  tag_name=test_tag,
-                                  signature_def_name=base_sig_def)
-    create_savedmodel_modelconfig(args.dir,
-                                  model_name="sig_tag3",
-                                  tag_name=test_tag,
-                                  signature_def_name=test_sig_def)
+        create_savedmodel(
+            base_dir,
+            model_name=model_name,
+            tag_name=test_tag,
+            signature_def_name=test_sig_def,
+        )
+    create_savedmodel(
+        base_dir,
+        model_name=base_model_name + "_different_io",
+        tag_name=test_tag,
+        signature_def_name=test_sig_def,
+        different_io=True,
+    )
+    create_savedmodel_modelconfig(
+        base_dir,
+        model_name="sig_tag0",
+        tag_name=base_tag,
+        signature_def_name=base_sig_def,
+    )
+    create_savedmodel_modelconfig(
+        base_dir,
+        model_name="sig_tag1",
+        tag_name=base_tag,
+        signature_def_name=test_sig_def,
+    )
+    create_savedmodel_modelconfig(
+        base_dir,
+        model_name="sig_tag2",
+        tag_name=test_tag,
+        signature_def_name=base_sig_def,
+    )
+    create_savedmodel_modelconfig(
+        base_dir,
+        model_name="sig_tag3",
+        tag_name=test_tag,
+        signature_def_name=test_sig_def,
+    )
+    create_savedmodel_modelconfig(
+        base_dir,
+        model_name="sig_tag_different_io",
+        tag_name=test_tag,
+        signature_def_name=test_sig_def,
+    )
diff --git a/qa/common/gen_xavier_trt_models b/qa/common/gen_xavier_trt_models
deleted file mode 100755
index 8e951b687e..0000000000
--- a/qa/common/gen_xavier_trt_models
+++ /dev/null
@@ -1,118 +0,0 @@
-#!/bin/bash
-# Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-#
-# Redistribution and use in source and binary forms, with or without
-# modification, are permitted provided that the following conditions
-# are met:
-#  * Redistributions of source code must retain the above copyright
-#    notice, this list of conditions and the following disclaimer.
-#  * Redistributions in binary form must reproduce the above copyright
-#    notice, this list of conditions and the following disclaimer in the
-#    documentation and/or other materials provided with the distribution.
-#  * Neither the name of NVIDIA CORPORATION nor the names of its
-#    contributors may be used to endorse or promote products derived
-#    from this software without specific prior written permission.
-#
-# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
-# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
-# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
-# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
-# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
-############################################################################
-## This script generates the model repository needed for TensorRT testing
-## testing on the Jetson Xavier. Generating these models requires having
-## TensorRT on the system.
-##
-## 1. Update Jetpack version to match what is being used by the
-## tritonserver release being tested.
-##
-## 2. Set CUDA_DEVICE to the ID of the CUDA device present on the
-## system that you want to target for the generated models.
-##
-## 3. Run this script to create a '/tmp/<TRITON_VERSION>_xavier' directory
-## containing all the TensorRT models needed for CI testing.
-##
-## 4. Copy all non-TensorRT models from the dlcluster directory
-## '/mnt/dldata/inferenceserver/<TRITON_VERSION>' to
-## '/mnt/dldata/inferenceserver/<TRITON_VERSION>_xavier'.
-##
-## 5. Copy the generated models from '/tmp/<TRITON_VERSION>_xavier' to the
-## dlcluster directory '/mnt/dldata/inferenceserver/<TRITON_VERSION>_xavier'.
-##
-## 6. Remember to delete the '/tmp/<TRITON_VERSION>_xavier' directory after.
-##
-############################################################################
-
-TRITON_VERSION=22.05
-CUDA_DEVICE=0
-
-REPO_ARCH="xavier"
-TMP_DIR=/tmp/${TRITON_VERSION}_${REPO_ARCH}
-
-SRCDIR=/tmp/gen_srcdir
-DESTDIR=$TMP_DIR/qa_model_repository
-VARDESTDIR=$TMP_DIR/qa_variable_model_repository
-IDENTITYDESTDIR=$TMP_DIR/qa_identity_model_repository
-IDENTITYBIGDESTDIR=$TMP_DIR/qa_identity_big_model_repository
-SHAPEDESTDIR=$TMP_DIR/qa_shapetensor_model_repository
-RESHAPEDESTDIR=$TMP_DIR/qa_reshape_model_repository
-SEQDESTDIR=$TMP_DIR/qa_sequence_model_repository
-DYNASEQDESTDIR=$TMP_DIR/qa_dyna_sequence_model_repository
-VARSEQDESTDIR=$TMP_DIR/qa_variable_sequence_model_repository
-
-rm -fr $SRCDIR $DESTDIR $VARDESTDIR
-rm -fr $IDENTITYDESTDIR $IDENTITYBIGDESTDIR $SHAPEDESTDIR $RESHAPEDESTDIR
-rm -fr $SEQDESTDIR $DYNASEQDESTDIR $VARSEQDESTDIR
-mkdir -p $SRCDIR
-mkdir -p $DESTDIR
-mkdir -p $VARDESTDIR
-mkdir -p $IDENTITYDESTDIR
-mkdir -p $IDENTITYBIGDESTDIR
-mkdir -p $SHAPEDESTDIR
-mkdir -p $RESHAPEDESTDIR
-mkdir -p $SEQDESTDIR
-mkdir -p $DYNASEQDESTDIR
-mkdir -p $VARSEQDESTDIR
-
-cp ./gen_qa_models.py $SRCDIR/.
-cp ./gen_qa_identity_models.py $SRCDIR/.
-cp ./gen_qa_reshape_models.py $SRCDIR/.
-cp ./gen_qa_noshape_models.py $SRCDIR/.
-cp ./gen_qa_sequence_models.py $SRCDIR/.
-cp ./gen_qa_dyna_sequence_models.py $SRCDIR/.
-cp ./gen_ensemble_model_utils.py $SRCDIR/.
-cp ./gen_qa_trt_plugin_models.py $SRCDIR/.
-cp ./test_util.py $SRCDIR/.
-
-# TensorRT
-set -e
-export TRT_SUPPRESS_DEPRECATION_WARNINGS=1
-cd $SRCDIR
-# Models using shape tensor i/o
-python3 $SRCDIR/gen_qa_identity_models.py --tensorrt-shape-io --models_dir=$SHAPEDESTDIR
-python3 $SRCDIR/gen_qa_sequence_models.py --tensorrt-shape-io --models_dir=$SHAPEDESTDIR
-python3 $SRCDIR/gen_qa_dyna_sequence_models.py --tensorrt-shape-io --models_dir=$SHAPEDESTDIR
-chown -R $(id -u):$(id -g) $SHAPEDESTDIR
-python3 $SRCDIR/gen_qa_models.py --tensorrt --models_dir=$DESTDIR
-chown -R $(id -u):$(id -g) $DESTDIR
-python3 $SRCDIR/gen_qa_models.py --tensorrt --variable --models_dir=$VARDESTDIR
-chown -R $(id -u):$(id -g) $VARDESTDIR
-python3 $SRCDIR/gen_qa_identity_models.py --tensorrt --models_dir=$IDENTITYDESTDIR
-chown -R $(id -u):$(id -g) $IDENTITYDESTDIR
-python3 $SRCDIR/gen_qa_identity_models.py --tensorrt-big --models_dir=$IDENTITYBIGDESTDIR
-chown -R $(id -u):$(id -g) $IDENTITYBIGDESTDIR
-python3 $SRCDIR/gen_qa_reshape_models.py --tensorrt --variable --models_dir=$RESHAPEDESTDIR
-chown -R $(id -u):$(id -g) $RESHAPEDESTDIR
-python3 $SRCDIR/gen_qa_sequence_models.py --tensorrt --models_dir=$SEQDESTDIR
-chown -R $(id -u):$(id -g) $SEQDESTDIR
-python3 $SRCDIR/gen_qa_dyna_sequence_models.py --tensorrt --models_dir=$DYNASEQDESTDIR
-chown -R $(id -u):$(id -g) $DYNASEQDESTDIR
-python3 $SRCDIR/gen_qa_sequence_models.py --tensorrt --variable --models_dir=$VARSEQDESTDIR
-chown -R $(id -u):$(id -g) $VARSEQDESTDIR
diff --git a/qa/common/infer_test.py b/qa/common/infer_test.py
new file mode 100755
index 0000000000..21cf630e39
--- /dev/null
+++ b/qa/common/infer_test.py
@@ -0,0 +1,220 @@
+#!/usr/bin/env python3
+
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+
+import os
+import unittest
+
+import infer_util as iu
+import numpy as np
+import test_util as tu
+
+np_dtype_string = np.dtype(object)
+
+# Allow caller to setup specific set of backends to test
+DEFAULT_BACKENDS = "graphdef savedmodel plan onnx libtorch"
+TEST_BACKENDS = os.environ.get("BACKENDS", DEFAULT_BACKENDS).split()
+
+
+class InferTest(tu.TestResultCollector):
+    def _full_exact(
+        self, input_dtype, output0_dtype, output1_dtype, output0_raw, output1_raw, swap
+    ):
+        def _infer_exact_helper(
+            tester,
+            pf,
+            tensor_shape,
+            batch_size,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            output0_raw=True,
+            output1_raw=True,
+            model_version=None,
+            swap=False,
+            outputs=("OUTPUT0", "OUTPUT1"),
+            use_http=True,
+            use_grpc=True,
+            skip_request_id_check=False,
+            use_streaming=True,
+            correlation_id=0,
+        ):
+            for bs in (1, batch_size):
+                iu.infer_exact(
+                    tester,
+                    pf,
+                    (bs,) + tensor_shape,
+                    bs,
+                    input_dtype,
+                    output0_dtype,
+                    output1_dtype,
+                    output0_raw=output0_raw,
+                    output1_raw=output1_raw,
+                    model_version=model_version,
+                    swap=swap,
+                    outputs=outputs,
+                    use_http=use_http,
+                    use_grpc=use_grpc,
+                    skip_request_id_check=skip_request_id_check,
+                    use_streaming=use_streaming,
+                    correlation_id=correlation_id,
+                )
+
+        input_size = 16
+
+        if tu.validate_for_tf_model(
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            (input_size,),
+            (input_size,),
+            (input_size,),
+        ):
+            for pf in ["graphdef", "savedmodel"]:
+                if pf in TEST_BACKENDS:
+                    _infer_exact_helper(
+                        self,
+                        pf,
+                        (input_size,),
+                        8,
+                        input_dtype,
+                        output0_dtype,
+                        output1_dtype,
+                        output0_raw=output0_raw,
+                        output1_raw=output1_raw,
+                        swap=swap,
+                    )
+
+        if tu.validate_for_trt_model(
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            (input_size, 1, 1),
+            (input_size, 1, 1),
+            (input_size, 1, 1),
+        ):
+            if "plan" in TEST_BACKENDS:
+                if input_dtype == np.int8:
+                    shape = (input_size, 1, 1)
+                else:
+                    shape = (input_size,)
+                _infer_exact_helper(
+                    self,
+                    "plan",
+                    shape,
+                    8,
+                    input_dtype,
+                    output0_dtype,
+                    output1_dtype,
+                    output0_raw=output0_raw,
+                    output1_raw=output1_raw,
+                    swap=swap,
+                )
+
+        if tu.validate_for_onnx_model(
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            (input_size,),
+            (input_size,),
+            (input_size,),
+        ):
+            if "onnx" in TEST_BACKENDS:
+                _infer_exact_helper(
+                    self,
+                    "onnx",
+                    (input_size,),
+                    8,
+                    input_dtype,
+                    output0_dtype,
+                    output1_dtype,
+                    output0_raw=output0_raw,
+                    output1_raw=output1_raw,
+                    swap=swap,
+                )
+
+        # Skip for batched string I/O
+        if tu.validate_for_libtorch_model(
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            (input_size,),
+            (input_size,),
+            (input_size,),
+            8,
+        ):
+            if "libtorch" in TEST_BACKENDS:
+                _infer_exact_helper(
+                    self,
+                    "libtorch",
+                    (input_size,),
+                    8,
+                    input_dtype,
+                    output0_dtype,
+                    output1_dtype,
+                    output0_raw=output0_raw,
+                    output1_raw=output1_raw,
+                    swap=swap,
+                )
+
+    def test_raw_fff(self):
+        self._full_exact(
+            np.float32,
+            np.float32,
+            np.float32,
+            output0_raw=True,
+            output1_raw=True,
+            swap=True,
+        )
+
+    def test_raw_ooo(self):
+        self._full_exact(
+            np_dtype_string,
+            np_dtype_string,
+            np_dtype_string,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
+
+    def test_class_fff(self):
+        self._full_exact(
+            np.float32,
+            np.float32,
+            np.float32,
+            output0_raw=False,
+            output1_raw=False,
+            swap=True,
+        )
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/common/infer_util.py b/qa/common/infer_util.py
old mode 100644
new mode 100755
index 7f0d491757..9a181c1d29
--- a/qa/common/infer_util.py
+++ b/qa/common/infer_util.py
@@ -1,4 +1,6 @@
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,12 +28,13 @@
 
 import os
 import sys
-import numpy as np
 from functools import partial
+
+import numpy as np
+import shm_util as su
 import test_util as tu
 import tritonclient.grpc as grpcclient
 import tritonclient.http as httpclient
-import shm_util as su
 from tritonclient.utils import *
 
 if sys.version_info >= (3, 0):
@@ -48,7 +51,7 @@
 
 # By default, find tritonserver on "localhost", but can be overridden
 # with TRITONSERVER_IPADDR envvar
-_tritonserver_ipaddr = os.environ.get('TRITONSERVER_IPADDR', 'localhost')
+_tritonserver_ipaddr = os.environ.get("TRITONSERVER_IPADDR", "localhost")
 
 
 def _unique_request_id():
@@ -78,7 +81,6 @@ def serialize_byte_tensor_list(tensor_values):
 
 
 class UserData:
-
     def __init__(self):
         self._completed_requests = queue.Queue()
 
@@ -90,33 +92,37 @@ def completion_callback(user_data, result, error):
 
 
 # Perform inference using an "addsum" type verification backend.
-def infer_exact(tester,
-                pf,
-                tensor_shape,
-                batch_size,
-                input_dtype,
-                output0_dtype,
-                output1_dtype,
-                output0_raw=True,
-                output1_raw=True,
-                model_version=None,
-                swap=False,
-                outputs=("OUTPUT0", "OUTPUT1"),
-                use_http=True,
-                use_grpc=True,
-                use_http_json_tensors=True,
-                skip_request_id_check=False,
-                use_streaming=True,
-                correlation_id=0,
-                shm_region_names=None,
-                precreated_shm_regions=None,
-                use_system_shared_memory=False,
-                use_cuda_shared_memory=False,
-                priority=0,
-                timeout_us=0):
+def infer_exact(
+    tester,
+    pf,
+    tensor_shape,
+    batch_size,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    output0_raw=True,
+    output1_raw=True,
+    model_version=None,
+    swap=False,
+    outputs=("OUTPUT0", "OUTPUT1"),
+    use_http=True,
+    use_grpc=True,
+    use_http_json_tensors=True,
+    skip_request_id_check=False,
+    use_streaming=True,
+    correlation_id=0,
+    shm_region_names=None,
+    precreated_shm_regions=None,
+    use_system_shared_memory=False,
+    use_cuda_shared_memory=False,
+    priority=0,
+    # 60 sec is the default value for L0_infer_valgrind
+    network_timeout=60.0,
+):
     # Lazy shm imports...
-    if use_system_shared_memory or use_cuda_shared_memory:
+    if use_system_shared_memory:
         import tritonclient.utils.shared_memory as shm
+    if use_cuda_shared_memory:
         import tritonclient.utils.cuda_shared_memory as cudashm
 
     tester.assertTrue(use_http or use_grpc or use_streaming)
@@ -126,8 +132,12 @@ def infer_exact(tester,
         configs.append((f"{_tritonserver_ipaddr}:8000", "http", False, True))
         if output0_raw == output1_raw:
             # Float16 not supported for Input and Output via JSON
-            if use_http_json_tensors and (input_dtype != np.float16) and \
-               (output0_dtype != np.float16) and (output1_dtype != np.float16):
+            if (
+                use_http_json_tensors
+                and (input_dtype != np.float16)
+                and (output0_dtype != np.float16)
+                and (output1_dtype != np.float16)
+            ):
                 configs.append((f"{_tritonserver_ipaddr}:8000", "http", False, False))
     if use_grpc:
         configs.append((f"{_tritonserver_ipaddr}:8001", "grpc", False, False))
@@ -141,33 +151,43 @@ def infer_exact(tester,
     # class outputs the result value/probability is returned as a
     # float so must use fp32 range in that case.
     rinput_dtype = _range_repr_dtype(input_dtype)
-    routput0_dtype = _range_repr_dtype(
-        output0_dtype if output0_raw else np.float32)
-    routput1_dtype = _range_repr_dtype(
-        output1_dtype if output1_raw else np.float32)
-    val_min = max(
-        np.iinfo(rinput_dtype).min,
-        np.iinfo(routput0_dtype).min,
-        np.iinfo(routput1_dtype).min) / 2
-    val_max = min(
-        np.iinfo(rinput_dtype).max,
-        np.iinfo(routput0_dtype).max,
-        np.iinfo(routput1_dtype).max) / 2
-
-    num_classes = 3
-
-    input0_array = np.random.randint(low=val_min,
-                                     high=val_max,
-                                     size=tensor_shape,
-                                     dtype=rinput_dtype)
-    input1_array = np.random.randint(low=val_min,
-                                     high=val_max,
-                                     size=tensor_shape,
-                                     dtype=rinput_dtype)
+    routput0_dtype = _range_repr_dtype(output0_dtype if output0_raw else np.float32)
+    routput1_dtype = _range_repr_dtype(output1_dtype if output1_raw else np.float32)
+    val_min = (
+        max(
+            np.iinfo(rinput_dtype).min,
+            np.iinfo(routput0_dtype).min,
+            np.iinfo(routput1_dtype).min,
+        )
+        / 2
+    )
+    val_max = (
+        min(
+            np.iinfo(rinput_dtype).max,
+            np.iinfo(routput0_dtype).max,
+            np.iinfo(routput1_dtype).max,
+        )
+        / 2
+    )
+
+    input0_array = np.random.randint(
+        low=val_min, high=val_max, size=tensor_shape, dtype=rinput_dtype
+    )
+    input1_array = np.random.randint(
+        low=val_min, high=val_max, size=tensor_shape, dtype=rinput_dtype
+    )
     if input_dtype != np.object_:
         input0_array = input0_array.astype(input_dtype)
         input1_array = input1_array.astype(input_dtype)
 
+    # for unsigned data type, the value being subtracted must be less than the
+    # value it is subtracted from, to avoid overflow.
+    if val_min == 0:
+        # swap element if the element at input 0 < input 1
+        tmp = np.where(input0_array < input1_array, input1_array, input0_array)
+        input1_array = np.where(input0_array < input1_array, input0_array, input1_array)
+        input0_array = tmp
+
     if not swap:
         output0_array = input0_array + input1_array
         output1_array = input0_array - input1_array
@@ -176,28 +196,28 @@ def infer_exact(tester,
         output1_array = input0_array + input1_array
 
     if output0_dtype == np.object_:
-        output0_array = np.array([
-            unicode(str(x), encoding='utf-8') for x in (output0_array.flatten())
-        ],
-                                 dtype=object).reshape(output0_array.shape)
+        output0_array = np.array(
+            [unicode(str(x), encoding="utf-8") for x in (output0_array.flatten())],
+            dtype=object,
+        ).reshape(output0_array.shape)
     else:
         output0_array = output0_array.astype(output0_dtype)
     if output1_dtype == np.object_:
-        output1_array = np.array([
-            unicode(str(x), encoding='utf-8') for x in (output1_array.flatten())
-        ],
-                                 dtype=object).reshape(output1_array.shape)
+        output1_array = np.array(
+            [unicode(str(x), encoding="utf-8") for x in (output1_array.flatten())],
+            dtype=object,
+        ).reshape(output1_array.shape)
     else:
         output1_array = output1_array.astype(output1_dtype)
 
     if input_dtype == np.object_:
         in0n = np.array(
-            [str(x) for x in input0_array.reshape(input0_array.size)],
-            dtype=object)
+            [str(x) for x in input0_array.reshape(input0_array.size)], dtype=object
+        )
         input0_array = in0n.reshape(input0_array.shape)
         in1n = np.array(
-            [str(x) for x in input1_array.reshape(input1_array.size)],
-            dtype=object)
+            [str(x) for x in input1_array.reshape(input1_array.size)], dtype=object
+        )
         input1_array = in1n.reshape(input1_array.shape)
 
     # prepend size of string to output string data
@@ -217,39 +237,13 @@ def infer_exact(tester,
     else:
         output1_array_tmp = output1_array
 
-    # Get model platform
-    model_name = tu.get_model_name(pf, input_dtype, output0_dtype,
-                                   output1_dtype)
-    if configs[0][1] == "http":
-        metadata_client = httpclient.InferenceServerClient(configs[0][0],
-                                                           verbose=True)
-        metadata = metadata_client.get_model_metadata(model_name)
-        platform = metadata["platform"]
-    else:
-        metadata_client = grpcclient.InferenceServerClient(configs[0][0],
-                                                           verbose=True)
-        metadata = metadata_client.get_model_metadata(model_name)
-        platform = metadata.platform
-
-    INPUT0 = "INPUT0"
-    INPUT1 = "INPUT1"
-
-    if platform == "pytorch_libtorch":
-        OUTPUT0 = "OUTPUT__0"
-        OUTPUT1 = "OUTPUT__1"
-    else:
-        OUTPUT0 = "OUTPUT0"
-        OUTPUT1 = "OUTPUT1"
-
     if output0_dtype == np.object_:
-        output0_byte_size = sum(
-            [serialized_byte_size(o0) for o0 in output0_array_tmp])
+        output0_byte_size = sum([serialized_byte_size(o0) for o0 in output0_array_tmp])
     else:
         output0_byte_size = sum([o0.nbytes for o0 in output0_array_tmp])
 
     if output1_dtype == np.object_:
-        output1_byte_size = sum(
-            [serialized_byte_size(o1) for o1 in output1_array_tmp])
+        output1_byte_size = sum([serialized_byte_size(o1) for o1 in output1_array_tmp])
     else:
         output1_byte_size = sum([o1.nbytes for o1 in output1_array_tmp])
 
@@ -269,70 +263,177 @@ def infer_exact(tester,
         input1_list_tmp = input1_list
 
     if input_dtype == np.object_:
-        input0_byte_size = sum(
-            [serialized_byte_size(i0) for i0 in input0_list_tmp])
-        input1_byte_size = sum(
-            [serialized_byte_size(i1) for i1 in input1_list_tmp])
+        input0_byte_size = sum([serialized_byte_size(i0) for i0 in input0_list_tmp])
+        input1_byte_size = sum([serialized_byte_size(i1) for i1 in input1_list_tmp])
     else:
         input0_byte_size = sum([i0.nbytes for i0 in input0_list_tmp])
         input1_byte_size = sum([i1.nbytes for i1 in input1_list_tmp])
 
-    # Create system/cuda shared memory regions if needed
-    shm_regions, shm_handles = su.create_set_shm_regions(
-        input0_list_tmp, input1_list_tmp, output0_byte_size, output1_byte_size,
-        outputs, shm_region_names, precreated_shm_regions,
-        use_system_shared_memory, use_cuda_shared_memory)
-
     if model_version is not None:
         model_version = str(model_version)
     else:
         model_version = ""
 
     # Run inference and check results for each config
+    inferAndCheckResults(
+        tester,
+        configs,
+        pf,
+        batch_size,
+        model_version,
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+        tensor_shape,
+        input0_array,
+        input1_array,
+        output0_array,
+        output1_array,
+        output0_raw,
+        output1_raw,
+        outputs,
+        precreated_shm_regions,
+        input0_list_tmp,
+        input1_list_tmp,
+        shm_region_names,
+        input0_byte_size,
+        input1_byte_size,
+        output0_byte_size,
+        output1_byte_size,
+        use_system_shared_memory,
+        use_cuda_shared_memory,
+        network_timeout,
+        skip_request_id_check,
+    )
+
+
+def inferAndCheckResults(
+    tester,
+    configs,
+    pf,
+    batch_size,
+    model_version,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    tensor_shape,
+    input0_array,
+    input1_array,
+    output0_array,
+    output1_array,
+    output0_raw,
+    output1_raw,
+    outputs,
+    precreated_shm_regions,
+    input0_list_tmp,
+    input1_list_tmp,
+    shm_region_names,
+    input0_byte_size,
+    input1_byte_size,
+    output0_byte_size,
+    output1_byte_size,
+    use_system_shared_memory,
+    use_cuda_shared_memory,
+    network_timeout,
+    skip_request_id_check,
+):
+    # Lazy shm imports...
+    if use_system_shared_memory:
+        import tritonclient.utils.shared_memory as shm
+    if use_cuda_shared_memory:
+        import tritonclient.utils.cuda_shared_memory as cudashm
+    num_classes = 3
+
+    # Get model platform
+    model_name = tu.get_model_name(pf, input_dtype, output0_dtype, output1_dtype)
+    if configs[0][1] == "http":
+        metadata_client = httpclient.InferenceServerClient(configs[0][0], verbose=True)
+        metadata = metadata_client.get_model_metadata(model_name)
+        platform = metadata["platform"]
+    else:
+        metadata_client = grpcclient.InferenceServerClient(configs[0][0], verbose=True)
+        metadata = metadata_client.get_model_metadata(model_name)
+        platform = metadata.platform
+
+    INPUT0 = "INPUT0"
+    INPUT1 = "INPUT1"
+
+    if platform == "pytorch_libtorch":
+        OUTPUT0 = "OUTPUT__0"
+        OUTPUT1 = "OUTPUT__1"
+    else:
+        OUTPUT0 = "OUTPUT0"
+        OUTPUT1 = "OUTPUT1"
+
+    # Create system/cuda shared memory regions if needed
+    shm_regions, shm_handles = su.create_set_shm_regions(
+        input0_list_tmp,
+        input1_list_tmp,
+        output0_byte_size,
+        output1_byte_size,
+        outputs,
+        shm_region_names,
+        precreated_shm_regions,
+        use_system_shared_memory,
+        use_cuda_shared_memory,
+    )
     for config in configs:
-        model_name = tu.get_model_name(pf, input_dtype, output0_dtype,
-                                       output1_dtype)
+        model_name = tu.get_model_name(pf, input_dtype, output0_dtype, output1_dtype)
 
         if config[1] == "http":
-            triton_client = httpclient.InferenceServerClient(config[0],
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                config[0], verbose=True, network_timeout=network_timeout
+            )
         else:
-            triton_client = grpcclient.InferenceServerClient(config[0],
-                                                             verbose=True)
+            triton_client = grpcclient.InferenceServerClient(config[0], verbose=True)
 
         inputs = []
         if config[1] == "http":
             inputs.append(
-                httpclient.InferInput(INPUT0, tensor_shape,
-                                      np_to_triton_dtype(input_dtype)))
+                httpclient.InferInput(
+                    INPUT0, tensor_shape, np_to_triton_dtype(input_dtype)
+                )
+            )
             inputs.append(
-                httpclient.InferInput(INPUT1, tensor_shape,
-                                      np_to_triton_dtype(input_dtype)))
+                httpclient.InferInput(
+                    INPUT1, tensor_shape, np_to_triton_dtype(input_dtype)
+                )
+            )
         else:
             inputs.append(
-                grpcclient.InferInput(INPUT0, tensor_shape,
-                                      np_to_triton_dtype(input_dtype)))
+                grpcclient.InferInput(
+                    INPUT0, tensor_shape, np_to_triton_dtype(input_dtype)
+                )
+            )
             inputs.append(
-                grpcclient.InferInput(INPUT1, tensor_shape,
-                                      np_to_triton_dtype(input_dtype)))
+                grpcclient.InferInput(
+                    INPUT1, tensor_shape, np_to_triton_dtype(input_dtype)
+                )
+            )
 
         if not (use_cuda_shared_memory or use_system_shared_memory):
             if config[1] == "http":
-                inputs[0].set_data_from_numpy(input0_array,
-                                              binary_data=config[3])
-                inputs[1].set_data_from_numpy(input1_array,
-                                              binary_data=config[3])
+                inputs[0].set_data_from_numpy(input0_array, binary_data=config[3])
+                inputs[1].set_data_from_numpy(input1_array, binary_data=config[3])
             else:
                 inputs[0].set_data_from_numpy(input0_array)
                 inputs[1].set_data_from_numpy(input1_array)
         else:
             # Register necessary shared memory regions/handles
-            su.register_add_shm_regions(inputs, outputs, shm_regions,
-                                        precreated_shm_regions, shm_handles,
-                                        input0_byte_size, input1_byte_size,
-                                        output0_byte_size, output1_byte_size,
-                                        use_system_shared_memory,
-                                        use_cuda_shared_memory, triton_client)
+            su.register_add_shm_regions(
+                inputs,
+                outputs,
+                shm_regions,
+                precreated_shm_regions,
+                shm_handles,
+                input0_byte_size,
+                input1_byte_size,
+                output0_byte_size,
+                output1_byte_size,
+                use_system_shared_memory,
+                use_cuda_shared_memory,
+                triton_client,
+            )
 
         if batch_size == 1:
             expected0_sort_idx = [
@@ -360,65 +461,73 @@ def infer_exact(tester,
             if len(shm_regions) != 0:
                 if config[1] == "http":
                     output_req.append(
-                        httpclient.InferRequestedOutput(OUTPUT0,
-                                                        binary_data=config[3]))
+                        httpclient.InferRequestedOutput(OUTPUT0, binary_data=config[3])
+                    )
                 else:
                     output_req.append(grpcclient.InferRequestedOutput(OUTPUT0))
 
-                output_req[-1].set_shared_memory(shm_regions[2] + '_data',
-                                                 output0_byte_size)
+                output_req[-1].set_shared_memory(
+                    shm_regions[2] + "_data", output0_byte_size
+                )
             else:
                 if output0_raw:
                     if config[1] == "http":
                         output_req.append(
                             httpclient.InferRequestedOutput(
-                                OUTPUT0, binary_data=config[3]))
+                                OUTPUT0, binary_data=config[3]
+                            )
+                        )
                     else:
-                        output_req.append(
-                            grpcclient.InferRequestedOutput(OUTPUT0))
+                        output_req.append(grpcclient.InferRequestedOutput(OUTPUT0))
                 else:
                     if config[1] == "http":
                         output_req.append(
                             httpclient.InferRequestedOutput(
-                                OUTPUT0,
-                                binary_data=config[3],
-                                class_count=num_classes))
+                                OUTPUT0, binary_data=config[3], class_count=num_classes
+                            )
+                        )
                     else:
                         output_req.append(
                             grpcclient.InferRequestedOutput(
-                                OUTPUT0, class_count=num_classes))
+                                OUTPUT0, class_count=num_classes
+                            )
+                        )
             i += 1
         if "OUTPUT1" in outputs:
             if len(shm_regions) != 0:
                 if config[1] == "http":
                     output_req.append(
-                        httpclient.InferRequestedOutput(OUTPUT1,
-                                                        binary_data=config[3]))
+                        httpclient.InferRequestedOutput(OUTPUT1, binary_data=config[3])
+                    )
                 else:
                     output_req.append(grpcclient.InferRequestedOutput(OUTPUT1))
 
-                output_req[-1].set_shared_memory(shm_regions[2 + i] + '_data',
-                                                 output1_byte_size)
+                output_req[-1].set_shared_memory(
+                    shm_regions[2 + i] + "_data", output1_byte_size
+                )
             else:
                 if output1_raw:
                     if config[1] == "http":
                         output_req.append(
                             httpclient.InferRequestedOutput(
-                                OUTPUT1, binary_data=config[3]))
+                                OUTPUT1, binary_data=config[3]
+                            )
+                        )
                     else:
-                        output_req.append(
-                            grpcclient.InferRequestedOutput(OUTPUT1))
+                        output_req.append(grpcclient.InferRequestedOutput(OUTPUT1))
                 else:
                     if config[1] == "http":
                         output_req.append(
                             httpclient.InferRequestedOutput(
-                                OUTPUT1,
-                                binary_data=config[3],
-                                class_count=num_classes))
+                                OUTPUT1, binary_data=config[3], class_count=num_classes
+                            )
+                        )
                     else:
                         output_req.append(
                             grpcclient.InferRequestedOutput(
-                                OUTPUT1, class_count=num_classes))
+                                OUTPUT1, class_count=num_classes
+                            )
+                        )
 
         if config[2]:
             user_data = UserData()
@@ -429,7 +538,8 @@ def infer_exact(tester,
                     inputs,
                     model_version=model_version,
                     outputs=output_req,
-                    request_id=str(_unique_request_id()))
+                    request_id=str(_unique_request_id()),
+                )
             except Exception as e:
                 triton_client.stop_stream()
                 raise e
@@ -438,11 +548,13 @@ def infer_exact(tester,
             if error is not None:
                 raise error
         else:
-            results = triton_client.infer(model_name,
-                                          inputs,
-                                          model_version=model_version,
-                                          outputs=output_req,
-                                          request_id=str(_unique_request_id()))
+            results = triton_client.infer(
+                model_name,
+                inputs,
+                model_version=model_version,
+                outputs=output_req,
+                request_id=str(_unique_request_id()),
+            )
 
         last_response = results.get_response()
 
@@ -452,8 +564,9 @@ def infer_exact(tester,
                 request_id = int(last_response["id"])
             else:
                 request_id = int(last_response.id)
-            tester.assertFalse(request_id in _seen_request_ids,
-                               "request_id: {}".format(request_id))
+            tester.assertFalse(
+                request_id in _seen_request_ids, "request_id: {}".format(request_id)
+            )
             _seen_request_ids.add(request_id)
 
         if config[1] == "http":
@@ -480,8 +593,9 @@ def infer_exact(tester,
             else:
                 result_name = result.name
 
-            if ((result_name == OUTPUT0 and output0_raw) or
-                (result_name == OUTPUT1 and output1_raw)):
+            if (result_name == OUTPUT0 and output0_raw) or (
+                result_name == OUTPUT1 and output1_raw
+            ):
                 if use_system_shared_memory or use_cuda_shared_memory:
                     if result_name == OUTPUT0:
                         shm_handle = shm_handles[2]
@@ -490,46 +604,54 @@ def infer_exact(tester,
 
                     output = results.get_output(result_name)
                     if config[1] == "http":
-                        output_datatype = output['datatype']
-                        output_shape = output['shape']
+                        output_datatype = output["datatype"]
+                        output_shape = output["shape"]
                     else:
                         output_datatype = output.datatype
                         output_shape = output.shape
                     output_dtype = triton_to_np_dtype(output_datatype)
                 if use_system_shared_memory:
                     output_data = shm.get_contents_as_numpy(
-                        shm_handle, output_dtype, output_shape)
+                        shm_handle, output_dtype, output_shape
+                    )
                 elif use_cuda_shared_memory:
                     output_data = cudashm.get_contents_as_numpy(
-                        shm_handle, output_dtype, output_shape)
+                        shm_handle, output_dtype, output_shape
+                    )
                 else:
                     output_data = results.as_numpy(result_name)
                     if (output_data.dtype == np.object_) and (not config[3]):
-                        if config[1] == 'http':
-                            output_data = np.array([
-                                unicode(str(x), encoding='utf-8')
-                                for x in (output_data.flatten())
-                            ],
-                                                   dtype=np.object_).reshape(
-                                                       output_data.shape)
-                        elif config[1] == 'grpc':
+                        if config[1] == "http":
+                            output_data = np.array(
+                                [
+                                    unicode(str(x), encoding="utf-8")
+                                    for x in (output_data.flatten())
+                                ],
+                                dtype=np.object_,
+                            ).reshape(output_data.shape)
+                        elif config[1] == "grpc":
                             output_data = np.array(
-                                [x for x in (output_data.flatten())],
-                                dtype=np.object_).reshape(output_data.shape)
+                                [x for x in (output_data.flatten())], dtype=np.object_
+                            ).reshape(output_data.shape)
 
                 if result_name == OUTPUT0:
                     tester.assertTrue(
                         np.array_equal(output_data, output0_array),
                         "{}, {} expected: {}, got {}".format(
-                            model_name, OUTPUT0, output0_array, output_data))
+                            model_name, OUTPUT0, output0_array, output_data
+                        ),
+                    )
                 elif result_name == OUTPUT1:
                     tester.assertTrue(
                         np.array_equal(output_data, output1_array),
                         "{}, {} expected: {}, got {}".format(
-                            model_name, OUTPUT1, output1_array, output_data))
+                            model_name, OUTPUT1, output1_array, output_data
+                        ),
+                    )
                 else:
                     tester.assertTrue(
-                        False, "unexpected raw result {}".format(result_name))
+                        False, "unexpected raw result {}".format(result_name)
+                    )
             else:
                 for b in range(batch_size):
                     # num_classes values must be returned and must
@@ -553,61 +675,67 @@ def infer_exact(tester,
                         # the value of each index equals the expected value.
                         # Only compare labels when the indices are equal.
                         if type(class_label) == str:
-                            ctuple = class_label.split(':')
+                            ctuple = class_label.split(":")
                         else:
-                            ctuple = "".join(
-                                chr(x) for x in class_label).split(':')
+                            ctuple = "".join(chr(x) for x in class_label).split(":")
                         cval = float(ctuple[0])
                         cidx = int(ctuple[1])
                         if result_name == OUTPUT0:
                             tester.assertEqual(cval, expected0_flatten[cidx])
                             tester.assertEqual(
-                                cval,
-                                expected0_flatten[expected0_sort_idx[b][idx]])
+                                cval, expected0_flatten[expected0_sort_idx[b][idx]]
+                            )
                             if cidx == expected0_sort_idx[b][idx]:
                                 tester.assertEqual(
-                                    ctuple[2], 'label{}'.format(
-                                        expected0_sort_idx[b][idx]))
+                                    ctuple[2],
+                                    "label{}".format(expected0_sort_idx[b][idx]),
+                                )
                         elif result_name == OUTPUT1:
                             tester.assertEqual(cval, expected1_flatten[cidx])
                             tester.assertEqual(
-                                cval,
-                                expected1_flatten[expected1_sort_idx[b][idx]])
+                                cval, expected1_flatten[expected1_sort_idx[b][idx]]
+                            )
                         else:
                             tester.assertTrue(
-                                False, "unexpected class result {}".format(
-                                    result_name))
+                                False, "unexpected class result {}".format(result_name)
+                            )
 
     # Unregister system/cuda shared memory regions if they exist
-    su.unregister_cleanup_shm_regions(shm_regions, shm_handles,
-                                      precreated_shm_regions, outputs,
-                                      use_system_shared_memory,
-                                      use_cuda_shared_memory)
+    su.unregister_cleanup_shm_regions(
+        shm_regions,
+        shm_handles,
+        precreated_shm_regions,
+        outputs,
+        use_system_shared_memory,
+        use_cuda_shared_memory,
+    )
 
     return results
 
 
 # resize the dummy tensor with the provided values in the shape tensor and finally
 # return the shape of the resized tensor.
-def infer_shape_tensor(tester,
-                       pf,
-                       tensor_dtype,
-                       input_shape_values,
-                       dummy_input_shapes,
-                       use_http=True,
-                       use_grpc=True,
-                       use_streaming=True,
-                       shm_suffix="",
-                       use_system_shared_memory=False,
-                       priority=0,
-                       timeout_us=0,
-                       batch_size=1):
+def infer_shape_tensor(
+    tester,
+    pf,
+    tensor_dtype,
+    input_shape_values,
+    dummy_input_shapes,
+    use_http=True,
+    use_grpc=True,
+    use_streaming=True,
+    shm_suffix="",
+    use_system_shared_memory=False,
+    priority=0,
+    timeout_us=0,
+    batch_size=1,
+):
     # Lazy shm imports...
     if use_system_shared_memory:
         import tritonclient.utils.shared_memory as shm
 
     tester.assertTrue(use_http or use_grpc or use_streaming)
-    tester.assertTrue(pf == "plan" or pf == "plan_nobatch")
+    tester.assertTrue(pf.startswith("plan"))
     tester.assertEqual(len(input_shape_values), len(dummy_input_shapes))
 
     configs = []
@@ -636,19 +764,23 @@ def infer_shape_tensor(tester,
 
         # Prepare the dummy tensor
         rtensor_dtype = _range_repr_dtype(tensor_dtype)
-        if (rtensor_dtype != bool):
-            dummy_in0 = np.random.randint(low=np.iinfo(rtensor_dtype).min,
-                                          high=np.iinfo(rtensor_dtype).max,
-                                          size=dummy_input_shapes[io_num],
-                                          dtype=rtensor_dtype)
+        if rtensor_dtype != bool:
+            dummy_in0 = np.random.randint(
+                low=np.iinfo(rtensor_dtype).min,
+                high=np.iinfo(rtensor_dtype).max,
+                size=dummy_input_shapes[io_num],
+                dtype=rtensor_dtype,
+            )
         else:
-            dummy_in0 = np.random.choice(a=[False, True],
-                                         size=dummy_input_shapes[io_num])
+            dummy_in0 = np.random.choice(
+                a=[False, True], size=dummy_input_shapes[io_num]
+            )
         if tensor_dtype != np.object_:
             dummy_in0 = dummy_in0.astype(tensor_dtype)
         else:
-            dummy_in0 = np.array([str(x) for x in dummy_in0.flatten()],
-                                 dtype=object).reshape(dummy_in0.shape)
+            dummy_in0 = np.array(
+                [str(x) for x in dummy_in0.flatten()], dtype=object
+            ).reshape(dummy_in0.shape)
         dummy_input_list.append(dummy_in0)
 
         # Prepare shape input tensor
@@ -664,25 +796,37 @@ def infer_shape_tensor(tester,
         output_byte_size = input_byte_size * batch_size
         if use_system_shared_memory:
             input_shm_handle_list.append(
-                (shm.create_shared_memory_region(input_name + shm_suffix,
-                                                 '/' + input_name + shm_suffix,
-                                                 input_byte_size),
-                 input_byte_size))
+                (
+                    shm.create_shared_memory_region(
+                        input_name + shm_suffix,
+                        "/" + input_name + shm_suffix,
+                        input_byte_size,
+                    ),
+                    input_byte_size,
+                )
+            )
             output_shm_handle_list.append(
-                (shm.create_shared_memory_region(output_name + shm_suffix,
-                                                 '/' + output_name + shm_suffix,
-                                                 output_byte_size),
-                 output_byte_size))
-            shm.set_shared_memory_region(input_shm_handle_list[-1][0], [
-                in0,
-            ])
+                (
+                    shm.create_shared_memory_region(
+                        output_name + shm_suffix,
+                        "/" + output_name + shm_suffix,
+                        output_byte_size,
+                    ),
+                    output_byte_size,
+                )
+            )
+            shm.set_shared_memory_region(
+                input_shm_handle_list[-1][0],
+                [
+                    in0,
+                ],
+            )
 
     model_name = tu.get_zero_model_name(pf, io_cnt, tensor_dtype)
     # Run inference and check results for each config
     for config in configs:
         client_utils = grpcclient if config[1] == "grpc" else httpclient
-        triton_client = client_utils.InferenceServerClient(config[0],
-                                                           verbose=True)
+        triton_client = client_utils.InferenceServerClient(config[0], verbose=True)
 
         inputs = []
         outputs = []
@@ -695,42 +839,51 @@ def infer_shape_tensor(tester,
             output_name = "OUTPUT{}".format(io_num)
 
             inputs.append(
-                client_utils.InferInput(dummy_input_name,
-                                        dummy_input_shapes[io_num],
-                                        np_to_triton_dtype(tensor_dtype)))
+                client_utils.InferInput(
+                    dummy_input_name,
+                    dummy_input_shapes[io_num],
+                    np_to_triton_dtype(tensor_dtype),
+                )
+            )
             inputs.append(
-                client_utils.InferInput(input_name, input_list[io_num].shape,
-                                        "INT32"))
+                client_utils.InferInput(input_name, input_list[io_num].shape, "INT32")
+            )
             outputs.append(client_utils.InferRequestedOutput(dummy_output_name))
             outputs.append(client_utils.InferRequestedOutput(output_name))
 
             # -2: dummy; -1: input
             inputs[-2].set_data_from_numpy(dummy_input_list[io_num])
-            if (not use_system_shared_memory):
+            if not use_system_shared_memory:
                 inputs[-1].set_data_from_numpy(input_list[io_num])
             else:
                 input_byte_size = input_shm_handle_list[io_num][1]
                 output_byte_size = output_shm_handle_list[io_num][1]
                 triton_client.register_system_shared_memory(
-                    input_name + shm_suffix, "/" + input_name + shm_suffix,
-                    input_byte_size)
+                    input_name + shm_suffix,
+                    "/" + input_name + shm_suffix,
+                    input_byte_size,
+                )
                 triton_client.register_system_shared_memory(
-                    output_name + shm_suffix, "/" + output_name + shm_suffix,
-                    output_byte_size)
-                inputs[-1].set_shared_memory(input_name + shm_suffix,
-                                             input_byte_size)
-                outputs[-1].set_shared_memory(output_name + shm_suffix,
-                                              output_byte_size)
+                    output_name + shm_suffix,
+                    "/" + output_name + shm_suffix,
+                    output_byte_size,
+                )
+                inputs[-1].set_shared_memory(input_name + shm_suffix, input_byte_size)
+                outputs[-1].set_shared_memory(
+                    output_name + shm_suffix, output_byte_size
+                )
 
         if config[2]:
             user_data = UserData()
             triton_client.start_stream(partial(completion_callback, user_data))
             try:
-                results = triton_client.async_stream_infer(model_name,
-                                                           inputs,
-                                                           outputs=outputs,
-                                                           priority=priority,
-                                                           timeout=timeout_us)
+                results = triton_client.async_stream_infer(
+                    model_name,
+                    inputs,
+                    outputs=outputs,
+                    priority=priority,
+                    timeout=timeout_us,
+                )
             except Exception as e:
                 triton_client.stop_stream()
                 raise e
@@ -739,11 +892,13 @@ def infer_shape_tensor(tester,
             if error is not None:
                 raise error
         else:
-            results = triton_client.infer(model_name,
-                                          inputs,
-                                          outputs=outputs,
-                                          priority=priority,
-                                          timeout=timeout_us)
+            results = triton_client.infer(
+                model_name,
+                inputs,
+                outputs=outputs,
+                priority=priority,
+                timeout=timeout_us,
+            )
 
         for io_num in range(io_cnt):
             output_name = "OUTPUT{}".format(io_num)
@@ -752,7 +907,7 @@ def infer_shape_tensor(tester,
 
             # get outputs as numpy array
             dummy_out = results.as_numpy(dummy_output_name)
-            if (not use_system_shared_memory):
+            if not use_system_shared_memory:
                 out = results.as_numpy(output_name)
             else:
                 output = results.get_output(output_name)
@@ -761,39 +916,44 @@ def infer_shape_tensor(tester,
                 else:
                     output_shape = output["shape"]
                 out = shm.get_contents_as_numpy(
-                    output_shm_handle_list[io_num][0], np.int32, output_shape)
+                    output_shm_handle_list[io_num][0], np.int32, output_shape
+                )
 
             # if out shape is 2D, it is batched
-            if (len(out.shape) == 2):
+            if len(out.shape) == 2:
                 # The shape of the dummy output should be equal to the shape values
                 # specified in the shape tensor
                 tester.assertTrue(
                     np.array_equal(dummy_out.shape[1:], out[0]),
                     "{}, {} shape, expected: {}, got {}".format(
-                        model_name, dummy_output_name, out[0],
-                        dummy_out.shape[1:]))
+                        model_name, dummy_output_name, out[0], dummy_out.shape[1:]
+                    ),
+                )
                 for b in range(1, out.shape[0]):
                     tester.assertTrue(
                         np.array_equal(out[b - 1], out[b]),
                         "expect shape tensor has consistent value, "
-                        "expected: {}, got {}".format(out[b - 1], out[b]))
+                        "expected: {}, got {}".format(out[b - 1], out[b]),
+                    )
                 out = out[0]
             else:
                 tester.assertTrue(
                     np.array_equal(dummy_out.shape, out),
                     "{}, {} shape, expected: {}, got {}".format(
-                        model_name, dummy_output_name, out, dummy_out.shape))
+                        model_name, dummy_output_name, out, dummy_out.shape
+                    ),
+                )
             tester.assertTrue(
                 np.array_equal(out, expected),
-                "{}, {}, expected: {}, got {}".format(model_name, output_name,
-                                                      expected, out))
+                "{}, {}, expected: {}, got {}".format(
+                    model_name, output_name, expected, out
+                ),
+            )
 
             # unregister shared memory region for next config
             if use_system_shared_memory:
-                triton_client.unregister_system_shared_memory(input_name +
-                                                              shm_suffix)
-                triton_client.unregister_system_shared_memory(output_name +
-                                                              shm_suffix)
+                triton_client.unregister_system_shared_memory(input_name + shm_suffix)
+                triton_client.unregister_system_shared_memory(output_name + shm_suffix)
 
     for handle in input_shm_handle_list:
         shm.destroy_shared_memory_region(handle[0])
@@ -805,25 +965,31 @@ def infer_shape_tensor(tester,
 # zero-sized input/output tensor.
 # FIXME Support for empty tensors using non-empty shared memory regions.
 # Currently shared memory support is broken for empty input/outputs tensors.
-def infer_zero(tester,
-               pf,
-               batch_size,
-               tensor_dtype,
-               input_shapes,
-               output_shapes,
-               model_version=None,
-               use_http=True,
-               use_grpc=True,
-               use_http_json_tensors=True,
-               use_streaming=True,
-               shm_region_name_prefix=None,
-               use_system_shared_memory=False,
-               use_cuda_shared_memory=False,
-               priority=0,
-               timeout_us=0):
+def infer_zero(
+    tester,
+    pf,
+    batch_size,
+    tensor_dtype,
+    input_shapes,
+    output_shapes,
+    model_version=None,
+    use_http=True,
+    use_grpc=True,
+    use_http_json_tensors=True,
+    use_streaming=True,
+    shm_region_name_prefix=None,
+    use_system_shared_memory=False,
+    use_cuda_shared_memory=False,
+    priority=0,
+    timeout_us=0,
+    override_model_name=None,
+    override_input_names=[],
+    override_output_names=[],
+):
     # Lazy shm imports...
-    if use_system_shared_memory or use_cuda_shared_memory:
+    if use_system_shared_memory:
         import tritonclient.utils.shared_memory as shm
+    if use_cuda_shared_memory:
         import tritonclient.utils.cuda_shared_memory as cudashm
 
     tester.assertTrue(use_http or use_grpc or use_streaming)
@@ -848,48 +1014,60 @@ def infer_zero(tester,
     shm_op_handles = list()
 
     # Get model platform
-    model_name = tu.get_zero_model_name(pf, io_cnt, tensor_dtype)
+    if override_model_name is None:
+        model_name = tu.get_zero_model_name(pf, io_cnt, tensor_dtype)
+    else:
+        model_name = override_model_name
     if configs[0][1] == "http":
-        metadata_client = httpclient.InferenceServerClient(configs[0][0],
-                                                           verbose=True)
+        metadata_client = httpclient.InferenceServerClient(configs[0][0], verbose=True)
         metadata = metadata_client.get_model_metadata(model_name)
         platform = metadata["platform"]
     else:
-        metadata_client = grpcclient.InferenceServerClient(configs[0][0],
-                                                           verbose=True)
+        metadata_client = grpcclient.InferenceServerClient(configs[0][0], verbose=True)
         metadata = metadata_client.get_model_metadata(model_name)
         platform = metadata.platform
 
     for io_num in range(io_cnt):
-        if platform == "pytorch_libtorch":
-            input_name = "INPUT__{}".format(io_num)
-            output_name = "OUTPUT__{}".format(io_num)
+        if override_input_names:
+            input_name = override_input_names[io_num]
         else:
-            input_name = "INPUT{}".format(io_num)
-            output_name = "OUTPUT{}".format(io_num)
+            if platform == "pytorch_libtorch":
+                input_name = "INPUT__{}".format(io_num)
+            else:
+                input_name = "INPUT{}".format(io_num)
+
+        if override_output_names:
+            output_name = override_output_names[io_num]
+        else:
+            if platform == "pytorch_libtorch":
+                output_name = "OUTPUT__{}".format(io_num)
+            else:
+                output_name = "OUTPUT{}".format(io_num)
 
         input_shape = input_shapes[io_num]
         output_shape = output_shapes[io_num]
 
         rtensor_dtype = _range_repr_dtype(tensor_dtype)
-        if (rtensor_dtype != bool):
-            input_array = np.random.randint(low=np.iinfo(rtensor_dtype).min,
-                                            high=np.iinfo(rtensor_dtype).max,
-                                            size=input_shape,
-                                            dtype=rtensor_dtype)
+        if rtensor_dtype != bool:
+            input_array = np.random.randint(
+                low=np.iinfo(rtensor_dtype).min,
+                high=np.iinfo(rtensor_dtype).max,
+                size=input_shape,
+                dtype=rtensor_dtype,
+            )
         else:
             input_array = np.random.choice(a=[False, True], size=input_shape)
         if tensor_dtype != np.object_:
             input_array = input_array.astype(tensor_dtype)
             expected_array = np.ndarray.copy(input_array)
         else:
-            expected_array = np.array([
-                unicode(str(x), encoding='utf-8')
-                for x in input_array.flatten()
-            ],
-                                      dtype=object)
-            input_array = np.array([str(x) for x in input_array.flatten()],
-                                   dtype=object).reshape(input_array.shape)
+            expected_array = np.array(
+                [unicode(str(x), encoding="utf-8") for x in input_array.flatten()],
+                dtype=object,
+            )
+            input_array = np.array(
+                [str(x) for x in input_array.flatten()], dtype=object
+            ).reshape(input_array.shape)
 
         expected_array = expected_array.reshape(output_shape)
         expected_dict[output_name] = expected_array
@@ -911,8 +1089,7 @@ def infer_zero(tester,
             input_list_tmp = input_list
 
         if tensor_dtype == np.object_:
-            input_byte_size = sum(
-                [serialized_byte_size(ip) for ip in input_list_tmp])
+            input_byte_size = sum([serialized_byte_size(ip) for ip in input_list_tmp])
         else:
             input_byte_size = sum([ip.nbytes for ip in input_list_tmp])
 
@@ -920,9 +1097,14 @@ def infer_zero(tester,
         shm_io_handles = su.create_set_either_shm_region(
             [
                 shm_region_name_prefix[0] + str(io_num),
-                shm_region_name_prefix[1] + str(io_num)
-            ], input_list_tmp, input_byte_size, output_byte_size,
-            use_system_shared_memory, use_cuda_shared_memory)
+                shm_region_name_prefix[1] + str(io_num),
+            ],
+            input_list_tmp,
+            input_byte_size,
+            output_byte_size,
+            use_system_shared_memory,
+            use_cuda_shared_memory,
+        )
 
         if len(shm_io_handles) != 0:
             shm_ip_handles.append(shm_io_handles[0])
@@ -936,55 +1118,64 @@ def infer_zero(tester,
 
     # Run inference and check results for each config
     for config in configs:
-        model_name = tu.get_zero_model_name(pf, io_cnt, tensor_dtype)
-
         if config[1] == "http":
-            triton_client = httpclient.InferenceServerClient(config[0],
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(config[0], verbose=True)
         else:
-            triton_client = grpcclient.InferenceServerClient(config[0],
-                                                             verbose=True)
+            triton_client = grpcclient.InferenceServerClient(config[0], verbose=True)
 
         inputs = []
         output_req = []
         for io_num, (input_name, output_name) in enumerate(
-                zip(input_dict.keys(), expected_dict.keys())):
+            zip(input_dict.keys(), expected_dict.keys())
+        ):
             input_data = input_dict[input_name]
             output_data = expected_dict[output_name]
             if tensor_dtype == np.object_:
                 input_byte_size = serialized_byte_size(
-                    serialize_byte_tensor(input_data))
+                    serialize_byte_tensor(input_data)
+                )
                 output_byte_size = serialized_byte_size(
-                    serialize_byte_tensor(output_data))
+                    serialize_byte_tensor(output_data)
+                )
             else:
                 input_byte_size = input_data.nbytes
                 output_byte_size = output_data.nbytes
             if config[1] == "http":
                 inputs.append(
-                    httpclient.InferInput(input_name, input_data.shape,
-                                          np_to_triton_dtype(tensor_dtype)))
+                    httpclient.InferInput(
+                        input_name, input_data.shape, np_to_triton_dtype(tensor_dtype)
+                    )
+                )
                 output_req.append(
-                    httpclient.InferRequestedOutput(output_name,
-                                                    binary_data=config[3]))
+                    httpclient.InferRequestedOutput(output_name, binary_data=config[3])
+                )
             else:
                 inputs.append(
-                    grpcclient.InferInput(input_name, input_data.shape,
-                                          np_to_triton_dtype(tensor_dtype)))
+                    grpcclient.InferInput(
+                        input_name, input_data.shape, np_to_triton_dtype(tensor_dtype)
+                    )
+                )
                 output_req.append(grpcclient.InferRequestedOutput(output_name))
 
             if not (use_cuda_shared_memory or use_system_shared_memory):
                 if config[1] == "http":
-                    inputs[-1].set_data_from_numpy(input_data,
-                                                   binary_data=config[3])
+                    inputs[-1].set_data_from_numpy(input_data, binary_data=config[3])
                 else:
                     inputs[-1].set_data_from_numpy(input_data)
             else:
                 # Register necessary shared memory regions/handles
                 su.register_add_either_shm_regions(
-                    inputs, output_req, shm_region_name_prefix,
-                    (shm_ip_handles, shm_op_handles), io_num, input_byte_size,
-                    output_byte_size, use_system_shared_memory,
-                    use_cuda_shared_memory, triton_client)
+                    inputs,
+                    output_req,
+                    shm_region_name_prefix,
+                    (shm_ip_handles, shm_op_handles),
+                    io_num,
+                    input_byte_size,
+                    output_byte_size,
+                    use_system_shared_memory,
+                    use_cuda_shared_memory,
+                    triton_client,
+                )
 
         if config[2]:
             user_data = UserData()
@@ -997,7 +1188,8 @@ def infer_zero(tester,
                     outputs=output_req,
                     request_id=str(_unique_request_id()),
                     priority=priority,
-                    timeout=timeout_us)
+                    timeout=timeout_us,
+                )
             except Exception as e:
                 triton_client.stop_stream()
                 raise e
@@ -1006,13 +1198,15 @@ def infer_zero(tester,
             if error is not None:
                 raise error
         else:
-            results = triton_client.infer(model_name,
-                                          inputs,
-                                          model_version=model_version,
-                                          outputs=output_req,
-                                          request_id=str(_unique_request_id()),
-                                          priority=priority,
-                                          timeout=timeout_us)
+            results = triton_client.infer(
+                model_name,
+                inputs,
+                model_version=model_version,
+                outputs=output_req,
+                request_id=str(_unique_request_id()),
+                priority=priority,
+                timeout=timeout_us,
+            )
 
         last_response = results.get_response()
 
@@ -1040,7 +1234,7 @@ def infer_zero(tester,
             else:
                 result_name = result.name
 
-            tester.assertTrue(result_name in expected_dict)
+            tester.assertIn(result_name, expected_dict)
             if use_system_shared_memory or use_cuda_shared_memory:
                 if platform == "pytorch_libtorch":
                     io_num = int(result_name.split("OUTPUT__")[1])
@@ -1050,56 +1244,64 @@ def infer_zero(tester,
 
                 output = results.get_output(result_name)
                 if config[1] == "http":
-                    output_datatype = output['datatype']
-                    output_shape = output['shape']
+                    output_datatype = output["datatype"]
+                    output_shape = output["shape"]
                 else:
                     output_datatype = output.datatype
                     output_shape = output.shape
                 output_dtype = triton_to_np_dtype(output_datatype)
             if use_system_shared_memory:
-                output_data = shm.get_contents_as_numpy(shm_handle,
-                                                        output_dtype,
-                                                        output_shape)
+                output_data = shm.get_contents_as_numpy(
+                    shm_handle, output_dtype, output_shape
+                )
             elif use_cuda_shared_memory:
                 output_data = cudashm.get_contents_as_numpy(
-                    shm_handle, output_dtype, output_shape)
+                    shm_handle, output_dtype, output_shape
+                )
             else:
                 output_data = results.as_numpy(result_name)
 
                 if (output_data.dtype == np.object_) and (config[3] == False):
-                    if config[1] == 'http':
-                        output_data = np.array([
-                            unicode(str(x), encoding='utf-8')
-                            for x in (output_data.flatten())
-                        ],
-                                               dtype=np.object_).reshape(
-                                                   output_data.shape)
-                    elif config[1] == 'grpc':
+                    if config[1] == "http":
+                        output_data = np.array(
+                            [
+                                unicode(str(x), encoding="utf-8")
+                                for x in (output_data.flatten())
+                            ],
+                            dtype=np.object_,
+                        ).reshape(output_data.shape)
+                    elif config[1] == "grpc":
                         output_data = np.array(
-                            [x for x in (output_data.flatten())],
-                            dtype=np.object_).reshape(output_data.shape)
+                            [x for x in (output_data.flatten())], dtype=np.object_
+                        ).reshape(output_data.shape)
 
             expected = expected_dict[result_name]
             tester.assertEqual(output_data.shape, expected.shape)
             tester.assertTrue(
                 np.array_equal(output_data, expected),
-                "{}, {}, expected: {}, got {}".format(model_name, result_name,
-                                                      expected, output_data))
+                "{}, {}, expected: {}, got {}".format(
+                    model_name, result_name, expected, output_data
+                ),
+            )
 
     if len(shm_ip_handles) != 0:
         for io_num in range(io_cnt):
             if use_cuda_shared_memory:
                 triton_client.unregister_cuda_shared_memory(
-                    shm_region_name_prefix[0] + str(io_num) + '_data')
+                    shm_region_name_prefix[0] + str(io_num) + "_data"
+                )
                 triton_client.unregister_cuda_shared_memory(
-                    shm_region_name_prefix[0] + str(io_num) + '_data')
+                    shm_region_name_prefix[0] + str(io_num) + "_data"
+                )
                 cudashm.destroy_shared_memory_region(shm_ip_handles[io_num])
                 cudashm.destroy_shared_memory_region(shm_op_handles[io_num])
             else:
                 triton_client.unregister_system_shared_memory(
-                    shm_region_name_prefix[1] + str(io_num) + '_data')
+                    shm_region_name_prefix[1] + str(io_num) + "_data"
+                )
                 triton_client.unregister_system_shared_memory(
-                    shm_region_name_prefix[1] + str(io_num) + '_data')
+                    shm_region_name_prefix[1] + str(io_num) + "_data"
+                )
                 shm.destroy_shared_memory_region(shm_ip_handles[io_num])
                 shm.destroy_shared_memory_region(shm_op_handles[io_num])
 
diff --git a/qa/common/inferentia_perf_analyzer_input_data_json/non_aligned_validation_batched.json b/qa/common/inferentia_perf_analyzer_input_data_json/non_aligned_validation_batched.json
index 5889c92459..af02734578 100644
--- a/qa/common/inferentia_perf_analyzer_input_data_json/non_aligned_validation_batched.json
+++ b/qa/common/inferentia_perf_analyzer_input_data_json/non_aligned_validation_batched.json
@@ -2,49 +2,49 @@
   "data" :
     [
       {
-        "INPUT__0" : 
-        { 
+        "INPUT__0" :
+        {
           "content": [1, 2, 3, 4],
           "shape": [4]
         },
-        "INPUT__1" : 
-        { 
+        "INPUT__1" :
+        {
           "content": [1, 1, 1, 1],
           "shape": [4]
         }
       },
       {
-        "INPUT__0" : 
-        { 
+        "INPUT__0" :
+        {
           "content": [0, 0, 0, 0],
           "shape": [4]
         },
-        "INPUT__1" : 
-        { 
+        "INPUT__1" :
+        {
           "content": [1, 1, 1, 1],
           "shape": [4]
         }
       },
       {
-        "INPUT__0" : 
-        { 
+        "INPUT__0" :
+        {
           "content": [-1, -2, -3, -4],
           "shape": [4]
         },
-        "INPUT__1" : 
-        { 
+        "INPUT__1" :
+        {
           "content": [1, 1, 1, 1],
           "shape": [4]
         }
       },
       {
-        "INPUT__0" : 
-        { 
+        "INPUT__0" :
+        {
           "content": [-4, -3, -2, -1],
           "shape": [4]
         },
-        "INPUT__1" : 
-        { 
+        "INPUT__1" :
+        {
           "content": [-1, -1, -1, -1],
           "shape": [4]
         }
@@ -53,37 +53,37 @@
   "validation_data" :
   [
       {
-        "OUTPUT__0" : 
-        { 
+        "OUTPUT__0" :
+        {
           "content": [2, 3, 4, 5],
           "shape": [4]
         },
-        "OUTPUT__1" : 
-        { 
+        "OUTPUT__1" :
+        {
           "content": [0, 1, 2, 3],
           "shape": [4]
         }
       },
       {
-        "OUTPUT__0" : 
-        { 
+        "OUTPUT__0" :
+        {
           "content": [1, 1, 1, 1],
           "shape": [4]
         },
-        "OUTPUT__1" : 
-        { 
+        "OUTPUT__1" :
+        {
           "content": [-1, -1 ,-1, -1],
           "shape": [4]
         }
       },
       {
-        "OUTPUT__0" : 
-        { 
+        "OUTPUT__0" :
+        {
           "content": [0, -1, -2, -3],
           "shape": [4]
         },
-        "OUTPUT__1" : 
-        { 
+        "OUTPUT__1" :
+        {
           "content": [-2, -3, -4, -5],
           "shape": [4]
         }
diff --git a/qa/common/inferentia_perf_analyzer_input_data_json/non_aligned_validation_no_batch.json b/qa/common/inferentia_perf_analyzer_input_data_json/non_aligned_validation_no_batch.json
index 2de478a6b7..a14eac39f6 100644
--- a/qa/common/inferentia_perf_analyzer_input_data_json/non_aligned_validation_no_batch.json
+++ b/qa/common/inferentia_perf_analyzer_input_data_json/non_aligned_validation_no_batch.json
@@ -2,49 +2,49 @@
   "data" :
     [
       {
-        "INPUT__0" : 
-        { 
+        "INPUT__0" :
+        {
           "content": [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4,1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
           "shape": [6, 4]
         },
-        "INPUT__1" : 
-        { 
+        "INPUT__1" :
+        {
           "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
           "shape": [6, 4]
         }
       },
       {
-        "INPUT__0" : 
-        { 
+        "INPUT__0" :
+        {
           "content": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
           "shape": [6, 4]
         },
-        "INPUT__1" : 
-        { 
+        "INPUT__1" :
+        {
           "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
           "shape": [6, 4]
         }
       },
       {
-        "INPUT__0" : 
-        { 
+        "INPUT__0" :
+        {
           "content": [-1, -2, -3, -4, -1, -2, -3, -4, -1, -2, -3, -4, -1, -2, -3, -4, -1, -2, -3, -4, -1, -2, -3, -4],
           "shape": [6, 4]
         },
-        "INPUT__1" : 
-        { 
+        "INPUT__1" :
+        {
           "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
           "shape": [6, 4]
         }
       },
       {
-        "INPUT__0" : 
-        { 
+        "INPUT__0" :
+        {
           "content": [-4, -3, -2, -1, -4, -3, -2, -1, -4, -3, -2, -1, -4, -3, -2, -1, -4, -3, -2, -1, -4, -3, -2, -1],
           "shape": [6, 4]
         },
-        "INPUT__1" : 
-        { 
+        "INPUT__1" :
+        {
           "content": [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
           "shape": [6, 4]
         }
@@ -53,37 +53,37 @@
   "validation_data" :
   [
       {
-        "OUTPUT__0" : 
-        { 
+        "OUTPUT__0" :
+        {
           "content": [2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5],
           "shape": [6, 4]
         },
-        "OUTPUT__1" : 
-        { 
+        "OUTPUT__1" :
+        {
           "content": [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3],
           "shape": [6, 4]
         }
       },
       {
-        "OUTPUT__0" : 
-        { 
+        "OUTPUT__0" :
+        {
           "content": [0, -1, -2, -3, 0, -1, -2, -3, 0, -1, -2, -3, 0, -1, -2, -3, 0, -1, -2, -3, 0, -1, -2, -3],
           "shape": [6, 4]
         },
-        "OUTPUT__1" : 
-        { 
+        "OUTPUT__1" :
+        {
           "content": [-2, -3, -4, -5, -2, -3, -4, -5, -2, -3, -4, -5, -2, -3, -4, -5, -2, -3, -4, -5, -2, -3, -4, -5],
           "shape": [6, 4]
         }
       },
       {
-        "OUTPUT__0" : 
-        { 
+        "OUTPUT__0" :
+        {
           "content": [-5, -4, -3, -2, -5, -4, -3, -2, -5, -4, -3, -2, -5, -4, -3, -2, -5, -4, -3, -2, -5, -4, -3, -2],
           "shape": [6, 4]
         },
-        "OUTPUT__1" : 
-        { 
+        "OUTPUT__1" :
+        {
           "content": [-3, -2, -1, 0, -3, -2, -1, 0, -3, -2, -1, 0, -3, -2, -1, 0, -3, -2, -1, 0, -3, -2, -1, 0],
           "shape": [6, 4]
         }
diff --git a/qa/common/inferentia_perf_analyzer_input_data_json/simple_model.py b/qa/common/inferentia_perf_analyzer_input_data_json/simple_model.py
old mode 100644
new mode 100755
index 650e5fbdb1..db7ca95848
--- a/qa/common/inferentia_perf_analyzer_input_data_json/simple_model.py
+++ b/qa/common/inferentia_perf_analyzer_input_data_json/simple_model.py
@@ -1,5 +1,5 @@
 #!/usr/bin/env python
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,7 +28,6 @@
 
 
 def gen_pytorch_model(name, batch_size):
-
     class PyAddSubNet(nn.Module):
         """
         Simple AddSub network in PyTorch. This network outputs the sum and
@@ -39,89 +38,92 @@ def __init__(self):
             super(PyAddSubNet, self).__init__()
 
         def forward(self, input0, input1):
-            return torch.sub(input0, input1, alpha=-1), torch.sub(input0,
-                                                                  input1,
-                                                                  alpha=1)
+            return torch.sub(input0, input1, alpha=-1), torch.sub(
+                input0, input1, alpha=1
+            )
 
     model = PyAddSubNet()
     model.eval()
     batch_size = 1
     example_inputs = torch.zeros([8, 4], dtype=torch.int64), torch.zeros(
-        [8, 4], dtype=torch.int64)
-    model_neuron = torch.neuron.trace(model,
-                                      example_inputs,
-                                      dynamic_batch_size=True)
-    model_neuron.save('{}.pt'.format(name))
+        [8, 4], dtype=torch.int64
+    )
+    model_neuron = torch_neuron.trace(model, example_inputs, dynamic_batch_size=True)
+    model_neuron.save("{}.pt".format(name))
 
 
 def gen_tf_model(name, batch_size, tf_version):
     # Set up model directory
-    model_dir = 'add_sub_model'
+    model_dir = "add_sub_model"
     compiled_model_dir = name
     shutil.rmtree(model_dir, ignore_errors=True)
     shutil.rmtree(compiled_model_dir, ignore_errors=True)
-    if (tf_version == 1):
+    if tf_version == 1:
         with tf.Session() as sess:
             # Export SavedModel
             input0 = tf.placeholder(tf.int64, [None, 4], "INPUT__0")
             input1 = tf.placeholder(tf.int64, [None, 4], "INPUT__1")
             output0 = tf.add(input0, input1, "OUTPUT__0")
             output1 = tf.subtract(input0, input1, "OUTPUT__1")
-            tf.compat.v1.saved_model.simple_save(session=sess,
-                                                 export_dir=model_dir,
-                                                 inputs={
-                                                     "INPUT__0": input0,
-                                                     "INPUT__1": input1
-                                                 },
-                                                 outputs={
-                                                     "OUTPUT__0": output0,
-                                                     "OUTPUT__1": output1
-                                                 })
+            tf.compat.v1.saved_model.simple_save(
+                session=sess,
+                export_dir=model_dir,
+                inputs={"INPUT__0": input0, "INPUT__1": input1},
+                outputs={"OUTPUT__0": output0, "OUTPUT__1": output1},
+            )
         # Compile using Neuron
-        tfn.saved_model.compile(model_dir,
-                                compiled_model_dir,
-                                batch_size=batch_size,
-                                dynamic_batch_size=True)
-    elif (tf_version == 2):
+        tfn.saved_model.compile(
+            model_dir,
+            compiled_model_dir,
+            batch_size=batch_size,
+            dynamic_batch_size=True,
+        )
+    elif tf_version == 2:
         # TODO: Add gen scripts for TF2
         raise Exception("TensorFlow2 not yet supported")
     else:
-        raise Exception(
-            "Unrecognized Tensorflow version: {}".format(tf_version))
+        raise Exception("Unrecognized Tensorflow version: {}".format(tf_version))
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('--model_type',
-                        type=str,
-                        required=True,
-                        choices=['pytorch', 'tensorflow'],
-                        help='''The type of the compiled model. Currently,
-                        only supports \"pytorch\" and \"tensorflow\".''')
-    parser.add_argument('--name',
-                        type=str,
-                        required=True,
-                        help='The name of the compiled model')
-    parser.add_argument('--tf_version',
-                        type=int,
-                        choices=[1, 2],
-                        help='Version of tensorflow for compiled model')
-    parser.add_argument('--batch_size',
-                        type=int,
-                        default=1,
-                        help='The batch size for the compiled model')
+    parser.add_argument(
+        "--model_type",
+        type=str,
+        required=True,
+        choices=["pytorch", "tensorflow"],
+        help="""The type of the compiled model. Currently,
+                        only supports \"pytorch\" and \"tensorflow\".""",
+    )
+    parser.add_argument(
+        "--name", type=str, required=True, help="The name of the compiled model"
+    )
+    parser.add_argument(
+        "--tf_version",
+        type=int,
+        choices=[1, 2],
+        help="Version of tensorflow for compiled model",
+    )
+    parser.add_argument(
+        "--batch_size",
+        type=int,
+        default=1,
+        help="The batch size for the compiled model",
+    )
 
     FLAGS, unparsed = parser.parse_known_args()
     if len(unparsed) > 0:
         raise Exception("Unrecognized options: {}".format(unparsed))
-    if FLAGS.model_type == 'tensorflow':
-        import os
+    if FLAGS.model_type == "tensorflow":
         import shutil
+
         import tensorflow as tf
         import tensorflow.neuron as tfn
+
         gen_tf_model(FLAGS.name, FLAGS.batch_size, FLAGS.tf_version)
-    elif FLAGS.model_type == 'pytorch':
+    elif FLAGS.model_type == "pytorch":
         import torch
-        from torch import nn
         import torch_neuron
+        from torch import nn
+
         gen_pytorch_model(FLAGS.name, FLAGS.batch_size)
diff --git a/qa/common/inferentia_perf_analyzer_input_data_json/validation_batched.json b/qa/common/inferentia_perf_analyzer_input_data_json/validation_batched.json
index 1d66ef33fa..9b733c5a55 100644
--- a/qa/common/inferentia_perf_analyzer_input_data_json/validation_batched.json
+++ b/qa/common/inferentia_perf_analyzer_input_data_json/validation_batched.json
@@ -2,49 +2,49 @@
     "data" :
       [
         {
-          "INPUT__0" : 
-          { 
+          "INPUT__0" :
+          {
             "content": [1, 2, 3, 4],
             "shape": [4]
           },
-          "INPUT__1" : 
-          { 
+          "INPUT__1" :
+          {
             "content": [1, 1, 1, 1],
             "shape": [4]
           }
         },
         {
-          "INPUT__0" : 
-          { 
+          "INPUT__0" :
+          {
             "content": [0, 0, 0, 0],
             "shape": [4]
           },
-          "INPUT__1" : 
-          { 
+          "INPUT__1" :
+          {
             "content": [1, 1, 1, 1],
             "shape": [4]
           }
         },
         {
-          "INPUT__0" : 
-          { 
+          "INPUT__0" :
+          {
             "content": [-1, -2, -3, -4],
             "shape": [4]
           },
-          "INPUT__1" : 
-          { 
+          "INPUT__1" :
+          {
             "content": [1, 1, 1, 1],
             "shape": [4]
           }
         },
         {
-          "INPUT__0" : 
-          { 
+          "INPUT__0" :
+          {
             "content": [-4, -3, -2, -1],
             "shape": [4]
           },
-          "INPUT__1" : 
-          { 
+          "INPUT__1" :
+          {
             "content": [-1, -1, -1, -1],
             "shape": [4]
           }
@@ -53,49 +53,49 @@
     "validation_data" :
     [
         {
-          "OUTPUT__0" : 
-          { 
+          "OUTPUT__0" :
+          {
             "content": [2, 3, 4, 5],
             "shape": [4]
           },
-          "OUTPUT__1" : 
-          { 
+          "OUTPUT__1" :
+          {
             "content": [0, 1, 2, 3],
             "shape": [4]
           }
         },
         {
-          "OUTPUT__0" : 
-          { 
+          "OUTPUT__0" :
+          {
             "content": [1, 1, 1, 1],
             "shape": [4]
           },
-          "OUTPUT__1" : 
-          { 
+          "OUTPUT__1" :
+          {
             "content": [-1, -1 ,-1, -1],
             "shape": [4]
           }
         },
         {
-          "OUTPUT__0" : 
-          { 
+          "OUTPUT__0" :
+          {
             "content": [0, -1, -2, -3],
             "shape": [4]
           },
-          "OUTPUT__1" : 
-          { 
+          "OUTPUT__1" :
+          {
             "content": [-2, -3, -4, -5],
             "shape": [4]
           }
         },
         {
-          "OUTPUT__0" : 
-          { 
+          "OUTPUT__0" :
+          {
             "content": [-5, -4, -3, -2],
             "shape": [4]
           },
-          "OUTPUT__1" : 
-          { 
+          "OUTPUT__1" :
+          {
             "content": [-3, -2, -1, 0],
             "shape": [4]
           }
diff --git a/qa/common/inferentia_perf_analyzer_input_data_json/validation_no_batch.json b/qa/common/inferentia_perf_analyzer_input_data_json/validation_no_batch.json
index 79367bd304..4ce1308fbc 100644
--- a/qa/common/inferentia_perf_analyzer_input_data_json/validation_no_batch.json
+++ b/qa/common/inferentia_perf_analyzer_input_data_json/validation_no_batch.json
@@ -2,49 +2,49 @@
     "data" :
       [
         {
-          "INPUT__0" : 
-          { 
+          "INPUT__0" :
+          {
             "content": [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4,1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
             "shape": [6, 4]
           },
-          "INPUT__1" : 
-          { 
+          "INPUT__1" :
+          {
             "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
             "shape": [6, 4]
           }
         },
         {
-          "INPUT__0" : 
-          { 
+          "INPUT__0" :
+          {
             "content": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
             "shape": [6, 4]
           },
-          "INPUT__1" : 
-          { 
+          "INPUT__1" :
+          {
             "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
             "shape": [6, 4]
           }
         },
         {
-          "INPUT__0" : 
-          { 
+          "INPUT__0" :
+          {
             "content": [-1, -2, -3, -4, -1, -2, -3, -4, -1, -2, -3, -4, -1, -2, -3, -4, -1, -2, -3, -4, -1, -2, -3, -4],
             "shape": [6, 4]
           },
-          "INPUT__1" : 
-          { 
+          "INPUT__1" :
+          {
             "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
             "shape": [6, 4]
           }
         },
         {
-          "INPUT__0" : 
-          { 
+          "INPUT__0" :
+          {
             "content": [-4, -3, -2, -1, -4, -3, -2, -1, -4, -3, -2, -1, -4, -3, -2, -1, -4, -3, -2, -1, -4, -3, -2, -1],
             "shape": [6, 4]
           },
-          "INPUT__1" : 
-          { 
+          "INPUT__1" :
+          {
             "content": [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
             "shape": [6, 4]
           }
@@ -53,49 +53,49 @@
     "validation_data" :
     [
         {
-          "OUTPUT__0" : 
-          { 
+          "OUTPUT__0" :
+          {
             "content": [2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5],
             "shape": [6, 4]
           },
-          "OUTPUT__1" : 
-          { 
+          "OUTPUT__1" :
+          {
             "content": [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3],
             "shape": [6, 4]
           }
         },
         {
-          "OUTPUT__0" : 
-          { 
+          "OUTPUT__0" :
+          {
             "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
             "shape": [6, 4]
           },
-          "OUTPUT__1" : 
-          { 
+          "OUTPUT__1" :
+          {
             "content": [-1, -1 ,-1, -1, -1, -1 ,-1, -1, -1, -1 ,-1, -1, -1, -1 ,-1, -1, -1, -1 ,-1, -1, -1, -1 ,-1, -1],
             "shape": [6, 4]
           }
         },
         {
-          "OUTPUT__0" : 
-          { 
+          "OUTPUT__0" :
+          {
             "content": [0, -1, -2, -3, 0, -1, -2, -3, 0, -1, -2, -3, 0, -1, -2, -3, 0, -1, -2, -3, 0, -1, -2, -3],
             "shape": [6, 4]
           },
-          "OUTPUT__1" : 
-          { 
+          "OUTPUT__1" :
+          {
             "content": [-2, -3, -4, -5, -2, -3, -4, -5, -2, -3, -4, -5, -2, -3, -4, -5, -2, -3, -4, -5, -2, -3, -4, -5],
             "shape": [6, 4]
           }
         },
         {
-          "OUTPUT__0" : 
-          { 
+          "OUTPUT__0" :
+          {
             "content": [-5, -4, -3, -2, -5, -4, -3, -2, -5, -4, -3, -2, -5, -4, -3, -2, -5, -4, -3, -2, -5, -4, -3, -2],
             "shape": [6, 4]
           },
-          "OUTPUT__1" : 
-          { 
+          "OUTPUT__1" :
+          {
             "content": [-3, -2, -1, 0, -3, -2, -1, 0, -3, -2, -1, 0, -3, -2, -1, 0, -3, -2, -1, 0, -3, -2, -1, 0],
             "shape": [6, 4]
           }
diff --git a/qa/common/inferentia_perf_analyzer_input_data_json/wrong_validation_batched.json b/qa/common/inferentia_perf_analyzer_input_data_json/wrong_validation_batched.json
index 4f0d9f0e30..5e40ffe569 100644
--- a/qa/common/inferentia_perf_analyzer_input_data_json/wrong_validation_batched.json
+++ b/qa/common/inferentia_perf_analyzer_input_data_json/wrong_validation_batched.json
@@ -2,49 +2,49 @@
   "data" :
     [
       {
-        "INPUT__0" : 
-        { 
+        "INPUT__0" :
+        {
           "content": [1, 2, 3, 4],
           "shape": [4]
         },
-        "INPUT__1" : 
-        { 
+        "INPUT__1" :
+        {
           "content": [1, 1, 1, 1],
           "shape": [4]
         }
       },
       {
-        "INPUT__0" : 
-        { 
+        "INPUT__0" :
+        {
           "content": [0, 0, 0, 0],
           "shape": [4]
         },
-        "INPUT__1" : 
-        { 
+        "INPUT__1" :
+        {
           "content": [1, 1, 1, 1],
           "shape": [4]
         }
       },
       {
-        "INPUT__0" : 
-        { 
+        "INPUT__0" :
+        {
           "content": [-1, -2, -3, -4],
           "shape": [4]
         },
-        "INPUT__1" : 
-        { 
+        "INPUT__1" :
+        {
           "content": [1, 1, 1, 1],
           "shape": [4]
         }
       },
       {
-        "INPUT__0" : 
-        { 
+        "INPUT__0" :
+        {
           "content": [-4, -3, -2, -1],
           "shape": [4]
         },
-        "INPUT__1" : 
-        { 
+        "INPUT__1" :
+        {
           "content": [-1, -1, -1, -1],
           "shape": [4]
         }
@@ -53,49 +53,49 @@
     "validation_data" :
     [
         {
-          "OUTPUT__0" : 
-          { 
+          "OUTPUT__0" :
+          {
             "content": [2, 3, 4, 5],
             "shape": [4]
           },
-          "OUTPUT__1" : 
-          { 
+          "OUTPUT__1" :
+          {
             "content": [0, 0, 0, 0],
             "shape": [4]
           }
         },
         {
-          "OUTPUT__0" : 
-          { 
+          "OUTPUT__0" :
+          {
             "content": [1, 1, 1, 1],
             "shape": [4]
           },
-          "OUTPUT__1" : 
-          { 
+          "OUTPUT__1" :
+          {
             "content": [1, 1, 1, 1],
             "shape": [4]
           }
         },
         {
-          "OUTPUT__0" : 
-          { 
+          "OUTPUT__0" :
+          {
             "content": [0, 1, 2, 3],
             "shape": [4]
           },
-          "OUTPUT__1" : 
-          { 
+          "OUTPUT__1" :
+          {
             "content": [7, 8, 9, 10],
             "shape": [4]
           }
         },
         {
-          "OUTPUT__0" : 
-          { 
+          "OUTPUT__0" :
+          {
             "content": [-5, -4, -3, -1],
             "shape": [4]
           },
-          "OUTPUT__1" : 
-          { 
+          "OUTPUT__1" :
+          {
             "content": [-3, -2, -1, 0],
             "shape": [4]
           }
diff --git a/qa/common/inferentia_perf_analyzer_input_data_json/wrong_validation_no_batch.json b/qa/common/inferentia_perf_analyzer_input_data_json/wrong_validation_no_batch.json
index 82c3faee65..e9a212e5ec 100644
--- a/qa/common/inferentia_perf_analyzer_input_data_json/wrong_validation_no_batch.json
+++ b/qa/common/inferentia_perf_analyzer_input_data_json/wrong_validation_no_batch.json
@@ -2,49 +2,49 @@
   "data" :
     [
       {
-        "INPUT__0" : 
-        { 
+        "INPUT__0" :
+        {
           "content": [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4,1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
           "shape": [6, 4]
         },
-        "INPUT__1" : 
-        { 
+        "INPUT__1" :
+        {
           "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
           "shape": [6, 4]
         }
       },
       {
-        "INPUT__0" : 
-        { 
+        "INPUT__0" :
+        {
           "content": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
           "shape": [6, 4]
         },
-        "INPUT__1" : 
-        { 
+        "INPUT__1" :
+        {
           "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
           "shape": [6, 4]
         }
       },
       {
-        "INPUT__0" : 
-        { 
+        "INPUT__0" :
+        {
           "content": [-1, -2, -3, -4, -1, -2, -3, -4, -1, -2, -3, -4, -1, -2, -3, -4, -1, -2, -3, -4, -1, -2, -3, -4],
           "shape": [6, 4]
         },
-        "INPUT__1" : 
-        { 
+        "INPUT__1" :
+        {
           "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
           "shape": [6, 4]
         }
       },
       {
-        "INPUT__0" : 
-        { 
+        "INPUT__0" :
+        {
           "content": [-4, -3, -2, -1, -4, -3, -2, -1, -4, -3, -2, -1, -4, -3, -2, -1, -4, -3, -2, -1, -4, -3, -2, -1],
           "shape": [6, 4]
         },
-        "INPUT__1" : 
-        { 
+        "INPUT__1" :
+        {
           "content": [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
           "shape": [6, 4]
         }
@@ -53,49 +53,49 @@
     "validation_data" :
     [
         {
-          "OUTPUT__0" : 
-          { 
+          "OUTPUT__0" :
+          {
             "content": [2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5],
             "shape": [6, 4]
           },
-          "OUTPUT__1" : 
-          { 
+          "OUTPUT__1" :
+          {
             "content": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3],
             "shape": [6, 4]
           }
         },
         {
-          "OUTPUT__0" : 
-          { 
+          "OUTPUT__0" :
+          {
             "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
             "shape": [6, 4]
           },
-          "OUTPUT__1" : 
-          { 
+          "OUTPUT__1" :
+          {
             "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1 ,-1, -1],
             "shape": [6, 4]
           }
         },
         {
-          "OUTPUT__0" : 
-          { 
+          "OUTPUT__0" :
+          {
             "content": [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3],
             "shape": [6, 4]
           },
-          "OUTPUT__1" : 
-          { 
+          "OUTPUT__1" :
+          {
             "content": [7, 8, 9, 10, 7, 8, 9, 10, 7, 8, 9, 10, 7, 8, 9, 10, 7, 8, 9, 10, 7, 8, 9, 10],
             "shape": [6, 4]
           }
         },
         {
-          "OUTPUT__0" : 
-          { 
+          "OUTPUT__0" :
+          {
             "content": [-5, -4, -3, -1, -5, -4, -3, -1, -5, -4, -3, -1, -5, -4, -3, -1, -5, -4, -3, -1, -5, -4, -3, -1],
             "shape": [6, 4]
           },
-          "OUTPUT__1" : 
-          { 
+          "OUTPUT__1" :
+          {
             "content": [-3, -2, -1, 0, -3, -2, -1, 0, -3, -2, -1, 0, -3, -2, -1, 0, -3, -2, -1, 0, -3, -2, -1, 0],
             "shape": [6, 4]
           }
diff --git a/qa/common/libtorch_infer_client.py b/qa/common/libtorch_infer_client.py
old mode 100644
new mode 100755
index d4acdf4a29..063c8dc009
--- a/qa/common/libtorch_infer_client.py
+++ b/qa/common/libtorch_infer_client.py
@@ -1,5 +1,5 @@
 #!/usr/bin/env python
-# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -27,25 +27,26 @@
 
 import os
 import sys
+
 sys.path.append("../common")
 
 import unittest
+
 import numpy as np
 import test_util as tu
-
 import tritonclient.http as httpclient
-from tritonclient.utils import InferenceServerException
 
 # By default, find tritonserver on "localhost", but can be overridden
 # with TRITONSERVER_IPADDR envvar
-_tritonserver_ipaddr = os.environ.get('TRITONSERVER_IPADDR', 'localhost')
+_tritonserver_ipaddr = os.environ.get("TRITONSERVER_IPADDR", "localhost")
 
-class InferTest(tu.TestResultCollector):
 
+class InferTest(tu.TestResultCollector):
     def test_infer(self):
         try:
             triton_client = httpclient.InferenceServerClient(
-                url=f"{_tritonserver_ipaddr}:8000")
+                url=f"{_tritonserver_ipaddr}:8000"
+            )
         except Exception as e:
             print("channel creation failed: " + str(e))
             sys.exit(1)
@@ -54,8 +55,8 @@ def test_infer(self):
 
         inputs = []
         outputs = []
-        inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32"))
-        inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32"))
+        inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32"))
 
         # Create the data for the two input tensors. Initialize the first
         # to unique integers and the second to all ones.
@@ -67,24 +68,30 @@ def test_infer(self):
         inputs[0].set_data_from_numpy(input0_data, binary_data=True)
         inputs[1].set_data_from_numpy(input1_data, binary_data=True)
 
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT__0', binary_data=True))
-        outputs.append(
-            httpclient.InferRequestedOutput('OUTPUT__1', binary_data=True))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT__0", binary_data=True))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT__1", binary_data=True))
 
         results = triton_client.infer(model_name, inputs, outputs=outputs)
 
-        output0_data = results.as_numpy('OUTPUT__0')
-        output1_data = results.as_numpy('OUTPUT__1')
+        output0_data = results.as_numpy("OUTPUT__0")
+        output1_data = results.as_numpy("OUTPUT__1")
 
         # Validate the results by comparing with precomputed values.
         for i in range(16):
             print(
-                str(input0_data[0][i]) + " - " + str(input1_data[0][i]) +
-                " = " + str(output0_data[0][i]))
+                str(input0_data[0][i])
+                + " - "
+                + str(input1_data[0][i])
+                + " = "
+                + str(output0_data[0][i])
+            )
             print(
-                str(input0_data[0][i]) + " + " + str(input1_data[0][i]) +
-                " = " + str(output1_data[0][i]))
+                str(input0_data[0][i])
+                + " + "
+                + str(input1_data[0][i])
+                + " = "
+                + str(output1_data[0][i])
+            )
             if (input0_data[0][i] - input1_data[0][i]) != output0_data[0][i]:
                 print("sync infer error: incorrect difference")
                 sys.exit(1)
@@ -93,5 +100,5 @@ def test_infer(self):
                 sys.exit(1)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/common/nightly_email_helper.py b/qa/common/nightly_email_helper.py
old mode 100644
new mode 100755
index 995e50a666..bc401e56d4
--- a/qa/common/nightly_email_helper.py
+++ b/qa/common/nightly_email_helper.py
@@ -25,34 +25,32 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-from email import encoders
-import smtplib
-from email.mime.multipart import MIMEMultipart
-from email.mime.base import MIMEBase
-from email.mime.text import MIMEText
 import glob
 import os
+import smtplib
 import sys
 import tarfile
+from email import encoders
+from email.mime.base import MIMEBase
+from email.mime.multipart import MIMEMultipart
+from email.mime.text import MIMEText
 
 
-def send(subject: str,
-         content: str,
-         attachments=None,
-         files_to_tar=None,
-         is_html=False):
-    FROM = os.environ.get('TRITON_FROM', '')
-    TO = os.environ.get('TRITON_TO_DL', '')
-    if FROM == '' or TO == '':
-        print('Must set TRITON_FROM and TRITON_TO_DL env variables')
+def send(
+    subject: str, content: str, attachments=None, files_to_tar=None, is_html=False
+):
+    FROM = os.environ.get("TRITON_FROM", "")
+    TO = os.environ.get("TRITON_TO_DL", "")
+    if FROM == "" or TO == "":
+        print("Must set TRITON_FROM and TRITON_TO_DL env variables")
         sys.exit(1)
 
-    msg = MIMEMultipart('alternative')
-    msg['Subject'] = subject
-    msg['From'] = FROM
-    msg['To'] = TO
+    msg = MIMEMultipart("alternative")
+    msg["Subject"] = subject
+    msg["From"] = FROM
+    msg["To"] = TO
     if is_html:
-        mime_text = MIMEText(content, 'html')
+        mime_text = MIMEText(content, "html")
     else:
         mime_text = MIMEText(content)
     msg.attach(mime_text)
@@ -67,12 +65,11 @@ def send(subject: str,
         attachments.append(subject + ".tgz")
 
     for fname in attachments:
-        p = MIMEBase('application', 'octet-stream')
+        p = MIMEBase("application", "octet-stream")
         with open(fname, "rb") as attachment:
             p.set_payload((attachment).read())
         encoders.encode_base64(p)
-        p.add_header('Content-Disposition',
-                     "attachment; filename= %s" % (fname))
+        p.add_header("Content-Disposition", "attachment; filename= %s" % (fname))
         msg.attach(p)
 
     mailServer = smtplib.SMTP("mailgw.nvidia.com")
diff --git a/qa/common/perf_analyzer_input_data_json/int_data.json b/qa/common/perf_analyzer_input_data_json/int_data.json
index 0f99ae08f8..8921d57f6e 100644
--- a/qa/common/perf_analyzer_input_data_json/int_data.json
+++ b/qa/common/perf_analyzer_input_data_json/int_data.json
@@ -1,7 +1,7 @@
 {
     "data" :
         [
-        
+
             {
                 "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                 "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
@@ -18,6 +18,6 @@
                 "INPUT0" : [4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                 "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
             }
-            
+
         ]
 }
diff --git a/qa/common/perf_analyzer_input_data_json/int_data_diff_shape.json b/qa/common/perf_analyzer_input_data_json/int_data_diff_shape.json
index 6b827bd0f3..53c3f1b412 100644
--- a/qa/common/perf_analyzer_input_data_json/int_data_diff_shape.json
+++ b/qa/common/perf_analyzer_input_data_json/int_data_diff_shape.json
@@ -1,7 +1,7 @@
 {
     "data" :
         [
-        
+
             {
                 "INPUT0" :
                 {
@@ -50,6 +50,6 @@
                     "shape": [2,8,2]
                 }
             }
-            
+
         ]
 }
diff --git a/qa/common/perf_analyzer_input_data_json/int_data_optional.json b/qa/common/perf_analyzer_input_data_json/int_data_optional.json
new file mode 100644
index 0000000000..bf07e47853
--- /dev/null
+++ b/qa/common/perf_analyzer_input_data_json/int_data_optional.json
@@ -0,0 +1,14 @@
+{
+    "data": [
+        {
+            "INPUT0": [
+                1
+            ]
+        },
+        {
+            "INPUT1": [
+                1
+            ]
+        }
+    ]
+}
\ No newline at end of file
diff --git a/qa/common/perf_analyzer_input_data_json/output.json b/qa/common/perf_analyzer_input_data_json/output.json
index d95d790882..f09aee52de 100644
--- a/qa/common/perf_analyzer_input_data_json/output.json
+++ b/qa/common/perf_analyzer_input_data_json/output.json
@@ -35,6 +35,6 @@
           {
               "OUTPUT0" : [3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
               "OUTPUT1" : [5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
-          }  
+          }
         ]
 }
diff --git a/qa/common/perf_analyzer_input_data_json/repeat_int32_data.json b/qa/common/perf_analyzer_input_data_json/repeat_int32_data.json
new file mode 100644
index 0000000000..6733168df4
--- /dev/null
+++ b/qa/common/perf_analyzer_input_data_json/repeat_int32_data.json
@@ -0,0 +1,31 @@
+{
+    "data": [
+        {
+            "IN": {
+                "content": [
+                    4,
+                    2,
+                    0,
+                    1
+                ],
+                "shape": [
+                    4
+                ]
+            },
+            "DELAY": {
+                "content": [
+                    1,
+                    2,
+                    3,
+                    4
+                ],
+                "shape": [
+                    4
+                ]
+            },
+            "WAIT": [
+                5
+            ]
+        }
+    ]
+}
\ No newline at end of file
diff --git a/qa/common/perf_analyzer_input_data_json/string_data_with_shape.json b/qa/common/perf_analyzer_input_data_json/string_data_with_shape.json
index d268b89c25..16640c7935 100644
--- a/qa/common/perf_analyzer_input_data_json/string_data_with_shape.json
+++ b/qa/common/perf_analyzer_input_data_json/string_data_with_shape.json
@@ -11,7 +11,7 @@
                     {
                         "content": ["1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"],
                         "shape": [2,8]
-                    }          
+                    }
             },
             {
                 "INPUT0" :
@@ -21,7 +21,7 @@
                 "INPUT1" :
                     {
                         "content": ["1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"]
-                    }          
+                    }
             },
             {
                 "INPUT0" :
@@ -31,7 +31,7 @@
                 "INPUT1" :
                     {
                         "content": ["1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"]
-                    }          
+                    }
             },
             {
                 "INPUT0" :
@@ -43,7 +43,7 @@
                     {
                         "content": ["1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"],
                         "shape": [2,8]
-                    }          
+                    }
             }
         ]
 }
diff --git a/qa/common/perf_analyzer_input_data_json/wrong_output.json b/qa/common/perf_analyzer_input_data_json/wrong_output.json
index 5d7b3f6eb6..a7765fdcb1 100644
--- a/qa/common/perf_analyzer_input_data_json/wrong_output.json
+++ b/qa/common/perf_analyzer_input_data_json/wrong_output.json
@@ -35,6 +35,6 @@
           {
               "OUTPUT0" : [3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
               "OUTPUT1" : [5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
-          }  
+          }
         ]
 }
diff --git a/qa/common/perf_analyzer_input_data_json/wrong_output_2.json b/qa/common/perf_analyzer_input_data_json/wrong_output_2.json
index e140ab7ba8..bc4487a3a3 100644
--- a/qa/common/perf_analyzer_input_data_json/wrong_output_2.json
+++ b/qa/common/perf_analyzer_input_data_json/wrong_output_2.json
@@ -35,6 +35,6 @@
           {
               "OUTPUT0" : [3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
               "OUTPUT1" : [5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
-          }  
+          }
         ]
 }
diff --git a/qa/common/reporter.py b/qa/common/reporter.py
index 6b6ad09372..e9d152f75d 100755
--- a/qa/common/reporter.py
+++ b/qa/common/reporter.py
@@ -30,39 +30,98 @@
 import csv
 import json
 import os
-import requests
 import socket
+from itertools import pairwise
+
+import numpy as np
+import requests
 
 FLAGS = None
 
 ENVS = [
-    "CUDA_DRIVER_VERSION", "CUDA_VERSION", "TRITON_SERVER_VERSION",
-    "NVIDIA_TRITON_SERVER_VERSION", "TRT_VERSION", "CUDNN_VERSION",
-    "CUBLAS_VERSION", "BENCHMARK_PIPELINE", "BENCHMARK_REPO_BRANCH",
-    "BENCHMARK_REPO_COMMIT", "BENCHMARK_CLUSTER", "BENCHMARK_GPU_COUNT"
+    "CUDA_DRIVER_VERSION",
+    "CUDA_VERSION",
+    "TRITON_SERVER_VERSION",
+    "NVIDIA_TRITON_SERVER_VERSION",
+    "TRT_VERSION",
+    "CUDNN_VERSION",
+    "CUBLAS_VERSION",
+    "BENCHMARK_PIPELINE",
+    "BENCHMARK_REPO_BRANCH",
+    "BENCHMARK_REPO_COMMIT",
+    "BENCHMARK_CLUSTER",
+    "BENCHMARK_GPU_COUNT",
 ]
 
 
-def annotate(datas):
+def collect_gpu_metrics(data):
+    import pynvml
+
+    pynvml.nvmlInit()
+    unique_gpu_models = set()
+    total_memory = 0
+    total_free_memory = 0
+
+    # Get the number of available GPUs
+    device_count = pynvml.nvmlDeviceGetCount()
+
+    # Iterate through each GPU
+    for i in range(device_count):
+        handle = pynvml.nvmlDeviceGetHandleByIndex(i)
+
+        # Get GPU name
+        gpu_name = pynvml.nvmlDeviceGetName(handle).decode("utf-8")
+        unique_gpu_models.add(gpu_name)
+
+        # Get GPU memory information
+        memory_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
+        total_memory += memory_info.total
+        total_free_memory += memory_info.free
+
+    data["l_gpus_count"] = device_count
+    data["s_gpu_model"] = ", ".join(unique_gpu_models)
+    data["d_total_gpu_memory_mb"] = total_memory / (1024**2)
+    data["d_total_free_gpu_memory_mb"] = total_free_memory / (1024**2)
+
+    pynvml.nvmlShutdown()
+
+
+def collect_token_latencies(export_data, data):
+    first_token_latencies = []
+    token_to_token_latencies = []
+    requests = export_data["experiments"][0]["requests"]
+
+    for r in requests:
+        init_request, responses = r["timestamp"], r["response_timestamps"]
+        first_token_latency = (responses[0] - init_request) / 1_000_000
+        first_token_latencies.append(first_token_latency)
+        for prev_res, res in pairwise(responses):
+            token_to_token_latencies.append((res - prev_res) / 1_000_000)
+
+    data["d_avg_token_to_token_latency_ms"] = np.mean(token_to_token_latencies)  # msec
+    data["d_avg_first_token_latency_ms"] = np.mean(first_token_latencies)  # msec
+
+
+def annotate(data):
     # Add all interesting envvar values
-    for data in datas:
+    for data in data:
         for env in ENVS:
             if env in os.environ:
                 val = os.environ[env]
-                data['s_' + env.lower()] = val
+                data["s_" + env.lower()] = val
 
         # Add this system's name. If running within slurm use
         # SLURM_JOB_NODELIST as the name (this assumes that the slurm
         # job was scheduled on a single node, otherwise
         # SLURM_JOB_NODELIST will list multiple nodes).
-        if 'SLURM_JOB_NODELIST' in os.environ:
-            data['s_benchmark_system'] = os.environ['SLURM_JOB_NODELIST']
+        if "SLURM_JOB_NODELIST" in os.environ:
+            data["s_benchmark_system"] = os.environ["SLURM_JOB_NODELIST"]
         else:
-            data['s_benchmark_system'] = socket.gethostname()
+            data["s_benchmark_system"] = socket.gethostname()
 
 
 def annotate_csv(data, csv_file):
-    csv_reader = csv.reader(csv_file, delimiter=',')
+    csv_reader = csv.reader(csv_file, delimiter=",")
     linenum = 0
     header_row = None
     concurrency_row = None
@@ -77,57 +136,81 @@ def annotate_csv(data, csv_file):
     if (header_row is not None) and (concurrency_row is not None):
         avg_latency_us = 0
         for header, result in zip(header_row, concurrency_row):
-            if header == 'Inferences/Second':
-                data['d_infer_per_sec'] = float(result)
-            elif ((header == 'Client Send') or
-                  (header == 'Network+Server Send/Recv') or
-                  (header == 'Server Queue') or
-                  (header == 'Server Compute Input') or
-                  (header == 'Server Compute Output') or
-                  (header == 'Server Compute Infer') or
-                  (header == 'Client Recv')):
+            if header == "Inferences/Second":
+                data["d_infer_per_sec"] = float(result)
+            elif (
+                (header == "Client Send")
+                or (header == "Network+Server Send/Recv")
+                or (header == "Server Queue")
+                or (header == "Server Compute Input")
+                or (header == "Server Compute Output")
+                or (header == "Server Compute Infer")
+                or (header == "Client Recv")
+            ):
                 avg_latency_us += float(result)
-            elif header == 'p50 latency':
-                data['d_latency_p50_ms'] = float(result) / 1000.0
-            elif header == 'p90 latency':
-                data['d_latency_p90_ms'] = float(result) / 1000.0
-            elif header == 'p95 latency':
-                data['d_latency_p95_ms'] = float(result) / 1000.0
-            elif header == 'p99 latency':
-                data['d_latency_p99_ms'] = float(result) / 1000.0
+            elif header == "p50 latency":
+                data["d_latency_p50_ms"] = float(result) / 1000.0
+            elif header == "p90 latency":
+                data["d_latency_p90_ms"] = float(result) / 1000.0
+            elif header == "p95 latency":
+                data["d_latency_p95_ms"] = float(result) / 1000.0
+            elif header == "p99 latency":
+                data["d_latency_p99_ms"] = float(result) / 1000.0
 
-        data['d_latency_avg_ms'] = avg_latency_us / 1000.0
+        data["d_latency_avg_ms"] = avg_latency_us / 1000.0
 
 
 def post_to_url(url, data):
-    headers = {'Content-Type': 'application/json', 'Accept-Charset': 'UTF-8'}
+    headers = {"Content-Type": "application/json", "Accept-Charset": "UTF-8"}
     r = requests.post(url, data=data, headers=headers)
     r.raise_for_status()
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('-v',
-                        '--verbose',
-                        action="store_true",
-                        required=False,
-                        default=False,
-                        help='Enable verbose output')
-    parser.add_argument('-o',
-                        '--output',
-                        type=str,
-                        required=False,
-                        help='Output filename')
-    parser.add_argument('-u',
-                        '--url',
-                        type=str,
-                        required=False,
-                        help='Post results to a URL')
-    parser.add_argument('--csv',
-                        type=argparse.FileType('r'),
-                        required=False,
-                        help='perf_analyzer generated CSV')
-    parser.add_argument('file', type=argparse.FileType('r'))
+    parser.add_argument(
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable verbose output",
+    )
+    parser.add_argument(
+        "--gpu-metrics",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Collect GPU details",
+    )
+    parser.add_argument(
+        "-e",
+        "--profile-export-file",
+        type=argparse.FileType("r"),
+        required=False,
+        help="Profile file exported by perf_analyzer",
+    )
+    parser.add_argument(
+        "--token-latency",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Collect token latency data",
+    )
+
+    parser.add_argument(
+        "-o", "--output", type=str, required=False, help="Output filename"
+    )
+    parser.add_argument(
+        "-u", "--url", type=str, required=False, help="Post results to a URL"
+    )
+    parser.add_argument(
+        "--csv",
+        type=argparse.FileType("r"),
+        required=False,
+        help="perf_analyzer generated CSV",
+    )
+    parser.add_argument("file", type=argparse.FileType("r"))
     FLAGS = parser.parse_args()
 
     data = json.loads(FLAGS.file.read())
@@ -136,10 +219,20 @@ def post_to_url(url, data):
         print("*** Load json ***")
         print(json.dumps(data, sort_keys=True, indent=2))
 
+    if FLAGS.gpu_metrics:
+        collect_gpu_metrics(data[0])
+
+    if FLAGS.token_latency:
+        if not FLAGS.profile_export_file:
+            raise Exception(
+                "Please provide a profile export file to collect token latencies."
+            )
+        export_data = json.loads(FLAGS.profile_export_file.read())
+        collect_token_latencies(export_data, data[0])
+
     if FLAGS.csv is not None:
         if len(data) != 1:
-            raise Exception(
-                "--csv requires that json data have a single array entry")
+            raise Exception("--csv requires that json data have a single array entry")
         annotate_csv(data[0], FLAGS.csv)
         if FLAGS.verbose:
             print("*** Annotate CSV ***")
diff --git a/qa/common/sequence_util.py b/qa/common/sequence_util.py
old mode 100644
new mode 100755
index 04c16542ec..22c618fbb3
--- a/qa/common/sequence_util.py
+++ b/qa/common/sequence_util.py
@@ -1,4 +1,6 @@
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,17 +26,16 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-from builtins import range
-from builtins import str
 import os
 import sys
-import time
 import threading
-import numpy as np
-import infer_util as iu
-import test_util as tu
+import time
+from builtins import range, str
 from functools import partial
 
+import infer_util as iu
+import numpy as np
+import test_util as tu
 import tritonclient.grpc as grpcclient
 import tritonclient.http as httpclient
 from tritonclient.utils import *
@@ -46,23 +47,21 @@
 
 # By default, find tritonserver on "localhost", but can be overridden
 # with TRITONSERVER_IPADDR envvar
-_tritonserver_ipaddr = os.environ.get('TRITONSERVER_IPADDR', 'localhost')
+_tritonserver_ipaddr = os.environ.get("TRITONSERVER_IPADDR", "localhost")
 
-_test_system_shared_memory = bool(
-    int(os.environ.get('TEST_SYSTEM_SHARED_MEMORY', 0)))
-_test_cuda_shared_memory = bool(
-    int(os.environ.get('TEST_CUDA_SHARED_MEMORY', 0)))
+_test_system_shared_memory = bool(int(os.environ.get("TEST_SYSTEM_SHARED_MEMORY", 0)))
+_test_cuda_shared_memory = bool(int(os.environ.get("TEST_CUDA_SHARED_MEMORY", 0)))
 
 if _test_system_shared_memory:
     import tritonclient.utils.shared_memory as shm
 if _test_cuda_shared_memory:
     import tritonclient.utils.cuda_shared_memory as cudashm
 
-_test_valgrind = bool(int(os.environ.get('TEST_VALGRIND', 0)))
-_test_jetson = bool(int(os.environ.get('TEST_JETSON', 0)))
+_test_valgrind = bool(int(os.environ.get("TEST_VALGRIND", 0)))
+_test_jetson = bool(int(os.environ.get("TEST_JETSON", 0)))
 
 _max_sequence_idle_ms = 5000
-_valgrind_delay_ms = bool(int(os.environ.get('TEST_DELAY_MS', 50)))
+_valgrind_delay_ms = bool(int(os.environ.get("TEST_DELAY_MS", 50)))
 
 _deferred_exceptions_lock = threading.Lock()
 _deferred_exceptions = None
@@ -70,7 +69,6 @@
 
 
 class UserData:
-
     def __init__(self):
         self._completed_requests = queue.Queue()
 
@@ -82,10 +80,11 @@ def completion_callback(user_data, result, error):
 
 
 class SequenceBatcherTestUtil(tu.TestResultCollector):
-
     def setUp(self):
         # The helper client for setup will be GRPC for simplicity.
-        self.triton_client_ = grpcclient.InferenceServerClient(f"{_tritonserver_ipaddr}:8001")
+        self.triton_client_ = grpcclient.InferenceServerClient(
+            f"{_tritonserver_ipaddr}:8001"
+        )
         self.clear_deferred_exceptions()
 
     def clear_deferred_exceptions(self):
@@ -112,12 +111,9 @@ def check_failure(self):
             if len(_deferred_exceptions) == 0:
                 raise Exception("Unexpected inference success")
 
-    def precreate_register_regions(self,
-                                   value_list,
-                                   dtype,
-                                   i,
-                                   batch_size=1,
-                                   tensor_shape=(1,)):
+    def precreate_register_regions(
+        self, value_list, dtype, i, batch_size=1, tensor_shape=(1,)
+    ):
         if _test_system_shared_memory or _test_cuda_shared_memory:
             shm_region_handles = []
             for j, value in enumerate(value_list):
@@ -134,11 +130,10 @@ def precreate_register_regions(self,
                 for b in range(batch_size):
                     if dtype == np.object_:
                         in0 = np.full(tensor_shape, value, dtype=np.int32)
-                        in0n = np.array([
-                            str(x).encode('utf-8')
-                            for x in in0.reshape(in0.size)
-                        ],
-                                        dtype=object)
+                        in0n = np.array(
+                            [str(x).encode("utf-8") for x in in0.reshape(in0.size)],
+                            dtype=object,
+                        )
                         in0 = in0n.reshape(tensor_shape)
                         output_byte_size += 64 * in0.size
                     else:
@@ -149,54 +144,63 @@ def precreate_register_regions(self,
                 if dtype == np.object_:
                     input_list_tmp = iu.serialize_byte_tensor_list(input_list)
                     input_byte_size = sum(
-                        [serialized_byte_size(i0) for i0 in input_list_tmp])
+                        [serialized_byte_size(i0) for i0 in input_list_tmp]
+                    )
                 else:
                     input_list_tmp = input_list
                     input_byte_size = sum([i0.nbytes for i0 in input_list_tmp])
 
                 # create shared memory regions and copy data for input values
-                ip_name = 'ip{}{}'.format(i, j)
-                op_name = 'op{}{}_data'.format(i, j)
+                ip_name = "ip{}{}".format(i, j)
+                op_name = "op{}{}_data".format(i, j)
                 if _test_system_shared_memory:
                     shm_ip_handle = shm.create_shared_memory_region(
-                        ip_name, '/' + ip_name, input_byte_size)
+                        ip_name, "/" + ip_name, input_byte_size
+                    )
                     shm_op_handle = shm.create_shared_memory_region(
-                        op_name, '/' + op_name, output_byte_size)
+                        op_name, "/" + op_name, output_byte_size
+                    )
                     shm.set_shared_memory_region(shm_ip_handle, input_list_tmp)
                     self.triton_client_.register_system_shared_memory(
-                        ip_name, '/' + ip_name, input_byte_size)
+                        ip_name, "/" + ip_name, input_byte_size
+                    )
                     self.triton_client_.register_system_shared_memory(
-                        op_name, '/' + op_name, output_byte_size)
+                        op_name, "/" + op_name, output_byte_size
+                    )
                 elif _test_cuda_shared_memory:
                     shm_ip_handle = cudashm.create_shared_memory_region(
-                        ip_name, input_byte_size, 0)
+                        ip_name, input_byte_size, 0
+                    )
                     shm_op_handle = cudashm.create_shared_memory_region(
-                        op_name, output_byte_size, 0)
-                    cudashm.set_shared_memory_region(shm_ip_handle,
-                                                     input_list_tmp)
+                        op_name, output_byte_size, 0
+                    )
+                    cudashm.set_shared_memory_region(shm_ip_handle, input_list_tmp)
                     self.triton_client_.register_cuda_shared_memory(
-                        ip_name, cudashm.get_raw_handle(shm_ip_handle), 0,
-                        input_byte_size)
+                        ip_name,
+                        cudashm.get_raw_handle(shm_ip_handle),
+                        0,
+                        input_byte_size,
+                    )
                     self.triton_client_.register_cuda_shared_memory(
-                        op_name, cudashm.get_raw_handle(shm_op_handle), 0,
-                        output_byte_size)
-                shm_region_handles.append(
-                    (ip_name, input_byte_size, shm_ip_handle))
-                shm_region_handles.append(
-                    (op_name, output_byte_size, shm_op_handle))
+                        op_name,
+                        cudashm.get_raw_handle(shm_op_handle),
+                        0,
+                        output_byte_size,
+                    )
+                shm_region_handles.append((ip_name, input_byte_size, shm_ip_handle))
+                shm_region_handles.append((op_name, output_byte_size, shm_op_handle))
             return shm_region_handles
         else:
             return []
 
     # Returns (name, byte size, shm_handle)
-    def precreate_register_shape_tensor_regions(self,
-                                                value_list,
-                                                dtype,
-                                                i,
-                                                batch_size=1,
-                                                tensor_shape=(1,)):
-        self.assertFalse(_test_cuda_shared_memory,
-                         "Shape tensors does not support CUDA shared memory")
+    def precreate_register_shape_tensor_regions(
+        self, value_list, dtype, i, batch_size=1, tensor_shape=(1,)
+    ):
+        self.assertFalse(
+            _test_cuda_shared_memory,
+            "Shape tensors does not support CUDA shared memory",
+        )
         if _test_system_shared_memory:
             shm_region_handles = []
             for j, (shape_value, value) in enumerate(value_list):
@@ -206,8 +210,9 @@ def precreate_register_shape_tensor_regions(self,
                 for b in range(batch_size):
                     if dtype == np.object_:
                         in0 = np.full(tensor_shape, value, dtype=np.int32)
-                        in0n = np.array([str(x) for x in in0.reshape(in0.size)],
-                                        dtype=object)
+                        in0n = np.array(
+                            [str(x) for x in in0.reshape(in0.size)], dtype=object
+                        )
                         in0 = in0n.reshape(tensor_shape)
                     else:
                         in0 = np.full(tensor_shape, value, dtype=dtype)
@@ -215,80 +220,86 @@ def precreate_register_shape_tensor_regions(self,
 
                 # Only one shape tensor input per batch
                 shape_input_list.append(
-                    np.full(tensor_shape, shape_value, dtype=np.int32))
+                    np.full(tensor_shape, shape_value, dtype=np.int32)
+                )
 
                 if dtype == np.object_:
                     input_list_tmp = iu.serialize_byte_tensor_list(input_list)
                     input_byte_size = sum(
-                        [serialized_byte_size(i0) for i0 in input_list_tmp])
+                        [serialized_byte_size(i0) for i0 in input_list_tmp]
+                    )
                 else:
                     input_list_tmp = input_list
                     input_byte_size = sum([i0.nbytes for i0 in input_list_tmp])
 
-                shape_input_byte_size = sum(
-                    [i0.nbytes for i0 in shape_input_list])
+                shape_input_byte_size = sum([i0.nbytes for i0 in shape_input_list])
                 shape_output_byte_size = shape_input_byte_size
                 output_byte_size = np.dtype(dtype).itemsize + 2
                 resized_output_byte_size = 32 * shape_value
 
                 # create shared memory regions and copy data for input values
-                ip_name = 'ip{}{}'.format(i, j)
-                shape_ip_name = 'shape_ip{}{}'.format(i, j)
-                shape_op_name = 'shape_op{}{}'.format(i, j)
-                op_name = 'op{}{}'.format(i, j)
-                resized_op_name = 'resized_op{}{}'.format(i, j)
+                ip_name = "ip{}{}".format(i, j)
+                shape_ip_name = "shape_ip{}{}".format(i, j)
+                shape_op_name = "shape_op{}{}".format(i, j)
+                op_name = "op{}{}".format(i, j)
+                resized_op_name = "resized_op{}{}".format(i, j)
 
                 shm_ip_handle = shm.create_shared_memory_region(
-                    ip_name, '/' + ip_name, input_byte_size)
+                    ip_name, "/" + ip_name, input_byte_size
+                )
                 shm_shape_ip_handle = shm.create_shared_memory_region(
-                    shape_ip_name, '/' + shape_ip_name, shape_input_byte_size)
+                    shape_ip_name, "/" + shape_ip_name, shape_input_byte_size
+                )
                 shm_shape_op_handle = shm.create_shared_memory_region(
-                    shape_op_name, '/' + shape_op_name, shape_output_byte_size)
+                    shape_op_name, "/" + shape_op_name, shape_output_byte_size
+                )
                 shm_op_handle = shm.create_shared_memory_region(
-                    op_name, '/' + op_name, output_byte_size)
+                    op_name, "/" + op_name, output_byte_size
+                )
                 shm_resized_op_handle = shm.create_shared_memory_region(
-                    resized_op_name, '/' + resized_op_name,
-                    resized_output_byte_size)
+                    resized_op_name, "/" + resized_op_name, resized_output_byte_size
+                )
                 shm.set_shared_memory_region(shm_ip_handle, input_list_tmp)
-                shm.set_shared_memory_region(shm_shape_ip_handle,
-                                             shape_input_list)
+                shm.set_shared_memory_region(shm_shape_ip_handle, shape_input_list)
                 self.triton_client_.register_system_shared_memory(
-                    ip_name, '/' + ip_name, input_byte_size)
+                    ip_name, "/" + ip_name, input_byte_size
+                )
                 self.triton_client_.register_system_shared_memory(
-                    shape_ip_name, '/' + shape_ip_name, shape_input_byte_size)
+                    shape_ip_name, "/" + shape_ip_name, shape_input_byte_size
+                )
                 self.triton_client_.register_system_shared_memory(
-                    shape_op_name, '/' + shape_op_name, shape_output_byte_size)
+                    shape_op_name, "/" + shape_op_name, shape_output_byte_size
+                )
                 self.triton_client_.register_system_shared_memory(
-                    op_name, '/' + op_name, output_byte_size)
+                    op_name, "/" + op_name, output_byte_size
+                )
                 self.triton_client_.register_system_shared_memory(
-                    resized_op_name, '/' + resized_op_name,
-                    resized_output_byte_size)
+                    resized_op_name, "/" + resized_op_name, resized_output_byte_size
+                )
 
+                shm_region_handles.append((ip_name, input_byte_size, shm_ip_handle))
                 shm_region_handles.append(
-                    (ip_name, input_byte_size, shm_ip_handle))
-                shm_region_handles.append(
-                    (shape_ip_name, shape_input_byte_size, shm_shape_ip_handle))
+                    (shape_ip_name, shape_input_byte_size, shm_shape_ip_handle)
+                )
                 shm_region_handles.append(
-                    (shape_op_name, shape_output_byte_size,
-                     shm_shape_op_handle))
+                    (shape_op_name, shape_output_byte_size, shm_shape_op_handle)
+                )
+                shm_region_handles.append((op_name, output_byte_size, shm_op_handle))
                 shm_region_handles.append(
-                    (op_name, output_byte_size, shm_op_handle))
-                shm_region_handles.append(
-                    (resized_op_name, resized_output_byte_size,
-                     shm_resized_op_handle))
+                    (resized_op_name, resized_output_byte_size, shm_resized_op_handle)
+                )
             return shm_region_handles
         else:
             return []
 
     # Returns (name, byte size, shm_handle)
-    def precreate_register_dynaseq_shape_tensor_regions(self,
-                                                        value_list,
-                                                        dtype,
-                                                        i,
-                                                        batch_size=1,
-                                                        tensor_shape=(1,)):
-        self.assertFalse(_test_cuda_shared_memory,
-                         "Shape tensors does not support CUDA shared memory")
+    def precreate_register_dynaseq_shape_tensor_regions(
+        self, value_list, dtype, i, batch_size=1, tensor_shape=(1,)
+    ):
+        self.assertFalse(
+            _test_cuda_shared_memory,
+            "Shape tensors does not support CUDA shared memory",
+        )
         if _test_system_shared_memory:
             shm_region_handles = []
             for j, (shape_value, value) in enumerate(value_list):
@@ -300,8 +311,8 @@ def precreate_register_dynaseq_shape_tensor_regions(self,
                     if dtype == np.object_:
                         dummy_in0 = np.full(tensor_shape, value, dtype=np.int32)
                         dummy_in0n = np.array(
-                            [str(x) for x in dummy_in0.reshape(in0.size)],
-                            dtype=object)
+                            [str(x) for x in dummy_in0.reshape(in0.size)], dtype=object
+                        )
                         dummy_in0 = dummy_in0n.reshape(tensor_shape)
                     else:
                         dummy_in0 = np.full(tensor_shape, value, dtype=dtype)
@@ -311,79 +322,87 @@ def precreate_register_dynaseq_shape_tensor_regions(self,
 
                 # Only one shape tensor input per batch
                 shape_input_list.append(
-                    np.full(tensor_shape, shape_value, dtype=np.int32))
+                    np.full(tensor_shape, shape_value, dtype=np.int32)
+                )
 
                 if dtype == np.object_:
                     input_list_tmp = iu.serialize_byte_tensor_list(input_list)
                     input_byte_size = sum(
-                        [serialized_byte_size(i0) for i0 in input_list_tmp])
+                        [serialized_byte_size(i0) for i0 in input_list_tmp]
+                    )
                 else:
                     input_list_tmp = input_list
                     input_byte_size = sum([i0.nbytes for i0 in input_list_tmp])
 
-                dummy_input_byte_size = sum(
-                    [i0.nbytes for i0 in dummy_input_list])
+                dummy_input_byte_size = sum([i0.nbytes for i0 in dummy_input_list])
 
-                shape_input_byte_size = sum(
-                    [i0.nbytes for i0 in shape_input_list])
+                shape_input_byte_size = sum([i0.nbytes for i0 in shape_input_list])
                 shape_output_byte_size = shape_input_byte_size
                 output_byte_size = np.dtype(np.int32).itemsize + 2
                 resized_output_byte_size = 32 * shape_value
 
                 # create shared memory regions and copy data for input values
-                ip_name = 'ip{}{}'.format(i, j)
-                shape_ip_name = 'shape_ip{}{}'.format(i, j)
-                dummy_ip_name = 'dummy_ip{}{}'.format(i, j)
-                shape_op_name = 'shape_op{}{}'.format(i, j)
-                op_name = 'op{}{}'.format(i, j)
-                resized_op_name = 'resized_op{}{}'.format(i, j)
+                ip_name = "ip{}{}".format(i, j)
+                shape_ip_name = "shape_ip{}{}".format(i, j)
+                dummy_ip_name = "dummy_ip{}{}".format(i, j)
+                shape_op_name = "shape_op{}{}".format(i, j)
+                op_name = "op{}{}".format(i, j)
+                resized_op_name = "resized_op{}{}".format(i, j)
 
                 shm_ip_handle = shm.create_shared_memory_region(
-                    ip_name, '/' + ip_name, input_byte_size)
+                    ip_name, "/" + ip_name, input_byte_size
+                )
                 shm_shape_ip_handle = shm.create_shared_memory_region(
-                    shape_ip_name, '/' + shape_ip_name, shape_input_byte_size)
+                    shape_ip_name, "/" + shape_ip_name, shape_input_byte_size
+                )
                 shm_dummy_ip_handle = shm.create_shared_memory_region(
-                    dummy_ip_name, '/' + dummy_ip_name, dummy_input_byte_size)
+                    dummy_ip_name, "/" + dummy_ip_name, dummy_input_byte_size
+                )
                 shm_shape_op_handle = shm.create_shared_memory_region(
-                    shape_op_name, '/' + shape_op_name, shape_output_byte_size)
+                    shape_op_name, "/" + shape_op_name, shape_output_byte_size
+                )
                 shm_op_handle = shm.create_shared_memory_region(
-                    op_name, '/' + op_name, output_byte_size)
+                    op_name, "/" + op_name, output_byte_size
+                )
                 shm_resized_op_handle = shm.create_shared_memory_region(
-                    resized_op_name, '/' + resized_op_name,
-                    resized_output_byte_size)
+                    resized_op_name, "/" + resized_op_name, resized_output_byte_size
+                )
                 shm.set_shared_memory_region(shm_ip_handle, input_list_tmp)
-                shm.set_shared_memory_region(shm_shape_ip_handle,
-                                             shape_input_list)
-                shm.set_shared_memory_region(shm_dummy_ip_handle,
-                                             dummy_input_list)
+                shm.set_shared_memory_region(shm_shape_ip_handle, shape_input_list)
+                shm.set_shared_memory_region(shm_dummy_ip_handle, dummy_input_list)
                 self.triton_client_.register_system_shared_memory(
-                    ip_name, '/' + ip_name, input_byte_size)
+                    ip_name, "/" + ip_name, input_byte_size
+                )
                 self.triton_client_.register_system_shared_memory(
-                    shape_ip_name, '/' + shape_ip_name, shape_input_byte_size)
+                    shape_ip_name, "/" + shape_ip_name, shape_input_byte_size
+                )
                 self.triton_client_.register_system_shared_memory(
-                    dummy_ip_name, '/' + dummy_ip_name, dummy_input_byte_size)
+                    dummy_ip_name, "/" + dummy_ip_name, dummy_input_byte_size
+                )
                 self.triton_client_.register_system_shared_memory(
-                    shape_op_name, '/' + shape_op_name, shape_output_byte_size)
+                    shape_op_name, "/" + shape_op_name, shape_output_byte_size
+                )
                 self.triton_client_.register_system_shared_memory(
-                    op_name, '/' + op_name, output_byte_size)
+                    op_name, "/" + op_name, output_byte_size
+                )
                 self.triton_client_.register_system_shared_memory(
-                    resized_op_name, '/' + resized_op_name,
-                    resized_output_byte_size)
+                    resized_op_name, "/" + resized_op_name, resized_output_byte_size
+                )
 
+                shm_region_handles.append((ip_name, input_byte_size, shm_ip_handle))
                 shm_region_handles.append(
-                    (ip_name, input_byte_size, shm_ip_handle))
-                shm_region_handles.append(
-                    (shape_ip_name, shape_input_byte_size, shm_shape_ip_handle))
-                shm_region_handles.append(
-                    (dummy_ip_name, dummy_input_byte_size, shm_dummy_ip_handle))
+                    (shape_ip_name, shape_input_byte_size, shm_shape_ip_handle)
+                )
                 shm_region_handles.append(
-                    (shape_op_name, shape_output_byte_size,
-                     shm_shape_op_handle))
+                    (dummy_ip_name, dummy_input_byte_size, shm_dummy_ip_handle)
+                )
                 shm_region_handles.append(
-                    (op_name, output_byte_size, shm_op_handle))
+                    (shape_op_name, shape_output_byte_size, shm_shape_op_handle)
+                )
+                shm_region_handles.append((op_name, output_byte_size, shm_op_handle))
                 shm_region_handles.append(
-                    (resized_op_name, resized_output_byte_size,
-                     shm_resized_op_handle))
+                    (resized_op_name, resized_output_byte_size, shm_resized_op_handle)
+                )
             return shm_region_handles
         else:
             return []
@@ -400,27 +419,35 @@ def cleanup_shm_regions(self, shm_handles):
             elif _test_cuda_shared_memory:
                 cudashm.destroy_shared_memory_region(shm_tmp_handle[2])
 
-    def check_sequence(self,
-                       trial,
-                       model_name,
-                       input_dtype,
-                       correlation_id,
-                       sequence_thresholds,
-                       values,
-                       expected_result,
-                       protocol,
-                       batch_size=1,
-                       sequence_name="<unknown>",
-                       tensor_shape=(1,)):
+    def check_sequence(
+        self,
+        trial,
+        model_name,
+        input_dtype,
+        correlation_id,
+        sequence_thresholds,
+        values,
+        expected_result,
+        protocol,
+        batch_size=1,
+        sequence_name="<unknown>",
+        tensor_shape=(1,),
+    ):
         """Perform sequence of inferences. The 'values' holds a list of
         tuples, one for each inference with format:
 
         (flag_str, value, (ls_ms, gt_ms), (pre_delay_ms, post_delay_ms)
 
         """
-        if (("savedmodel" not in trial) and ("graphdef" not in trial) and
-            ("custom" not in trial) and ("onnx" not in trial) and
-            ("libtorch" not in trial) and ("plan" not in trial)):
+        if (
+            ("savedmodel" not in trial)
+            and ("graphdef" not in trial)
+            and ("custom" not in trial)
+            and ("onnx" not in trial)
+            and ("libtorch" not in trial)
+            and ("plan" not in trial)
+            and ("python" not in trial)
+        ):
             self.assertFalse(True, "unknown trial type: " + trial)
 
         # Can only send the request exactly once since it is a
@@ -435,12 +462,14 @@ def check_sequence(self,
 
         self.assertFalse(
             _test_system_shared_memory and _test_cuda_shared_memory,
-            "Cannot set both System and CUDA shared memory flags to 1")
+            "Cannot set both System and CUDA shared memory flags to 1",
+        )
 
         self.assertEqual(len(configs), 1)
 
-        full_shape = tensor_shape if "nobatch" in trial else (
-            batch_size,) + tensor_shape
+        full_shape = (
+            tensor_shape if "nobatch" in trial else (batch_size,) + tensor_shape
+        )
 
         # create and register shared memory output region in advance,
         # knowing that this function will not be called concurrently.
@@ -450,33 +479,36 @@ def check_sequence(self,
             output_byte_size = 512
             if _test_system_shared_memory:
                 shm_op_handle = shm.create_shared_memory_region(
-                    "output_data", "/output", output_byte_size)
+                    "output_data", "/output", output_byte_size
+                )
                 self.triton_client_.register_system_shared_memory(
-                    "output_data", "/output", output_byte_size)
+                    "output_data", "/output", output_byte_size
+                )
             elif _test_cuda_shared_memory:
                 shm_op_handle = cudashm.create_shared_memory_region(
-                    "output_data", output_byte_size, 0)
+                    "output_data", output_byte_size, 0
+                )
                 self.triton_client_.register_cuda_shared_memory(
-                    "output_data", cudashm.get_raw_handle(shm_op_handle), 0,
-                    output_byte_size)
+                    "output_data",
+                    cudashm.get_raw_handle(shm_op_handle),
+                    0,
+                    output_byte_size,
+                )
             shm_ip_handles = []
 
         for config in configs:
             client_utils = grpcclient if config[1] == "grpc" else httpclient
 
-            triton_client = client_utils.InferenceServerClient(config[0],
-                                                               verbose=True)
+            triton_client = client_utils.InferenceServerClient(config[0], verbose=True)
             if config[2]:
                 user_data = UserData()
-                triton_client.start_stream(
-                    partial(completion_callback, user_data))
+                triton_client.start_stream(partial(completion_callback, user_data))
             # Execute the sequence of inference...
             try:
                 seq_start_ms = int(round(time.time() * 1000))
 
                 INPUT = "INPUT__0" if trial.startswith("libtorch") else "INPUT"
-                OUTPUT = "OUTPUT__0" if trial.startswith(
-                    "libtorch") else "OUTPUT"
+                OUTPUT = "OUTPUT__0" if trial.startswith("libtorch") else "OUTPUT"
                 for flag_str, value, thresholds, delay_ms in values:
                     if _test_valgrind or _test_jetson:
                         if delay_ms is not None:
@@ -491,20 +523,23 @@ def check_sequence(self,
                     seq_start = False
                     seq_end = False
                     if flag_str is not None:
-                        seq_start = ("start" in flag_str)
-                        seq_end = ("end" in flag_str)
+                        seq_start = "start" in flag_str
+                        seq_end = "end" in flag_str
 
                     # Construct request IOs
                     inputs = []
                     outputs = []
                     inputs.append(
                         client_utils.InferInput(
-                            INPUT, full_shape, np_to_triton_dtype(input_dtype)))
+                            INPUT, full_shape, np_to_triton_dtype(input_dtype)
+                        )
+                    )
                     outputs.append(client_utils.InferRequestedOutput(OUTPUT))
                     if input_dtype == np.object_:
                         in0 = np.full(full_shape, value, dtype=np.int32)
-                        in0n = np.array([str(x) for x in in0.reshape(in0.size)],
-                                        dtype=object)
+                        in0n = np.array(
+                            [str(x) for x in in0.reshape(in0.size)], dtype=object
+                        )
                         in0 = in0n.reshape(full_shape)
                     else:
                         in0 = np.full(full_shape, value, dtype=input_dtype)
@@ -512,39 +547,44 @@ def check_sequence(self,
                     # create input shared memory and copy input data values into it
                     if _test_system_shared_memory or _test_cuda_shared_memory:
                         if input_dtype == np.object_:
-                            input_list_tmp = iu.serialize_byte_tensor_list(
-                                [in0])
-                            input_byte_size = sum([
-                                serialized_byte_size(i0)
-                                for i0 in input_list_tmp
-                            ])
+                            input_list_tmp = iu.serialize_byte_tensor_list([in0])
+                            input_byte_size = sum(
+                                [serialized_byte_size(i0) for i0 in input_list_tmp]
+                            )
                         else:
                             input_list_tmp = [in0]
-                            input_byte_size = sum(
-                                [i0.nbytes for i0 in input_list_tmp])
+                            input_byte_size = sum([i0.nbytes for i0 in input_list_tmp])
                         ip_name = "ip{}".format(len(shm_ip_handles))
                         if _test_system_shared_memory:
                             shm_ip_handles.append(
                                 shm.create_shared_memory_region(
-                                    ip_name, "/" + ip_name, input_byte_size))
+                                    ip_name, "/" + ip_name, input_byte_size
+                                )
+                            )
                             shm.set_shared_memory_region(
-                                shm_ip_handles[-1], input_list_tmp)
+                                shm_ip_handles[-1], input_list_tmp
+                            )
                             triton_client.register_system_shared_memory(
-                                ip_name, "/" + ip_name, input_byte_size)
+                                ip_name, "/" + ip_name, input_byte_size
+                            )
                         elif _test_cuda_shared_memory:
                             shm_ip_handles.append(
                                 cudashm.create_shared_memory_region(
-                                    ip_name, input_byte_size, 0))
+                                    ip_name, input_byte_size, 0
+                                )
+                            )
                             cudashm.set_shared_memory_region(
-                                shm_ip_handles[-1], input_list_tmp)
+                                shm_ip_handles[-1], input_list_tmp
+                            )
                             triton_client.register_cuda_shared_memory(
                                 ip_name,
-                                cudashm.get_raw_handle(shm_ip_handles[-1]), 0,
-                                input_byte_size)
+                                cudashm.get_raw_handle(shm_ip_handles[-1]),
+                                0,
+                                input_byte_size,
+                            )
 
                         inputs[0].set_shared_memory(ip_name, input_byte_size)
-                        outputs[0].set_shared_memory("output_data",
-                                                     output_byte_size)
+                        outputs[0].set_shared_memory("output_data", output_byte_size)
                     else:
                         inputs[0].set_data_from_numpy(in0)
 
@@ -557,7 +597,8 @@ def check_sequence(self,
                             outputs=outputs,
                             sequence_id=correlation_id,
                             sequence_start=seq_start,
-                            sequence_end=seq_end)
+                            sequence_end=seq_end,
+                        )
                         (results, error) = user_data._completed_requests.get()
                         if error is not None:
                             raise error
@@ -568,14 +609,16 @@ def check_sequence(self,
                             outputs=outputs,
                             sequence_id=correlation_id,
                             sequence_start=seq_start,
-                            sequence_end=seq_end)
+                            sequence_end=seq_end,
+                        )
 
                     end_ms = int(round(time.time() * 1000))
 
                     # Get value of "OUTPUT", for shared memory, need to get it via
                     # shared memory utils
                     if (not _test_system_shared_memory) and (
-                            not _test_cuda_shared_memory):
+                        not _test_cuda_shared_memory
+                    ):
                         out = results.as_numpy(OUTPUT)
                     else:
                         output = results.get_output(OUTPUT)
@@ -586,10 +629,12 @@ def check_sequence(self,
                         output_type = input_dtype
                         if _test_system_shared_memory:
                             out = shm.get_contents_as_numpy(
-                                shm_op_handle, output_type, output_shape)
+                                shm_op_handle, output_type, output_shape
+                            )
                         else:
                             out = cudashm.get_contents_as_numpy(
-                                shm_op_handle, output_type, output_shape)
+                                shm_op_handle, output_type, output_shape
+                            )
                     result = out[0] if "nobatch" in trial else out[0][0]
                     print("{}: {}".format(sequence_name, result))
 
@@ -597,16 +642,23 @@ def check_sequence(self,
                         lt_ms = thresholds[0]
                         gt_ms = thresholds[1]
                         if lt_ms is not None:
-                            self.assertTrue((end_ms - start_ms) < lt_ms,
-                                            "expected less than " + str(lt_ms) +
-                                            "ms response time, got " +
-                                            str(end_ms - start_ms) + " ms")
+                            self.assertTrue(
+                                (end_ms - start_ms) < lt_ms,
+                                "expected less than "
+                                + str(lt_ms)
+                                + "ms response time, got "
+                                + str(end_ms - start_ms)
+                                + " ms",
+                            )
                         if gt_ms is not None:
                             self.assertTrue(
                                 (end_ms - start_ms) > gt_ms,
-                                "expected greater than " + str(gt_ms) +
-                                "ms response time, got " +
-                                str(end_ms - start_ms) + " ms")
+                                "expected greater than "
+                                + str(gt_ms)
+                                + "ms response time, got "
+                                + str(end_ms - start_ms)
+                                + " ms",
+                            )
                     if delay_ms is not None:
                         time.sleep(delay_ms[1] / 1000.0)
 
@@ -623,15 +675,23 @@ def check_sequence(self,
                     if lt_ms is not None:
                         if _test_jetson:
                             lt_ms *= _jetson_slowdown_factor
-                        self.assertTrue((seq_end_ms - seq_start_ms) < lt_ms,
-                                        "sequence expected less than " +
-                                        str(lt_ms) + "ms response time, got " +
-                                        str(seq_end_ms - seq_start_ms) + " ms")
+                        self.assertTrue(
+                            (seq_end_ms - seq_start_ms) < lt_ms,
+                            "sequence expected less than "
+                            + str(lt_ms)
+                            + "ms response time, got "
+                            + str(seq_end_ms - seq_start_ms)
+                            + " ms",
+                        )
                     if gt_ms is not None:
-                        self.assertTrue((seq_end_ms - seq_start_ms) > gt_ms,
-                                        "sequence expected greater than " +
-                                        str(gt_ms) + "ms response time, got " +
-                                        str(seq_end_ms - seq_start_ms) + " ms")
+                        self.assertTrue(
+                            (seq_end_ms - seq_start_ms) > gt_ms,
+                            "sequence expected greater than "
+                            + str(gt_ms)
+                            + "ms response time, got "
+                            + str(seq_end_ms - seq_start_ms)
+                            + " ms",
+                        )
             except Exception as ex:
                 self.add_deferred_exception(ex)
             if config[2]:
@@ -640,44 +700,59 @@ def check_sequence(self,
         if _test_system_shared_memory or _test_cuda_shared_memory:
             self.triton_client_.unregister_system_shared_memory()
             self.triton_client_.unregister_cuda_shared_memory()
-            destroy_func = shm.destroy_shared_memory_region if _test_system_shared_memory else cudashm.destroy_shared_memory_region
+            destroy_func = (
+                shm.destroy_shared_memory_region
+                if _test_system_shared_memory
+                else cudashm.destroy_shared_memory_region
+            )
             destroy_func(shm_op_handle)
             for shm_ip_handle in shm_ip_handles:
                 destroy_func(shm_ip_handle)
 
-    def check_sequence_async(self,
-                             trial,
-                             model_name,
-                             input_dtype,
-                             correlation_id,
-                             sequence_thresholds,
-                             values,
-                             expected_result,
-                             shm_region_handles,
-                             batch_size=1,
-                             sequence_name="<unknown>",
-                             tensor_shape=(1,)):
+    def check_sequence_async(
+        self,
+        trial,
+        model_name,
+        input_dtype,
+        correlation_id,
+        sequence_thresholds,
+        values,
+        expected_result,
+        shm_region_handles,
+        batch_size=1,
+        sequence_name="<unknown>",
+        tensor_shape=(1,),
+    ):
         """Perform sequence of inferences using stream async run.
         The 'values' holds a list of tuples, one for each inference with format:
 
         (flag_str, value, pre_delay_ms)
 
         """
-        if (("savedmodel" not in trial) and ("graphdef" not in trial) and
-            ("custom" not in trial) and ("onnx" not in trial) and
-            ("libtorch" not in trial) and ("plan" not in trial)):
+        if (
+            ("savedmodel" not in trial)
+            and ("graphdef" not in trial)
+            and ("custom" not in trial)
+            and ("onnx" not in trial)
+            and ("libtorch" not in trial)
+            and ("plan" not in trial)
+            and ("python" not in trial)
+        ):
             self.assertFalse(True, "unknown trial type: " + trial)
 
         self.assertFalse(
             _test_system_shared_memory and _test_cuda_shared_memory,
-            "Cannot set both System and CUDA shared memory flags to 1")
+            "Cannot set both System and CUDA shared memory flags to 1",
+        )
 
-        full_shape = tensor_shape if "nobatch" in trial else (
-            batch_size,) + tensor_shape
+        full_shape = (
+            tensor_shape if "nobatch" in trial else (batch_size,) + tensor_shape
+        )
 
         client_utils = grpcclient
-        triton_client = client_utils.InferenceServerClient(f"{_tritonserver_ipaddr}:8001",
-                                                           verbose=True)
+        triton_client = client_utils.InferenceServerClient(
+            f"{_tritonserver_ipaddr}:8001", verbose=True
+        )
         user_data = UserData()
         triton_client.start_stream(partial(completion_callback, user_data))
         # Execute the sequence of inference...
@@ -691,43 +766,50 @@ def check_sequence_async(self,
                 seq_start = False
                 seq_end = False
                 if flag_str is not None:
-                    seq_start = ("start" in flag_str)
-                    seq_end = ("end" in flag_str)
+                    seq_start = "start" in flag_str
+                    seq_end = "end" in flag_str
 
                 # Construct request IOs
                 inputs = []
                 outputs = []
                 inputs.append(
-                    client_utils.InferInput(INPUT, full_shape,
-                                            np_to_triton_dtype(input_dtype)))
+                    client_utils.InferInput(
+                        INPUT, full_shape, np_to_triton_dtype(input_dtype)
+                    )
+                )
                 outputs.append(client_utils.InferRequestedOutput(OUTPUT))
 
                 if not (_test_system_shared_memory or _test_cuda_shared_memory):
                     if input_dtype == np.object_:
                         in0 = np.full(full_shape, value, dtype=np.int32)
-                        in0n = np.array([str(x) for x in in0.reshape(in0.size)],
-                                        dtype=object)
+                        in0n = np.array(
+                            [str(x) for x in in0.reshape(in0.size)], dtype=object
+                        )
                         in0 = in0n.reshape(full_shape)
                     else:
                         in0 = np.full(full_shape, value, dtype=input_dtype)
                     inputs[0].set_data_from_numpy(in0)
                 else:
                     offset = 2 * sent_count
-                    inputs[0].set_shared_memory(shm_region_handles[offset][0],
-                                                shm_region_handles[offset][1])
+                    inputs[0].set_shared_memory(
+                        shm_region_handles[offset][0], shm_region_handles[offset][1]
+                    )
                     outputs[0].set_shared_memory(
                         shm_region_handles[offset + 1][0],
-                        shm_region_handles[offset + 1][1])
+                        shm_region_handles[offset + 1][1],
+                    )
 
                 if pre_delay_ms is not None:
                     time.sleep(pre_delay_ms / 1000.0)
 
-                triton_client.async_stream_infer(model_name,
-                                                 inputs,
-                                                 outputs=outputs,
-                                                 sequence_id=correlation_id,
-                                                 sequence_start=seq_start,
-                                                 sequence_end=seq_end)
+                triton_client.async_stream_infer(
+                    model_name,
+                    inputs,
+                    outputs=outputs,
+                    sequence_id=correlation_id,
+                    sequence_start=seq_start,
+                    sequence_end=seq_end,
+                )
                 sent_count += 1
 
             # Wait for the results in the order sent
@@ -739,8 +821,7 @@ def check_sequence_async(self,
                     raise error
                 # Get value of "OUTPUT", for shared memory, need to get it via
                 # shared memory utils
-                if (not _test_system_shared_memory) and (
-                        not _test_cuda_shared_memory):
+                if (not _test_system_shared_memory) and (not _test_cuda_shared_memory):
                     out = results.as_numpy(OUTPUT)
                 else:
                     output = results.get_output(OUTPUT)
@@ -749,12 +830,12 @@ def check_sequence_async(self,
                     output_type = input_dtype
                     if _test_system_shared_memory:
                         out = shm.get_contents_as_numpy(
-                            shm_region_handles[offset][2], output_type,
-                            output_shape)
+                            shm_region_handles[offset][2], output_type, output_shape
+                        )
                     else:
                         out = cudashm.get_contents_as_numpy(
-                            shm_region_handles[offset][2], output_type,
-                            output_shape)
+                            shm_region_handles[offset][2], output_type, output_shape
+                        )
                 result = out[0] if "nobatch" in trial else out[0][0]
                 print("{}: {}".format(sequence_name, result))
                 processed_count += 1
@@ -772,30 +853,40 @@ def check_sequence_async(self,
                 if lt_ms is not None:
                     if _test_jetson:
                         lt_ms *= _jetson_slowdown_factor
-                    self.assertTrue((seq_end_ms - seq_start_ms) < lt_ms,
-                                    "sequence expected less than " +
-                                    str(lt_ms) + "ms response time, got " +
-                                    str(seq_end_ms - seq_start_ms) + " ms")
+                    self.assertTrue(
+                        (seq_end_ms - seq_start_ms) < lt_ms,
+                        "sequence expected less than "
+                        + str(lt_ms)
+                        + "ms response time, got "
+                        + str(seq_end_ms - seq_start_ms)
+                        + " ms",
+                    )
                 if gt_ms is not None:
-                    self.assertTrue((seq_end_ms - seq_start_ms) > gt_ms,
-                                    "sequence expected greater than " +
-                                    str(gt_ms) + "ms response time, got " +
-                                    str(seq_end_ms - seq_start_ms) + " ms")
+                    self.assertTrue(
+                        (seq_end_ms - seq_start_ms) > gt_ms,
+                        "sequence expected greater than "
+                        + str(gt_ms)
+                        + "ms response time, got "
+                        + str(seq_end_ms - seq_start_ms)
+                        + " ms",
+                    )
         except Exception as ex:
             self.add_deferred_exception(ex)
         triton_client.stop_stream()
 
     # This sequence util only sends inference via streaming scenario
-    def check_sequence_shape_tensor_io(self,
-                                       model_name,
-                                       input_dtype,
-                                       correlation_id,
-                                       sequence_thresholds,
-                                       values,
-                                       expected_result,
-                                       shm_region_handles,
-                                       using_dynamic_batcher=False,
-                                       sequence_name="<unknown>"):
+    def check_sequence_shape_tensor_io(
+        self,
+        model_name,
+        input_dtype,
+        correlation_id,
+        sequence_thresholds,
+        values,
+        expected_result,
+        shm_region_handles,
+        using_dynamic_batcher=False,
+        sequence_name="<unknown>",
+    ):
         """Perform sequence of inferences using async run. The 'values' holds
         a list of tuples, one for each inference with format:
 
@@ -805,12 +896,15 @@ def check_sequence_shape_tensor_io(self,
         tensor_shape = (1, 1)
         # shape tensor is 1-D tensor that doesn't contain batch size as first value
         shape_tensor_shape = (1,)
-        self.assertFalse(_test_cuda_shared_memory,
-                         "Shape tensors does not support CUDA shared memory")
+        self.assertFalse(
+            _test_cuda_shared_memory,
+            "Shape tensors does not support CUDA shared memory",
+        )
 
         client_utils = grpcclient
-        triton_client = client_utils.InferenceServerClient(f"{_tritonserver_ipaddr}:8001",
-                                                           verbose=True)
+        triton_client = client_utils.InferenceServerClient(
+            f"{_tritonserver_ipaddr}:8001", verbose=True
+        )
         user_data = UserData()
         triton_client.start_stream(partial(completion_callback, user_data))
         # Execute the sequence of inference...
@@ -823,8 +917,8 @@ def check_sequence_shape_tensor_io(self,
                 seq_start = False
                 seq_end = False
                 if flag_str is not None:
-                    seq_start = ("start" in flag_str)
-                    seq_end = ("end" in flag_str)
+                    seq_start = "start" in flag_str
+                    seq_end = "end" in flag_str
 
                 # Construct request IOs
                 inputs = []
@@ -832,53 +926,54 @@ def check_sequence_shape_tensor_io(self,
                 # input order: input, shape(, dummy)
                 inputs.append(
                     client_utils.InferInput(
-                        "INPUT", tensor_shape,
-                        np_to_triton_dtype(np.int32 if using_dynamic_batcher
-                                           else input_dtype)))
+                        "INPUT",
+                        tensor_shape,
+                        np_to_triton_dtype(
+                            np.int32 if using_dynamic_batcher else input_dtype
+                        ),
+                    )
+                )
                 inputs.append(
-                    client_utils.InferInput("SHAPE_INPUT", shape_tensor_shape,
-                                            np_to_triton_dtype(np.int32)))
+                    client_utils.InferInput(
+                        "SHAPE_INPUT", shape_tensor_shape, np_to_triton_dtype(np.int32)
+                    )
+                )
                 if using_dynamic_batcher:
                     inputs.append(
                         client_utils.InferInput(
-                            "DUMMY_INPUT", tensor_shape,
-                            np_to_triton_dtype(input_dtype)))
+                            "DUMMY_INPUT", tensor_shape, np_to_triton_dtype(input_dtype)
+                        )
+                    )
                 # output order: shape, output, resized
-                outputs.append(
-                    client_utils.InferRequestedOutput("SHAPE_OUTPUT"))
+                outputs.append(client_utils.InferRequestedOutput("SHAPE_OUTPUT"))
                 outputs.append(client_utils.InferRequestedOutput("OUTPUT"))
-                outputs.append(
-                    client_utils.InferRequestedOutput("RESIZED_OUTPUT"))
+                outputs.append(client_utils.InferRequestedOutput("RESIZED_OUTPUT"))
 
                 # Set IO values
                 shape_values.append(
-                    np.full(shape_tensor_shape, shape_value, dtype=np.int32))
+                    np.full(shape_tensor_shape, shape_value, dtype=np.int32)
+                )
                 if not _test_system_shared_memory:
                     if using_dynamic_batcher:
                         if input_dtype == np.object_:
-                            dummy_in0 = np.full(tensor_shape,
-                                                value,
-                                                dtype=np.int32)
+                            dummy_in0 = np.full(tensor_shape, value, dtype=np.int32)
                             dummy_in0n = np.array(
                                 [str(x) for x in in0.reshape(dummy_in0.size)],
-                                dtype=object)
+                                dtype=object,
+                            )
                             dummy_in0 = dummy_in0n.reshape(tensor_shape)
                         else:
-                            dummy_in0 = np.full(tensor_shape,
-                                                value,
-                                                dtype=input_dtype)
+                            dummy_in0 = np.full(tensor_shape, value, dtype=input_dtype)
                         in0 = np.full(tensor_shape, value, dtype=np.int32)
                     else:
                         if input_dtype == np.object_:
                             in0 = np.full(tensor_shape, value, dtype=np.int32)
                             in0n = np.array(
-                                [str(x) for x in in0.reshape(in0.size)],
-                                dtype=object)
+                                [str(x) for x in in0.reshape(in0.size)], dtype=object
+                            )
                             in0 = in0n.reshape(tensor_shape)
                         else:
-                            in0 = np.full(tensor_shape,
-                                          value,
-                                          dtype=input_dtype)
+                            in0 = np.full(tensor_shape, value, dtype=input_dtype)
 
                     inputs[0].set_data_from_numpy(in0)
                     inputs[1].set_data_from_numpy(shape_values[-1])
@@ -894,21 +989,25 @@ def check_sequence_shape_tensor_io(self,
                     for i in range(len(inputs)):
                         inputs[i].set_shared_memory(
                             shm_region_handles[input_offset + i][0],
-                            shm_region_handles[input_offset + i][1])
+                            shm_region_handles[input_offset + i][1],
+                        )
                     for i in range(len(outputs)):
                         outputs[i].set_shared_memory(
                             shm_region_handles[output_offset + i][0],
-                            shm_region_handles[output_offset + i][1])
+                            shm_region_handles[output_offset + i][1],
+                        )
 
                 if pre_delay_ms is not None:
                     time.sleep(pre_delay_ms / 1000.0)
 
-                triton_client.async_stream_infer(model_name,
-                                                 inputs,
-                                                 outputs=outputs,
-                                                 sequence_id=correlation_id,
-                                                 sequence_start=seq_start,
-                                                 sequence_end=seq_end)
+                triton_client.async_stream_infer(
+                    model_name,
+                    inputs,
+                    outputs=outputs,
+                    sequence_id=correlation_id,
+                    sequence_start=seq_start,
+                    sequence_end=seq_end,
+                )
 
                 sent_count += 1
 
@@ -921,27 +1020,35 @@ def check_sequence_shape_tensor_io(self,
                     raise error
                 # Get value of "OUTPUT", for shared memory, need to get it via
                 # shared memory utils
-                if (not _test_system_shared_memory):
+                if not _test_system_shared_memory:
                     out = results.as_numpy("OUTPUT")
                 else:
                     output = results.get_output("OUTPUT")
-                    output_offset = 6 * processed_count + 4 if using_dynamic_batcher else 5 * processed_count + 3
+                    output_offset = (
+                        6 * processed_count + 4
+                        if using_dynamic_batcher
+                        else 5 * processed_count + 3
+                    )
                     output_shape = output.shape
                     output_type = np.int32 if using_dynamic_batcher else np.float32
                     out = shm.get_contents_as_numpy(
-                        shm_region_handles[output_offset][2], output_type,
-                        output_shape)
+                        shm_region_handles[output_offset][2], output_type, output_shape
+                    )
                 result = out[0][0]
 
                 # Validate the (debatched) shape of the resized output matches
                 # with the shape input values
                 resized_shape = results.get_output("RESIZED_OUTPUT").shape[1:]
                 self.assertTrue(
-                    np.array_equal(resized_shape,
-                                   shape_values[processed_count]),
+                    np.array_equal(resized_shape, shape_values[processed_count]),
                     "{}, {}, slot {}, expected: {}, got {}".format(
-                        model_name, "RESIZED_OUTPUT", processed_count,
-                        shape_values[processed_count], resized_shape))
+                        model_name,
+                        "RESIZED_OUTPUT",
+                        processed_count,
+                        shape_values[processed_count],
+                        resized_shape,
+                    ),
+                )
                 print("{}: {}".format(sequence_name, result))
                 processed_count += 1
 
@@ -958,15 +1065,23 @@ def check_sequence_shape_tensor_io(self,
                 if lt_ms is not None:
                     if _test_jetson:
                         lt_ms *= _jetson_slowdown_factor
-                    self.assertTrue((seq_end_ms - seq_start_ms) < lt_ms,
-                                    "sequence expected less than " +
-                                    str(lt_ms) + "ms response time, got " +
-                                    str(seq_end_ms - seq_start_ms) + " ms")
+                    self.assertTrue(
+                        (seq_end_ms - seq_start_ms) < lt_ms,
+                        "sequence expected less than "
+                        + str(lt_ms)
+                        + "ms response time, got "
+                        + str(seq_end_ms - seq_start_ms)
+                        + " ms",
+                    )
                 if gt_ms is not None:
-                    self.assertTrue((seq_end_ms - seq_start_ms) > gt_ms,
-                                    "sequence expected greater than " +
-                                    str(gt_ms) + "ms response time, got " +
-                                    str(seq_end_ms - seq_start_ms) + " ms")
+                    self.assertTrue(
+                        (seq_end_ms - seq_start_ms) > gt_ms,
+                        "sequence expected greater than "
+                        + str(gt_ms)
+                        + "ms response time, got "
+                        + str(seq_end_ms - seq_start_ms)
+                        + " ms",
+                    )
         except Exception as ex:
             self.add_deferred_exception(ex)
         triton_client.stop_stream()
@@ -977,46 +1092,77 @@ def check_setup(self, model_name):
         # Skip the sequence batching check on ensemble model
         if config.platform != "ensemble":
             bconfig = config.sequence_batching
-            self.assertEqual(bconfig.max_sequence_idle_microseconds,
-                             _max_sequence_idle_ms * 1000)  # 5 secs
+            self.assertEqual(
+                bconfig.max_sequence_idle_microseconds, _max_sequence_idle_ms * 1000
+            )  # 5 secs
 
     def check_status(self, model_name, batch_exec, exec_cnt, infer_cnt):
-        stats = self.triton_client_.get_inference_statistics(model_name, "1")
-        self.assertEqual(len(stats.model_stats), 1, "expect 1 model stats")
-        self.assertEqual(stats.model_stats[0].name, model_name,
-                         "expect model stats for model {}".format(model_name))
+        # There is a time window between when responses are returned and statistics are updated.
+        # To prevent intermittent test failure during that window, wait up to 10 seconds for the
+        # inference statistics to be ready.
+        num_tries = 10
+        for i in range(num_tries):
+            stats = self.triton_client_.get_inference_statistics(model_name, "1")
+            self.assertEqual(len(stats.model_stats), 1, "expect 1 model stats")
+            actual_exec_cnt = stats.model_stats[0].execution_count
+            if actual_exec_cnt == exec_cnt:
+                break
+            print(
+                "WARNING: expect {} executions, got {} (attempt {})".format(
+                    exec_cnt, actual_exec_cnt, i
+                )
+            )
+            time.sleep(1)
+
+        self.assertEqual(
+            stats.model_stats[0].name,
+            model_name,
+            "expect model stats for model {}".format(model_name),
+        )
         self.assertEqual(
-            stats.model_stats[0].version, "1",
-            "expect model stats for model {} version 1".format(model_name))
+            stats.model_stats[0].version,
+            "1",
+            "expect model stats for model {} version 1".format(model_name),
+        )
 
         if batch_exec is not None:
             batch_stats = stats.model_stats[0].batch_stats
             print(batch_stats)
             self.assertEqual(
-                len(batch_stats), len(batch_exec),
+                len(batch_stats),
+                len(batch_exec),
                 "expected {} different batch-sizes, got {}".format(
-                    len(batch_exec), len(batch_stats)))
+                    len(batch_exec), len(batch_stats)
+                ),
+            )
 
             for batch_stat in batch_stats:
                 bs = batch_stat.batch_size
                 bc = batch_stat.compute_infer.count
                 self.assertTrue(
-                    bs in batch_exec,
-                    "did not find expected batch-size {}".format(bs))
+                    bs in batch_exec, "did not find expected batch-size {}".format(bs)
+                )
                 # Get count from one of the stats
                 self.assertEqual(
-                    bc, batch_exec[bs],
-                    "expected model-execution-count {} for batch size {}, got {}"
-                    .format(batch_exec[bs], bs, bc))
+                    bc,
+                    batch_exec[bs],
+                    "expected model-execution-count {} for batch size {}, got {}".format(
+                        batch_exec[bs], bs, bc
+                    ),
+                )
 
         actual_exec_cnt = stats.model_stats[0].execution_count
         self.assertEqual(
-            actual_exec_cnt, exec_cnt,
-            "expected model-exec-count {}, got {}".format(
-                exec_cnt, actual_exec_cnt))
+            actual_exec_cnt,
+            exec_cnt,
+            "expected model-exec-count {}, got {}".format(exec_cnt, actual_exec_cnt),
+        )
 
         actual_infer_cnt = stats.model_stats[0].inference_count
         self.assertEqual(
-            actual_infer_cnt, infer_cnt,
+            actual_infer_cnt,
+            infer_cnt,
             "expected model-inference-count {}, got {}".format(
-                infer_cnt, actual_infer_cnt))
+                infer_cnt, actual_infer_cnt
+            ),
+        )
diff --git a/qa/common/shm_util.py b/qa/common/shm_util.py
old mode 100644
new mode 100755
index 99775edcf5..80607b24b1
--- a/qa/common/shm_util.py
+++ b/qa/common/shm_util.py
@@ -1,4 +1,6 @@
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,17 +27,17 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import os
-from os import listdir
-import numpy as np
-import time
 from ctypes import *
+from os import listdir
 
+import numpy as np
 import tritonclient.http as httpclient
 from tritonclient.utils import *
 
 # By default, find tritonserver on "localhost", but can be overridden
 # with TRITONSERVER_IPADDR envvar
-_tritonserver_ipaddr = os.environ.get('TRITONSERVER_IPADDR', 'localhost')
+_tritonserver_ipaddr = os.environ.get("TRITONSERVER_IPADDR", "localhost")
+_test_jetson = bool(int(os.environ.get("TEST_JETSON", 0)))
 
 
 def _range_repr_dtype(dtype):
@@ -50,10 +52,17 @@ def _range_repr_dtype(dtype):
     return dtype
 
 
-def create_set_shm_regions(input0_list, input1_list, output0_byte_size,
-                           output1_byte_size, outputs, shm_region_names,
-                           precreated_shm_regions, use_system_shared_memory,
-                           use_cuda_shared_memory):
+def create_set_shm_regions(
+    input0_list,
+    input1_list,
+    output0_byte_size,
+    output1_byte_size,
+    outputs,
+    shm_region_names,
+    precreated_shm_regions,
+    use_system_shared_memory,
+    use_cuda_shared_memory,
+):
     # Lazy shm imports...
     if use_system_shared_memory:
         import tritonclient.utils.shared_memory as shm
@@ -61,8 +70,7 @@ def create_set_shm_regions(input0_list, input1_list, output0_byte_size,
         import tritonclient.utils.cuda_shared_memory as cudashm
 
     if use_system_shared_memory and use_cuda_shared_memory:
-        raise ValueError(
-            "Cannot set both System and CUDA shared memory flags to 1")
+        raise ValueError("Cannot set both System and CUDA shared memory flags to 1")
 
     if not (use_system_shared_memory or use_cuda_shared_memory):
         return [], []
@@ -78,33 +86,37 @@ def create_set_shm_regions(input0_list, input1_list, output0_byte_size,
         input1_byte_size = sum([i1.nbytes for i1 in input1_list])
 
     if shm_region_names is None:
-        shm_region_names = ['input0', 'input1', 'output0', 'output1']
+        shm_region_names = ["input0", "input1", "output0", "output1"]
 
     shm_op0_handle = None
     shm_op1_handle = None
 
     if use_system_shared_memory:
         shm_ip0_handle = shm.create_shared_memory_region(
-            shm_region_names[0] + '_data', '/' + shm_region_names[0],
-            input0_byte_size)
+            shm_region_names[0] + "_data", "/" + shm_region_names[0], input0_byte_size
+        )
         shm_ip1_handle = shm.create_shared_memory_region(
-            shm_region_names[1] + '_data', '/' + shm_region_names[1],
-            input1_byte_size)
+            shm_region_names[1] + "_data", "/" + shm_region_names[1], input1_byte_size
+        )
 
         i = 0
         if "OUTPUT0" in outputs:
             if precreated_shm_regions is None:
                 shm_op0_handle = shm.create_shared_memory_region(
-                    shm_region_names[2] + '_data', '/' + shm_region_names[2],
-                    output0_byte_size)
+                    shm_region_names[2] + "_data",
+                    "/" + shm_region_names[2],
+                    output0_byte_size,
+                )
             else:
                 shm_op0_handle = precreated_shm_regions[0]
             i += 1
         if "OUTPUT1" in outputs:
             if precreated_shm_regions is None:
                 shm_op1_handle = shm.create_shared_memory_region(
-                    shm_region_names[2 + i] + '_data',
-                    '/' + shm_region_names[2 + i], output1_byte_size)
+                    shm_region_names[2 + i] + "_data",
+                    "/" + shm_region_names[2 + i],
+                    output1_byte_size,
+                )
             else:
                 shm_op1_handle = precreated_shm_regions[i]
 
@@ -113,21 +125,25 @@ def create_set_shm_regions(input0_list, input1_list, output0_byte_size,
 
     if use_cuda_shared_memory:
         shm_ip0_handle = cudashm.create_shared_memory_region(
-            shm_region_names[0] + '_data', input0_byte_size, 0)
+            shm_region_names[0] + "_data", input0_byte_size, 0
+        )
         shm_ip1_handle = cudashm.create_shared_memory_region(
-            shm_region_names[1] + '_data', input1_byte_size, 0)
+            shm_region_names[1] + "_data", input1_byte_size, 0
+        )
         i = 0
         if "OUTPUT0" in outputs:
             if precreated_shm_regions is None:
                 shm_op0_handle = cudashm.create_shared_memory_region(
-                    shm_region_names[2] + '_data', output0_byte_size, 0)
+                    shm_region_names[2] + "_data", output0_byte_size, 0
+                )
             else:
                 shm_op0_handle = precreated_shm_regions[0]
             i += 1
         if "OUTPUT1" in outputs:
             if precreated_shm_regions is None:
                 shm_op1_handle = cudashm.create_shared_memory_region(
-                    shm_region_names[2 + i] + '_data', output1_byte_size, 0)
+                    shm_region_names[2 + i] + "_data", output1_byte_size, 0
+                )
             else:
                 shm_op1_handle = precreated_shm_regions[i]
 
@@ -135,16 +151,27 @@ def create_set_shm_regions(input0_list, input1_list, output0_byte_size,
         cudashm.set_shared_memory_region(shm_ip1_handle, input1_list)
 
     return shm_region_names, [
-        shm_ip0_handle, shm_ip1_handle, shm_op0_handle, shm_op1_handle
+        shm_ip0_handle,
+        shm_ip1_handle,
+        shm_op0_handle,
+        shm_op1_handle,
     ]
 
 
-def register_add_shm_regions(inputs, outputs, shm_region_names,
-                             precreated_shm_regions, shm_handles,
-                             input0_byte_size, input1_byte_size,
-                             output0_byte_size, output1_byte_size,
-                             use_system_shared_memory, use_cuda_shared_memory,
-                             triton_client):
+def register_add_shm_regions(
+    inputs,
+    outputs,
+    shm_region_names,
+    precreated_shm_regions,
+    shm_handles,
+    input0_byte_size,
+    input1_byte_size,
+    output0_byte_size,
+    output1_byte_size,
+    use_system_shared_memory,
+    use_cuda_shared_memory,
+    triton_client,
+):
     # Lazy shm imports...
     if use_system_shared_memory:
         import tritonclient.utils.shared_memory as shm
@@ -154,74 +181,94 @@ def register_add_shm_regions(inputs, outputs, shm_region_names,
     if use_system_shared_memory or use_cuda_shared_memory:
         # Unregister then register required shared memory regions
         if use_system_shared_memory:
-            triton_client.unregister_system_shared_memory(shm_region_names[0] +
-                                                          '_data')
-            triton_client.unregister_system_shared_memory(shm_region_names[1] +
-                                                          '_data')
+            triton_client.unregister_system_shared_memory(shm_region_names[0] + "_data")
+            triton_client.unregister_system_shared_memory(shm_region_names[1] + "_data")
             triton_client.register_system_shared_memory(
-                shm_region_names[0] + '_data', '/' + shm_region_names[0],
-                input0_byte_size)
+                shm_region_names[0] + "_data",
+                "/" + shm_region_names[0],
+                input0_byte_size,
+            )
             triton_client.register_system_shared_memory(
-                shm_region_names[1] + '_data', '/' + shm_region_names[1],
-                input1_byte_size)
+                shm_region_names[1] + "_data",
+                "/" + shm_region_names[1],
+                input1_byte_size,
+            )
             i = 0
             if "OUTPUT0" in outputs:
                 if precreated_shm_regions is None:
                     triton_client.unregister_system_shared_memory(
-                        shm_region_names[2] + '_data')
+                        shm_region_names[2] + "_data"
+                    )
                     triton_client.register_system_shared_memory(
-                        shm_region_names[2] + '_data',
-                        '/' + shm_region_names[2], output0_byte_size)
+                        shm_region_names[2] + "_data",
+                        "/" + shm_region_names[2],
+                        output0_byte_size,
+                    )
                 i += 1
             if "OUTPUT1" in outputs:
                 if precreated_shm_regions is None:
                     triton_client.unregister_system_shared_memory(
-                        shm_region_names[2 + i] + '_data')
+                        shm_region_names[2 + i] + "_data"
+                    )
                     triton_client.register_system_shared_memory(
-                        shm_region_names[2 + i] + '_data',
-                        '/' + shm_region_names[2 + i], output1_byte_size)
+                        shm_region_names[2 + i] + "_data",
+                        "/" + shm_region_names[2 + i],
+                        output1_byte_size,
+                    )
 
         if use_cuda_shared_memory:
-            triton_client.unregister_cuda_shared_memory(shm_region_names[0] +
-                                                        '_data')
-            triton_client.unregister_cuda_shared_memory(shm_region_names[1] +
-                                                        '_data')
+            triton_client.unregister_cuda_shared_memory(shm_region_names[0] + "_data")
+            triton_client.unregister_cuda_shared_memory(shm_region_names[1] + "_data")
             triton_client.register_cuda_shared_memory(
-                shm_region_names[0] + '_data',
-                cudashm.get_raw_handle(shm_handles[0]), 0, input0_byte_size)
+                shm_region_names[0] + "_data",
+                cudashm.get_raw_handle(shm_handles[0]),
+                0,
+                input0_byte_size,
+            )
             triton_client.register_cuda_shared_memory(
-                shm_region_names[1] + '_data',
-                cudashm.get_raw_handle(shm_handles[1]), 0, input1_byte_size)
+                shm_region_names[1] + "_data",
+                cudashm.get_raw_handle(shm_handles[1]),
+                0,
+                input1_byte_size,
+            )
             i = 0
             if "OUTPUT0" in outputs:
                 if precreated_shm_regions is None:
                     triton_client.unregister_cuda_shared_memory(
-                        shm_region_names[2] + '_data')
+                        shm_region_names[2] + "_data"
+                    )
                     triton_client.register_cuda_shared_memory(
-                        shm_region_names[2] + '_data',
-                        cudashm.get_raw_handle(shm_handles[2]), 0,
-                        output0_byte_size)
+                        shm_region_names[2] + "_data",
+                        cudashm.get_raw_handle(shm_handles[2]),
+                        0,
+                        output0_byte_size,
+                    )
                 i += 1
             if "OUTPUT1" in outputs:
                 if precreated_shm_regions is None:
                     triton_client.unregister_cuda_shared_memory(
-                        shm_region_names[2 + i] + '_data')
+                        shm_region_names[2 + i] + "_data"
+                    )
                     triton_client.register_cuda_shared_memory(
-                        shm_region_names[2 + i] + '_data',
-                        cudashm.get_raw_handle(shm_handles[3]), 0,
-                        output1_byte_size)
+                        shm_region_names[2 + i] + "_data",
+                        cudashm.get_raw_handle(shm_handles[3]),
+                        0,
+                        output1_byte_size,
+                    )
 
         # Add shared memory regions to inputs
-        inputs[0].set_shared_memory(shm_region_names[0] + '_data',
-                                    input0_byte_size)
-        inputs[1].set_shared_memory(shm_region_names[1] + '_data',
-                                    input1_byte_size)
-
-
-def unregister_cleanup_shm_regions(shm_regions, shm_handles,
-                                   precreated_shm_regions, outputs,
-                                   use_system_shared_memory,
-                                   use_cuda_shared_memory):
+        inputs[0].set_shared_memory(shm_region_names[0] + "_data", input0_byte_size)
+        inputs[1].set_shared_memory(shm_region_names[1] + "_data", input1_byte_size)
+
+
+def unregister_cleanup_shm_regions(
+    shm_regions,
+    shm_handles,
+    precreated_shm_regions,
+    outputs,
+    use_system_shared_memory,
+    use_cuda_shared_memory,
+):
     # Lazy shm imports...
     if use_system_shared_memory:
         import tritonclient.utils.shared_memory as shm
@@ -231,17 +278,16 @@ def unregister_cleanup_shm_regions(shm_regions, shm_handles,
     if not (use_system_shared_memory or use_cuda_shared_memory):
         return None
 
-    triton_client = httpclient.InferenceServerClient(
-        f"{_tritonserver_ipaddr}:8000")
+    triton_client = httpclient.InferenceServerClient(f"{_tritonserver_ipaddr}:8000")
 
     if use_cuda_shared_memory:
-        triton_client.unregister_cuda_shared_memory(shm_regions[0] + '_data')
-        triton_client.unregister_cuda_shared_memory(shm_regions[1] + '_data')
+        triton_client.unregister_cuda_shared_memory(shm_regions[0] + "_data")
+        triton_client.unregister_cuda_shared_memory(shm_regions[1] + "_data")
         cudashm.destroy_shared_memory_region(shm_handles[0])
         cudashm.destroy_shared_memory_region(shm_handles[1])
     else:
-        triton_client.unregister_system_shared_memory(shm_regions[0] + '_data')
-        triton_client.unregister_system_shared_memory(shm_regions[1] + '_data')
+        triton_client.unregister_system_shared_memory(shm_regions[0] + "_data")
+        triton_client.unregister_system_shared_memory(shm_regions[1] + "_data")
         shm.destroy_shared_memory_region(shm_handles[0])
         shm.destroy_shared_memory_region(shm_handles[1])
 
@@ -249,29 +295,33 @@ def unregister_cleanup_shm_regions(shm_regions, shm_handles,
         i = 0
         if "OUTPUT0" in outputs:
             if use_cuda_shared_memory:
-                triton_client.unregister_cuda_shared_memory(shm_regions[2] +
-                                                            '_data')
+                triton_client.unregister_cuda_shared_memory(shm_regions[2] + "_data")
                 cudashm.destroy_shared_memory_region(shm_handles[2])
             else:
-                triton_client.unregister_system_shared_memory(shm_regions[2] +
-                                                              '_data')
+                triton_client.unregister_system_shared_memory(shm_regions[2] + "_data")
                 shm.destroy_shared_memory_region(shm_handles[2])
             i += 1
         if "OUTPUT1" in outputs:
             if use_cuda_shared_memory:
-                triton_client.unregister_cuda_shared_memory(shm_regions[2 + i] +
-                                                            '_data')
+                triton_client.unregister_cuda_shared_memory(
+                    shm_regions[2 + i] + "_data"
+                )
                 cudashm.destroy_shared_memory_region(shm_handles[3])
             else:
-                triton_client.unregister_system_shared_memory(shm_regions[2 +
-                                                                          i] +
-                                                              '_data')
+                triton_client.unregister_system_shared_memory(
+                    shm_regions[2 + i] + "_data"
+                )
                 shm.destroy_shared_memory_region(shm_handles[3])
 
 
-def create_set_either_shm_region(shm_region_names, input_list, input_byte_size,
-                                 output_byte_size, use_system_shared_memory,
-                                 use_cuda_shared_memory):
+def create_set_either_shm_region(
+    shm_region_names,
+    input_list,
+    input_byte_size,
+    output_byte_size,
+    use_system_shared_memory,
+    use_cuda_shared_memory,
+):
     # Lazy shm imports...
     if use_system_shared_memory:
         import tritonclient.utils.shared_memory as shm
@@ -279,34 +329,43 @@ def create_set_either_shm_region(shm_region_names, input_list, input_byte_size,
         import tritonclient.utils.cuda_shared_memory as cudashm
 
     if use_cuda_shared_memory and use_system_shared_memory:
-        raise ValueError(
-            "Cannot set both System and CUDA shared memory flags to 1")
+        raise ValueError("Cannot set both System and CUDA shared memory flags to 1")
 
     if not (use_system_shared_memory or use_cuda_shared_memory):
         return []
 
     if use_cuda_shared_memory:
         shm_ip_handle = cudashm.create_shared_memory_region(
-            shm_region_names[0] + "_data", input_byte_size, 0)
+            shm_region_names[0] + "_data", input_byte_size, 0
+        )
         shm_op_handle = cudashm.create_shared_memory_region(
-            shm_region_names[1] + "_data", output_byte_size, 0)
+            shm_region_names[1] + "_data", output_byte_size, 0
+        )
         cudashm.set_shared_memory_region(shm_ip_handle, input_list)
     elif use_system_shared_memory:
         shm_ip_handle = shm.create_shared_memory_region(
-            shm_region_names[0] + "_data", "/" + shm_region_names[0],
-            input_byte_size)
+            shm_region_names[0] + "_data", "/" + shm_region_names[0], input_byte_size
+        )
         shm_op_handle = shm.create_shared_memory_region(
-            shm_region_names[1] + "_data", "/" + shm_region_names[1],
-            output_byte_size)
+            shm_region_names[1] + "_data", "/" + shm_region_names[1], output_byte_size
+        )
         shm.set_shared_memory_region(shm_ip_handle, input_list)
 
     return [shm_ip_handle, shm_op_handle]
 
 
-def register_add_either_shm_regions(inputs, outputs, shm_region_prefix,
-                                    shm_handles, io_num, input_byte_size,
-                                    output_byte_size, use_system_shared_memory,
-                                    use_cuda_shared_memory, triton_client):
+def register_add_either_shm_regions(
+    inputs,
+    outputs,
+    shm_region_prefix,
+    shm_handles,
+    io_num,
+    input_byte_size,
+    output_byte_size,
+    use_system_shared_memory,
+    use_cuda_shared_memory,
+    triton_client,
+):
     # Lazy shm imports...
     if use_system_shared_memory:
         import tritonclient.utils.shared_memory as shm
@@ -318,46 +377,46 @@ def register_add_either_shm_regions(inputs, outputs, shm_region_prefix,
         input_shm_name = shm_region_prefix[0] + str(io_num)
         output_shm_name = shm_region_prefix[1] + str(io_num)
         if use_system_shared_memory:
-            triton_client.unregister_system_shared_memory(input_shm_name +
-                                                          '_data')
-            triton_client.unregister_system_shared_memory(output_shm_name +
-                                                          '_data')
+            triton_client.unregister_system_shared_memory(input_shm_name + "_data")
+            triton_client.unregister_system_shared_memory(output_shm_name + "_data")
             triton_client.register_system_shared_memory(
-                input_shm_name + '_data', '/' + input_shm_name, input_byte_size)
+                input_shm_name + "_data", "/" + input_shm_name, input_byte_size
+            )
             triton_client.register_system_shared_memory(
-                output_shm_name + '_data', '/' + output_shm_name,
-                output_byte_size)
+                output_shm_name + "_data", "/" + output_shm_name, output_byte_size
+            )
 
         if use_cuda_shared_memory:
-            triton_client.unregister_cuda_shared_memory(input_shm_name +
-                                                        '_data')
-            triton_client.unregister_cuda_shared_memory(output_shm_name +
-                                                        '_data')
+            triton_client.unregister_cuda_shared_memory(input_shm_name + "_data")
+            triton_client.unregister_cuda_shared_memory(output_shm_name + "_data")
             triton_client.register_cuda_shared_memory(
-                input_shm_name + '_data',
-                cudashm.get_raw_handle(shm_handles[0][io_num]), 0,
-                input_byte_size)
+                input_shm_name + "_data",
+                cudashm.get_raw_handle(shm_handles[0][io_num]),
+                0,
+                input_byte_size,
+            )
             triton_client.register_cuda_shared_memory(
-                output_shm_name + '_data',
-                cudashm.get_raw_handle(shm_handles[1][io_num]), 0,
-                output_byte_size)
+                output_shm_name + "_data",
+                cudashm.get_raw_handle(shm_handles[1][io_num]),
+                0,
+                output_byte_size,
+            )
 
         # Add shared memory regions to inputs
-        inputs[io_num].set_shared_memory(input_shm_name + '_data',
-                                         input_byte_size)
-        outputs[io_num].set_shared_memory(output_shm_name + '_data',
-                                          output_byte_size)
+        inputs[io_num].set_shared_memory(input_shm_name + "_data", input_byte_size)
+        outputs[io_num].set_shared_memory(output_shm_name + "_data", output_byte_size)
 
 
 class ShmLeakDetector:
     """Detect shared memory leaks when testing Python backend."""
 
     class ShmLeakProbe:
-
         def __init__(self, shm_monitors):
             self._shm_monitors = shm_monitors
 
         def __enter__(self):
+            if _test_jetson:
+                return self
             self._shm_region_free_sizes = []
             for shm_monitor in self._shm_monitors:
                 self._shm_region_free_sizes.append(shm_monitor.free_memory())
@@ -365,28 +424,41 @@ def __enter__(self):
             return self
 
         def __exit__(self, type, value, traceback):
+            if _test_jetson:
+                return
             current_shm_sizes = []
             for shm_monitor in self._shm_monitors:
                 current_shm_sizes.append(shm_monitor.free_memory())
 
             shm_leak_detected = False
             for current_shm_size, prev_shm_size in zip(
-                    current_shm_sizes, self._shm_region_free_sizes):
+                current_shm_sizes, self._shm_region_free_sizes
+            ):
                 if current_shm_size != prev_shm_size:
                     shm_leak_detected = True
                     print(
-                        f'Shared memory leak detected: {current_shm_size} (current) != {prev_shm_size} (prev).'
+                        f"Shared memory leak detected: {current_shm_size} (current) != {prev_shm_size} (prev)."
                     )
             assert not shm_leak_detected, "Shared memory leak detected."
 
-    def __init__(self, prefix='triton_python_backend_shm_region'):
+    def __init__(self, prefix="triton_python_backend_shm_region"):
+        if _test_jetson:
+            return
         import triton_shm_monitor
+
         self._shm_monitors = []
-        shm_regions = listdir('/dev/shm')
+        shm_regions = listdir("/dev/shm")
         for shm_region in shm_regions:
             if shm_region.startswith(prefix):
                 self._shm_monitors.append(
-                    triton_shm_monitor.SharedMemoryManager(shm_region))
+                    triton_shm_monitor.SharedMemoryManager(shm_region)
+                )
 
     def Probe(self):
-        return self.ShmLeakProbe(self._shm_monitors)
+        # Jetson cleanup takes too long and results in false positives.
+        # Do not use the shared memory check on Jetson.
+        # [DLIS-4876] Investigate how to re-enable shared memory check on Jetson.
+        if _test_jetson:
+            return self.ShmLeakProbe(None)
+        else:
+            return self.ShmLeakProbe(self._shm_monitors)
diff --git a/qa/common/test_util.py b/qa/common/test_util.py
old mode 100644
new mode 100755
index 2c3ac26555..d0d7bda590
--- a/qa/common/test_util.py
+++ b/qa/common/test_util.py
@@ -1,4 +1,6 @@
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,9 +26,10 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
-import unittest
 import json
+import unittest
+
+import numpy as np
 
 _last_request_id = 0
 
@@ -68,28 +71,42 @@ def shape_to_onnx_shape(shape, idx=0, increment_index=True):
 
 
 def shape_to_dims_str(shape):
-    return ','.join(str(i) for i in shape)
+    return ",".join(str(i) for i in shape)
 
 
-def validate_for_tf_model(input_dtype, output0_dtype, output1_dtype,
-                          input_shape, output0_shape, output1_shape):
+def validate_for_tf_model(
+    input_dtype, output0_dtype, output1_dtype, input_shape, output0_shape, output1_shape
+):
     """Return True if input and output dtypes are supported by a TF model."""
 
+    # Not extending test to uint8 yet
+    if (
+        input_dtype == np.uint8
+        or output0_dtype == np.uint8
+        or output1_dtype == np.uint8
+    ):
+        return False
+
     # If the input type is string the output type must be string or
     # int32. This is because the QA models we generate convert strings
     # internally to int32 for compute.
-    if ((input_dtype == np.object_) and
-        (((output0_dtype != np.object_) and (output0_dtype != np.int32)) or
-         ((output1_dtype != np.object_) and (output1_dtype != np.int32)))):
+    if (input_dtype == np.object_) and (
+        ((output0_dtype != np.object_) and (output0_dtype != np.int32))
+        or ((output1_dtype != np.object_) and (output1_dtype != np.int32))
+    ):
         return False
 
     return True
 
 
-def validate_for_trt_model(input_dtype, output0_dtype, output1_dtype,
-                           input_shape, output0_shape, output1_shape):
+def validate_for_trt_model(
+    input_dtype, output0_dtype, output1_dtype, input_shape, output0_shape, output1_shape
+):
     """Return True if input and output dtypes are supported by a TRT model."""
-    supported_datatypes = [bool, np.int8, np.int32, np.float16, np.float32]
+    supported_datatypes = [bool, np.int8, np.int32, np.uint8, np.float16, np.float32]
+    # FIXME: Remove this check when jetson supports TRT 8.5 (DLIS-4256)
+    if not support_trt_uint8():
+        supported_datatypes.remove(np.uint8)
     if not input_dtype in supported_datatypes:
         return False
     if not output0_dtype in supported_datatypes:
@@ -108,16 +125,34 @@ def validate_for_trt_model(input_dtype, output0_dtype, output1_dtype,
     return True
 
 
-def validate_for_ensemble_model(ensemble_type, input_dtype, output0_dtype,
-                                output1_dtype, input_shape, output0_shape,
-                                output1_shape):
+def validate_for_ensemble_model(
+    ensemble_type,
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    input_shape,
+    output0_shape,
+    output1_shape,
+):
     """Return True if input and output dtypes are supported by the ensemble type."""
 
+    # Not extending test to uint8 yet
+    if (
+        input_dtype == np.uint8
+        or output0_dtype == np.uint8
+        or output1_dtype == np.uint8
+    ):
+        return False
+
     # Those ensemble types contains "identity" model which doesn't allow STRING
     # data type
     # Test types that use identity for both input and output
     test_type_involved = ["reshape", "zero", "fan"]
-    if input_dtype == np.object_ or output0_dtype == np.object_ or output1_dtype == np.object_:
+    if (
+        input_dtype == np.object_
+        or output0_dtype == np.object_
+        or output1_dtype == np.object_
+    ):
         for type_str in test_type_involved:
             if type_str in ensemble_type:
                 return False
@@ -129,55 +164,98 @@ def validate_for_ensemble_model(ensemble_type, input_dtype, output0_dtype,
     return True
 
 
-def validate_for_onnx_model(input_dtype, output0_dtype, output1_dtype,
-                            input_shape, output0_shape, output1_shape):
+def validate_for_onnx_model(
+    input_dtype, output0_dtype, output1_dtype, input_shape, output0_shape, output1_shape
+):
     """Return True if input and output dtypes are supported by a Onnx model."""
 
+    # Not extending test to uint8 yet
+    if (
+        input_dtype == np.uint8
+        or output0_dtype == np.uint8
+        or output1_dtype == np.uint8
+    ):
+        return False
+
     # If the input type is string the output type must be string or
     # int32. This is because the QA models we generate convert strings
     # internally to int32 for compute.
-    if ((input_dtype == np.object_) and
-        (((output0_dtype != np.object_) and (output0_dtype != np.int32)) or
-         ((output1_dtype != np.object_) and (output1_dtype != np.int32)))):
+    if (input_dtype == np.object_) and (
+        ((output0_dtype != np.object_) and (output0_dtype != np.int32))
+        or ((output1_dtype != np.object_) and (output1_dtype != np.int32))
+    ):
         return False
 
     return True
 
 
-def validate_for_libtorch_model(input_dtype,
-                                output0_dtype,
-                                output1_dtype,
-                                input_shape,
-                                output0_shape,
-                                output1_shape,
-                                max_batch=0):
+def validate_for_libtorch_model(
+    input_dtype,
+    output0_dtype,
+    output1_dtype,
+    input_shape,
+    output0_shape,
+    output1_shape,
+    max_batch=0,
+    reshape=False,
+):
     """Return True if input and output dtypes are supported by a libtorch model."""
 
-    # STRING data type do not support batching or I/O with more than 1 dims
-    if ((input_dtype == np.object_) or (output0_dtype == np.object_) or
-        (output1_dtype == np.object_)) and ((len(input_shape) > 1) or
-                                            (len(output0_shape) > 1) or
-                                            (len(output1_shape) > 1) or
-                                            (max_batch != 0)):
+    # Not extending test to uint8 yet
+    if (
+        input_dtype == np.uint8
+        or output0_dtype == np.uint8
+        or output1_dtype == np.uint8
+    ):
+        return False
+
+    # STRING data type does not support I/O with more than 1 dims. It supports
+    # batching when 'reshape' field is set properly to empty shape.
+    has_string_type = (
+        (input_dtype == np.object_)
+        or (output0_dtype == np.object_)
+        or (output1_dtype == np.object_)
+    )
+    is_more_than_one_dimensional = (
+        (len(input_shape) > 1)
+        or (len(output0_shape) > 1)
+        or (len(output1_shape) > 1)
+        or (max_batch != 0)
+    )
+
+    if has_string_type and is_more_than_one_dimensional and not reshape:
         return False
 
     # FLOAT16 and UINT16 data types are not supported currently
-    if (input_dtype == np.uint16) or (output0_dtype
-                                      == np.uint16) or (output1_dtype
-                                                        == np.uint16):
+    if (
+        (input_dtype == np.uint16)
+        or (output0_dtype == np.uint16)
+        or (output1_dtype == np.uint16)
+    ):
         return False
-    if (input_dtype == np.float16) or (output0_dtype
-                                       == np.float16) or (output1_dtype
-                                                          == np.float16):
+    if (
+        (input_dtype == np.float16)
+        or (output0_dtype == np.float16)
+        or (output1_dtype == np.float16)
+    ):
         return False
 
     return True
 
 
-def validate_for_openvino_model(input_dtype, output0_dtype, output1_dtype,
-                                input_shape, output0_shape, output1_shape):
+def validate_for_openvino_model(
+    input_dtype, output0_dtype, output1_dtype, input_shape, output0_shape, output1_shape
+):
     """Return True if input and output dtypes are supported by an OpenVino model."""
 
+    # Not extending test to uint8 yet
+    if (
+        input_dtype == np.uint8
+        or output0_dtype == np.uint8
+        or output1_dtype == np.uint8
+    ):
+        return False
+
     # float16 is not supported on CPU by OpenVino
     supported_datatypes = [np.int8, np.int32, np.float32]
     if not input_dtype in supported_datatypes:
@@ -189,18 +267,21 @@ def validate_for_openvino_model(input_dtype, output0_dtype, output1_dtype,
 
     # Return false if input dtype != output dtype and shape > 1 dims
     # https://github.com/openvinotoolkit/openvino/issues/7173
-    if ((output1_dtype != input_dtype) or
-        (output0_dtype != input_dtype)) and len(input_shape) > 1:
+    if ((output1_dtype != input_dtype) or (output0_dtype != input_dtype)) and len(
+        input_shape
+    ) > 1:
         return False
 
     return True
 
 
 def get_model_name(pf, input_dtype, output0_dtype, output1_dtype):
-    return "{}_{}_{}_{}".format(pf,
-                                np.dtype(input_dtype).name,
-                                np.dtype(output0_dtype).name,
-                                np.dtype(output1_dtype).name)
+    return "{}_{}_{}_{}".format(
+        pf,
+        np.dtype(input_dtype).name,
+        np.dtype(output0_dtype).name,
+        np.dtype(output1_dtype).name,
+    )
 
 
 def get_sequence_model_name(pf, dtype):
@@ -215,6 +296,19 @@ def get_zero_model_name(pf, io_cnt, dtype):
     return "{}_zero_{}_{}".format(pf, io_cnt, np.dtype(dtype).name)
 
 
+# FIXME: Remove this def when jetson supports TRT 8.5 (DLIS-4256)
+def support_trt_uint8():
+    try:
+        import tensorrt as trt
+    except:
+        # tensorrt library is not found, detect from environment
+        import os
+
+        return not bool(int(os.environ.get("TEST_JETSON", 0)))
+    # tensorrt library is found, return if uint8 is defined
+    return hasattr(trt, "uint8")
+
+
 class TestResultCollector(unittest.TestCase):
     # TestResultCollector stores test result and prints it to stdout. In order
     # to use this class, unit tests must inherit this class. Use
@@ -223,19 +317,14 @@ class TestResultCollector(unittest.TestCase):
 
     @classmethod
     def setResult(cls, total, errors, failures):
-        cls.total, cls.errors, cls.failures = \
-            total, errors, failures
+        cls.total, cls.errors, cls.failures = total, errors, failures
 
     @classmethod
     def tearDownClass(cls):
         # this method is called when all the unit tests in a class are
         # finished.
-        json_res = {
-            'total': cls.total,
-            'errors': cls.errors,
-            'failures': cls.failures
-        }
-        with open('test_results.txt', 'w+') as f:
+        json_res = {"total": cls.total, "errors": cls.errors, "failures": cls.failures}
+        with open("test_results.txt", "w+") as f:
             f.write(json.dumps(json_res))
 
     def run(self, result=None):
diff --git a/qa/common/trace_summary.py b/qa/common/trace_summary.py
index c10970008d..d7dfff184e 100755
--- a/qa/common/trace_summary.py
+++ b/qa/common/trace_summary.py
@@ -27,8 +27,9 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import argparse
+import csv
 import json
-import sys
+
 import numpy as np
 
 FLAGS = None
@@ -37,18 +38,17 @@
 def add_span(span_map, timestamps, span_name, ts_start, ts_end):
     for tag in (ts_start, ts_end):
         if tag not in timestamps:
-            raise ValueError('timestamps missing "{}": {}'.format(
-                tag, timestamps))
+            raise ValueError('timestamps missing "{}": {}'.format(tag, timestamps))
     if timestamps[ts_end] < timestamps[ts_start]:
-        raise ValueError('end timestamp "{}" < start timestamp "{}"'.format(
-            ts_end, ts_start))
+        raise ValueError(
+            'end timestamp "{}" < start timestamp "{}"'.format(ts_end, ts_start)
+        )
     if span_name not in span_map:
         span_map[span_name] = 0
     span_map[span_name] += timestamps[ts_end] - timestamps[ts_start]
 
 
-class AbstractFrontend():
-
+class AbstractFrontend:
     @property
     def filter_timestamp(self):
         return None
@@ -61,65 +61,88 @@ def summarize_frontend_span(self, span_map, cnt):
 
 
 class HttpFrontend(AbstractFrontend):
-
     @property
     def filter_timestamp(self):
         return "HTTP_RECV_START"
 
     def add_frontend_span(self, span_map, timestamps):
-        if ("HTTP_RECV_START" in timestamps) and ("HTTP_SEND_END"
-                                                  in timestamps):
-            add_span(span_map, timestamps, "HTTP_INFER", "HTTP_RECV_START",
-                     "HTTP_SEND_END")
-            add_span(span_map, timestamps, "HTTP_RECV", "HTTP_RECV_START",
-                     "HTTP_RECV_END")
-            add_span(span_map, timestamps, "HTTP_SEND", "HTTP_SEND_START",
-                     "HTTP_SEND_END")
+        if ("HTTP_RECV_START" in timestamps) and ("HTTP_SEND_END" in timestamps):
+            add_span(
+                span_map, timestamps, "HTTP_INFER", "HTTP_RECV_START", "HTTP_SEND_END"
+            )
+            add_span(
+                span_map, timestamps, "HTTP_RECV", "HTTP_RECV_START", "HTTP_RECV_END"
+            )
+            add_span(
+                span_map, timestamps, "HTTP_SEND", "HTTP_SEND_START", "HTTP_SEND_END"
+            )
 
     def summarize_frontend_span(self, span_map, cnt):
         if "HTTP_INFER" in span_map:
             res = "HTTP infer request (avg): {}us\n".format(
-                span_map["HTTP_INFER"] / (cnt * 1000))
-            res += "\tReceive (avg): {}us\n".format(span_map["HTTP_RECV"] /
-                                                    (cnt * 1000))
-            res += "\tSend (avg): {}us\n".format(span_map["HTTP_SEND"] /
-                                                 (cnt * 1000))
+                span_map["HTTP_INFER"] / (cnt * 1000)
+            )
+            res += "\tReceive (avg): {}us\n".format(
+                span_map["HTTP_RECV"] / (cnt * 1000)
+            )
+            res += "\tSend (avg): {}us\n".format(span_map["HTTP_SEND"] / (cnt * 1000))
             res += "\tOverhead (avg): {}us\n".format(
-                (span_map["HTTP_INFER"] - span_map["REQUEST"] -
-                 span_map["HTTP_RECV"] - span_map["HTTP_SEND"]) / (cnt * 1000))
+                (
+                    span_map["HTTP_INFER"]
+                    - span_map["REQUEST"]
+                    - span_map["HTTP_RECV"]
+                    - span_map["HTTP_SEND"]
+                )
+                / (cnt * 1000)
+            )
             return res
         else:
             return None
 
 
 class GrpcFrontend(AbstractFrontend):
-
     @property
     def filter_timestamp(self):
         return "GRPC_WAITREAD_START"
 
     def add_frontend_span(self, span_map, timestamps):
-        if ("GRPC_WAITREAD_START" in timestamps) and ("GRPC_SEND_END"
-                                                      in timestamps):
-            add_span(span_map, timestamps, "GRPC_INFER", "GRPC_WAITREAD_START",
-                     "GRPC_SEND_END")
-            add_span(span_map, timestamps, "GRPC_WAITREAD",
-                     "GRPC_WAITREAD_START", "GRPC_WAITREAD_END")
-            add_span(span_map, timestamps, "GRPC_SEND", "GRPC_SEND_START",
-                     "GRPC_SEND_END")
+        if ("GRPC_WAITREAD_START" in timestamps) and ("GRPC_SEND_END" in timestamps):
+            add_span(
+                span_map,
+                timestamps,
+                "GRPC_INFER",
+                "GRPC_WAITREAD_START",
+                "GRPC_SEND_END",
+            )
+            add_span(
+                span_map,
+                timestamps,
+                "GRPC_WAITREAD",
+                "GRPC_WAITREAD_START",
+                "GRPC_WAITREAD_END",
+            )
+            add_span(
+                span_map, timestamps, "GRPC_SEND", "GRPC_SEND_START", "GRPC_SEND_END"
+            )
 
     def summarize_frontend_span(self, span_map, cnt):
         if "GRPC_INFER" in span_map:
             res = "GRPC infer request (avg): {}us\n".format(
-                span_map["GRPC_INFER"] / (cnt * 1000))
+                span_map["GRPC_INFER"] / (cnt * 1000)
+            )
             res += "\tWait/Read (avg): {}us\n".format(
-                span_map["GRPC_WAITREAD"] / (cnt * 1000))
-            res += "\tSend (avg): {}us\n".format(span_map["GRPC_SEND"] /
-                                                 (cnt * 1000))
+                span_map["GRPC_WAITREAD"] / (cnt * 1000)
+            )
+            res += "\tSend (avg): {}us\n".format(span_map["GRPC_SEND"] / (cnt * 1000))
             res += "\tOverhead (avg): {}us\n".format(
-                (span_map["GRPC_INFER"] - span_map["REQUEST"] -
-                 span_map["GRPC_WAITREAD"] - span_map["GRPC_SEND"]) /
-                (cnt * 1000))
+                (
+                    span_map["GRPC_INFER"]
+                    - span_map["REQUEST"]
+                    - span_map["GRPC_WAITREAD"]
+                    - span_map["GRPC_SEND"]
+                )
+                / (cnt * 1000)
+            )
             return res
         else:
             return None
@@ -132,7 +155,7 @@ def summarize(frontend, traces):
     model_span_map = dict()
 
     # Order traces by id to be more intuitive if 'show_trace'
-    traces = sorted(traces, key=lambda t: t.get('id', -1))
+    traces = sorted(traces, key=lambda t: t.get("id", -1))
 
     # Filter the trace that is not for the requested frontend
     match_frontend_id_set = set()
@@ -158,9 +181,9 @@ def summarize(frontend, traces):
         if "id" not in trace:
             continue
         if trace["id"] in match_frontend_id_set:
-            if (trace['id'] in filtered_traces.keys()):
-                rep_trace = filtered_traces[trace['id']]
-                # Apend the timestamp to the trace representing this 'id'
+            if trace["id"] in filtered_traces.keys():
+                rep_trace = filtered_traces[trace["id"]]
+                # Append the timestamp to the trace representing this 'id'
                 if "model_name" in trace:
                     rep_trace["model_name"] = trace["model_name"]
                 if "model_version" in trace:
@@ -171,7 +194,7 @@ def summarize(frontend, traces):
                 # Use this trace to represent this 'id'
                 if "timestamps" not in trace:
                     trace["timestamps"] = []
-                filtered_traces[trace['id']] = trace
+                filtered_traces[trace["id"]] = trace
 
     for trace_id, trace in filtered_traces.items():
         if trace_id not in match_frontend_id_set:
@@ -190,29 +213,57 @@ def summarize(frontend, traces):
 
             frontend.add_frontend_span(model_span_map[key], timestamps)
 
-            add_span(model_span_map[key], timestamps, "REQUEST",
-                     "REQUEST_START", "REQUEST_END")
+            add_span(
+                model_span_map[key],
+                timestamps,
+                "REQUEST",
+                "REQUEST_START",
+                "REQUEST_END",
+            )
 
             # The tags below will be missing for ensemble model
-            if ("QUEUE_START" in timestamps) and ("COMPUTE_START"
-                                                  in timestamps):
-                add_span(model_span_map[key], timestamps, "QUEUE",
-                         "QUEUE_START", "COMPUTE_START")
-            if ("COMPUTE_START" in timestamps) and ("COMPUTE_END"
-                                                    in timestamps):
-                add_span(model_span_map[key], timestamps, "COMPUTE",
-                         "COMPUTE_START", "COMPUTE_END")
-            if ("COMPUTE_INPUT_END" in timestamps) and ("COMPUTE_OUTPUT_START"
-                                                        in timestamps):
-                add_span(model_span_map[key], timestamps, "COMPUTE_INPUT",
-                         "COMPUTE_START", "COMPUTE_INPUT_END")
-                add_span(model_span_map[key], timestamps, "COMPUTE_INFER",
-                         "COMPUTE_INPUT_END", "COMPUTE_OUTPUT_START")
-                add_span(model_span_map[key], timestamps, "COMPUTE_OUTPUT",
-                         "COMPUTE_OUTPUT_START", "COMPUTE_END")
+            if ("QUEUE_START" in timestamps) and ("COMPUTE_START" in timestamps):
+                add_span(
+                    model_span_map[key],
+                    timestamps,
+                    "QUEUE",
+                    "QUEUE_START",
+                    "COMPUTE_START",
+                )
+            if ("COMPUTE_START" in timestamps) and ("COMPUTE_END" in timestamps):
+                add_span(
+                    model_span_map[key],
+                    timestamps,
+                    "COMPUTE",
+                    "COMPUTE_START",
+                    "COMPUTE_END",
+                )
+            if ("COMPUTE_INPUT_END" in timestamps) and (
+                "COMPUTE_OUTPUT_START" in timestamps
+            ):
+                add_span(
+                    model_span_map[key],
+                    timestamps,
+                    "COMPUTE_INPUT",
+                    "COMPUTE_START",
+                    "COMPUTE_INPUT_END",
+                )
+                add_span(
+                    model_span_map[key],
+                    timestamps,
+                    "COMPUTE_INFER",
+                    "COMPUTE_INPUT_END",
+                    "COMPUTE_OUTPUT_START",
+                )
+                add_span(
+                    model_span_map[key],
+                    timestamps,
+                    "COMPUTE_OUTPUT",
+                    "COMPUTE_OUTPUT_START",
+                    "COMPUTE_END",
+                )
             if FLAGS.show_trace:
-                print("{} ({}):".format(trace["model_name"],
-                                        trace["model_version"]))
+                print("{} ({}):".format(trace["model_name"], trace["model_version"]))
                 print("\tid: {}".format(trace["id"]))
                 if "parent_id" in trace:
                     print("\tparent id: {}".format(trace["parent_id"]))
@@ -230,34 +281,59 @@ def summarize(frontend, traces):
 
     for key, cnt in model_count_map.items():
         model_name, model_value = key
-        print("Summary for {} ({}): trace count = {}".format(
-            model_name, model_value, cnt))
+        print(
+            "Summary for {} ({}): trace count = {}".format(model_name, model_value, cnt)
+        )
 
-        frontend_summary = frontend.summarize_frontend_span(
-            model_span_map[key], cnt)
+        frontend_summary = frontend.summarize_frontend_span(model_span_map[key], cnt)
         if frontend_summary is not None:
             print(frontend_summary)
 
         # collect handler timeline
-        print("\tHandler (avg): {}us".format(model_span_map[key]["REQUEST"] /
-                                             (cnt * 1000)))
-        if ("QUEUE"
-                in model_span_map[key]) and "COMPUTE" in model_span_map[key]:
-            print("\t\tOverhead (avg): {}us".format(
-                (model_span_map[key]["REQUEST"] - model_span_map[key]["QUEUE"] -
-                 model_span_map[key]["COMPUTE"]) / (cnt * 1000)))
-            print("\t\tQueue (avg): {}us".format(model_span_map[key]["QUEUE"] /
-                                                 (cnt * 1000)))
-            print("\t\tCompute (avg): {}us".format(
-                model_span_map[key]["COMPUTE"] / (cnt * 1000)))
-        if ("COMPUTE_INPUT" in model_span_map[key]
-           ) and "COMPUTE_OUTPUT" in model_span_map[key]:
-            print("\t\t\tInput (avg): {}us".format(
-                model_span_map[key]["COMPUTE_INPUT"] / (cnt * 1000)))
-            print("\t\t\tInfer (avg): {}us".format(
-                model_span_map[key]["COMPUTE_INFER"] / (cnt * 1000)))
-            print("\t\t\tOutput (avg): {}us".format(
-                model_span_map[key]["COMPUTE_OUTPUT"] / (cnt * 1000)))
+        print(
+            "\tHandler (avg): {}us".format(
+                model_span_map[key]["REQUEST"] / (cnt * 1000)
+            )
+        )
+        if ("QUEUE" in model_span_map[key]) and "COMPUTE" in model_span_map[key]:
+            print(
+                "\t\tOverhead (avg): {}us".format(
+                    (
+                        model_span_map[key]["REQUEST"]
+                        - model_span_map[key]["QUEUE"]
+                        - model_span_map[key]["COMPUTE"]
+                    )
+                    / (cnt * 1000)
+                )
+            )
+            print(
+                "\t\tQueue (avg): {}us".format(
+                    model_span_map[key]["QUEUE"] / (cnt * 1000)
+                )
+            )
+            print(
+                "\t\tCompute (avg): {}us".format(
+                    model_span_map[key]["COMPUTE"] / (cnt * 1000)
+                )
+            )
+        if (
+            "COMPUTE_INPUT" in model_span_map[key]
+        ) and "COMPUTE_OUTPUT" in model_span_map[key]:
+            print(
+                "\t\t\tInput (avg): {}us".format(
+                    model_span_map[key]["COMPUTE_INPUT"] / (cnt * 1000)
+                )
+            )
+            print(
+                "\t\t\tInfer (avg): {}us".format(
+                    model_span_map[key]["COMPUTE_INFER"] / (cnt * 1000)
+                )
+            )
+            print(
+                "\t\t\tOutput (avg): {}us".format(
+                    model_span_map[key]["COMPUTE_OUTPUT"] / (cnt * 1000)
+                )
+            )
 
 
 def summarize_dataflow(traces):
@@ -268,7 +344,7 @@ def summarize_dataflow(traces):
     #   - child output
 
     # Order traces by id to be more intuitive if 'show_trace'
-    traces = sorted(traces, key=lambda t: t.get('id', -1))
+    traces = sorted(traces, key=lambda t: t.get("id", -1))
 
     # {3: [4, 5, 6], 4: [7]}
     dataflow_parent_map = dict()
@@ -295,14 +371,16 @@ def summarize_dataflow(traces):
     # {3: {4: {7: None}, 5: None, 6: None}}
     dataflow_tree_map = dict()
     depth = [0]
-    append_dataflow_tensor(dataflow_tree_map, first_parent_id,
-                           dataflow_parent_map, traces, depth)
+    append_dataflow_tensor(
+        dataflow_tree_map, first_parent_id, dataflow_parent_map, traces, depth
+    )
 
     print_dataflow_tensor(dataflow_tree_map, traces, depth[0], step=0)
 
 
-def append_dataflow_tensor(dataflow_tensor_map, parent_id, dataflow_tree_map,
-                           traces, depth):
+def append_dataflow_tensor(
+    dataflow_tensor_map, parent_id, dataflow_tree_map, traces, depth
+):
     if parent_id not in dataflow_tree_map:
         dataflow_tensor_map[parent_id] = None
         return
@@ -313,8 +391,9 @@ def append_dataflow_tensor(dataflow_tensor_map, parent_id, dataflow_tree_map,
 
     child_ids = dataflow_tree_map[parent_id]
     for child_id in child_ids:
-        append_dataflow_tensor(child_tensor_map, child_id, dataflow_tree_map,
-                               traces, depth)
+        append_dataflow_tensor(
+            child_tensor_map, child_id, dataflow_tree_map, traces, depth
+        )
 
 
 def print_dataflow_tensor(dataflow_tree_map, traces, depth, step):
@@ -324,8 +403,7 @@ def print_dataflow_tensor(dataflow_tree_map, traces, depth, step):
         if dataflow_tree_map[parent_id] is None:
             continue
 
-        print_dataflow_tensor(dataflow_tree_map[parent_id], traces, depth,
-                              step + 1)
+        print_dataflow_tensor(dataflow_tree_map[parent_id], traces, depth, step + 1)
 
 
 def print_tensor_by_id(id, traces, depth, step):
@@ -337,35 +415,48 @@ def print_tensor_by_id(id, traces, depth, step):
     print("{0}{1}".format(tabs, "=" * (50 + 8 * (depth - step))))
     for trace in traces:
         # print model name and version
-        if "id" in trace and "model_name" in trace and "model_version" in trace and "timestamps" in trace and trace[
-                "id"] == id:
+        if (
+            "id" in trace
+            and "model_name" in trace
+            and "model_version" in trace
+            and "timestamps" in trace
+            and trace["id"] == id
+        ):
             print("{0}Name:   {1}".format(tabs, trace["model_name"]))
             print("{0}Version:{1}".format(tabs, trace["model_version"]))
         # print data
         if "id" in trace and "activity" in trace:
             if trace["id"] == id and trace["activity"] == "TENSOR_QUEUE_INPUT":
                 print("{0}{1}:".format(tabs, "QUEUE_INPUT"))
-                print("{0}\t{1}: {2}".format(tabs, trace["tensor"]["name"],
-                                             get_numpy_array(trace["tensor"])))
-            elif trace["id"] == id and trace[
-                    "activity"] == "TENSOR_BACKEND_INPUT":
+                print(
+                    "{0}\t{1}: {2}".format(
+                        tabs, trace["tensor"]["name"], get_numpy_array(trace["tensor"])
+                    )
+                )
+            elif trace["id"] == id and trace["activity"] == "TENSOR_BACKEND_INPUT":
                 print("{0}{1}:".format(tabs, "BACKEND_INPUT"))
-                print("{0}\t{1}: {2}".format(tabs, trace["tensor"]["name"],
-                                             get_numpy_array(trace["tensor"])))
-            elif trace["id"] == id and trace[
-                    "activity"] == "TENSOR_BACKEND_OUTPUT":
+                print(
+                    "{0}\t{1}: {2}".format(
+                        tabs, trace["tensor"]["name"], get_numpy_array(trace["tensor"])
+                    )
+                )
+            elif trace["id"] == id and trace["activity"] == "TENSOR_BACKEND_OUTPUT":
                 print("{0}{1}:".format(tabs, "BACKEND_OUTPUT"))
-                print("{0}\t{1}: {2}".format(tabs, trace["tensor"]["name"],
-                                             get_numpy_array(trace["tensor"])))
+                print(
+                    "{0}\t{1}: {2}".format(
+                        tabs, trace["tensor"]["name"], get_numpy_array(trace["tensor"])
+                    )
+                )
     print("{0}{1}".format(tabs, "=" * (50 + 8 * (depth - step))))
 
 
 def find_first_id_with_tensor(traces):
     for trace in traces:
         if "activity" in trace and (
-                trace["activity"] == "TENSOR_QUEUE_INPUT" or
-                trace["activity"] == "TENSOR_BACKEND_INPUT" or
-                trace["activity"] == "TENSOR_BACKEND_OUTPUT"):
+            trace["activity"] == "TENSOR_QUEUE_INPUT"
+            or trace["activity"] == "TENSOR_BACKEND_INPUT"
+            or trace["activity"] == "TENSOR_BACKEND_OUTPUT"
+        ):
             return trace["id"]
     return 0
 
@@ -383,34 +474,41 @@ def find_first_id_with_tensor(traces):
     "FP16": np.float16,
     "FP32": np.float32,
     "FP64": np.float64,
-    "BYTES": np.object_
+    "BYTES": np.object_,
 }
 
 
 def get_numpy_array(tensor):
     dtype = TRITON_TYPE_TO_NUMPY[tensor["dtype"]]
-    value = map(float, tensor["data"].split(","))
+    if dtype == np.object_:
+        value = next(csv.reader([tensor["data"]], skipinitialspace=True))
+    else:
+        value = map(float, tensor["data"].split(","))
     shape = map(int, tensor["shape"].split(","))
     array = np.array(list(value), dtype=dtype)
     array = array.reshape(list(shape))
     return array
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('-v',
-                        '--verbose',
-                        action="store_true",
-                        required=False,
-                        default=False,
-                        help='Enable verbose output')
-    parser.add_argument('-t',
-                        '--show-trace',
-                        action="store_true",
-                        required=False,
-                        default=False,
-                        help='Show timestamps for each individual trace')
-    parser.add_argument('file', type=argparse.FileType('r'), nargs='+')
+    parser.add_argument(
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable verbose output",
+    )
+    parser.add_argument(
+        "-t",
+        "--show-trace",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Show timestamps for each individual trace",
+    )
+    parser.add_argument("file", type=argparse.FileType("r"), nargs="+")
     FLAGS = parser.parse_args()
 
     for f in FLAGS.file:
diff --git a/qa/common/util.sh b/qa/common/util.sh
index 1ef3c9462c..dfa136ca4b 100755
--- a/qa/common/util.sh
+++ b/qa/common/util.sh
@@ -1,4 +1,5 @@
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/bin/bash
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -66,7 +67,7 @@ function wait_for_server_ready() {
 
     local wait_secs=$wait_time_secs
     until test $wait_secs -eq 0 ; do
-        if ! kill -0 $spid; then
+        if ! kill -0 $spid > /dev/null 2>&1; then
             echo "=== Server not running."
             WAIT_RET=1
             return
@@ -146,6 +147,33 @@ function wait_for_model_stable() {
     echo "=== Timeout $wait_time_secs secs. Not all models stable."
 }
 
+function gdb_helper () {
+  if ! command -v gdb > /dev/null 2>&1; then
+    echo "=== WARNING: gdb not installed"
+    return
+  fi
+
+  ### Server Hang ###
+  if kill -0 ${SERVER_PID} > /dev/null 2>&1; then
+    # If server process is still alive, try to get backtrace and core dump from it
+    GDB_LOG="gdb_bt.${SERVER_PID}.log"
+    echo -e "=== WARNING: SERVER HANG DETECTED, DUMPING GDB BACKTRACE TO [${PWD}/${GDB_LOG}] ==="
+    # Dump backtrace log for quick analysis. Allow these commands to fail.
+    gdb -batch -ex "thread apply all bt" -p "${SERVER_PID}" 2>&1 | tee "${GDB_LOG}" || true
+
+    # Generate core dump for deeper analysis. Default filename is "core.${PID}"
+    gdb -batch -ex "gcore" -p "${SERVER_PID}" || true
+  fi
+
+  ### Server Segfaulted ###
+  # If there are any core dumps locally from a segfault, load them and get a backtrace
+  for corefile in $(ls core.* > /dev/null 2>&1); do
+    GDB_LOG="${corefile}.log"
+    echo -e "=== WARNING: SEGFAULT DETECTED, DUMPING GDB BACKTRACE TO [${PWD}/${GDB_LOG}] ==="
+    gdb -batch ${SERVER} ${corefile} -ex "thread apply all bt" | tee "${corefile}.log" || true;
+  done
+}
+
 # Run inference server. Return once server's health endpoint shows
 # ready or timeout expires. Sets SERVER_PID to pid of SERVER, or 0 if
 # error (including expired timeout)
@@ -168,12 +196,16 @@ function run_server () {
       echo "=== Running LD_PRELOAD=$SERVER_LD_PRELOAD $SERVER $SERVER_ARGS"
     fi
 
-    LD_PRELOAD=$SERVER_LD_PRELOAD $SERVER $SERVER_ARGS > $SERVER_LOG 2>&1 &
+    LD_PRELOAD=$SERVER_LD_PRELOAD:${LD_PRELOAD} $SERVER $SERVER_ARGS > $SERVER_LOG 2>&1 &
     SERVER_PID=$!
 
     wait_for_server_ready $SERVER_PID $SERVER_TIMEOUT
     if [ "$WAIT_RET" != "0" ]; then
-        kill $SERVER_PID || true
+        # Get further debug information about server startup failure
+        gdb_helper || true
+
+        # Cleanup
+        kill $SERVER_PID > /dev/null 2>&1 || true
         SERVER_PID=0
     fi
 }
@@ -200,7 +232,7 @@ function run_server_tolive () {
       echo "=== Running LD_PRELOAD=$SERVER_LD_PRELOAD $SERVER $SERVER_ARGS"
     fi
 
-    LD_PRELOAD=$SERVER_LD_PRELOAD $SERVER $SERVER_ARGS > $SERVER_LOG 2>&1 &
+    LD_PRELOAD=$SERVER_LD_PRELOAD:${LD_PRELOAD} $SERVER $SERVER_ARGS > $SERVER_LOG 2>&1 &
     SERVER_PID=$!
 
     wait_for_server_live $SERVER_PID $SERVER_TIMEOUT
@@ -244,7 +276,7 @@ function run_server_nowait () {
             echo "=== Running LD_PRELOAD=$SERVER_LD_PRELOAD $SERVER $SERVER_ARGS"
         fi
 
-        LD_PRELOAD=$SERVER_LD_PRELOAD $SERVER $SERVER_ARGS > $SERVER_LOG 2>&1 &
+        LD_PRELOAD=$SERVER_LD_PRELOAD:${LD_PRELOAD} $SERVER $SERVER_ARGS > $SERVER_LOG 2>&1 &
         SERVER_PID=$!
     fi
 }
@@ -276,7 +308,7 @@ function run_server_leakcheck () {
       echo "=== Running LD_PRELOAD=$SERVER_LD_PRELOAD $SERVER $SERVER_ARGS"
     fi
 
-    LD_PRELOAD=$SERVER_LD_PRELOAD $LEAKCHECK $LEAKCHECK_ARGS $SERVER $SERVER_ARGS > $SERVER_LOG 2>&1 &
+    LD_PRELOAD=$SERVER_LD_PRELOAD:${LD_PRELOAD} $LEAKCHECK $LEAKCHECK_ARGS $SERVER $SERVER_ARGS > $SERVER_LOG 2>&1 &
     SERVER_PID=$!
 
     wait_for_server_ready $SERVER_PID $SERVER_TIMEOUT
@@ -392,3 +424,81 @@ function check_test_results () {
 
     return 0
 }
+
+# Run multiple inference servers and return immediately. Sets pid for each server
+# correspondingly, or 0 if error.
+function run_multiple_servers_nowait () {
+    if [ -z "$SERVER" ]; then
+        echo "=== SERVER must be defined"
+        return
+    fi
+
+    if [ ! -f "$SERVER" ]; then
+        echo "=== $SERVER does not exist"
+        return
+    fi
+
+    local server_count=$1
+    server_pid=()
+    local server_args=()
+    local server_log=()
+    for (( i=0; i<$server_count; i++ )); do
+        let SERVER${i}_PID=0 || true
+        server_pid+=(SERVER${i}_PID)
+        server_args+=(SERVER${i}_ARGS)
+        server_log+=(SERVER${i}_LOG)
+    done
+
+    for (( i=0; i<$server_count; i++ )); do
+        if [ -z "$SERVER_LD_PRELOAD" ]; then
+            echo "=== Running $SERVER ${!server_args[$i]}"
+        else
+            echo "=== Running LD_PRELOAD=$SERVER_LD_PRELOAD $SERVER ${!server_args[$i]}"
+        fi
+        LD_PRELOAD=$SERVER_LD_PRELOAD:${LD_PRELOAD} $SERVER ${!server_args[$i]} > ${!server_log[$i]} 2>&1 &
+        let SERVER${i}_PID=$!
+    done
+}
+
+# Kill all inference servers.
+function kill_servers () {
+    for (( i=0; i<${#server_pid[@]}; i++ )); do
+        kill ${!server_pid[$i]}
+        wait ${!server_pid[$i]}
+    done
+}
+
+# Sort an array
+# Call with sort_array <array_name>
+# Example: sort_array array
+sort_array() {
+    local -n arr=$1
+    local length=${#arr[@]}
+
+    if [ "$length" -le 1 ]; then
+        return
+    fi
+
+    IFS=$'\n' sorted_arr=($(sort -n <<<"${arr[*]}"))
+    unset IFS
+    arr=("${sorted_arr[@]}")
+}
+
+# Remove an array's outliers
+# Call with remove_array_outliers <array_name> <percent to trim from both sides>
+# Example: remove_array_outliers array 5
+remove_array_outliers() {
+    local -n arr=$1
+    local percent=$2
+    local length=${#arr[@]}
+
+    if [ "$length" -le 1 ]; then
+        return
+    fi
+
+    local trim_count=$((length * percent / 100))
+    local start_index=$trim_count
+    local end_index=$((length - (trim_count*2)))
+
+    arr=("${arr[@]:$start_index:$end_index}")
+}
diff --git a/qa/custom_models/custom_zero_1_float32/config.pbtxt b/qa/custom_models/custom_zero_1_float32/config.pbtxt
old mode 100755
new mode 100644
diff --git a/qa/openvino_models/README.md b/qa/openvino_models/README.md
new file mode 100644
index 0000000000..939fff6593
--- /dev/null
+++ b/qa/openvino_models/README.md
@@ -0,0 +1,34 @@
+<!--
+# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+The models in this directory are TF2/keras models converted into OpenVINO
+models. The "fixed_batch" model has a fixed batch dimension of 1 and the
+"dynamic_batch" model has a variable batch dimension.
+
+The models are currently in **beta**, which they might not work as expected and
+could be **changed, moved or deleted without warning** in the future.
diff --git a/qa/openvino_models/dynamic_batch/1/model.bin b/qa/openvino_models/dynamic_batch/1/model.bin
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/qa/openvino_models/dynamic_batch/1/model.mapping b/qa/openvino_models/dynamic_batch/1/model.mapping
new file mode 100644
index 0000000000..4705831777
--- /dev/null
+++ b/qa/openvino_models/dynamic_batch/1/model.mapping
@@ -0,0 +1,195 @@
+<?xml version="1.0"?>
+<mapping>
+	<map>
+		<framework name="input1" output_port_id="input1:0" />
+		<IR name="input1" output_port_id="0" />
+	</map>
+	<map>
+		<framework name="input1" output_port_id="input1:0" />
+		<IR name="input1" output_port_id="0" />
+	</map>
+	<map>
+		<framework name="input1" output_port_id="input1:0" />
+		<IR name="input1" output_port_id="0" />
+	</map>
+	<map>
+		<framework name="input1" output_port_id="input1" />
+		<IR name="input1" output_port_id="0" />
+	</map>
+	<map>
+		<framework name="input1" output_port_id="input1" />
+		<IR name="input1" output_port_id="0" />
+	</map>
+	<map>
+		<framework name="input1" output_port_id="input1" />
+		<IR name="input1" output_port_id="0" />
+	</map>
+	<map>
+		<framework name="Func/PartitionedCall/input/_0/placeholder_out_port_0" output_port_id="Func/PartitionedCall/input/_0:0" />
+		<IR name="Func/PartitionedCall/input/_0/placeholder_out_port_0" output_port_id="0" />
+	</map>
+	<map>
+		<framework name="Func/PartitionedCall/input/_0" output_port_id="Func/PartitionedCall/input/_0:0" />
+		<IR name="Func/PartitionedCall/input/_0/placeholder_out_port_0" output_port_id="0" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="Func/PartitionedCall/output/_3:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Func/PartitionedCall/output/_3" output_port_id="Func/PartitionedCall/output/_3:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Identity_1" output_port_id="Func/PartitionedCall/output/_3:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/Identity_1" output_port_id="Func/PartitionedCall/output/_3:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="Func/PartitionedCall/output/_3:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="Identity_1:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Func/PartitionedCall/output/_3" output_port_id="Identity_1:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Identity_1" output_port_id="Identity_1:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/Identity_1" output_port_id="Identity_1:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="Identity_1:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="PartitionedCall/Identity_1:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Func/PartitionedCall/output/_3" output_port_id="PartitionedCall/Identity_1:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Identity_1" output_port_id="PartitionedCall/Identity_1:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/Identity_1" output_port_id="PartitionedCall/Identity_1:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="PartitionedCall/Identity_1:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="PartitionedCall/model/tf.math.subtract/Sub:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Func/PartitionedCall/output/_3" output_port_id="PartitionedCall/model/tf.math.subtract/Sub:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Identity_1" output_port_id="PartitionedCall/model/tf.math.subtract/Sub:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/Identity_1" output_port_id="PartitionedCall/model/tf.math.subtract/Sub:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="PartitionedCall/model/tf.math.subtract/Sub:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.add/Add" output_port_id="PartitionedCall/model/tf.math.add/Add:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.add/Add" output_port_id="PartitionedCall/model/tf.math.add/Add:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/Identity" output_port_id="PartitionedCall/model/tf.math.add/Add:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Identity" output_port_id="PartitionedCall/model/tf.math.add/Add:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Func/PartitionedCall/output/_2" output_port_id="PartitionedCall/model/tf.math.add/Add:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.add/Add" output_port_id="PartitionedCall/Identity:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.add/Add" output_port_id="PartitionedCall/Identity:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/Identity" output_port_id="PartitionedCall/Identity:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Identity" output_port_id="PartitionedCall/Identity:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Func/PartitionedCall/output/_2" output_port_id="PartitionedCall/Identity:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.add/Add" output_port_id="Identity:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.add/Add" output_port_id="Identity:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/Identity" output_port_id="Identity:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Identity" output_port_id="Identity:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Func/PartitionedCall/output/_2" output_port_id="Identity:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.add/Add" output_port_id="Func/PartitionedCall/output/_2:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.add/Add" output_port_id="Func/PartitionedCall/output/_2:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/Identity" output_port_id="Func/PartitionedCall/output/_2:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Identity" output_port_id="Func/PartitionedCall/output/_2:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Func/PartitionedCall/output/_2" output_port_id="Func/PartitionedCall/output/_2:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+</mapping>
diff --git a/qa/openvino_models/dynamic_batch/1/model.xml b/qa/openvino_models/dynamic_batch/1/model.xml
new file mode 100644
index 0000000000..59594953c6
--- /dev/null
+++ b/qa/openvino_models/dynamic_batch/1/model.xml
@@ -0,0 +1,166 @@
+<?xml version="1.0" ?>
+<net name="dynamic_batch" version="11">
+	<layers>
+		<layer id="1" name="Func/PartitionedCall/input/_0/placeholder_out_port_0" type="Parameter" version="opset1">
+			<data shape="1,4" element_type="i32"/>
+			<rt_info>
+				<attribute name="fused_names" version="0" value="Func/PartitionedCall/input/_0/placeholder_out_port_0"/>
+			</rt_info>
+			<output>
+				<port id="0" precision="I32" names="Func/PartitionedCall/input/_0:0">
+					<dim>1</dim>
+					<dim>4</dim>
+					<rt_info>
+						<attribute name="layout" version="0" layout="[N,H]"/>
+					</rt_info>
+				</port>
+			</output>
+		</layer>
+		<layer id="0" name="input1" type="Parameter" version="opset1">
+			<data shape="1,4" element_type="i32"/>
+			<rt_info>
+				<attribute name="fused_names" version="0" value="input1"/>
+			</rt_info>
+			<output>
+				<port id="0" precision="I32" names="input1,input1:0">
+					<dim>1</dim>
+					<dim>4</dim>
+					<rt_info>
+						<attribute name="layout" version="0" layout="[N,H]"/>
+					</rt_info>
+				</port>
+			</output>
+		</layer>
+		<layer id="2" name="PartitionedCall/model/tf.math.subtract/Sub" type="Subtract" version="opset1">
+			<data auto_broadcast="numpy"/>
+			<rt_info>
+				<attribute name="fused_names" version="0" value="PartitionedCall/model/tf.math.subtract/Sub, PartitionedCall/model/tf.math.subtract/Sub/neg_"/>
+			</rt_info>
+			<input>
+				<port id="0" precision="I32">
+					<dim>1</dim>
+					<dim>4</dim>
+				</port>
+				<port id="1" precision="I32">
+					<dim>1</dim>
+					<dim>4</dim>
+				</port>
+			</input>
+			<output>
+				<port id="2" precision="I32" names="Func/PartitionedCall/output/_3:0,Identity_1:0,PartitionedCall/Identity_1:0,PartitionedCall/model/tf.math.subtract/Sub:0">
+					<dim>1</dim>
+					<dim>4</dim>
+					<rt_info>
+						<attribute name="layout" version="0" layout="[N,H]"/>
+					</rt_info>
+				</port>
+			</output>
+		</layer>
+		<layer id="4" name="PartitionedCall/model/tf.math.add/Add" type="Add" version="opset1">
+			<data auto_broadcast="numpy"/>
+			<rt_info>
+				<attribute name="fused_names" version="0" value="PartitionedCall/model/tf.math.add/Add"/>
+			</rt_info>
+			<input>
+				<port id="0" precision="I32">
+					<dim>1</dim>
+					<dim>4</dim>
+				</port>
+				<port id="1" precision="I32">
+					<dim>1</dim>
+					<dim>4</dim>
+				</port>
+			</input>
+			<output>
+				<port id="2" precision="I32" names="Func/PartitionedCall/output/_2:0,Identity:0,PartitionedCall/Identity:0,PartitionedCall/model/tf.math.add/Add:0">
+					<dim>1</dim>
+					<dim>4</dim>
+					<rt_info>
+						<attribute name="layout" version="0" layout="[N,H]"/>
+					</rt_info>
+				</port>
+			</output>
+		</layer>
+		<layer id="5" name="Func/PartitionedCall/output/_2:0" type="Result" version="opset1">
+			<rt_info>
+				<attribute name="fused_names" version="0" value="Func/PartitionedCall/output/_2:0"/>
+			</rt_info>
+			<input>
+				<port id="0" precision="I32">
+					<dim>1</dim>
+					<dim>4</dim>
+				</port>
+			</input>
+		</layer>
+		<layer id="3" name="Func/PartitionedCall/output/_3:0" type="Result" version="opset1">
+			<rt_info>
+				<attribute name="fused_names" version="0" value="Func/PartitionedCall/output/_3:0"/>
+			</rt_info>
+			<input>
+				<port id="0" precision="I32">
+					<dim>1</dim>
+					<dim>4</dim>
+				</port>
+			</input>
+		</layer>
+	</layers>
+	<edges>
+		<edge from-layer="0" from-port="0" to-layer="2" to-port="1"/>
+		<edge from-layer="0" from-port="0" to-layer="4" to-port="1"/>
+		<edge from-layer="1" from-port="0" to-layer="2" to-port="0"/>
+		<edge from-layer="1" from-port="0" to-layer="4" to-port="0"/>
+		<edge from-layer="2" from-port="2" to-layer="3" to-port="0"/>
+		<edge from-layer="4" from-port="2" to-layer="5" to-port="0"/>
+	</edges>
+	<meta_data>
+		<MO_version value="2022.1.0-7019-cdb9bec7210-releases/2022/1"/>
+		<Runtime_version value="2022.1.0-7019-cdb9bec7210-releases/2022/1"/>
+		<legacy_path value="True"/>
+		<cli_parameters>
+			<auto_disable_nhwc_to_nchw value="True"/>
+			<caffe_parser_path value="DIR"/>
+			<compress_fp16 value="False"/>
+			<data_type value="float"/>
+			<disable_nhwc_to_nchw value="False"/>
+			<disable_omitting_optional value="False"/>
+			<disable_resnet_optimization value="False"/>
+			<disable_weights_compression value="False"/>
+			<enable_concat_optimization value="False"/>
+			<enable_flattening_nested_params value="False"/>
+			<enable_ssd_gluoncv value="False"/>
+			<extensions value="DIR"/>
+			<framework value="tf"/>
+			<freeze_placeholder_with_value value="{}"/>
+			<input value="Func/PartitionedCall/input/_0:0,input1"/>
+			<input_model_is_text value="False"/>
+			<k value="DIR/CustomLayersMapping.xml"/>
+			<layout value="Func/PartitionedCall/input/_0:0(nh),input1(nh),Func/PartitionedCall/output/_2:0(nh),Func/PartitionedCall/output/_3:0(nh)"/>
+			<layout_values value="{'Func/PartitionedCall/input/_0:0': {'source_layout': 'nh', 'target_layout': None, 'is_input': True}, 'input1': {'source_layout': 'nh', 'target_layout': None, 'is_input': True}, 'Func/PartitionedCall/output/_2:0': {'source_layout': 'nh', 'target_layout': None, 'is_input': False}, 'Func/PartitionedCall/output/_3:0': {'source_layout': 'nh', 'target_layout': None, 'is_input': False}}"/>
+			<legacy_mxnet_model value="False"/>
+			<log_level value="ERROR"/>
+			<mean_scale_values value="{}"/>
+			<mean_values value="()"/>
+			<model_name value="saved_model"/>
+			<output_dir value="DIR"/>
+			<packed_user_shapes value="defaultdict(&lt;class 'list'&gt;, {'Func/PartitionedCall/input/_0': [{'shape': (1, 4), 'out': 0, 'data_type': &lt;class 'numpy.int32'&gt;, 'added': True}], 'input1': [{'shape': (1, 4), 'port': None, 'data_type': &lt;class 'numpy.int32'&gt;, 'added': True}]})"/>
+			<placeholder_data_types value="{'Func/PartitionedCall/input/_0:0': &lt;class 'numpy.int32'&gt;, 'input1': &lt;class 'numpy.int32'&gt;}"/>
+			<placeholder_shapes value="{'Func/PartitionedCall/input/_0:0': (1, 4), 'input1': (1, 4)}"/>
+			<progress value="False"/>
+			<remove_memory value="False"/>
+			<remove_output_softmax value="False"/>
+			<reverse_input_channels value="False"/>
+			<save_params_from_nd value="False"/>
+			<saved_model_dir value="DIR"/>
+			<scale_values value="()"/>
+			<silent value="False"/>
+			<source_layout value="()"/>
+			<static_shape value="False"/>
+			<stream_output value="False"/>
+			<target_layout value="()"/>
+			<transform value=""/>
+			<use_legacy_frontend value="False"/>
+			<use_new_frontend value="False"/>
+			<unset unset_cli_parameters="batch, counts, disable_fusing, finegrain_fusing, input_checkpoint, input_meta_graph, input_model, input_proto, input_shape, input_symbol, mean_file, mean_file_offsets, nd_prefix_name, output, pretrained_model_name, saved_model_tags, scale, tensorboard_logdir, tensorflow_custom_layer_libraries, tensorflow_custom_operations_config_update, tensorflow_object_detection_api_pipeline_config, tensorflow_use_custom_operations_config, transformations_config"/>
+		</cli_parameters>
+	</meta_data>
+</net>
diff --git a/qa/openvino_models/fixed_batch/1/model.bin b/qa/openvino_models/fixed_batch/1/model.bin
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/qa/openvino_models/fixed_batch/1/model.mapping b/qa/openvino_models/fixed_batch/1/model.mapping
new file mode 100644
index 0000000000..bd1a4eccb8
--- /dev/null
+++ b/qa/openvino_models/fixed_batch/1/model.mapping
@@ -0,0 +1,211 @@
+<?xml version="1.0"?>
+<mapping>
+	<map>
+		<framework name="input1" output_port_id="input1:0" />
+		<IR name="input1" output_port_id="0" />
+	</map>
+	<map>
+		<framework name="input1" output_port_id="input1:0" />
+		<IR name="input1" output_port_id="0" />
+	</map>
+	<map>
+		<framework name="input1" output_port_id="input1:0" />
+		<IR name="input1" output_port_id="0" />
+	</map>
+	<map>
+		<framework name="input1" output_port_id="input1" />
+		<IR name="input1" output_port_id="0" />
+	</map>
+	<map>
+		<framework name="input1" output_port_id="input1" />
+		<IR name="input1" output_port_id="0" />
+	</map>
+	<map>
+		<framework name="input1" output_port_id="input1" />
+		<IR name="input1" output_port_id="0" />
+	</map>
+	<map>
+		<framework name="input0" output_port_id="input0:0" />
+		<IR name="input0" output_port_id="0" />
+	</map>
+	<map>
+		<framework name="input0" output_port_id="input0:0" />
+		<IR name="input0" output_port_id="0" />
+	</map>
+	<map>
+		<framework name="Func/PartitionedCall/input/_0" output_port_id="input0:0" />
+		<IR name="input0" output_port_id="0" />
+	</map>
+	<map>
+		<framework name="input0" output_port_id="Func/PartitionedCall/input/_0:0" />
+		<IR name="input0" output_port_id="0" />
+	</map>
+	<map>
+		<framework name="input0" output_port_id="Func/PartitionedCall/input/_0:0" />
+		<IR name="input0" output_port_id="0" />
+	</map>
+	<map>
+		<framework name="Func/PartitionedCall/input/_0" output_port_id="Func/PartitionedCall/input/_0:0" />
+		<IR name="input0" output_port_id="0" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="Func/PartitionedCall/output/_3:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Func/PartitionedCall/output/_3" output_port_id="Func/PartitionedCall/output/_3:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Identity_1" output_port_id="Func/PartitionedCall/output/_3:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/Identity_1" output_port_id="Func/PartitionedCall/output/_3:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="Func/PartitionedCall/output/_3:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="Identity_1:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Func/PartitionedCall/output/_3" output_port_id="Identity_1:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Identity_1" output_port_id="Identity_1:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/Identity_1" output_port_id="Identity_1:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="Identity_1:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="PartitionedCall/Identity_1:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Func/PartitionedCall/output/_3" output_port_id="PartitionedCall/Identity_1:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Identity_1" output_port_id="PartitionedCall/Identity_1:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/Identity_1" output_port_id="PartitionedCall/Identity_1:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="PartitionedCall/Identity_1:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="PartitionedCall/model/tf.math.subtract/Sub:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Func/PartitionedCall/output/_3" output_port_id="PartitionedCall/model/tf.math.subtract/Sub:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Identity_1" output_port_id="PartitionedCall/model/tf.math.subtract/Sub:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/Identity_1" output_port_id="PartitionedCall/model/tf.math.subtract/Sub:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="PartitionedCall/model/tf.math.subtract/Sub:0" />
+		<IR name="PartitionedCall/model/tf.math.subtract/Sub" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.add/Add" output_port_id="PartitionedCall/model/tf.math.add/Add:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.add/Add" output_port_id="PartitionedCall/model/tf.math.add/Add:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/Identity" output_port_id="PartitionedCall/model/tf.math.add/Add:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Identity" output_port_id="PartitionedCall/model/tf.math.add/Add:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Func/PartitionedCall/output/_2" output_port_id="PartitionedCall/model/tf.math.add/Add:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.add/Add" output_port_id="PartitionedCall/Identity:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.add/Add" output_port_id="PartitionedCall/Identity:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/Identity" output_port_id="PartitionedCall/Identity:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Identity" output_port_id="PartitionedCall/Identity:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Func/PartitionedCall/output/_2" output_port_id="PartitionedCall/Identity:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.add/Add" output_port_id="Identity:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.add/Add" output_port_id="Identity:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/Identity" output_port_id="Identity:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Identity" output_port_id="Identity:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Func/PartitionedCall/output/_2" output_port_id="Identity:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.add/Add" output_port_id="Func/PartitionedCall/output/_2:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/model/tf.math.add/Add" output_port_id="Func/PartitionedCall/output/_2:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="PartitionedCall/Identity" output_port_id="Func/PartitionedCall/output/_2:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Identity" output_port_id="Func/PartitionedCall/output/_2:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+	<map>
+		<framework name="Func/PartitionedCall/output/_2" output_port_id="Func/PartitionedCall/output/_2:0" />
+		<IR name="PartitionedCall/model/tf.math.add/Add" output_port_id="2" />
+	</map>
+</mapping>
diff --git a/qa/openvino_models/fixed_batch/1/model.xml b/qa/openvino_models/fixed_batch/1/model.xml
new file mode 100644
index 0000000000..e0f8954866
--- /dev/null
+++ b/qa/openvino_models/fixed_batch/1/model.xml
@@ -0,0 +1,152 @@
+<?xml version="1.0" ?>
+<net name="fixed_batch" version="11">
+	<layers>
+		<layer id="1" name="input0" type="Parameter" version="opset1">
+			<data shape="1,4" element_type="i32"/>
+			<rt_info>
+				<attribute name="fused_names" version="0" value="input0"/>
+			</rt_info>
+			<output>
+				<port id="0" precision="I32" names="Func/PartitionedCall/input/_0:0,input0:0">
+					<dim>1</dim>
+					<dim>4</dim>
+				</port>
+			</output>
+		</layer>
+		<layer id="0" name="input1" type="Parameter" version="opset1">
+			<data shape="1,4" element_type="i32"/>
+			<rt_info>
+				<attribute name="fused_names" version="0" value="input1"/>
+			</rt_info>
+			<output>
+				<port id="0" precision="I32" names="input1,input1:0">
+					<dim>1</dim>
+					<dim>4</dim>
+				</port>
+			</output>
+		</layer>
+		<layer id="2" name="PartitionedCall/model/tf.math.subtract/Sub" type="Subtract" version="opset1">
+			<data auto_broadcast="numpy"/>
+			<rt_info>
+				<attribute name="fused_names" version="0" value="PartitionedCall/model/tf.math.subtract/Sub, PartitionedCall/model/tf.math.subtract/Sub/neg_"/>
+			</rt_info>
+			<input>
+				<port id="0" precision="I32">
+					<dim>1</dim>
+					<dim>4</dim>
+				</port>
+				<port id="1" precision="I32">
+					<dim>1</dim>
+					<dim>4</dim>
+				</port>
+			</input>
+			<output>
+				<port id="2" precision="I32" names="Func/PartitionedCall/output/_3:0,Identity_1:0,PartitionedCall/Identity_1:0,PartitionedCall/model/tf.math.subtract/Sub:0">
+					<dim>1</dim>
+					<dim>4</dim>
+				</port>
+			</output>
+		</layer>
+		<layer id="4" name="PartitionedCall/model/tf.math.add/Add" type="Add" version="opset1">
+			<data auto_broadcast="numpy"/>
+			<rt_info>
+				<attribute name="fused_names" version="0" value="PartitionedCall/model/tf.math.add/Add"/>
+			</rt_info>
+			<input>
+				<port id="0" precision="I32">
+					<dim>1</dim>
+					<dim>4</dim>
+				</port>
+				<port id="1" precision="I32">
+					<dim>1</dim>
+					<dim>4</dim>
+				</port>
+			</input>
+			<output>
+				<port id="2" precision="I32" names="Func/PartitionedCall/output/_2:0,Identity:0,PartitionedCall/Identity:0,PartitionedCall/model/tf.math.add/Add:0">
+					<dim>1</dim>
+					<dim>4</dim>
+				</port>
+			</output>
+		</layer>
+		<layer id="5" name="Func/PartitionedCall/output/_2:0" type="Result" version="opset1">
+			<rt_info>
+				<attribute name="fused_names" version="0" value="Func/PartitionedCall/output/_2:0"/>
+			</rt_info>
+			<input>
+				<port id="0" precision="I32">
+					<dim>1</dim>
+					<dim>4</dim>
+				</port>
+			</input>
+		</layer>
+		<layer id="3" name="Func/PartitionedCall/output/_3:0" type="Result" version="opset1">
+			<rt_info>
+				<attribute name="fused_names" version="0" value="Func/PartitionedCall/output/_3:0"/>
+			</rt_info>
+			<input>
+				<port id="0" precision="I32">
+					<dim>1</dim>
+					<dim>4</dim>
+				</port>
+			</input>
+		</layer>
+	</layers>
+	<edges>
+		<edge from-layer="0" from-port="0" to-layer="2" to-port="1"/>
+		<edge from-layer="0" from-port="0" to-layer="4" to-port="1"/>
+		<edge from-layer="1" from-port="0" to-layer="2" to-port="0"/>
+		<edge from-layer="1" from-port="0" to-layer="4" to-port="0"/>
+		<edge from-layer="2" from-port="2" to-layer="3" to-port="0"/>
+		<edge from-layer="4" from-port="2" to-layer="5" to-port="0"/>
+	</edges>
+	<meta_data>
+		<MO_version value="2022.1.0-7019-cdb9bec7210-releases/2022/1"/>
+		<Runtime_version value="2022.1.0-7019-cdb9bec7210-releases/2022/1"/>
+		<legacy_path value="True"/>
+		<cli_parameters>
+			<auto_disable_nhwc_to_nchw value="True"/>
+			<batch value="1"/>
+			<caffe_parser_path value="DIR"/>
+			<compress_fp16 value="False"/>
+			<data_type value="float"/>
+			<disable_nhwc_to_nchw value="False"/>
+			<disable_omitting_optional value="False"/>
+			<disable_resnet_optimization value="False"/>
+			<disable_weights_compression value="False"/>
+			<enable_concat_optimization value="False"/>
+			<enable_flattening_nested_params value="False"/>
+			<enable_ssd_gluoncv value="False"/>
+			<extensions value="DIR"/>
+			<framework value="tf"/>
+			<freeze_placeholder_with_value value="{}"/>
+			<input_model_is_text value="False"/>
+			<k value="DIR/CustomLayersMapping.xml"/>
+			<layout value="()"/>
+			<layout_values value="{}"/>
+			<legacy_mxnet_model value="False"/>
+			<log_level value="ERROR"/>
+			<mean_scale_values value="{}"/>
+			<mean_values value="()"/>
+			<model_name value="saved_model"/>
+			<output_dir value="DIR"/>
+			<placeholder_data_types value="{}"/>
+			<progress value="False"/>
+			<remove_memory value="False"/>
+			<remove_output_softmax value="False"/>
+			<reverse_input_channels value="False"/>
+			<save_params_from_nd value="False"/>
+			<saved_model_dir value="DIR"/>
+			<scale_values value="()"/>
+			<silent value="False"/>
+			<source_layout value="()"/>
+			<static_shape value="False"/>
+			<stream_output value="False"/>
+			<target_layout value="()"/>
+			<transform value=""/>
+			<use_legacy_frontend value="False"/>
+			<use_new_frontend value="False"/>
+			<unset unset_cli_parameters="counts, disable_fusing, finegrain_fusing, input, input_checkpoint, input_meta_graph, input_model, input_proto, input_shape, input_symbol, mean_file, mean_file_offsets, nd_prefix_name, output, packed_user_shapes, placeholder_shapes, pretrained_model_name, saved_model_tags, scale, tensorboard_logdir, tensorflow_custom_layer_libraries, tensorflow_custom_operations_config_update, tensorflow_object_detection_api_pipeline_config, tensorflow_use_custom_operations_config, transformations_config"/>
+		</cli_parameters>
+	</meta_data>
+</net>
diff --git a/qa/python_models/add_sub/config.pbtxt b/qa/python_models/add_sub/config.pbtxt
index b0805c0089..39bd6771d0 100644
--- a/qa/python_models/add_sub/config.pbtxt
+++ b/qa/python_models/add_sub/config.pbtxt
@@ -24,7 +24,6 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-name: "add_sub"
 backend: "python"
 
 input [
diff --git a/qa/python_models/add_sub/model.py b/qa/python_models/add_sub/model.py
index 4aac895e1c..0868014804 100644
--- a/qa/python_models/add_sub/model.py
+++ b/qa/python_models/add_sub/model.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,29 +24,28 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
 import json
+
+import numpy as np
 import triton_python_backend_utils as pb_utils
 
 
 class TritonPythonModel:
-
     def initialize(self, args):
-        self.model_config = model_config = json.loads(args['model_config'])
+        self.model_config = model_config = json.loads(args["model_config"])
 
-        output0_config = pb_utils.get_output_config_by_name(
-            model_config, "OUTPUT0")
-        output1_config = pb_utils.get_output_config_by_name(
-            model_config, "OUTPUT1")
+        output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0")
+        output1_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT1")
 
         self.output0_dtype = pb_utils.triton_string_to_numpy(
-            output0_config['data_type'])
+            output0_config["data_type"]
+        )
         self.output1_dtype = pb_utils.triton_string_to_numpy(
-            output1_config['data_type'])
+            output1_config["data_type"]
+        )
 
     def execute(self, requests):
-        """ This function is called on inference request.
-        """
+        """This function is called on inference request."""
 
         output0_dtype = self.output0_dtype
         output1_dtype = self.output1_dtype
@@ -55,18 +54,21 @@ def execute(self, requests):
         for request in requests:
             in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
             in_1 = pb_utils.get_input_tensor_by_name(request, "INPUT1")
-            if in_0.as_numpy().dtype.type is np.bytes_ or in_0.as_numpy(
-            ).dtype == np.object_:
-                out_0, out_1 = (in_0.as_numpy().astype(np.int32) + in_1.as_numpy().astype(np.int32),\
-                    in_0.as_numpy().astype(np.int32) - in_1.as_numpy().astype(np.int32))
+            if (
+                in_0.as_numpy().dtype.type is np.bytes_
+                or in_0.as_numpy().dtype == np.object_
+            ):
+                out_0, out_1 = (
+                    in_0.as_numpy().astype(np.int32) + in_1.as_numpy().astype(np.int32),
+                    in_0.as_numpy().astype(np.int32) - in_1.as_numpy().astype(np.int32),
+                )
             else:
-                out_0, out_1 = (in_0.as_numpy() + in_1.as_numpy(),
-                                in_0.as_numpy() - in_1.as_numpy())
+                out_0, out_1 = (
+                    in_0.as_numpy() + in_1.as_numpy(),
+                    in_0.as_numpy() - in_1.as_numpy(),
+                )
 
-            out_tensor_0 = pb_utils.Tensor("OUTPUT0",
-                                           out_0.astype(output0_dtype))
-            out_tensor_1 = pb_utils.Tensor("OUTPUT1",
-                                           out_1.astype(output1_dtype))
-            responses.append(
-                pb_utils.InferenceResponse([out_tensor_0, out_tensor_1]))
+            out_tensor_0 = pb_utils.Tensor("OUTPUT0", out_0.astype(output0_dtype))
+            out_tensor_1 = pb_utils.Tensor("OUTPUT1", out_1.astype(output1_dtype))
+            responses.append(pb_utils.InferenceResponse([out_tensor_0, out_tensor_1]))
         return responses
diff --git a/qa/python_models/add_sub_gpu/config.pbtxt b/qa/python_models/add_sub_gpu/config.pbtxt
index 79154871c2..dd4a3ebecf 100644
--- a/qa/python_models/add_sub_gpu/config.pbtxt
+++ b/qa/python_models/add_sub_gpu/config.pbtxt
@@ -32,7 +32,7 @@ input [
     name: "INPUT0"
     data_type: TYPE_FP32
     dims: [ 4 ]
-    
+
   }
 ]
 input [
@@ -40,7 +40,7 @@ input [
     name: "INPUT1"
     data_type: TYPE_FP32
     dims: [ 4 ]
-    
+
   }
 ]
 output [
@@ -55,8 +55,8 @@ output [
     name: "OUTPUT1"
     data_type: TYPE_FP32
     dims: [ 4 ]
-    
-    
+
+
   }
 ]
 
diff --git a/qa/python_models/auto_complete/model.py b/qa/python_models/auto_complete/model.py
index c4768a562e..7f67182387 100644
--- a/qa/python_models/auto_complete/model.py
+++ b/qa/python_models/auto_complete/model.py
@@ -1,4 +1,4 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,19 +24,19 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
 import json
+
+import numpy as np
 import triton_python_backend_utils as pb_utils
 
 
 class TritonPythonModel:
-
     @staticmethod
     def auto_complete_config(auto_complete_model_config):
-        input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
+        input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]}
+        output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]}
 
         auto_complete_model_config.set_max_batch_size(0)
         auto_complete_model_config.add_input(input0)
@@ -47,21 +47,20 @@ def auto_complete_config(auto_complete_model_config):
         return auto_complete_model_config
 
     def initialize(self, args):
-        self.model_config = model_config = json.loads(args['model_config'])
+        self.model_config = model_config = json.loads(args["model_config"])
 
-        output0_config = pb_utils.get_output_config_by_name(
-            model_config, "OUTPUT0")
-        output1_config = pb_utils.get_output_config_by_name(
-            model_config, "OUTPUT1")
+        output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0")
+        output1_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT1")
 
         self.output0_dtype = pb_utils.triton_string_to_numpy(
-            output0_config['data_type'])
+            output0_config["data_type"]
+        )
         self.output1_dtype = pb_utils.triton_string_to_numpy(
-            output1_config['data_type'])
+            output1_config["data_type"]
+        )
 
     def execute(self, requests):
-        """ This function is called on inference request.
-        """
+        """This function is called on inference request."""
 
         output0_dtype = self.output0_dtype
         output1_dtype = self.output1_dtype
@@ -70,18 +69,21 @@ def execute(self, requests):
         for request in requests:
             in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
             in_1 = pb_utils.get_input_tensor_by_name(request, "INPUT1")
-            if in_0.as_numpy().dtype.type is np.bytes_ or in_0.as_numpy(
-            ).dtype == np.object_:
-                out_0, out_1 = (in_0.as_numpy().astype(np.int32) + in_1.as_numpy().astype(np.int32),\
-                    in_0.as_numpy().astype(np.int32) - in_1.as_numpy().astype(np.int32))
+            if (
+                in_0.as_numpy().dtype.type is np.bytes_
+                or in_0.as_numpy().dtype == np.object_
+            ):
+                out_0, out_1 = (
+                    in_0.as_numpy().astype(np.int32) + in_1.as_numpy().astype(np.int32),
+                    in_0.as_numpy().astype(np.int32) - in_1.as_numpy().astype(np.int32),
+                )
             else:
-                out_0, out_1 = (in_0.as_numpy() + in_1.as_numpy(),
-                                in_0.as_numpy() - in_1.as_numpy())
+                out_0, out_1 = (
+                    in_0.as_numpy() + in_1.as_numpy(),
+                    in_0.as_numpy() - in_1.as_numpy(),
+                )
 
-            out_tensor_0 = pb_utils.Tensor("OUTPUT0",
-                                           out_0.astype(output0_dtype))
-            out_tensor_1 = pb_utils.Tensor("OUTPUT1",
-                                           out_1.astype(output1_dtype))
-            responses.append(
-                pb_utils.InferenceResponse([out_tensor_0, out_tensor_1]))
+            out_tensor_0 = pb_utils.Tensor("OUTPUT0", out_0.astype(output0_dtype))
+            out_tensor_1 = pb_utils.Tensor("OUTPUT1", out_1.astype(output1_dtype))
+            responses.append(pb_utils.InferenceResponse([out_tensor_0, out_tensor_1]))
         return responses
diff --git a/qa/python_models/auto_complete_error/model.py b/qa/python_models/auto_complete_error/model.py
index b45a8f1149..1d611c36d5 100644
--- a/qa/python_models/auto_complete_error/model.py
+++ b/qa/python_models/auto_complete_error/model.py
@@ -1,4 +1,4 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,13 +24,8 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
-import json
-import triton_python_backend_utils as pb_utils
-
 
 class TritonPythonModel:
-
     @staticmethod
     def auto_complete_config(auto_complete_model_config):
         """
@@ -38,10 +33,10 @@ def auto_complete_config(auto_complete_model_config):
         to test correct handling of Python errors in the `auto_complete_config`
         function.
         """
-        input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]}
-        output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]}
+        input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]}
+        output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]}
+        output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]}
 
         auto_complete_model_config.set_max_batch_size(0)
         auto_complete_model_config.add_input(input0)
diff --git a/qa/python_models/bls/model.py b/qa/python_models/bls/model.py
index 894aa9a09a..30bba29a70 100644
--- a/qa/python_models/bls/model.py
+++ b/qa/python_models/bls/model.py
@@ -1,4 +1,4 @@
-# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,14 +24,17 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
+import gc
+import os
+import sys
+import threading
 import unittest
-import triton_python_backend_utils as pb_utils
+from multiprocessing import Pool
+
+import numpy as np
 import torch
+import triton_python_backend_utils as pb_utils
 from torch.utils.dlpack import from_dlpack, to_dlpack
-import threading
-from multiprocessing import Pool
-import sys
 
 _deferred_exceptions_lock = threading.Lock()
 _deferred_exceptions = []
@@ -42,18 +45,19 @@ def bls_add_sub(_=None):
     input0_np = input0_np.astype(np.float32)
     input1_np = np.random.randn(*[16])
     input1_np = input1_np.astype(np.float32)
-    input0 = pb_utils.Tensor('INPUT0', input0_np)
-    input1 = pb_utils.Tensor('INPUT1', input1_np)
+    input0 = pb_utils.Tensor("INPUT0", input0_np)
+    input1 = pb_utils.Tensor("INPUT1", input1_np)
     infer_request = pb_utils.InferenceRequest(
-        model_name='add_sub',
+        model_name="add_sub",
         inputs=[input0, input1],
-        requested_output_names=['OUTPUT0', 'OUTPUT1'])
+        requested_output_names=["OUTPUT0", "OUTPUT1"],
+    )
     infer_response = infer_request.exec()
     if infer_response.has_error():
         return False
 
-    output0 = pb_utils.get_output_tensor_by_name(infer_response, 'OUTPUT0')
-    output1 = pb_utils.get_output_tensor_by_name(infer_response, 'OUTPUT1')
+    output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0")
+    output1 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT1")
     if output0 is None or output1 is None:
         return False
 
@@ -69,7 +73,97 @@ def bls_add_sub(_=None):
     return True
 
 
+def bls_square(_=None):
+    input0_np = np.random.randint(16, size=1, dtype=np.int32)
+    input0 = pb_utils.Tensor("IN", input0_np)
+    infer_request = pb_utils.InferenceRequest(
+        model_name="square_int32", inputs=[input0], requested_output_names=["OUT"]
+    )
+    infer_responses = infer_request.exec(decoupled=True)
+
+    response_count = 0
+
+    if infer_responses:
+        for infer_response in infer_responses:
+            if infer_response.has_error():
+                return False
+
+            if len(infer_response.output_tensors()) > 0:
+                output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUT")
+                if output0 is None:
+                    return False
+
+                expected_output = input0.as_numpy()
+
+                if not np.all(expected_output == output0.as_numpy()):
+                    return False
+
+            response_count += 1
+
+    if not np.all(input0.as_numpy() == response_count - 1):
+        return False
+
+    return True
+
+
+def bls_libtorch(model_name, result_device):
+    shape = [16]
+    input0_np = np.random.rand(*shape).astype(np.float32)
+    input1_np = np.random.rand(*shape).astype(np.float32)
+    input0 = pb_utils.Tensor("INPUT0", input0_np)
+    input1 = pb_utils.Tensor("INPUT1", input1_np)
+
+    if result_device == "CPU":
+        preferred_memory = pb_utils.PreferredMemory(pb_utils.TRITONSERVER_MEMORY_CPU)
+    else:
+        preferred_memory = pb_utils.PreferredMemory(pb_utils.TRITONSERVER_MEMORY_GPU, 0)
+
+    infer_request = pb_utils.InferenceRequest(
+        model_name=model_name,
+        model_version=1,
+        inputs=[input0, input1],
+        requested_output_names=["OUTPUT__0", "OUTPUT__1"],
+        preferred_memory=preferred_memory,
+    )
+
+    infer_response = infer_request.exec()
+    if infer_response.has_error():
+        return False
+
+    output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT__0")
+    output1 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT__1")
+    if output0 is None or output1 is None:
+        return False
+
+    expected_output_0 = input0.as_numpy() + input1.as_numpy()
+    expected_output_1 = input0.as_numpy() - input1.as_numpy()
+
+    if result_device == "CPU":
+        if not output0.is_cpu() or not output1.is_cpu():
+            return False
+
+        if not np.all(expected_output_0 == output0.as_numpy()):
+            return False
+
+        if not np.all(expected_output_1 == output1.as_numpy()):
+            return False
+    else:
+        if output0.is_cpu() or output1.is_cpu():
+            return False
+        output0 = from_dlpack(output0.to_dlpack()).to("cpu").cpu().detach().numpy()
+        output1 = from_dlpack(output1.to_dlpack()).to("cpu").cpu().detach().numpy()
+
+        if not np.all(output0 == expected_output_0):
+            return False
+        if not np.all(output1 == expected_output_1):
+            return False
+
+    return True
+
+
 class PBBLSTest(unittest.TestCase):
+    def setUp(self):
+        self._is_decoupled = True if os.environ["BLS_KIND"] == "decoupled" else False
 
     def add_deferred_exception(self, ex):
         global _deferred_exceptions
@@ -82,84 +176,132 @@ def check_deferred_exception(self):
                 raise _deferred_exceptions[0]
 
     def test_bls_wrong_inputs(self):
-        input0 = pb_utils.Tensor('INPUT0', np.random.randn(*[1, 16]))
+        input0 = pb_utils.Tensor("INPUT0", np.random.randn(*[1, 16]))
 
-        infer_request = pb_utils.InferenceRequest(
-            model_name='add_sub',
-            inputs=[input0],
-            requested_output_names=['OUTPUT0', 'OUTPUT1'])
-        infer_response = infer_request.exec()
-        self.assertTrue(infer_response.has_error())
-        self.assertEqual(
-            infer_response.error().message(),
-            "expected 2 inputs but got 1 inputs for model 'add_sub'")
-        self.assertTrue(len(infer_response.output_tensors()) == 0)
+        if self._is_decoupled:
+            infer_request = pb_utils.InferenceRequest(
+                model_name="square_int32", inputs=[], requested_output_names=["OUT"]
+            )
+            infer_responses = infer_request.exec(decoupled=True)
+            for infer_response in infer_responses:
+                self.assertTrue(infer_response.has_error())
+                self.assertIn(
+                    "expected 1 inputs but got 0 inputs for model 'square_int32'",
+                    infer_response.error().message(),
+                )
+                self.assertTrue(len(infer_response.output_tensors()) == 0)
+        else:
+            infer_request = pb_utils.InferenceRequest(
+                model_name="add_sub",
+                inputs=[input0],
+                requested_output_names=["OUTPUT0", "OUTPUT1"],
+            )
+            infer_response = infer_request.exec()
+            self.assertTrue(infer_response.has_error())
+            self.assertIn(
+                "expected 2 inputs but got 1 inputs for model 'add_sub'",
+                infer_response.error().message(),
+            )
+            self.assertTrue(len(infer_response.output_tensors()) == 0)
 
-    def _send_bls_sequence_requests(self, correlation_id):
+    def _send_bls_sequence_requests(self, correlation_id, is_decoupled):
         # Start request
         try:
-            input = pb_utils.Tensor('INPUT', np.array([1000], dtype=np.int32))
+            input = pb_utils.Tensor("INPUT", np.array([1000], dtype=np.int32))
 
             infer_request = pb_utils.InferenceRequest(
-                model_name='onnx_nobatch_sequence_int32',
+                model_name="onnx_nobatch_sequence_int32",
                 inputs=[input],
-                requested_output_names=['OUTPUT'],
+                requested_output_names=["OUTPUT"],
                 flags=pb_utils.TRITONSERVER_REQUEST_FLAG_SEQUENCE_START,
-                correlation_id=correlation_id)
-            self.assertTrue(infer_request.flags(),
-                            pb_utils.TRITONSERVER_REQUEST_FLAG_SEQUENCE_START)
+                correlation_id=correlation_id,
+            )
+            self.assertTrue(
+                infer_request.flags(), pb_utils.TRITONSERVER_REQUEST_FLAG_SEQUENCE_START
+            )
             infer_response = infer_request.exec()
             self.assertFalse(infer_response.has_error())
-            output = pb_utils.get_output_tensor_by_name(infer_response,
-                                                        'OUTPUT')
-            self.assertEqual(output.as_numpy()[0], input.as_numpy()[0])
+            output = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT")
+            self.assertFalse(output.is_cpu())
+            output = from_dlpack(output.to_dlpack()).to("cpu").cpu().detach().numpy()
+            self.assertEqual(output[0], input.as_numpy()[0])
 
             for i in range(10):
-                input = pb_utils.Tensor('INPUT', np.array([i], dtype=np.int32))
+                input = pb_utils.Tensor("INPUT", np.array([i], dtype=np.int32))
                 infer_request = pb_utils.InferenceRequest(
-                    model_name='onnx_nobatch_sequence_int32',
+                    model_name="onnx_nobatch_sequence_int32",
                     inputs=[input],
-                    requested_output_names=['OUTPUT'],
-                    correlation_id=correlation_id)
-                infer_response = infer_request.exec()
+                    requested_output_names=["OUTPUT"],
+                    correlation_id=correlation_id,
+                )
+
+                if is_decoupled:
+                    infer_responses = infer_request.exec(decoupled=True)
+                    infer_response = next(infer_responses)
+                    with self.assertRaises(StopIteration):
+                        next(infer_responses)
+                else:
+                    infer_response = infer_request.exec()
                 self.assertFalse(infer_response.has_error())
 
                 # The new output is the previous output + the current input
-                expected_output = output.as_numpy()[0] + i
-                output = pb_utils.get_output_tensor_by_name(
-                    infer_response, 'OUTPUT')
-                self.assertEqual(output.as_numpy()[0], expected_output)
+                expected_output = output[0] + i
+                output = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT")
+                self.assertFalse(output.is_cpu())
+                output = (
+                    from_dlpack(output.to_dlpack()).to("cpu").cpu().detach().numpy()
+                )
+                self.assertEqual(output[0], expected_output)
 
             # Final request
-            input = pb_utils.Tensor('INPUT', np.array([2000], dtype=np.int32))
+            input = pb_utils.Tensor("INPUT", np.array([2000], dtype=np.int32))
 
             infer_request = pb_utils.InferenceRequest(
-                model_name='onnx_nobatch_sequence_int32',
+                model_name="onnx_nobatch_sequence_int32",
                 inputs=[input],
-                requested_output_names=['OUTPUT'],
-                correlation_id=correlation_id)
-            infer_request.set_flags(
-                pb_utils.TRITONSERVER_REQUEST_FLAG_SEQUENCE_END)
-            self.assertTrue(infer_request.flags(),
-                            pb_utils.TRITONSERVER_REQUEST_FLAG_SEQUENCE_END)
+                requested_output_names=["OUTPUT"],
+                correlation_id=correlation_id,
+            )
+            infer_request.set_flags(pb_utils.TRITONSERVER_REQUEST_FLAG_SEQUENCE_END)
+            self.assertTrue(
+                infer_request.flags(), pb_utils.TRITONSERVER_REQUEST_FLAG_SEQUENCE_END
+            )
+
+            if is_decoupled:
+                infer_responses = infer_request.exec(decoupled=True)
+                infer_response = next(infer_responses)
+                with self.assertRaises(StopIteration):
+                    next(infer_responses)
+            else:
+                infer_response = infer_request.exec()
 
-            infer_response = infer_request.exec()
             self.assertFalse(infer_response.has_error())
-            expected_output = output.as_numpy()[0] + input.as_numpy()[0]
-            output = pb_utils.get_output_tensor_by_name(infer_response,
-                                                        'OUTPUT')
-            self.assertEqual(output.as_numpy()[0], expected_output)
+            expected_output = output[0] + input.as_numpy()[0]
+            output = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT")
+            self.assertFalse(output.is_cpu())
+            output = from_dlpack(output.to_dlpack()).to("cpu").cpu().detach().numpy()
+            self.assertEqual(output[0], expected_output)
         except Exception as e:
             self.add_deferred_exception(e)
 
     def test_bls_sequence(self):
         # Send 2 sequence of BLS requests simultaneously and check the responses.
         threads = []
-        thread1 = threading.Thread(target=self._send_bls_sequence_requests,
-                                   args=(1000,))
+        thread1 = threading.Thread(
+            target=self._send_bls_sequence_requests,
+            args=(
+                1000,
+                self._is_decoupled,
+            ),
+        )
         threads.append(thread1)
-        thread2 = threading.Thread(target=self._send_bls_sequence_requests,
-                                   args=(1001,))
+        thread2 = threading.Thread(
+            target=self._send_bls_sequence_requests,
+            args=(
+                1001,
+                self._is_decoupled,
+            ),
+        )
         threads.append(thread2)
 
         for thread in threads:
@@ -174,30 +316,39 @@ def test_bls_sequence(self):
     def test_bls_incorrect_args(self):
         with self.assertRaises(TypeError):
             pb_utils.InferenceRequest(
-                inputs=[], requested_output_names=['OUTPUT0', 'OUTPUT1'])
+                inputs=[], requested_output_names=["OUTPUT0", "OUTPUT1"]
+            )
 
         with self.assertRaises(TypeError):
             pb_utils.InferenceRequest(
-                model_name='add_sub',
-                requested_output_names=['OUTPUT0', 'OUTPUT1'])
+                model_name="add_sub", requested_output_names=["OUTPUT0", "OUTPUT1"]
+            )
 
         with self.assertRaises(TypeError):
-            pb_utils.InferenceRequest(model_name='add_sub', inputs=[])
+            pb_utils.InferenceRequest(model_name="add_sub", inputs=[])
 
-    def _get_gpu_bls_outputs(self, input0_pb, input1_pb):
+    def _get_gpu_bls_outputs(self, input0_pb, input1_pb, is_decoupled):
         """
         This function is created to test that the DLPack container works
         properly when the inference response and outputs go out of scope.
         """
         infer_request = pb_utils.InferenceRequest(
-            model_name='dlpack_add_sub',
+            model_name="dlpack_add_sub",
             inputs=[input0_pb, input1_pb],
-            requested_output_names=['OUTPUT0', 'OUTPUT1'])
-        infer_response = infer_request.exec()
+            requested_output_names=["OUTPUT0", "OUTPUT1"],
+        )
+        if is_decoupled:
+            infer_responses = infer_request.exec(decoupled=True)
+            infer_response = next(infer_responses)
+            with self.assertRaises(StopIteration):
+                next(infer_responses)
+        else:
+            infer_response = infer_request.exec()
+
         self.assertFalse(infer_response.has_error())
 
-        output0 = pb_utils.get_output_tensor_by_name(infer_response, 'OUTPUT0')
-        output1 = pb_utils.get_output_tensor_by_name(infer_response, 'OUTPUT1')
+        output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0")
+        output1 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT1")
         self.assertIsNotNone(output0)
         self.assertIsNotNone(output1)
 
@@ -227,178 +378,435 @@ def _get_gpu_bls_outputs(self, input0_pb, input1_pb):
         output1_dlpack = None
         rc_after_del_dlpack_output0 = sys.getrefcount(output0)
         rc_after_del_dlpack_output1 = sys.getrefcount(output1)
-        self.assertEqual(rc_after_del_dlpack_output0 - rc_after_dlpack_output0,
-                         -1)
-        self.assertEqual(rc_after_del_dlpack_output1 - rc_after_dlpack_output1,
-                         -1)
+        self.assertEqual(rc_after_del_dlpack_output0 - rc_after_dlpack_output0, -1)
+        self.assertEqual(rc_after_del_dlpack_output1 - rc_after_dlpack_output1, -1)
 
         return output0.to_dlpack(), output1.to_dlpack()
 
     def test_zero_length_io(self):
-        model_name = 'identity_fp32'
+        model_name = "identity_fp32"
         input0 = np.zeros([1, 0], dtype=np.float32)
-        input0_pb = pb_utils.Tensor('INPUT0', input0)
+        input0_pb = pb_utils.Tensor("INPUT0", input0)
         infer_request = pb_utils.InferenceRequest(
             model_name=model_name,
             inputs=[input0_pb],
-            requested_output_names=['OUTPUT0'])
-        infer_response = infer_request.exec()
+            requested_output_names=["OUTPUT0"],
+        )
+
+        if self._is_decoupled:
+            infer_responses = infer_request.exec(decoupled=True)
+            infer_response = next(infer_responses)
+            with self.assertRaises(StopIteration):
+                next(infer_responses)
+        else:
+            infer_response = infer_request.exec()
+
         self.assertFalse(infer_response.has_error())
 
-        output0 = pb_utils.get_output_tensor_by_name(infer_response, 'OUTPUT0')
+        output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0")
         self.assertTrue(np.all(output0 == input0))
 
-    def test_bls_tensor_lifecycle(self):
-        model_name = 'dlpack_identity'
+    def cuda_memory_stats(self):
+        allocated_bytes = torch.cuda.memory_allocated()
+        reserved_bytes = torch.cuda.memory_reserved()
+        return allocated_bytes, reserved_bytes
+
+    def bls_tensor_lifecycle_helper(self):
+        model_name = "dlpack_identity"
+        verbose = True
 
         # A 10 MB tensor.
         input_size = 10 * 1024 * 1024
+        input_type_size_bytes = 4  # TYPE_FP32
+        input_size_bytes = input_size * input_type_size_bytes
 
         # Sending the tensor 50 times to test whether the deallocation is
         # happening correctly. If the deallocation doesn't happen correctly,
         # there will be an out of shared memory error.
         for _ in range(50):
             input0 = np.ones([1, input_size], dtype=np.float32)
-            input0_pb = pb_utils.Tensor('INPUT0', input0)
+            input0_pb = pb_utils.Tensor("INPUT0", input0)
             infer_request = pb_utils.InferenceRequest(
                 model_name=model_name,
                 inputs=[input0_pb],
-                requested_output_names=['OUTPUT0'])
-            infer_response = infer_request.exec()
+                requested_output_names=["OUTPUT0"],
+            )
+
+            if self._is_decoupled:
+                infer_responses = infer_request.exec(decoupled=True)
+                infer_response = next(infer_responses)
+                with self.assertRaises(StopIteration):
+                    next(infer_responses)
+            else:
+                infer_response = infer_request.exec()
             self.assertFalse(infer_response.has_error())
 
-            output0 = pb_utils.get_output_tensor_by_name(
-                infer_response, 'OUTPUT0')
-            np.testing.assert_equal(output0.as_numpy(), input0,
-                                    "BLS CPU memory lifecycle failed.")
+            output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0")
+            np.testing.assert_equal(
+                output0.as_numpy(), input0, "BLS CPU memory lifecycle failed."
+            )
+
+        # Show total memory stats before gpu tensor test
+        print(torch.cuda.memory_summary())
 
         # Checking the same with the GPU tensors.
         for index in range(50):
             input0 = None
             infer_request = None
             input0_pb = None
+            fail_msg = f"GPU memory lifecycle test failed at index: {index}"
 
             torch.cuda.empty_cache()
-            free_memory, _ = torch.cuda.mem_get_info()
-            if index == 1:
-                recorded_memory = free_memory
-
-            if index > 1:
-                self.assertEqual(free_memory, recorded_memory,
-                                 "GPU memory lifecycle test failed.")
+            alloced, cached = self.cuda_memory_stats()
+
+            # Check cuda memory usage is cleaned up (empty) between iterations
+            # when device tensors go out of scope
+            self.assertEqual(alloced, 0, fail_msg)
+            # Check that cache is properly cleaned up when emptied
+            self.assertEqual(cached, 0, fail_msg)
+
+            if verbose:
+                # NOTE: this reflects total gpu memory usage, and may be affected
+                # by other processes, so don't use it for direct checks but log it
+                # for debugging/context.
+                free_memory, total_memory = torch.cuda.mem_get_info()
+                used_memory = total_memory - free_memory
+                print(f"[DEBUG][Iteration {index}][GPU] {used_memory=} bytes")
+
+            input0 = torch.ones([1, input_size], dtype=torch.float32).to("cuda")
+            input0_pb = pb_utils.Tensor.from_dlpack("INPUT0", to_dlpack(input0))
+            # Check cuda memory usage after creating device tensor
+            alloced, _ = self.cuda_memory_stats()
+            self.assertEqual(
+                alloced,
+                input_size_bytes,
+                "Expected precise byte allocation after input tensor creation",
+            )
 
-            input0 = torch.ones([1, input_size], dtype=torch.float32).to('cuda')
-            input0_pb = pb_utils.Tensor.from_dlpack('INPUT0', to_dlpack(input0))
             infer_request = pb_utils.InferenceRequest(
                 model_name=model_name,
                 inputs=[input0_pb],
-                requested_output_names=['OUTPUT0'])
-            infer_response = infer_request.exec()
+                requested_output_names=["OUTPUT0"],
+            )
+
+            if self._is_decoupled:
+                infer_responses = infer_request.exec(decoupled=True)
+                infer_response = next(infer_responses)
+                with self.assertRaises(StopIteration):
+                    next(infer_responses)
+            else:
+                infer_response = infer_request.exec()
+
             self.assertFalse(infer_response.has_error())
 
-            output0 = pb_utils.get_output_tensor_by_name(
-                infer_response, 'OUTPUT0')
+            output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0")
             output0_pytorch = from_dlpack(output0.to_dlpack())
 
+            # Stats after getting output tensor
+            alloced, _ = self.cuda_memory_stats()
+            self.assertEqual(
+                alloced,
+                input_size_bytes,
+                "Expected only input allocation, as output zero-copies input tensor",
+            )
+
             # Set inference response and output0_pytorch to None, to make sure
             # that the DLPack is still valid.
             output0 = None
             infer_response = None
             self.assertTrue(
                 torch.all(output0_pytorch == input0),
-                f"input ({input0}) and output ({output0_pytorch}) didn't match for identity model."
+                f"input ({input0}) and output ({output0_pytorch}) didn't match for identity model.",
             )
 
-    def _test_gpu_bls_add_sub(self, is_input0_gpu, is_input1_gpu):
+        print(torch.cuda.memory_summary())
+
+    def assert_cuda_memory_empty(self, msg):
+        torch.cuda.empty_cache()
+        alloced, cached = self.cuda_memory_stats()
+        self.assertEqual(alloced, 0, msg)
+        self.assertEqual(cached, 0, msg)
+
+    def test_bls_tensor_lifecycle(self):
+        self.assert_cuda_memory_empty("Expected all gpu memory cleaned up before test")
+        self.bls_tensor_lifecycle_helper()
+        self.assert_cuda_memory_empty("Expected all gpu memory cleaned up after test")
+
+    def _test_gpu_bls_add_sub(self, is_input0_gpu, is_input1_gpu, is_decoupled=False):
         input0 = torch.rand(16)
         input1 = torch.rand(16)
 
         if is_input0_gpu:
-            input0 = input0.to('cuda')
+            input0 = input0.to("cuda")
 
         if is_input1_gpu:
-            input1 = input1.to('cuda')
+            input1 = input1.to("cuda")
+
+        input0_pb = pb_utils.Tensor.from_dlpack("INPUT0", to_dlpack(input0))
+        input1_pb = pb_utils.Tensor.from_dlpack("INPUT1", to_dlpack(input1))
 
-        input0_pb = pb_utils.Tensor.from_dlpack('INPUT0', to_dlpack(input0))
-        input1_pb = pb_utils.Tensor.from_dlpack('INPUT1', to_dlpack(input1))
         output0_dlpack, output1_dlpack = self._get_gpu_bls_outputs(
-            input0_pb, input1_pb)
+            input0_pb, input1_pb, is_decoupled=is_decoupled
+        )
 
-        expected_output_0 = from_dlpack(
-            input0_pb.to_dlpack()).to('cpu') + from_dlpack(
-                input1_pb.to_dlpack()).to('cpu')
-        expected_output_1 = from_dlpack(
-            input0_pb.to_dlpack()).to('cpu') - from_dlpack(
-                input1_pb.to_dlpack()).to('cpu')
+        expected_output_0 = from_dlpack(input0_pb.to_dlpack()).to("cpu") + from_dlpack(
+            input1_pb.to_dlpack()
+        ).to("cpu")
+        expected_output_1 = from_dlpack(input0_pb.to_dlpack()).to("cpu") - from_dlpack(
+            input1_pb.to_dlpack()
+        ).to("cpu")
 
         self.assertTrue(
-            torch.all(
-                expected_output_0 == from_dlpack(output0_dlpack).to('cpu')))
+            torch.all(expected_output_0 == from_dlpack(output0_dlpack).to("cpu"))
+        )
         self.assertTrue(
-            torch.all(
-                expected_output_1 == from_dlpack(output1_dlpack).to('cpu')))
+            torch.all(expected_output_1 == from_dlpack(output1_dlpack).to("cpu"))
+        )
 
     def test_gpu_bls(self):
         for input0_device in [True, False]:
             for input1_device in [True, False]:
-                self._test_gpu_bls_add_sub(input0_device, input1_device)
+                self._test_gpu_bls_add_sub(
+                    input0_device, input1_device, self._is_decoupled
+                )
 
     def test_multiprocess(self):
         # Test multiprocess Pool with sync BLS
-        pool = Pool(10)
-        pool.map(bls_add_sub, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
-        pool.close()
-        pool.join()
+        if self._is_decoupled:
+            # Fixme: DLIS-4630
+            # func_name = bls_square
+            pass
+        else:
+            func_name = bls_add_sub
+
+            pool = Pool(10)
+            pool.map(func_name, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
+            pool.close()
+            pool.join()
 
     def test_bls_sync(self):
         infer_request = pb_utils.InferenceRequest(
-            model_name='non_existent_model',
-            inputs=[],
-            requested_output_names=[])
-        infer_response = infer_request.exec()
-
-        # Because the model doesn't exist, the inference response must have an
-        # error
-        self.assertTrue(infer_response.has_error())
-        self.assertEqual(
-            infer_response.error().message(),
-            "Failed for execute the inference request. Model 'non_existent_model' is not ready."
+            model_name="non_existent_model", inputs=[], requested_output_names=[]
         )
 
-        # Make sure that the inference requests can be performed properly after
-        # an error.
-        self.assertTrue(bls_add_sub())
+        if self._is_decoupled:
+            infer_responses = infer_request.exec(decoupled=True)
+
+            for infer_response in infer_responses:
+                # Because the model doesn't exist, the inference response must have an
+                # error
+                self.assertTrue(infer_response.has_error())
+                self.assertIn(
+                    "Failed for execute the inference request. Model 'non_existent_model' is not ready.",
+                    infer_response.error().message(),
+                )
+
+                # Make sure that the inference requests can be performed properly after
+                # an error.
+                self.assertTrue(bls_square())
+        else:
+            infer_response = infer_request.exec()
+
+            # Because the model doesn't exist, the inference response must have an
+            # error
+            self.assertTrue(infer_response.has_error())
+            self.assertIn(
+                "Failed for execute the inference request. Model 'non_existent_model' is not ready.",
+                infer_response.error().message(),
+            )
+
+            # Make sure that the inference requests can be performed properly after
+            # an error.
+            self.assertTrue(bls_add_sub())
 
     def test_bls_execute_error(self):
         # Test BLS with a model that has an error during execution.
-        infer_request = pb_utils.InferenceRequest(model_name='execute_error',
-                                                  inputs=[],
-                                                  requested_output_names=[])
-        infer_response = infer_request.exec()
+        infer_request = pb_utils.InferenceRequest(
+            model_name="execute_error", inputs=[], requested_output_names=[]
+        )
+        if self._is_decoupled:
+            infer_responses = infer_request.exec(decoupled=True)
+            infer_response = next(infer_responses)
+            with self.assertRaises(StopIteration):
+                next(infer_responses)
+        else:
+            infer_response = infer_request.exec()
+
         self.assertTrue(infer_response.has_error())
-        self.assertEqual(
+        self.assertIn(
+            "expected 1 inputs but got 0 inputs for model 'execute_error'",
             infer_response.error().message(),
-            "expected 1 inputs but got 0 inputs for model 'execute_error'")
+        )
         self.assertTrue(len(infer_response.output_tensors()) == 0)
 
     def test_multiple_bls(self):
         # Test running multiple BLS requests together
-        for _ in range(100):
-            self.assertTrue(bls_add_sub())
+        if self._is_decoupled:
+            for _ in range(100):
+                self.assertTrue(bls_square())
+        else:
+            for _ in range(100):
+                self.assertTrue(bls_add_sub())
 
+    def test_timeout(self):
+        tensor_size = [1, 1024 * 1024]
+        input0_np = np.random.randn(*tensor_size)
+        input0 = pb_utils.Tensor("INPUT0", input0_np.astype(np.float32))
+        infer_request = pb_utils.InferenceRequest(
+            model_name="identity_fp32_timeout",
+            inputs=[input0],
+            requested_output_names=["OUTPUT0"],
+            timeout=5,
+        )
 
-class TritonPythonModel:
+        if self._is_decoupled:
+            infer_responses = infer_request.exec(decoupled=True)
+            infer_response = next(infer_responses)
+        else:
+            infer_response = infer_request.exec()
 
+        # Expect timeout error
+        self.assertTrue(infer_response.has_error())
+        self.assertIn("Request timeout expired", infer_response.error().message())
+        self.assertTrue(len(infer_response.output_tensors()) == 0)
+
+        # Verifies two things:
+        # 1. A request timeout can be accessed by receiver models
+        # 2. A user can specify a very large value (11s) for a timeout
+        infer_request = pb_utils.InferenceRequest(
+            model_name="identity_fp32_timeout",
+            inputs=[input0],
+            requested_output_names=["OUTPUT0"],
+            timeout=11000000000,
+        )
+
+        if self._is_decoupled:
+            infer_responses = infer_request.exec(decoupled=True)
+            infer_response = next(infer_responses)
+        else:
+            infer_response = infer_request.exec()
+
+        # Expect no timeout error. Check for log message
+        # in test.sh
+        self.assertFalse(infer_response.has_error())
+
+    def _test_response_iterator_square(
+        self, expected_output_cnt, expected_output_value, response_iterator
+    ):
+        response_count = 0
+        expected_output_cnt = np.array([expected_output_cnt], dtype=np.int32)
+
+        for infer_response in response_iterator:
+            self.assertFalse(infer_response.has_error())
+            if len(infer_response.output_tensors()) > 0:
+                output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUT")
+                self.assertIsNotNone(output0)
+                self.assertEqual(expected_output_value, output0.as_numpy())
+
+            response_count += 1
+
+        self.assertEqual(response_count, expected_output_cnt)
+
+        # Make sure the iterator is exhausted.
+        with self.assertRaises(StopIteration):
+            next(response_iterator)
+
+        return response_iterator
+
+    def test_response_iterator(self):
+        if self._is_decoupled:
+            # Test the response iterator for decoupled responses. The request
+            # has 4 decoupled responses followed by an empty response.
+            response_value = 4
+            input0_np = np.array([response_value], dtype=np.int32)
+            input0 = pb_utils.Tensor("IN", input0_np)
+            infer_request = pb_utils.InferenceRequest(
+                model_name="square_int32",
+                inputs=[input0],
+                requested_output_names=["OUT"],
+            )
+            infer_responses = infer_request.exec(decoupled=True)
+
+            # case 1. Use Next() to get the next response first, then use
+            # for-loop to get the remaining responses.
+            infer_response = next(infer_responses)
+            self.assertFalse(infer_response.has_error())
+            output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUT")
+            self.assertIsNotNone(output0)
+            self.assertEqual(response_value, output0.as_numpy())
+            # The iterator now should only have 4 remaining responses.
+            infer_responses = self._test_response_iterator_square(
+                4, response_value, infer_responses
+            )
+
+            # case 2. Call for-loop to get all the responses multiple times.
+            infer_responses = self._test_response_iterator_square(
+                5, response_value, infer_responses
+            )
+            infer_responses = self._test_response_iterator_square(
+                5, response_value, infer_responses
+            )
+            infer_responses = self._test_response_iterator_square(
+                5, response_value, infer_responses
+            )
+
+            # case 3. Break from the iteration, then use Next() and for-loop to
+            # get the remaining responses.
+            response_count = 0
+            for infer_response in infer_responses:
+                self.assertFalse(infer_response.has_error())
+                output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUT")
+                self.assertIsNotNone(output0)
+                self.assertEqual(response_value, output0.as_numpy())
+
+                response_count += 1
+                if response_count == 2:
+                    break
+
+            infer_response = next(infer_responses)
+            self.assertFalse(infer_response.has_error())
+            output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUT")
+            self.assertIsNotNone(output0)
+            self.assertEqual(response_value, output0.as_numpy())
+
+            # The iterator now should only have 2 remaining responses.
+            infer_responses = self._test_response_iterator_square(
+                2, response_value, infer_responses
+            )
+
+            # case 4. Delete the iterator before all the responses have been
+            # retrieved.
+            infer_responses = infer_request.exec(decoupled=True)
+
+            infer_response = next(infer_responses)
+            self.assertFalse(infer_response.has_error())
+            output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUT")
+            self.assertIsNotNone(output0)
+            self.assertEqual(response_value, output0.as_numpy())
+
+            del infer_responses
+
+    def test_preferred_memory(self):
+        self.assertTrue(bls_libtorch("libtorch_gpu", "CPU"))
+        self.assertTrue(bls_libtorch("libtorch_cpu", "GPU"))
+
+
+class TritonPythonModel:
     def execute(self, requests):
         responses = []
         for _ in requests:
             # Run the unittest and store the results in InferenceResponse.
-            test = unittest.main('model', exit=False)
+            test = unittest.main("model", exit=False)
+            for test_case, traceback in test.result.failures:
+                print(f"{test_case} failed:\n{traceback}")
             responses.append(
-                pb_utils.InferenceResponse([
-                    pb_utils.Tensor(
-                        'OUTPUT0',
-                        np.array([test.result.wasSuccessful()],
-                                 dtype=np.float16))
-                ]))
+                pb_utils.InferenceResponse(
+                    [
+                        pb_utils.Tensor(
+                            "OUTPUT0",
+                            np.array([test.result.wasSuccessful()], dtype=np.float16),
+                        )
+                    ]
+                )
+            )
         return responses
diff --git a/qa/python_models/bls_async/model.py b/qa/python_models/bls_async/model.py
index 676b7727de..8d75259b7b 100644
--- a/qa/python_models/bls_async/model.py
+++ b/qa/python_models/bls_async/model.py
@@ -1,4 +1,4 @@
-# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,44 +24,43 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
+import asyncio
+import os
+
 import numpy as np
-import triton_python_backend_utils as pb_utils
 import torch
+import triton_python_backend_utils as pb_utils
 from torch.utils.dlpack import from_dlpack, to_dlpack
-import asyncio
 
 
 def verify_add_sub_results(input0, input1, infer_response):
     if infer_response.has_error():
+        print("Async BLS failed:", infer_response.error().message(), flush=True)
         return False
 
-    output0 = pb_utils.get_output_tensor_by_name(infer_response, 'OUTPUT0')
-    output1 = pb_utils.get_output_tensor_by_name(infer_response, 'OUTPUT1')
+    output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0")
+    output1 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT1")
 
     if (output0 is None) or (output1 is None):
         return False
 
     if not input0.is_cpu():
-        input0 = from_dlpack(
-            input0.to_dlpack()).to('cpu').cpu().detach().numpy()
+        input0 = from_dlpack(input0.to_dlpack()).to("cpu").cpu().detach().numpy()
     else:
         input0 = input0.as_numpy()
 
     if not input1.is_cpu():
-        input1 = from_dlpack(
-            input1.to_dlpack()).to('cpu').cpu().detach().numpy()
+        input1 = from_dlpack(input1.to_dlpack()).to("cpu").cpu().detach().numpy()
     else:
         input1 = input1.as_numpy()
 
     if not output0.is_cpu():
-        output0 = from_dlpack(
-            output0.to_dlpack()).to('cpu').cpu().detach().numpy()
+        output0 = from_dlpack(output0.to_dlpack()).to("cpu").cpu().detach().numpy()
     else:
         output0 = output0.as_numpy()
 
     if not output1.is_cpu():
-        output1 = from_dlpack(
-            output1.to_dlpack()).to('cpu').cpu().detach().numpy()
+        output1 = from_dlpack(output1.to_dlpack()).to("cpu").cpu().detach().numpy()
     else:
         output1 = output1.as_numpy()
 
@@ -69,11 +68,56 @@ def verify_add_sub_results(input0, input1, infer_response):
     expected_output_1 = input0 - input1
 
     if not np.all(expected_output_0 == output0):
-        print(f'For OUTPUT0 expected {expected_output_0} found {output0}')
+        print(f"For OUTPUT0 expected {expected_output_0} found {output0}")
         return False
 
     if not np.all(expected_output_1 == output1):
-        print(f'For OUTPUT1 expected {expected_output_1} found {output1}')
+        print(f"For OUTPUT1 expected {expected_output_1} found {output1}")
+        return False
+
+    return True
+
+
+def verify_square_results(input0, infer_responses):
+    if not input0.is_cpu():
+        input0 = from_dlpack(input0.to_dlpack()).to("cpu").cpu().detach().numpy()
+    else:
+        input0 = input0.as_numpy()
+
+    response_count = 0
+
+    for infer_response in infer_responses:
+        if infer_response.has_error():
+            print(
+                "Async BLS decoupled failed:",
+                infer_response.error().message(),
+                flush=True,
+            )
+            return False
+
+        if len(infer_response.output_tensors()) > 0:
+            output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUT")
+
+            if output0 is None:
+                return False
+
+            if not output0.is_cpu():
+                output0 = (
+                    from_dlpack(output0.to_dlpack()).to("cpu").cpu().detach().numpy()
+                )
+            else:
+                output0 = output0.as_numpy()
+
+            expected_output = input0
+
+            if not np.all(expected_output == input0):
+                print(f"For OUT expected {expected_output} found {output0}")
+                return False
+
+        response_count += 1
+
+    if not np.all(input0 == response_count - 1):
+        print("Expected {} responses, got {}".format(input0, response_count - 1))
         return False
 
     return True
@@ -85,23 +129,36 @@ def create_addsub_inference_request(gpu=False):
         input1_np = np.random.randn(16)
         input0_np = input0_np.astype(np.float32)
         input1_np = input1_np.astype(np.float32)
-        input0 = pb_utils.Tensor('INPUT0', input0_np)
-        input1 = pb_utils.Tensor('INPUT1', input1_np)
+        input0 = pb_utils.Tensor("INPUT0", input0_np)
+        input1 = pb_utils.Tensor("INPUT1", input1_np)
     else:
-        input0_pytorch = torch.rand(16).to('cuda')
-        input1_pytorch = torch.rand(16).to('cuda')
-        input0 = pb_utils.Tensor.from_dlpack('INPUT0',
-                                             to_dlpack(input0_pytorch))
-        input1 = pb_utils.Tensor.from_dlpack('INPUT1',
-                                             to_dlpack(input1_pytorch))
+        input0_pytorch = torch.rand(16).to("cuda")
+        input1_pytorch = torch.rand(16).to("cuda")
+        input0 = pb_utils.Tensor.from_dlpack("INPUT0", to_dlpack(input0_pytorch))
+        input1 = pb_utils.Tensor.from_dlpack("INPUT1", to_dlpack(input1_pytorch))
 
     infer_request = pb_utils.InferenceRequest(
-        model_name='dlpack_add_sub',
+        model_name="dlpack_add_sub",
         inputs=[input0, input1],
-        requested_output_names=['OUTPUT0', 'OUTPUT1'])
+        requested_output_names=["OUTPUT0", "OUTPUT1"],
+    )
     return input0, input1, infer_request
 
 
+def create_square_inference_request(gpu=False):
+    if not gpu:
+        input0_np = np.random.randint(16, size=1, dtype=np.int32)
+        input0 = pb_utils.Tensor("IN", input0_np)
+    else:
+        input0_pytorch = torch.randint(1, 16, (1,), dtype=torch.int32).to("cuda")
+        input0 = pb_utils.Tensor.from_dlpack("IN", to_dlpack(input0_pytorch))
+
+    infer_request = pb_utils.InferenceRequest(
+        model_name="dlpack_square", inputs=[input0], requested_output_names=["OUT"]
+    )
+    return input0, infer_request
+
+
 async def async_bls_add_sub():
     input0, input1, infer_request = create_addsub_inference_request()
     infer_response = await infer_request.async_exec()
@@ -117,7 +174,22 @@ async def async_bls_add_sub():
     return True
 
 
-async def multiple_async_bls(gpu):
+async def async_bls_square():
+    input0, infer_request = create_square_inference_request()
+    infer_responses = await infer_request.async_exec(decoupled=True)
+    result_correct = verify_square_results(input0, infer_responses)
+    if not result_correct:
+        return False
+
+    infer_responses_sync = infer_request.exec(decoupled=True)
+    result_correct = verify_square_results(input0, infer_responses_sync)
+    if not result_correct:
+        return False
+
+    return True
+
+
+async def multiple_async_bls_addsub(gpu):
     infer_request_aws = []
     inputs = []
     for _ in range(10):
@@ -127,14 +199,26 @@ async def multiple_async_bls(gpu):
 
     infer_responses = await asyncio.gather(*infer_request_aws)
     for infer_response, input_pair in zip(infer_responses, inputs):
-        if infer_response.has_error():
-            print('Async BLS failed:',
-                  infer_response.error().message(),
-                  flush=True)
+        result_correct = verify_add_sub_results(
+            input_pair[0], input_pair[1], infer_response
+        )
+        if not result_correct:
             return False
 
-        result_correct = verify_add_sub_results(input_pair[0], input_pair[1],
-                                                infer_response)
+    return True
+
+
+async def multiple_async_bls_square(gpu):
+    infer_request_aws = []
+    inputs = []
+    for _ in range(10):
+        input0, infer_request = create_square_inference_request(gpu)
+        inputs.append(input0)
+        infer_request_aws.append(infer_request.async_exec(decoupled=True))
+
+    async_responses = await asyncio.gather(*infer_request_aws)
+    for infer_responses, input_pair in zip(async_responses, inputs):
+        result_correct = verify_square_results(input_pair, infer_responses)
         if not result_correct:
             return False
 
@@ -142,18 +226,26 @@ async def multiple_async_bls(gpu):
 
 
 class TritonPythonModel:
-
     async def execute(self, requests):
+        is_decoupled = True if os.environ["BLS_KIND"] == "decoupled" else False
+
         responses = []
         for _ in requests:
-            test1 = await multiple_async_bls(gpu=True)
-            test2 = await multiple_async_bls(gpu=False)
-            test3 = await async_bls_add_sub()
+            if is_decoupled:
+                test1 = await multiple_async_bls_square(gpu=True)
+                test2 = await multiple_async_bls_square(gpu=False)
+                test3 = await async_bls_square()
+            else:
+                test1 = await multiple_async_bls_addsub(gpu=True)
+                test2 = await multiple_async_bls_addsub(gpu=False)
+                test3 = await async_bls_add_sub()
 
             responses.append(
-                pb_utils.InferenceResponse(output_tensors=[
-                    pb_utils.Tensor('OUTPUT0', np.array([test1 & test2 &
-                                                         test3]))
-                ]))
+                pb_utils.InferenceResponse(
+                    output_tensors=[
+                        pb_utils.Tensor("OUTPUT0", np.array([test1 & test2 & test3]))
+                    ]
+                )
+            )
 
         return responses
diff --git a/qa/python_models/bls_finalize_error/config.pbtxt b/qa/python_models/bls_finalize_error/config.pbtxt
new file mode 100644
index 0000000000..ff5f42188b
--- /dev/null
+++ b/qa/python_models/bls_finalize_error/config.pbtxt
@@ -0,0 +1,38 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "bls_finalize_error"
+backend: "python"
+
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ 16 ]
+  }
+]
+
+instance_group [{ kind: KIND_CPU }]
diff --git a/qa/python_models/bls_finalize_error/model.py b/qa/python_models/bls_finalize_error/model.py
new file mode 100644
index 0000000000..a38b1080ad
--- /dev/null
+++ b/qa/python_models/bls_finalize_error/model.py
@@ -0,0 +1,45 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import numpy as np
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    def initialize(self, args):
+        pass
+
+    def execute(self, requests):
+        pass
+
+    def finalize(self):
+        print("Cleaning up...")
+        input0_np = np.random.randint(3, size=1, dtype=np.int32)
+        input0 = pb_utils.Tensor("IN", input0_np)
+        infer_request = pb_utils.InferenceRequest(
+            model_name="square_int32", inputs=[input0], requested_output_names=["OUT"]
+        )
+        infer_responses = infer_request.exec(decoupled=True)
diff --git a/qa/python_models/bls_init_error/config.pbtxt b/qa/python_models/bls_init_error/config.pbtxt
new file mode 100644
index 0000000000..6cf5024e1f
--- /dev/null
+++ b/qa/python_models/bls_init_error/config.pbtxt
@@ -0,0 +1,38 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "bls_init_error"
+backend: "python"
+
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ 16 ]
+  }
+]
+
+instance_group [{ kind: KIND_CPU }]
diff --git a/qa/python_models/bls_init_error/model.py b/qa/python_models/bls_init_error/model.py
new file mode 100644
index 0000000000..b2518e0334
--- /dev/null
+++ b/qa/python_models/bls_init_error/model.py
@@ -0,0 +1,44 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import numpy as np
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    def initialize(self, args):
+        input0_np = np.random.randint(3, size=1, dtype=np.int32)
+        input0 = pb_utils.Tensor("IN", input0_np)
+        infer_request = pb_utils.InferenceRequest(
+            model_name="square_int32", inputs=[input0], requested_output_names=["OUT"]
+        )
+        infer_responses = infer_request.exec(decoupled=True)
+
+    def execute(self, requests):
+        pass
+
+    def finalize(self):
+        print("Cleaning up...")
diff --git a/qa/python_models/bls_memory/model.py b/qa/python_models/bls_memory/model.py
index 101c321ec8..69da4f440f 100644
--- a/qa/python_models/bls_memory/model.py
+++ b/qa/python_models/bls_memory/model.py
@@ -1,4 +1,4 @@
-# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,64 +24,80 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
+import os
 import unittest
+
+import numpy as np
 import triton_python_backend_utils as pb_utils
 
 
 class PBBLSMemoryTest(unittest.TestCase):
+    def setUp(self):
+        self._is_decoupled = True if os.environ["BLS_KIND"] == "decoupled" else False
 
-    def _send_identity_tensor(self, size):
+    def _send_identity_tensor(self, size, is_decoupled):
         tensor_size = [1, size]
         input0_np = np.random.randn(*tensor_size)
-        input0 = pb_utils.Tensor('INPUT0', input0_np.astype(np.float32))
+        input0 = pb_utils.Tensor("INPUT0", input0_np.astype(np.float32))
         infer_request = pb_utils.InferenceRequest(
-            model_name='identity_fp32',
+            model_name="identity_fp32",
             inputs=[input0],
-            requested_output_names=['OUTPUT0'])
-        return input0_np, infer_request.exec()
+            requested_output_names=["OUTPUT0"],
+        )
+
+        if is_decoupled:
+            infer_responses = infer_request.exec(decoupled=True)
+            infer_response = next(infer_responses)
+            with self.assertRaises(StopIteration):
+                next(infer_responses)
+        else:
+            infer_response = infer_request.exec()
+
+        return input0_np, infer_response
 
     def test_bls_out_of_memory(self):
-        tensor_size = 1024 * 1024 * 1024
-        input0_np, infer_response = self._send_identity_tensor(tensor_size)
+        tensor_size = 256 * 1024 * 1024
+        input0_np, infer_response = self._send_identity_tensor(
+            tensor_size, self._is_decoupled
+        )
         out_of_memory_message = "Failed to increase the shared memory pool size for key"
 
         if infer_response.has_error():
-            self.assertIn(out_of_memory_message,
-                          infer_response.error().message())
+            self.assertIn(out_of_memory_message, infer_response.error().message())
         else:
             self.assertFalse(infer_response.has_error())
-            output0 = pb_utils.get_output_tensor_by_name(
-                infer_response, 'OUTPUT0')
+            output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0")
             self.assertIsNotNone(output0)
             self.assertTrue(np.allclose(output0.as_numpy(), input0_np))
 
         tensor_size = 50 * 1024 * 1024
         for _ in range(4):
-            input0_np, infer_response = self._send_identity_tensor(tensor_size)
+            input0_np, infer_response = self._send_identity_tensor(
+                tensor_size, self._is_decoupled
+            )
             if infer_response.has_error():
-                self.assertIn(out_of_memory_message,
-                              infer_response.error().message())
+                self.assertIn(out_of_memory_message, infer_response.error().message())
             else:
                 self.assertFalse(infer_response.has_error())
-                output0 = pb_utils.get_output_tensor_by_name(
-                    infer_response, 'OUTPUT0')
+                output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0")
                 self.assertIsNotNone(output0)
                 self.assertTrue(np.allclose(output0.as_numpy(), input0_np))
 
 
 class TritonPythonModel:
-
     def execute(self, requests):
         responses = []
         for _ in requests:
             # Run the unittest and store the results in InferenceResponse.
-            test = unittest.main('model', exit=False)
+            test = unittest.main("model", exit=False)
             responses.append(
-                pb_utils.InferenceResponse([
-                    pb_utils.Tensor(
-                        'OUTPUT0',
-                        np.array([test.result.wasSuccessful()],
-                                 dtype=np.float16))
-                ]))
+                pb_utils.InferenceResponse(
+                    [
+                        pb_utils.Tensor(
+                            "OUTPUT0",
+                            np.array([test.result.wasSuccessful()], dtype=np.float16),
+                        )
+                    ]
+                )
+            )
         return responses
diff --git a/qa/python_models/bls_memory_async/model.py b/qa/python_models/bls_memory_async/model.py
index c7eec807b1..d9e676b42e 100644
--- a/qa/python_models/bls_memory_async/model.py
+++ b/qa/python_models/bls_memory_async/model.py
@@ -1,4 +1,4 @@
-# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,31 +24,42 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
+import os
+
 import numpy as np
 import triton_python_backend_utils as pb_utils
 
 
-async def _send_identity_tensor(size):
+async def _send_identity_tensor(size, is_decoupled):
     tensor_size = [1, size]
     input0_np = np.random.randn(*tensor_size)
-    input0 = pb_utils.Tensor('INPUT0', input0_np.astype(np.float32))
+    input0 = pb_utils.Tensor("INPUT0", input0_np.astype(np.float32))
     infer_request = pb_utils.InferenceRequest(
-        model_name='identity_fp32',
-        inputs=[input0],
-        requested_output_names=['OUTPUT0'])
-    return input0_np, await infer_request.async_exec()
+        model_name="identity_fp32", inputs=[input0], requested_output_names=["OUTPUT0"]
+    )
+
+    if is_decoupled:
+        infer_responses = await infer_request.async_exec(decoupled=True)
+        infer_response = next(infer_responses)
+    else:
+        infer_response = await infer_request.async_exec()
+
+    return input0_np, infer_response
 
 
 async def test_bls_out_of_memory():
-    tensor_size = 1024 * 1024 * 1024
-    input0_np, infer_response = await _send_identity_tensor(tensor_size)
+    is_decoupled = True if os.environ["BLS_KIND"] == "decoupled" else False
+
+    tensor_size = 256 * 1024 * 1024
+    input0_np, infer_response = await _send_identity_tensor(tensor_size, is_decoupled)
+
     out_of_memory_message = "Failed to increase the shared memory pool size for key"
 
     if infer_response.has_error():
         if not (out_of_memory_message in infer_response.error().message()):
             return False
     else:
-        output0 = pb_utils.get_output_tensor_by_name(infer_response, 'OUTPUT0')
+        output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0")
         if output0 is None:
             return False
         if not np.allclose(output0.as_numpy(), input0_np):
@@ -56,13 +67,15 @@ async def test_bls_out_of_memory():
 
     tensor_size = 50 * 1024 * 1024
     for _ in range(4):
-        input0_np, infer_response = await _send_identity_tensor(tensor_size)
+        input0_np, infer_response = await _send_identity_tensor(
+            tensor_size, is_decoupled
+        )
+
         if infer_response.has_error():
             if not (out_of_memory_message in infer_response.error().message()):
                 return False
         else:
-            output0 = pb_utils.get_output_tensor_by_name(
-                infer_response, 'OUTPUT0')
+            output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0")
             if output0 is None:
                 return False
             if not np.allclose(output0.as_numpy(), input0_np):
@@ -72,15 +85,14 @@ async def test_bls_out_of_memory():
 
 
 class TritonPythonModel:
-
     async def execute(self, requests):
         responses = []
         for _ in requests:
             # Run the unittest and store the results in InferenceResponse.
             result = await test_bls_out_of_memory()
             responses.append(
-                pb_utils.InferenceResponse([
-                    pb_utils.Tensor('OUTPUT0',
-                                    np.array([result], dtype=np.float16))
-                ]))
+                pb_utils.InferenceResponse(
+                    [pb_utils.Tensor("OUTPUT0", np.array([result], dtype=np.float16))]
+                )
+            )
         return responses
diff --git a/qa/python_models/bls_model_loading/config.pbtxt b/qa/python_models/bls_model_loading/config.pbtxt
new file mode 100644
index 0000000000..2099ba5db7
--- /dev/null
+++ b/qa/python_models/bls_model_loading/config.pbtxt
@@ -0,0 +1,43 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "bls_model_loading"
+backend: "python"
+
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_BOOL
+    dims: [ 1 ]
+  }
+]
+
+instance_group [
+  {
+    count: 1
+    kind: KIND_CPU
+  }
+]
diff --git a/qa/python_models/bls_model_loading/model.py b/qa/python_models/bls_model_loading/model.py
new file mode 100644
index 0000000000..84162e2fac
--- /dev/null
+++ b/qa/python_models/bls_model_loading/model.py
@@ -0,0 +1,135 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import time
+import unittest
+
+import numpy as np
+import triton_python_backend_utils as pb_utils
+
+
+class PBBLSModelLoadingTest(unittest.TestCase):
+    def setUp(self):
+        self.model_name = "onnx_int32_int32_int32"
+
+    def tearDown(self):
+        # The unload call does not wait for the requested model to be fully
+        # unloaded before returning.
+        pb_utils.unload_model(self.model_name)
+        # TODO: Make this more robust to wait until fully unloaded
+        print("Sleep 30 seconds to make sure model finishes unloading...")
+        time.sleep(30)
+        print("Done sleeping.")
+
+    def test_load_unload_model(self):
+        self.assertFalse(pb_utils.is_model_ready(model_name=self.model_name))
+        pb_utils.load_model(model_name=self.model_name)
+        self.assertTrue(pb_utils.is_model_ready(self.model_name))
+        pb_utils.unload_model(self.model_name)
+        self.assertFalse(pb_utils.is_model_ready(self.model_name))
+
+    def test_load_with_config_override(self):
+        self.assertFalse(pb_utils.is_model_ready(self.model_name))
+        pb_utils.load_model(self.model_name)
+        self.assertTrue(pb_utils.is_model_ready(self.model_name))
+
+        # Send the config with the wrong format
+        wrong_config = '"parameters": {"config": {{"backend":"onnxruntime", "version_policy":{"specific":{"versions":[2]}}}}}'
+        with self.assertRaises(pb_utils.TritonModelException):
+            pb_utils.load_model(model_name=self.model_name, config=wrong_config)
+        # The model should not be changed after a failed load model request
+        for version in ["2", "3"]:
+            self.assertTrue(
+                pb_utils.is_model_ready(
+                    model_name=self.model_name, model_version=version
+                )
+            )
+
+        # Send the config with the correct format
+        config = (
+            '{"backend":"onnxruntime", "version_policy":{"specific":{"versions":[2]}}}'
+        )
+        pb_utils.load_model(self.model_name, config=config)
+        # The model should be changed after a successful load model request
+        self.assertTrue(pb_utils.is_model_ready(self.model_name, "2"))
+        self.assertFalse(pb_utils.is_model_ready(self.model_name, "3"))
+
+    def test_load_with_file_override(self):
+        self.assertFalse(pb_utils.is_model_ready(self.model_name))
+        pb_utils.load_model(self.model_name)
+        self.assertTrue(pb_utils.is_model_ready(self.model_name))
+
+        override_name = "override_model"
+        config = '{"backend":"onnxruntime"}'
+        with open("models/onnx_int32_int32_int32/3/model.onnx", "rb") as file:
+            data = file.read()
+        files = {"file:1/model.onnx": data}
+
+        # Request to load the model with override file, should fail without
+        # providing override config.
+        with self.assertRaises(pb_utils.TritonModelException):
+            pb_utils.load_model(self.model_name, "", files)
+
+        # Request to load the model with override file and config in a different name
+        pb_utils.load_model(model_name=override_name, config=config, files=files)
+        # Sanity check that the model with original name is unchanged
+        self.assertFalse(pb_utils.is_model_ready(self.model_name, "1"))
+        self.assertTrue(pb_utils.is_model_ready(self.model_name, "3"))
+
+        # Check the override model readiness
+        self.assertTrue(pb_utils.is_model_ready(override_name, "1"))
+        self.assertFalse(pb_utils.is_model_ready(override_name, "3"))
+
+        # Request to load the model with override file and config in original name
+        pb_utils.load_model(self.model_name, config, files)
+        # Check that the model with original name is changed
+        self.assertTrue(pb_utils.is_model_ready(self.model_name, "1"))
+        self.assertFalse(pb_utils.is_model_ready(self.model_name, "3"))
+
+        # Sanity check readiness of the different named model
+        self.assertTrue(pb_utils.is_model_ready(override_name, "1"))
+        self.assertFalse(pb_utils.is_model_ready(override_name, "3"))
+
+
+class TritonPythonModel:
+    def initialize(self, args):
+        # Run the unittest during initialization
+        test = unittest.main("model", exit=False)
+        self.result = test.result.wasSuccessful()
+
+    def execute(self, requests):
+        responses = []
+        for _ in requests:
+            responses.append(
+                pb_utils.InferenceResponse(
+                    [
+                        pb_utils.Tensor(
+                            "OUTPUT0", np.array([self.result], dtype=np.float16)
+                        )
+                    ]
+                )
+            )
+        return responses
diff --git a/qa/python_models/bls_onnx_warmup/config.pbtxt b/qa/python_models/bls_onnx_warmup/config.pbtxt
new file mode 100644
index 0000000000..879f85ca81
--- /dev/null
+++ b/qa/python_models/bls_onnx_warmup/config.pbtxt
@@ -0,0 +1,38 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "bls_onnx_warmup"
+backend: "python"
+
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ 16 ]
+  }
+]
+
+instance_group [{ kind: KIND_CPU }]
\ No newline at end of file
diff --git a/qa/python_models/bls_onnx_warmup/model.py b/qa/python_models/bls_onnx_warmup/model.py
new file mode 100644
index 0000000000..233bdc85ab
--- /dev/null
+++ b/qa/python_models/bls_onnx_warmup/model.py
@@ -0,0 +1,88 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import unittest
+
+import numpy as np
+import triton_python_backend_utils as pb_utils
+from torch.utils.dlpack import from_dlpack
+
+
+class PBBLSONNXWarmupTest(unittest.TestCase):
+    def test_onnx_output_mem_type(self):
+        input0_np = np.random.randn(*[16])
+        input0_np = input0_np.astype(np.float32)
+        input1_np = np.random.randn(*[16])
+        input1_np = input1_np.astype(np.float32)
+        input0 = pb_utils.Tensor("INPUT0", input0_np)
+        input1 = pb_utils.Tensor("INPUT1", input1_np)
+        infer_request = pb_utils.InferenceRequest(
+            model_name="onnx_nobatch_float32_float32_float32",
+            inputs=[input0, input1],
+            requested_output_names=["OUTPUT0", "OUTPUT1"],
+        )
+
+        infer_response = infer_request.exec()
+
+        self.assertFalse(infer_response.has_error())
+
+        output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0")
+        output1 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT1")
+
+        self.assertIsNotNone(output0)
+        self.assertIsNotNone(output1)
+
+        # The memory type of output tensor should be GPU
+        self.assertFalse(output0.is_cpu())
+        self.assertFalse(output1.is_cpu())
+
+        expected_output_0 = input0.as_numpy() - input1.as_numpy()
+        expected_output_1 = input0.as_numpy() + input1.as_numpy()
+
+        output0 = from_dlpack(output0.to_dlpack()).to("cpu").cpu().detach().numpy()
+        output1 = from_dlpack(output1.to_dlpack()).to("cpu").cpu().detach().numpy()
+
+        self.assertTrue(np.all(output0 == expected_output_0))
+        self.assertTrue(np.all(output1 == expected_output_1))
+
+
+class TritonPythonModel:
+    def execute(self, requests):
+        responses = []
+        for _ in requests:
+            # Run the unittest and store the results in InferenceResponse.
+            test = unittest.main("model", exit=False)
+            responses.append(
+                pb_utils.InferenceResponse(
+                    [
+                        pb_utils.Tensor(
+                            "OUTPUT0",
+                            np.array([test.result.wasSuccessful()], dtype=np.float16),
+                        )
+                    ]
+                )
+            )
+        return responses
diff --git a/qa/python_models/bls_parameters/config.pbtxt b/qa/python_models/bls_parameters/config.pbtxt
new file mode 100644
index 0000000000..dddf300185
--- /dev/null
+++ b/qa/python_models/bls_parameters/config.pbtxt
@@ -0,0 +1,52 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "bls_parameters"
+backend: "python"
+max_batch_size: 0
+
+input [
+  {
+    name: "NUMBER_PARAMETERS"
+    data_type: TYPE_UINT8
+    dims: [ 1 ]
+  }
+]
+
+output [
+  {
+    name: "PARAMETERS_AGGREGATED"
+    data_type: TYPE_STRING
+    dims: [ 1 ]
+  }
+]
+
+instance_group [
+  {
+    count: 4
+    kind: KIND_CPU
+  }
+]
diff --git a/qa/python_models/bls_parameters/model.py b/qa/python_models/bls_parameters/model.py
new file mode 100644
index 0000000000..5dc54ebffd
--- /dev/null
+++ b/qa/python_models/bls_parameters/model.py
@@ -0,0 +1,77 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import json
+
+import numpy as np
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    def execute(self, requests):
+        responses = []
+
+        for request in requests:
+            num_params = int(
+                pb_utils.get_input_tensor_by_name(
+                    request, "NUMBER_PARAMETERS"
+                ).as_numpy()[0]
+            )
+            params = json.loads(request.parameters())
+
+            if num_params == 0:
+                # Base case where the received parameters are returned as JSON
+                response = json.dumps(params)
+                response_tensors = [
+                    pb_utils.Tensor(
+                        "PARAMETERS_AGGREGATED", np.array([response], dtype=np.object_)
+                    )
+                ]
+            else:
+                # Add the parameters of num_params step to the received parameters
+                params["bool_" + str(num_params)] = bool(num_params)
+                params["int_" + str(num_params)] = num_params
+                params["str_" + str(num_params)] = str(num_params)
+                # Complete any remaining steps [1, num_params - 1] by calling self
+                # recursively via BLS
+                bls_request_tensor = pb_utils.Tensor(
+                    "NUMBER_PARAMETERS", np.array([num_params - 1], dtype=np.ubyte)
+                )
+                bls_request = pb_utils.InferenceRequest(
+                    model_name="bls_parameters",
+                    inputs=[bls_request_tensor],
+                    requested_output_names=["PARAMETERS_AGGREGATED"],
+                    parameters=params,
+                )
+                bls_response = bls_request.exec()
+                response_tensors = bls_response.output_tensors()
+
+            inference_response = pb_utils.InferenceResponse(
+                output_tensors=response_tensors
+            )
+            responses.append(inference_response)
+
+        return responses
diff --git a/qa/python_models/bls_request_rescheduling/config.pbtxt b/qa/python_models/bls_request_rescheduling/config.pbtxt
new file mode 100644
index 0000000000..84f8658f7f
--- /dev/null
+++ b/qa/python_models/bls_request_rescheduling/config.pbtxt
@@ -0,0 +1,38 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "bls_request_rescheduling"
+backend: "python"
+
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ 16 ]
+  }
+]
+
+instance_group [{ kind: KIND_CPU }]
diff --git a/qa/python_models/bls_request_rescheduling/model.py b/qa/python_models/bls_request_rescheduling/model.py
new file mode 100644
index 0000000000..8615622af9
--- /dev/null
+++ b/qa/python_models/bls_request_rescheduling/model.py
@@ -0,0 +1,133 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import time
+import unittest
+
+import numpy as np
+import triton_python_backend_utils as pb_utils
+
+
+class RequestReschedulingTest(unittest.TestCase):
+    def _reload_model(self, model_name):
+        # Reload the model to reset the flag for multiple iterations
+        pb_utils.unload_model(model_name)
+        # TODO: Make this more robust to wait until fully unloaded
+        print("Sleep 10 seconds to make sure model finishes unloading...", flush=True)
+        time.sleep(10)
+        print("Done sleeping.", flush=True)
+        pb_utils.load_model(model_name)
+
+    def test_wrong_return_type(self):
+        input0 = pb_utils.Tensor("INPUT0", (np.random.randn(*[4])).astype(np.float32))
+        infer_request = pb_utils.InferenceRequest(
+            model_name="wrong_return_type",
+            inputs=[input0],
+            requested_output_names=["OUTPUT0"],
+        )
+
+        infer_response = infer_request.exec()
+        self.assertTrue(infer_response.has_error())
+        self.assertIn(
+            "Expected a None object in the execute function return list for reschduled request",
+            infer_response.error().message(),
+        )
+
+    def test_non_decoupled_e2e(self):
+        model_name = "request_rescheduling_addsub"
+        self._reload_model(model_name)
+
+        input0_np = np.random.randn(*[16])
+        input0_np = input0_np.astype(np.float32)
+        input1_np = np.random.randn(*[16])
+        input1_np = input1_np.astype(np.float32)
+        input0 = pb_utils.Tensor("INPUT0", input0_np)
+        input1 = pb_utils.Tensor("INPUT1", input1_np)
+        infer_request = pb_utils.InferenceRequest(
+            model_name=model_name,
+            inputs=[input0, input1],
+            requested_output_names=["OUTPUT0", "OUTPUT1"],
+        )
+        infer_response = infer_request.exec()
+
+        self.assertFalse(infer_response.has_error())
+
+        output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0")
+        output1 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT1")
+
+        self.assertIsNotNone(output0)
+        self.assertIsNotNone(output1)
+
+        expected_output_0 = input0.as_numpy() + input1.as_numpy()
+        expected_output_1 = input0.as_numpy() - input1.as_numpy()
+
+        self.assertEqual(expected_output_0[0], output0.as_numpy()[0])
+        self.assertEqual(expected_output_1[0], output1.as_numpy()[0])
+
+    def test_decoupled_e2e(self):
+        model_name = "iterative_sequence"
+        self._reload_model(model_name)
+
+        input_value = 3
+        input0 = pb_utils.Tensor("IN", np.array([input_value], dtype=np.int32))
+        infer_request = pb_utils.InferenceRequest(
+            model_name=model_name,
+            inputs=[input0],
+            requested_output_names=["OUT"],
+        )
+        infer_responses = infer_request.exec(decoupled=True)
+
+        expected_output = input_value - 1
+
+        if infer_responses:
+            for infer_response in infer_responses:
+                self.assertFalse(infer_response.has_error())
+
+                if len(infer_response.output_tensors()) > 0:
+                    output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUT")
+                    self.assertIsNotNone(output0)
+
+                    self.assertEqual(expected_output, output0.as_numpy()[0])
+                    expected_output -= 1
+
+
+class TritonPythonModel:
+    def execute(self, requests):
+        responses = []
+        for _ in requests:
+            # Run the unittest and store the results in InferenceResponse.
+            test = unittest.main("model", exit=False)
+            responses.append(
+                pb_utils.InferenceResponse(
+                    [
+                        pb_utils.Tensor(
+                            "OUTPUT0",
+                            np.array([test.result.wasSuccessful()], dtype=np.float16),
+                        )
+                    ]
+                )
+            )
+        return responses
diff --git a/qa/python_models/bls_simple/bls_simple.py b/qa/python_models/bls_simple/bls_simple.py
new file mode 100644
index 0000000000..962c3834b9
--- /dev/null
+++ b/qa/python_models/bls_simple/bls_simple.py
@@ -0,0 +1,84 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    @staticmethod
+    def auto_complete_config(auto_complete_model_config):
+        inputs = [
+            {"name": "MODEL_NAME", "data_type": "TYPE_STRING", "dims": [1]},
+            {"name": "INPUT0", "data_type": "TYPE_INT32", "dims": [1, 16]},
+            {"name": "INPUT1", "data_type": "TYPE_INT32", "dims": [1, 16]},
+        ]
+        outputs = [
+            {"name": "OUTPUT0", "data_type": "TYPE_INT32", "dims": [16]},
+            {"name": "OUTPUT1", "data_type": "TYPE_INT32", "dims": [16]},
+        ]
+
+        config = auto_complete_model_config.as_dict()
+        input_names = []
+        output_names = []
+        for input in config["input"]:
+            input_names.append(input["name"])
+        for output in config["output"]:
+            output_names.append(output["name"])
+
+        for input in inputs:
+            if input["name"] not in input_names:
+                auto_complete_model_config.add_input(input)
+        for output in outputs:
+            if output["name"] not in output_names:
+                auto_complete_model_config.add_output(output)
+
+        auto_complete_model_config.set_max_batch_size(0)
+
+        return auto_complete_model_config
+
+    def execute(self, requests):
+        responses = []
+        for request in requests:
+            in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
+            in_1 = pb_utils.get_input_tensor_by_name(request, "INPUT1")
+            model_name = pb_utils.get_input_tensor_by_name(request, "MODEL_NAME")
+            model_name_string = model_name.as_numpy()[0]
+
+            infer_request = pb_utils.InferenceRequest(
+                model_name=model_name_string,
+                requested_output_names=["OUTPUT0", "OUTPUT1"],
+                inputs=[in_0, in_1],
+                trace=request.trace(),
+            )
+
+            infer_response = infer_request.exec()
+
+            inference_response = pb_utils.InferenceResponse(
+                output_tensors=infer_response.output_tensors()
+            )
+            responses.append(inference_response)
+
+        return responses
diff --git a/qa/python_models/bls_undefined/config.pbtxt b/qa/python_models/bls_undefined/config.pbtxt
new file mode 100644
index 0000000000..ab873d8a64
--- /dev/null
+++ b/qa/python_models/bls_undefined/config.pbtxt
@@ -0,0 +1,50 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "bls_undefined"
+backend: "python"
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_INT32
+    dims: [ -1 ]
+  }
+]
+
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_INT32
+    dims: [ -1 ]
+  }
+]
+
+instance_group [{
+        kind: KIND_CPU,
+        count: 2
+}]
+
diff --git a/qa/python_models/bls_undefined/model.py b/qa/python_models/bls_undefined/model.py
new file mode 100644
index 0000000000..30e5f4106a
--- /dev/null
+++ b/qa/python_models/bls_undefined/model.py
@@ -0,0 +1,33 @@
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+class TritonPythonModel:
+    def execute(self, requests):
+        undefined_variable
+
+    def finalize(self):
+        print("Cleaning up...")
diff --git a/qa/python_models/cuda_memory_consumer/1/model.py b/qa/python_models/cuda_memory_consumer/1/model.py
new file mode 100644
index 0000000000..e3526920ea
--- /dev/null
+++ b/qa/python_models/cuda_memory_consumer/1/model.py
@@ -0,0 +1,69 @@
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import triton_python_backend_utils as pb_utils
+from cuda import cuda
+
+
+class TritonPythonModel:
+    @staticmethod
+    def auto_complete_config(auto_complete_model_config):
+        input = {"name": "INPUT", "data_type": "TYPE_FP32", "dims": [1]}
+        output = {"name": "OUTPUT", "data_type": "TYPE_FP32", "dims": [1]}
+
+        auto_complete_model_config.set_max_batch_size(0)
+        auto_complete_model_config.add_input(input)
+        auto_complete_model_config.add_output(output)
+
+        return auto_complete_model_config
+
+    def initialize(self, args):
+        self.mem_ptr = None
+        # Initialize CUDA context
+        cuda.cuInit(0)
+        cuda.cuCtxCreate(0, 0)
+
+        mem_info = cuda.cuMemGetInfo()
+        if mem_info[0] != 0:
+            raise pb_utils.TritonModelException("Failed to get CUDA memory info")
+
+        mem_alloc = cuda.cuMemAlloc(mem_info[2] * 0.4)
+        if mem_alloc[0] != 0:
+            raise pb_utils.TritonModelException("Failed to allocate CUDA memory")
+        self.mem_ptr = mem_alloc[1]
+
+    def finalize(self):
+        if self.mem_ptr is not None:
+            cuda.cuMemFree(self.mem_ptr)
+
+    def execute(self, requests):
+        """This function is called on inference request."""
+        responses = []
+        for request in requests:
+            input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT0")
+            out_tensor = pb_utils.Tensor("OUTPUT0", input_tensor.as_numpy())
+            responses.append(pb_utils.InferenceResponse([out_tensor]))
+        return responses
diff --git a/qa/python_models/cuda_memory_consumer/config.pbtxt b/qa/python_models/cuda_memory_consumer/config.pbtxt
new file mode 100644
index 0000000000..b1e0348433
--- /dev/null
+++ b/qa/python_models/cuda_memory_consumer/config.pbtxt
@@ -0,0 +1,28 @@
+# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+backend: "python"
+instance_group [{ kind: KIND_GPU, gpus: [0] }]
diff --git a/qa/python_models/custom_metrics/config.pbtxt b/qa/python_models/custom_metrics/config.pbtxt
new file mode 100644
index 0000000000..c2bf81331b
--- /dev/null
+++ b/qa/python_models/custom_metrics/config.pbtxt
@@ -0,0 +1,43 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "custom_metrics"
+backend: "python"
+
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ 16 ]
+  }
+]
+
+instance_group [
+  {
+    count: 3
+    kind: KIND_CPU
+  }
+]
diff --git a/qa/python_models/custom_metrics/model.py b/qa/python_models/custom_metrics/model.py
new file mode 100644
index 0000000000..31f105a1dd
--- /dev/null
+++ b/qa/python_models/custom_metrics/model.py
@@ -0,0 +1,278 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import unittest
+
+import numpy as np
+import requests
+import triton_python_backend_utils as pb_utils
+
+
+class PBCustomMetricsTest(unittest.TestCase):
+    def _get_metrics(self):
+        metrics_url = "http://localhost:8002/metrics"
+        r = requests.get(metrics_url)
+        r.raise_for_status()
+        return r.text
+
+    def _metric_api_helper(self, metric, kind):
+        # Adding logger to test if custom metrics and logging work together
+        # as they use the same message queue.
+        logger = pb_utils.Logger
+
+        # The value should be 0.0 before the test
+        self.assertEqual(metric.value(), 0.0)
+
+        # Test increment positive value
+        increment = 2023.0
+        metric.increment(increment)
+        self.assertEqual(metric.value(), increment)
+        logger.log_info("Incremented metric to : {}".format(metric.value()))
+
+        # Test increment negative value
+        decrement = -23.5
+        if kind == "counter":
+            # Counter should not accept negative values
+            with self.assertRaises(pb_utils.TritonModelException):
+                metric.increment(decrement)
+        else:
+            metric.increment(decrement)
+            self.assertEqual(metric.value(), increment + decrement)
+            logger.log_info("Decremented metric to : {}".format(metric.value()))
+
+        # Test set value
+        value = 999.9
+        if kind == "counter":
+            # Counter does not support set
+            with self.assertRaises(pb_utils.TritonModelException):
+                metric.set(value)
+        else:
+            metric.set(value)
+            self.assertEqual(metric.value(), value)
+            logger.log_info("Set metric to : {}".format(metric.value()))
+
+    def _dup_metric_helper(self, labels={}):
+        # Adding logger to test if custom metrics and logging work together
+        # as they use the same message queue.
+        logger = pb_utils.Logger
+
+        description = "dup metric"
+        metric_family = pb_utils.MetricFamily(
+            name="test_dup_metric",
+            description=description,
+            kind=pb_utils.MetricFamily.COUNTER,
+        )
+
+        # Verify dupe metrics reference same underlying metric
+        metric1 = metric_family.Metric(labels=labels)
+        metric2 = metric_family.Metric(labels=labels)
+
+        # The value should be 0 before the test
+        self.assertEqual(metric1.value(), 0.0)
+        self.assertEqual(metric2.value(), 0.0)
+
+        # Increment metric 1, check metric 2 == metric 1
+        increment = 7.5
+        metric1.increment(increment)
+        self.assertEqual(metric1.value(), metric2.value())
+        logger.log_info("Incremented metric1 to : {}".format(metric1.value()))
+        logger.log_info("Incremented metric2 to : {}".format(metric2.value()))
+
+        # Assert custom metric/family remains when there's still a reference to it
+        del metric1
+        metrics = self._get_metrics()
+        self.assertIn(description, metrics)
+
+    def test_counter_e2e(self):
+        metric_family = pb_utils.MetricFamily(
+            name="test_counter_e2e",
+            description="test metric counter kind end to end",
+            kind=pb_utils.MetricFamily.COUNTER,
+        )
+        labels = {"example1": "counter_label1", "example2": "counter_label2"}
+        metric = metric_family.Metric(labels=labels)
+        self._metric_api_helper(metric, "counter")
+
+        pattern = (
+            'test_counter_e2e{example1="counter_label1",example2="counter_label2"}'
+        )
+        metrics = self._get_metrics()
+        self.assertIn(pattern, metrics)
+
+    def test_gauge_e2e(self):
+        metric_family = pb_utils.MetricFamily(
+            name="test_gauge_e2e",
+            description="test metric gauge kind end to end",
+            kind=pb_utils.MetricFamily.GAUGE,
+        )
+        labels = {"example1": "counter_label1", "example2": "counter_label2"}
+        metric = metric_family.Metric(labels=labels)
+        self._metric_api_helper(metric, "gauge")
+
+        pattern = 'test_gauge_e2e{example1="counter_label1",example2="counter_label2"}'
+        metrics = self._get_metrics()
+        self.assertIn(pattern, metrics)
+
+    def test_dup_metric_family_diff_kind(self):
+        # Test that a duplicate metric family can't be added with a conflicting type/kind
+        metric_family1 = pb_utils.MetricFamily(
+            name="test_dup_metric_family_diff_kind",
+            description="test metric family with same name but different kind",
+            kind=pb_utils.MetricFamily.COUNTER,
+        )
+        with self.assertRaises(pb_utils.TritonModelException):
+            metric_family2 = pb_utils.MetricFamily(
+                name="test_dup_metric_family_diff_kind",
+                description="test metric family with same name but different kind",
+                kind=pb_utils.MetricFamily.GAUGE,
+            )
+            self.assertIsNone(metric_family2)
+
+        self.assertIsNotNone(metric_family1)
+
+    def test_dup_metric_family_diff_description(self):
+        # Test that a duplicate metric family name will still return the
+        # original metric family even if the description is changed
+        metric_family1 = pb_utils.MetricFamily(
+            name="test_dup_metric_family_diff_description",
+            description="first description",
+            kind=pb_utils.MetricFamily.COUNTER,
+        )
+        metric_family2 = pb_utils.MetricFamily(
+            name="test_dup_metric_family_diff_description",
+            description="second description",
+            kind=pb_utils.MetricFamily.COUNTER,
+        )
+
+        metric2 = metric_family2.Metric()
+        self.assertEqual(metric2.value(), 0)
+
+        # Delete metric_family1 and check if metric_family2 still references it
+        del metric_family1
+        pattern = "test_dup_metric_family_diff_description first description"
+        metrics = self._get_metrics()
+        self.assertIn(pattern, metrics)
+
+        # The first description will be kept if adding a duplicate metric
+        # family name with a different description
+        pattern = "test_dup_metric_family_diff_description second description"
+        self.assertNotIn(pattern, metrics)
+
+    def test_dup_metric_family(self):
+        # Test that adding a duplicate metric family will reuse the original
+        # and not add another entry to registry
+        metric_family1 = pb_utils.MetricFamily(
+            name="test_dup_metric_family",
+            description="dup description",
+            kind=pb_utils.MetricFamily.COUNTER,
+        )
+        metric_family2 = pb_utils.MetricFamily(
+            name="test_dup_metric_family",
+            description="dup description",
+            kind=pb_utils.MetricFamily.COUNTER,
+        )
+
+        metric_key = "custom_metric_key"
+        metric1 = metric_family1.Metric(labels={metric_key: "label1"})
+        metric2 = metric_family2.Metric(labels={metric_key: "label2"})
+
+        self.assertEqual(metric1.value(), 0)
+        self.assertEqual(metric2.value(), 0)
+
+        patterns = [
+            "# HELP test_dup_metric_family dup description",
+            "# TYPE test_dup_metric_family counter",
+            'test_dup_metric_family{custom_metric_key="label2"} 0',
+            'test_dup_metric_family{custom_metric_key="label1"} 0',
+        ]
+        metrics = self._get_metrics()
+        for pattern in patterns:
+            self.assertIn(pattern, metrics)
+
+    def test_dup_metric_labels(self):
+        # Test that adding a duplicate metric will refer to the same
+        # underlying metric, and all instances will be updated
+        labels = {"example1": "label1", "example2": "label2"}
+        self._dup_metric_helper(labels)
+
+    def test_dup_metric_empty_labels(self):
+        # Test that adding a duplicate metric will refer to the same
+        # underlying metric, and all instances will be updated
+        self._dup_metric_helper()
+
+    def test_metric_lifetime_error(self):
+        # Test the error handling when the corresponding 'MetricFamily' is
+        # deleted before the 'Metric' is deleted, and the 'Metric' is still
+        # being used for metric operations
+        kinds = [pb_utils.MetricFamily.COUNTER, pb_utils.MetricFamily.GAUGE]
+        metric_family_names = [
+            "test_metric_lifetime_error_counter",
+            "test_metric_lifetime_error_gauge",
+        ]
+        for kind, name in zip(kinds, metric_family_names):
+            metric_family = pb_utils.MetricFamily(
+                name=name, description="test metric lifetime error", kind=kind
+            )
+            labels = {"example1": "counter_label1", "example2": "counter_label2"}
+            metric = metric_family.Metric(labels=labels)
+
+            # Intentionally delete the 'MetricFamily' before the 'Metric' being deleted
+            del metric_family
+
+            error_msg = "Invalid metric operation as the corresponding 'MetricFamily' has been deleted."
+
+            # Counter does not support set
+            if kind is not pb_utils.MetricFamily.COUNTER:
+                with self.assertRaises(pb_utils.TritonModelException) as ex:
+                    metric.set(10)
+                self.assertIn(error_msg, str(ex.exception))
+
+            with self.assertRaises(pb_utils.TritonModelException) as ex:
+                metric.increment(10)
+            self.assertIn(error_msg, str(ex.exception))
+
+            with self.assertRaises(pb_utils.TritonModelException) as ex:
+                metric.value()
+            self.assertIn(error_msg, str(ex.exception))
+
+
+class TritonPythonModel:
+    def execute(self, requests):
+        responses = []
+        for _ in requests:
+            # Run the unittest and store the results in InferenceResponse.
+            test = unittest.main("model", exit=False)
+            responses.append(
+                pb_utils.InferenceResponse(
+                    [
+                        pb_utils.Tensor(
+                            "OUTPUT0",
+                            np.array([test.result.wasSuccessful()], dtype=np.float16),
+                        )
+                    ]
+                )
+            )
+        return responses
diff --git a/qa/python_models/delayed_model/model.py b/qa/python_models/delayed_model/model.py
index 639497f542..e7538148f1 100644
--- a/qa/python_models/delayed_model/model.py
+++ b/qa/python_models/delayed_model/model.py
@@ -1,4 +1,4 @@
-# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,21 +24,21 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import triton_python_backend_utils as pb_utils
 import time
 
+import triton_python_backend_utils as pb_utils
+
 # Sleep for 5 seconds to ensure that delayed startup works properly.
 time.sleep(5)
 
 
 class TritonPythonModel:
-
     def execute(self, requests):
         responses = []
         for request in requests:
             input_tensor = pb_utils.get_input_tensor_by_name(request, "IN")
-            out_tensor = utils.Tensor("OUT", input_tensor.as_numpy())
-            responses.append(utils.InferenceResponse([out_tensor]))
+            out_tensor = pb_utils.Tensor("OUT", input_tensor.as_numpy())
+            responses.append(pb_utils.InferenceResponse([out_tensor]))
         return responses
 
     def finalize(self):
diff --git a/qa/python_models/dlpack_add_sub/model.py b/qa/python_models/dlpack_add_sub/model.py
index e32e31c9a8..7f70e05d5c 100644
--- a/qa/python_models/dlpack_add_sub/model.py
+++ b/qa/python_models/dlpack_add_sub/model.py
@@ -1,4 +1,4 @@
-# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,27 +24,27 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import triton_python_backend_utils as pb_utils
-from torch.utils.dlpack import to_dlpack, from_dlpack
-import torch
-import numpy as np
 import json
 
+import numpy as np
+import torch
+import triton_python_backend_utils as pb_utils
+from torch.utils.dlpack import from_dlpack, to_dlpack
+
 
 class TritonPythonModel:
-
     def initialize(self, args):
-        self.model_config = model_config = json.loads(args['model_config'])
+        self.model_config = model_config = json.loads(args["model_config"])
 
-        output0_config = pb_utils.get_output_config_by_name(
-            model_config, "OUTPUT0")
-        output1_config = pb_utils.get_output_config_by_name(
-            model_config, "OUTPUT1")
+        output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0")
+        output1_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT1")
 
         self.output0_dtype = pb_utils.triton_string_to_numpy(
-            output0_config['data_type'])
+            output0_config["data_type"]
+        )
         self.output1_dtype = pb_utils.triton_string_to_numpy(
-            output1_config['data_type'])
+            output1_config["data_type"]
+        )
         self.numpy_to_pytorch_dtype = {
             np.bool_: torch.bool,
             np.uint8: torch.uint8,
@@ -68,52 +68,63 @@ def execute(self, requests):
 
             # If both of the tensors are in CPU, use NumPy.
             if in_0.is_cpu() and in_1.is_cpu():
-                if in_0.as_numpy().dtype.type is np.bytes_ or in_0.as_numpy(
-                ).dtype == np.object_:
-                    out_0, out_1 = (in_0.as_numpy().astype(np.int32) + in_1.as_numpy().astype(np.int32),\
-                        in_0.as_numpy().astype(np.int32) - in_1.as_numpy().astype(np.int32))
-                    out_tensor_0 = pb_utils.Tensor("OUTPUT0",
-                                                   out_0.astype(output0_dtype))
-                    out_tensor_1 = pb_utils.Tensor("OUTPUT1",
-                                                   out_1.astype(output1_dtype))
+                if (
+                    in_0.as_numpy().dtype.type is np.bytes_
+                    or in_0.as_numpy().dtype == np.object_
+                ):
+                    out_0, out_1 = (
+                        in_0.as_numpy().astype(np.int32)
+                        + in_1.as_numpy().astype(np.int32),
+                        in_0.as_numpy().astype(np.int32)
+                        - in_1.as_numpy().astype(np.int32),
+                    )
+                    out_tensor_0 = pb_utils.Tensor(
+                        "OUTPUT0", out_0.astype(output0_dtype)
+                    )
+                    out_tensor_1 = pb_utils.Tensor(
+                        "OUTPUT1", out_1.astype(output1_dtype)
+                    )
                 else:
                     in_0_pytorch, in_1_pytorch = from_dlpack(
-                        in_0.to_dlpack()), from_dlpack(in_1.to_dlpack())
-                    out_0, out_1 = (in_0_pytorch + in_1_pytorch,
-                                    in_0_pytorch - in_1_pytorch)
+                        in_0.to_dlpack()
+                    ), from_dlpack(in_1.to_dlpack())
+                    out_0, out_1 = (
+                        in_0_pytorch + in_1_pytorch,
+                        in_0_pytorch - in_1_pytorch,
+                    )
 
                     if self.output0_dtype == np.object_:
                         out_tensor_0 = pb_utils.Tensor(
-                            "OUTPUT0",
-                            out_0.numpy().astype(output0_dtype))
+                            "OUTPUT0", out_0.numpy().astype(output0_dtype)
+                        )
                     else:
-                        out_0 = out_0.type(
-                            self.numpy_to_pytorch_dtype[output0_dtype])
+                        out_0 = out_0.type(self.numpy_to_pytorch_dtype[output0_dtype])
                         out_tensor_0 = pb_utils.Tensor.from_dlpack(
-                            "OUTPUT0", to_dlpack(out_0))
+                            "OUTPUT0", to_dlpack(out_0)
+                        )
 
                     if self.output1_dtype == np.object_:
                         out_tensor_1 = pb_utils.Tensor(
-                            "OUTPUT1",
-                            out_1.numpy().astype(output1_dtype))
+                            "OUTPUT1", out_1.numpy().astype(output1_dtype)
+                        )
                     else:
-                        out_1 = out_1.type(
-                            self.numpy_to_pytorch_dtype[output1_dtype])
+                        out_1 = out_1.type(self.numpy_to_pytorch_dtype[output1_dtype])
                         out_tensor_1 = pb_utils.Tensor.from_dlpack(
-                            "OUTPUT1", to_dlpack(out_1))
+                            "OUTPUT1", to_dlpack(out_1)
+                        )
 
             else:
-                in_0_pytorch, in_1_pytorch = from_dlpack(
-                    in_0.to_dlpack()).cuda(), from_dlpack(
-                        in_1.to_dlpack()).cuda()
-                out_0, out_1 = (in_0_pytorch + in_1_pytorch,
-                                in_0_pytorch - in_1_pytorch)
-                out_tensor_0 = pb_utils.Tensor.from_dlpack(
-                    "OUTPUT0", to_dlpack(out_0))
-                out_tensor_1 = pb_utils.Tensor.from_dlpack(
-                    "OUTPUT1", to_dlpack(out_1))
+                in_0_pytorch, in_1_pytorch = (
+                    from_dlpack(in_0.to_dlpack()).cuda(),
+                    from_dlpack(in_1.to_dlpack()).cuda(),
+                )
+                out_0, out_1 = (
+                    in_0_pytorch + in_1_pytorch,
+                    in_0_pytorch - in_1_pytorch,
+                )
+                out_tensor_0 = pb_utils.Tensor.from_dlpack("OUTPUT0", to_dlpack(out_0))
+                out_tensor_1 = pb_utils.Tensor.from_dlpack("OUTPUT1", to_dlpack(out_1))
 
-            responses.append(
-                pb_utils.InferenceResponse([out_tensor_0, out_tensor_1]))
+            responses.append(pb_utils.InferenceResponse([out_tensor_0, out_tensor_1]))
 
         return responses
diff --git a/qa/python_models/dlpack_empty_output/config.pbtxt b/qa/python_models/dlpack_empty_output/config.pbtxt
new file mode 100644
index 0000000000..d026db1cd1
--- /dev/null
+++ b/qa/python_models/dlpack_empty_output/config.pbtxt
@@ -0,0 +1,43 @@
+# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "dlpack_empty_output"
+max_batch_size: 8
+
+input [
+  {
+    name: "INPUT"
+    data_type: TYPE_FP32
+    dims: [ -1 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT"
+    data_type: TYPE_FP32
+    dims: [ -1 ]
+  }
+]
diff --git a/qa/python_models/dlpack_empty_output/model.py b/qa/python_models/dlpack_empty_output/model.py
new file mode 100644
index 0000000000..7784e28b4d
--- /dev/null
+++ b/qa/python_models/dlpack_empty_output/model.py
@@ -0,0 +1,53 @@
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import torch
+import triton_python_backend_utils as pb_utils
+from torch.utils.dlpack import to_dlpack
+
+
+class TritonPythonModel:
+    def initialize(self, args):
+        pass
+
+    def execute(self, requests):
+        responses = []
+
+        for _ in requests:
+            SHAPE = (0,)
+
+            pytorch_tensor = torch.ones(SHAPE, dtype=torch.float32)
+
+            device = torch.device("cuda:0")
+            pytorch_tensor = pytorch_tensor.to(device)
+
+            dlpack_tensor = to_dlpack(pytorch_tensor)
+            pb_tensor = pb_utils.Tensor.from_dlpack("OUTPUT", dlpack_tensor)
+
+            inference_response = pb_utils.InferenceResponse(output_tensors=[pb_tensor])
+            responses.append(inference_response)
+
+        return responses
diff --git a/qa/python_models/dlpack_identity/model.py b/qa/python_models/dlpack_identity/model.py
index 9057180381..1bd0748df9 100644
--- a/qa/python_models/dlpack_identity/model.py
+++ b/qa/python_models/dlpack_identity/model.py
@@ -1,4 +1,4 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,7 +28,6 @@
 
 
 class TritonPythonModel:
-
     def execute(self, requests):
         """Identity model in Python backend that works with GPU and CPU
         tensors."""
@@ -36,7 +35,8 @@ def execute(self, requests):
         responses = []
         for request in requests:
             input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT0")
-            out_tensor = pb_utils.Tensor.from_dlpack("OUTPUT0", input_tensor.to_dlpack())
+            out_tensor = pb_utils.Tensor.from_dlpack(
+                "OUTPUT0", input_tensor.to_dlpack()
+            )
             responses.append(pb_utils.InferenceResponse([out_tensor]))
         return responses
-
diff --git a/qa/python_models/dlpack_io_identity/model.py b/qa/python_models/dlpack_io_identity/model.py
index f98a4f51c4..225d026992 100644
--- a/qa/python_models/dlpack_io_identity/model.py
+++ b/qa/python_models/dlpack_io_identity/model.py
@@ -1,4 +1,4 @@
-# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,9 +24,9 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import triton_python_backend_utils as pb_utils
-from torch.utils.dlpack import to_dlpack, from_dlpack
 import numpy as np
+import triton_python_backend_utils as pb_utils
+from torch.utils.dlpack import from_dlpack, to_dlpack
 
 
 class TritonPythonModel:
@@ -36,70 +36,73 @@ class TritonPythonModel:
     """
 
     def initialize(self, args):
-        self._model_name = args['model_name']
+        self._model_name = args["model_name"]
 
     def execute(self, requests):
         responses = []
         for request in requests:
             input0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
             gpu_output = pb_utils.get_input_tensor_by_name(
-                request, "GPU_OUTPUT").as_numpy()
+                request, "GPU_OUTPUT"
+            ).as_numpy()
 
             if input0.is_cpu():
                 if not gpu_output[0]:
-                    output0 = pb_utils.Tensor.from_dlpack(
-                        "OUTPUT0", input0.to_dlpack())
+                    output0 = pb_utils.Tensor.from_dlpack("OUTPUT0", input0.to_dlpack())
                 else:
                     outptu0_pytorch = from_dlpack(input0.to_dlpack()).cuda()
                     output0 = pb_utils.Tensor.from_dlpack(
-                        "OUTPUT0", to_dlpack(outptu0_pytorch))
+                        "OUTPUT0", to_dlpack(outptu0_pytorch)
+                    )
             else:
                 if gpu_output[0]:
-                    output0 = pb_utils.Tensor.from_dlpack(
-                        "OUTPUT0", input0.to_dlpack())
+                    output0 = pb_utils.Tensor.from_dlpack("OUTPUT0", input0.to_dlpack())
                 else:
                     outptu0_pytorch = from_dlpack(input0.to_dlpack()).cpu()
                     output0 = pb_utils.Tensor.from_dlpack(
-                        "OUTPUT0", to_dlpack(outptu0_pytorch))
+                        "OUTPUT0", to_dlpack(outptu0_pytorch)
+                    )
 
             next_gpu_output = pb_utils.Tensor("NEXT_GPU_OUTPUT", gpu_output[1:])
 
             # Do not perform BLS inference if it is the first
             # model in the pipeline.
-            if self._model_name != 'dlpack_io_identity_1':
+            if self._model_name != "dlpack_io_identity_1":
                 infer_request = pb_utils.InferenceRequest(
-                    model_name='dlpack_io_identity_1',
+                    model_name="dlpack_io_identity_1",
                     inputs=[
                         input0,
-                        pb_utils.get_input_tensor_by_name(
-                            request, "GPU_OUTPUT")
+                        pb_utils.get_input_tensor_by_name(request, "GPU_OUTPUT"),
                     ],
-                    requested_output_names=['OUTPUT0'])
+                    requested_output_names=["OUTPUT0"],
+                )
                 infer_response = infer_request.exec()
 
                 if infer_response.has_error():
                     raise pb_utils.TritonModelException(
-                        infer_response.error().message())
+                        infer_response.error().message()
+                    )
 
                 bls_output0 = pb_utils.get_output_tensor_by_name(
-                    infer_response, 'OUTPUT0')
+                    infer_response, "OUTPUT0"
+                )
                 if not output0.is_cpu():
-                    bls_output0 = from_dlpack(
-                        bls_output0.to_dlpack()).detach().cpu().numpy()
+                    bls_output0 = (
+                        from_dlpack(bls_output0.to_dlpack()).detach().cpu().numpy()
+                    )
                 else:
                     bls_output0 = bls_output0.as_numpy()
 
                 if not input0.is_cpu():
-                    input0 = from_dlpack(
-                        input0.to_dlpack()).detach().cpu().numpy()
+                    input0 = from_dlpack(input0.to_dlpack()).detach().cpu().numpy()
                 else:
                     input0 = input0.as_numpy()
 
                 if not np.allclose(bls_output0, input0):
                     raise pb_utils.TritonModelException(
-                        'BLS input and output tensors are not equal')
+                        "BLS input and output tensors are not equal"
+                    )
 
-            responses.append(
-                pb_utils.InferenceResponse([output0, next_gpu_output]))
+            responses.append(pb_utils.InferenceResponse([output0, next_gpu_output]))
 
         return responses
diff --git a/qa/python_models/dlpack_io_identity_decoupled/model.py b/qa/python_models/dlpack_io_identity_decoupled/model.py
index 5b4de86e60..5f4e597df8 100644
--- a/qa/python_models/dlpack_io_identity_decoupled/model.py
+++ b/qa/python_models/dlpack_io_identity_decoupled/model.py
@@ -1,4 +1,4 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,11 +24,11 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import triton_python_backend_utils as pb_utils
-from torch.utils.dlpack import to_dlpack, from_dlpack
-import numpy as np
-import time
 import threading
+import time
+
+import triton_python_backend_utils as pb_utils
+from torch.utils.dlpack import from_dlpack, to_dlpack
 
 
 class TritonPythonModel:
@@ -38,7 +38,7 @@ class TritonPythonModel:
     """
 
     def initialize(self, args):
-        self._model_name = args['model_name']
+        self._model_name = args["model_name"]
         self.inflight_thread_count = 0
         self.inflight_thread_count_lck = threading.Lock()
 
@@ -48,20 +48,20 @@ def response_thread(self, response_sender, input0, gpu_output):
 
         if input0.is_cpu():
             if not gpu_output[0]:
-                output0 = pb_utils.Tensor.from_dlpack("OUTPUT0",
-                                                      input0.to_dlpack())
+                output0 = pb_utils.Tensor.from_dlpack("OUTPUT0", input0.to_dlpack())
             else:
                 outptu0_pytorch = from_dlpack(input0.to_dlpack()).cuda()
                 output0 = pb_utils.Tensor.from_dlpack(
-                    "OUTPUT0", to_dlpack(outptu0_pytorch))
+                    "OUTPUT0", to_dlpack(outptu0_pytorch)
+                )
         else:
             if gpu_output[0]:
-                output0 = pb_utils.Tensor.from_dlpack("OUTPUT0",
-                                                      input0.to_dlpack())
+                output0 = pb_utils.Tensor.from_dlpack("OUTPUT0", input0.to_dlpack())
             else:
                 output0_pytorch = from_dlpack(input0.to_dlpack()).cpu()
                 output0 = pb_utils.Tensor.from_dlpack(
-                    "OUTPUT0", to_dlpack(output0_pytorch))
+                    "OUTPUT0", to_dlpack(output0_pytorch)
+                )
 
         next_gpu_output = pb_utils.Tensor("NEXT_GPU_OUTPUT", gpu_output[1:])
         infer_response = pb_utils.InferenceResponse([output0, next_gpu_output])
@@ -71,8 +71,7 @@ def response_thread(self, response_sender, input0, gpu_output):
         for _ in range(response_repeat):
             response_sender.send(infer_response)
 
-        response_sender.send(
-            flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)
+        response_sender.send(flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)
 
         with self.inflight_thread_count_lck:
             self.inflight_thread_count -= 1
@@ -81,11 +80,13 @@ def execute(self, requests):
         for request in requests:
             input0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
             gpu_output = pb_utils.get_input_tensor_by_name(
-                request, "GPU_OUTPUT").as_numpy()
+                request, "GPU_OUTPUT"
+            ).as_numpy()
 
-            thread = threading.Thread(target=self.response_thread,
-                                      args=(request.get_response_sender(),
-                                            input0, gpu_output))
+            thread = threading.Thread(
+                target=self.response_thread,
+                args=(request.get_response_sender(), input0, gpu_output),
+            )
 
             thread.daemon = True
 
@@ -99,11 +100,11 @@ def finalize(self):
         cycles = 0
         logging_time_sec = 5
         sleep_time_sec = 0.1
-        cycle_to_log = (logging_time_sec / sleep_time_sec)
+        cycle_to_log = logging_time_sec / sleep_time_sec
         while inflight_threads:
             with self.inflight_thread_count_lck:
-                inflight_threads = (self.inflight_thread_count != 0)
-                if (cycles % cycle_to_log == 0):
+                inflight_threads = self.inflight_thread_count != 0
+                if cycles % cycle_to_log == 0:
                     print(
                         f"Waiting for {self.inflight_thread_count} response threads to complete..."
                     )
diff --git a/qa/python_models/dlpack_square/config.pbtxt b/qa/python_models/dlpack_square/config.pbtxt
new file mode 100644
index 0000000000..15cf6b7fd2
--- /dev/null
+++ b/qa/python_models/dlpack_square/config.pbtxt
@@ -0,0 +1,48 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "dlpack_square"
+backend: "python"
+max_batch_size: 0
+model_transaction_policy {
+  decoupled: True
+}
+input [
+  {
+    name: "IN"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  }
+]
+output [
+  {
+    name: "OUT"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  }
+]
+instance_group [{ kind: KIND_CPU }]
+
diff --git a/qa/python_models/dlpack_square/model.py b/qa/python_models/dlpack_square/model.py
new file mode 100644
index 0000000000..b31531461e
--- /dev/null
+++ b/qa/python_models/dlpack_square/model.py
@@ -0,0 +1,139 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import json
+import threading
+
+import numpy as np
+import torch
+
+# triton_python_backend_utils is available in every Triton Python model. You
+# need to use this module to create inference requests and responses. It also
+# contains some utility functions for extracting information from model_config
+# and converting Triton input/output types to numpy types.
+import triton_python_backend_utils as pb_utils
+from torch.utils.dlpack import from_dlpack, to_dlpack
+
+numpy_to_pytorch_dtype = {
+    np.bool_: torch.bool,
+    np.uint8: torch.uint8,
+    np.int8: torch.int8,
+    np.int16: torch.int16,
+    np.int32: torch.int32,
+    np.int64: torch.int64,
+    np.float16: torch.float16,
+    np.float32: torch.float32,
+    np.float64: torch.float64,
+}
+
+
+class TritonPythonModel:
+    def initialize(self, args):
+        self.model_config = model_config = json.loads(args["model_config"])
+
+        output_config = pb_utils.get_output_config_by_name(model_config, "OUT")
+        self.output_dtype = pb_utils.triton_string_to_numpy(output_config["data_type"])
+
+        using_decoupled = pb_utils.using_decoupled_model_transaction_policy(
+            model_config
+        )
+        if not using_decoupled:
+            raise pb_utils.TritonModelException(
+                """the model `{}` can generate any number of responses per request,
+                enable decoupled transaction policy in model configuration to
+                serve this model""".format(
+                    args["model_name"]
+                )
+            )
+
+        self.inflight_thread_count = 0
+        self.inflight_thread_count_lck = threading.Lock()
+
+    def execute(self, requests):
+        for request in requests:
+            self.process_request(request)
+
+        return None
+
+    def process_request(self, request):
+        # Start a separate thread to send the responses for the request. The
+        # sending back the responses is delegated to this thread.
+        thread = threading.Thread(
+            target=self.response_thread,
+            args=(
+                request.get_response_sender(),
+                pb_utils.get_input_tensor_by_name(request, "IN"),
+                self.output_dtype,
+            ),
+        )
+
+        thread.daemon = True
+
+        with self.inflight_thread_count_lck:
+            self.inflight_thread_count += 1
+
+        thread.start()
+
+    def response_thread(self, response_sender, in_input, output_dtype):
+        # The response_sender is used to send response(s) associated with the
+        # corresponding request.
+
+        for idx in range(in_input.as_numpy()[0]):
+            if in_input.is_cpu():
+                if (
+                    in_input.as_numpy().dtype.type is np.bytes_
+                    or in_input.as_numpy().dtype == np.object_
+                ):
+                    out_0 = in_input.as_numpy().astype(np.int32)
+                    out_tensor = pb_utils.Tensor("OUT", out_0.astype(output_dtype))
+                else:
+                    in_0_pytorch = from_dlpack(in_input.to_dlpack())
+                    out_0 = in_0_pytorch
+                    if output_dtype == np.object_:
+                        out_tensor = pb_utils.Tensor(
+                            "OUT", out_0.numpy().astype(output_dtype)
+                        )
+                    else:
+                        out_0 = out_0.type(numpy_to_pytorch_dtype[output_dtype])
+                        out_tensor = pb_utils.Tensor.from_dlpack(
+                            "OUT", to_dlpack(out_0)
+                        )
+            else:
+                in_0_pytorch = from_dlpack(in_input.to_dlpack()).cuda()
+                out_0 = in_0_pytorch
+                out_tensor = pb_utils.Tensor.from_dlpack("OUTPUT0", to_dlpack(out_0))
+
+            response = pb_utils.InferenceResponse(output_tensors=[out_tensor])
+            response_sender.send(response)
+
+        # We must close the response sender to indicate to Triton that we are
+        # done sending responses for the corresponding request. We can't use the
+        # response sender after closing it. The response sender is closed by
+        # setting the TRITONSERVER_RESPONSE_COMPLETE_FINAL.
+        response_sender.send(flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)
+
+        with self.inflight_thread_count_lck:
+            self.inflight_thread_count -= 1
diff --git a/qa/python_models/dlpack_sub_add/model.py b/qa/python_models/dlpack_sub_add/model.py
index af07874a9f..16caafcea2 100644
--- a/qa/python_models/dlpack_sub_add/model.py
+++ b/qa/python_models/dlpack_sub_add/model.py
@@ -1,4 +1,4 @@
-# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,27 +24,27 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import triton_python_backend_utils as pb_utils
-from torch.utils.dlpack import to_dlpack, from_dlpack
-import torch
-import numpy as np
 import json
 
+import numpy as np
+import torch
+import triton_python_backend_utils as pb_utils
+from torch.utils.dlpack import from_dlpack, to_dlpack
+
 
 class TritonPythonModel:
-
     def initialize(self, args):
-        self.model_config = model_config = json.loads(args['model_config'])
+        self.model_config = model_config = json.loads(args["model_config"])
 
-        output0_config = pb_utils.get_output_config_by_name(
-            model_config, "OUTPUT0")
-        output1_config = pb_utils.get_output_config_by_name(
-            model_config, "OUTPUT1")
+        output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0")
+        output1_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT1")
 
         self.output0_dtype = pb_utils.triton_string_to_numpy(
-            output0_config['data_type'])
+            output0_config["data_type"]
+        )
         self.output1_dtype = pb_utils.triton_string_to_numpy(
-            output1_config['data_type'])
+            output1_config["data_type"]
+        )
         self.numpy_to_pytorch_dtype = {
             np.bool_: torch.bool,
             np.uint8: torch.uint8,
@@ -68,52 +68,63 @@ def execute(self, requests):
 
             # If both of the tensors are in CPU, use NumPy.
             if in_0.is_cpu() and in_1.is_cpu():
-                if in_0.as_numpy().dtype.type is np.bytes_ or in_0.as_numpy(
-                ).dtype == np.object_:
-                    out_0, out_1 = (in_0.as_numpy().astype(np.int32) - in_1.as_numpy().astype(np.int32),\
-                        in_0.as_numpy().astype(np.int32) + in_1.as_numpy().astype(np.int32))
-                    out_tensor_0 = pb_utils.Tensor("OUTPUT0",
-                                                   out_0.astype(output0_dtype))
-                    out_tensor_1 = pb_utils.Tensor("OUTPUT1",
-                                                   out_1.astype(output1_dtype))
+                if (
+                    in_0.as_numpy().dtype.type is np.bytes_
+                    or in_0.as_numpy().dtype == np.object_
+                ):
+                    out_0, out_1 = (
+                        in_0.as_numpy().astype(np.int32)
+                        - in_1.as_numpy().astype(np.int32),
+                        in_0.as_numpy().astype(np.int32)
+                        + in_1.as_numpy().astype(np.int32),
+                    )
+                    out_tensor_0 = pb_utils.Tensor(
+                        "OUTPUT0", out_0.astype(output0_dtype)
+                    )
+                    out_tensor_1 = pb_utils.Tensor(
+                        "OUTPUT1", out_1.astype(output1_dtype)
+                    )
                 else:
                     in_0_pytorch, in_1_pytorch = from_dlpack(
-                        in_0.to_dlpack()), from_dlpack(in_1.to_dlpack())
-                    out_0, out_1 = (in_0_pytorch - in_1_pytorch,
-                                    in_0_pytorch + in_1_pytorch)
+                        in_0.to_dlpack()
+                    ), from_dlpack(in_1.to_dlpack())
+                    out_0, out_1 = (
+                        in_0_pytorch - in_1_pytorch,
+                        in_0_pytorch + in_1_pytorch,
+                    )
 
                     if self.output0_dtype == np.object_:
                         out_tensor_0 = pb_utils.Tensor(
-                            "OUTPUT0",
-                            out_0.numpy().astype(output0_dtype))
+                            "OUTPUT0", out_0.numpy().astype(output0_dtype)
+                        )
                     else:
-                        out_0 = out_0.type(
-                            self.numpy_to_pytorch_dtype[output0_dtype])
+                        out_0 = out_0.type(self.numpy_to_pytorch_dtype[output0_dtype])
                         out_tensor_0 = pb_utils.Tensor.from_dlpack(
-                            "OUTPUT0", to_dlpack(out_0))
+                            "OUTPUT0", to_dlpack(out_0)
+                        )
 
                     if self.output1_dtype == np.object_:
                         out_tensor_1 = pb_utils.Tensor(
-                            "OUTPUT1",
-                            out_1.numpy().astype(output1_dtype))
+                            "OUTPUT1", out_1.numpy().astype(output1_dtype)
+                        )
                     else:
-                        out_1 = out_1.type(
-                            self.numpy_to_pytorch_dtype[output1_dtype])
+                        out_1 = out_1.type(self.numpy_to_pytorch_dtype[output1_dtype])
                         out_tensor_1 = pb_utils.Tensor.from_dlpack(
-                            "OUTPUT1", to_dlpack(out_1))
+                            "OUTPUT1", to_dlpack(out_1)
+                        )
 
             else:
-                in_0_pytorch, in_1_pytorch = from_dlpack(
-                    in_0.to_dlpack()).cuda(), from_dlpack(
-                        in_1.to_dlpack()).cuda()
-                out_0, out_1 = (in_0_pytorch - in_1_pytorch,
-                                in_0_pytorch + in_1_pytorch)
-                out_tensor_0 = pb_utils.Tensor.from_dlpack(
-                    "OUTPUT0", to_dlpack(out_0))
-                out_tensor_1 = pb_utils.Tensor.from_dlpack(
-                    "OUTPUT1", to_dlpack(out_1))
+                in_0_pytorch, in_1_pytorch = (
+                    from_dlpack(in_0.to_dlpack()).cuda(),
+                    from_dlpack(in_1.to_dlpack()).cuda(),
+                )
+                out_0, out_1 = (
+                    in_0_pytorch - in_1_pytorch,
+                    in_0_pytorch + in_1_pytorch,
+                )
+                out_tensor_0 = pb_utils.Tensor.from_dlpack("OUTPUT0", to_dlpack(out_0))
+                out_tensor_1 = pb_utils.Tensor.from_dlpack("OUTPUT1", to_dlpack(out_1))
 
-            responses.append(
-                pb_utils.InferenceResponse([out_tensor_0, out_tensor_1]))
+            responses.append(pb_utils.InferenceResponse([out_tensor_0, out_tensor_1]))
 
         return responses
diff --git a/qa/python_models/dlpack_test/model.py b/qa/python_models/dlpack_test/model.py
index cd3ab37c7d..64bc7d6692 100644
--- a/qa/python_models/dlpack_test/model.py
+++ b/qa/python_models/dlpack_test/model.py
@@ -1,4 +1,4 @@
-# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,44 +24,61 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
 import unittest
+
+import cupy as cp
+import numpy as np
 import torch
-from torch.utils.dlpack import from_dlpack, to_dlpack
 import triton_python_backend_utils as pb_utils
+from torch.utils.dlpack import from_dlpack, to_dlpack
 
 
 class PBTensorTest(unittest.TestCase):
-
     def test_pytorch_dlpack(self):
         # Test different dtypes
         pytorch_dtypes = [
-            torch.float16, torch.float32, torch.float64, torch.int8,
-            torch.int16, torch.int32, torch.int64, torch.uint8, torch.bool
+            torch.float16,
+            torch.float32,
+            torch.float64,
+            torch.int8,
+            torch.int16,
+            torch.int32,
+            torch.int64,
+            torch.uint8,
         ]
 
         for pytorch_dtype in pytorch_dtypes:
-            pytorch_tensor = torch.rand([100], dtype=torch.float16) * 100
-            pytorch_tensor = pytorch_tensor.type(pytorch_dtype)
+            pytorch_tensor = torch.ones([100], dtype=pytorch_dtype)
             dlpack_tensor = to_dlpack(pytorch_tensor)
-            pb_tensor = pb_utils.Tensor.from_dlpack('test_tensor',
-                                                    dlpack_tensor)
+            pb_tensor = pb_utils.Tensor.from_dlpack("test_tensor", dlpack_tensor)
             self.assertTrue(
-                np.all(pb_tensor.as_numpy() == pytorch_tensor.numpy()))
+                np.array_equal(pb_tensor.as_numpy(), pytorch_tensor.numpy())
+            )
 
             # Convert the tensor back to DLPack and ensure that both tensors are
             # the same
             pytorch_tensor_dlpack = from_dlpack(pb_tensor.to_dlpack())
-            self.assertTrue(torch.all(pytorch_tensor_dlpack == pytorch_tensor))
+            self.assertTrue(torch.equal(pytorch_tensor_dlpack, pytorch_tensor))
+
+            self.assertEqual(pytorch_tensor.type(), pytorch_tensor_dlpack.type())
+
+            # Now let's check that upgraded DLPack implementation also
+            # works as expected, i.e. from_dlpack should work with
+            # external pytorch tensor directly
 
-            # DLPack does not properly support bool type:
-            # https://github.com/google/jax/issues/4719
-            if pytorch_dtype != torch.bool:
-                self.assertTrue(
-                    pytorch_tensor.type() == pytorch_tensor_dlpack.type())
-            else:
-                self.assertFalse(
-                    pytorch_tensor.type() == pytorch_tensor_dlpack.type())
+            pb_tensor_upgraded = pb_utils.Tensor.from_dlpack(
+                "test_tensor", pytorch_tensor
+            )
+            self.assertTrue(
+                np.array_equal(pb_tensor_upgraded.as_numpy(), pytorch_tensor.numpy())
+            )
+
+            # Here we check that `pb_tensor` as a producer, properly
+            # invokes `__dlpack__` and `__dlpack_device__`
+            pytorch_tensor_dlpack = from_dlpack(pb_tensor_upgraded)
+            self.assertTrue(torch.equal(pytorch_tensor_dlpack, pytorch_tensor))
+
+            self.assertEqual(pytorch_tensor.type(), pytorch_tensor_dlpack.type())
 
     def test_non_contiguous_error(self):
         pytorch_tensor = torch.rand([20, 30], dtype=torch.float16)
@@ -70,78 +87,257 @@ def test_non_contiguous_error(self):
         pytorch_tensor = torch.transpose(pytorch_tensor, 0, 1)
 
         with self.assertRaises(Exception) as e:
-            pb_utils.Tensor.from_dlpack('test_tensor',
-                                        to_dlpack(pytorch_tensor))
+            pb_utils.Tensor.from_dlpack("test_tensor", to_dlpack(pytorch_tensor))
         self.assertTrue(
-            str(e.exception) ==
-            'DLPack tensor is not contiguous. Only contiguous DLPack tensors that are stored in C-Order are supported.'
+            str(e.exception)
+            == "DLPack tensor is not contiguous. Only contiguous DLPack tensors that are stored in C-Order are supported."
         )
 
     def test_dlpack_string_tensor(self):
-        np_object = np.array(['An Example String'], dtype=np.object_)
-        pb_tensor = pb_utils.Tensor('test_tensor', np_object)
+        np_object = np.array(["An Example String"], dtype=np.object_)
+        pb_tensor = pb_utils.Tensor("test_tensor", np_object)
 
         with self.assertRaises(Exception) as e:
             pb_tensor.to_dlpack()
 
         self.assertTrue(
-            str(e.exception) ==
-            'DLPack does not have support for string tensors.')
+            str(e.exception) == "DLPack does not have support for string tensors."
+        )
 
     def test_dlpack_gpu_tensors(self):
         # Test different dtypes
+        # PyTorch does not support DLPack bool type yet:
+        # https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/DLConvertor.cpp
         pytorch_dtypes = [
-            torch.float16, torch.float32, torch.float64, torch.int8,
-            torch.int16, torch.int32, torch.int64, torch.uint8, torch.bool
+            torch.float16,
+            torch.float32,
+            torch.float64,
+            torch.int8,
+            torch.int16,
+            torch.int32,
+            torch.int64,
+            torch.uint8,
         ]
 
         for pytorch_dtype in pytorch_dtypes:
-            pytorch_tensor = torch.rand(
-                [100], dtype=torch.float16, device='cuda') * 100
-            pytorch_tensor = pytorch_tensor.type(pytorch_dtype)
+            pytorch_tensor = torch.ones([100], dtype=pytorch_dtype, device="cuda")
             dlpack_tensor = to_dlpack(pytorch_tensor)
-            pb_tensor = pb_utils.Tensor.from_dlpack('test_tensor',
-                                                    dlpack_tensor)
+            pb_tensor = pb_utils.Tensor.from_dlpack("test_tensor", dlpack_tensor)
 
             # Convert the tensor back to DLPack and ensure that both tensors are
             # the same
             pytorch_tensor_dlpack = from_dlpack(pb_tensor.to_dlpack())
-            self.assertTrue(torch.all(pytorch_tensor_dlpack == pytorch_tensor))
+            self.assertTrue(torch.equal(pytorch_tensor_dlpack, pytorch_tensor))
+            self.assertEqual(pytorch_tensor.type(), pytorch_tensor_dlpack.type())
 
-            # DLPack does not properly support bool type:
-            # https://github.com/google/jax/issues/4719
-            if pytorch_dtype != torch.bool:
-                self.assertTrue(
-                    pytorch_tensor.type() == pytorch_tensor_dlpack.type())
-            else:
-                self.assertFalse(
-                    pytorch_tensor.type() == pytorch_tensor_dlpack.type())
+            # Now we make sure that updated DLPack implementation works
+            # with GPU as well
+            pb_tensor = pb_utils.Tensor.from_dlpack("test_tensor", pytorch_tensor)
+            pytorch_tensor_dlpack = from_dlpack(pb_tensor)
+            self.assertTrue(torch.equal(pytorch_tensor_dlpack, pytorch_tensor))
+            self.assertEqual(pytorch_tensor.type(), pytorch_tensor_dlpack.type())
 
     def test_dlpack_gpu_numpy(self):
         # DLPack tesnors that are in GPU cannot be converted to NumPy
-        pytorch_tensor = torch.rand([100], dtype=torch.float16,
-                                    device='cuda') * 100
-        pb_tensor = pb_utils.Tensor.from_dlpack('tensor',
-                                                to_dlpack(pytorch_tensor))
+        pytorch_tensor = torch.rand([100], dtype=torch.float16, device="cuda") * 100
+        pb_tensor = pb_utils.Tensor.from_dlpack("tensor", to_dlpack(pytorch_tensor))
+        # Make sure that `__dlpack_device__` works as expected
+        self.assertFalse(pb_tensor.is_cpu())
+        self.assertTrue(pytorch_tensor.is_cuda)
+        self.assertEqual(
+            pb_tensor.__dlpack_device__(), pytorch_tensor.__dlpack_device__()
+        )
+
         with self.assertRaises(Exception) as e:
             pb_tensor.as_numpy()
         self.assertTrue(
-            str(e.exception) ==
-            'Tensor is stored in GPU and cannot be converted to NumPy.')
+            str(e.exception)
+            == "Tensor is stored in GPU and cannot be converted to NumPy."
+        )
+
+    def test_dlpack_cpu_numpy(self):
+        # Check compatibiity of PbTensor DLPack implementation
+        # with numpy
+        pytorch_tensor = torch.rand([100], dtype=torch.float16, device="cpu") * 100
+        pb_tensor = pb_utils.Tensor.from_dlpack("tensor", pytorch_tensor)
+        numpy_tensor_dlpack = np.from_dlpack(pb_tensor)
+        self.assertTrue(np.array_equal(numpy_tensor_dlpack, pytorch_tensor.numpy()))
+        # Make sure that `__dlpack_device__` works as expected
+        self.assertTrue(pb_tensor.is_cpu())
+        self.assertFalse(pytorch_tensor.is_cuda)
+        self.assertEqual(
+            pb_tensor.__dlpack_device__(), pytorch_tensor.__dlpack_device__()
+        )
 
+    def test_bool_datatype(self):
+        # [FIXME] pass bool_array directly to `pb_utils.Tensor.from_dlpack`,
+        # when numpy release supports DLPack bool type
+        bool_array = np.asarray([False, True])
+        bool_tensor = pb_utils.Tensor("tensor", bool_array)
+        bool_tensor_dlpack = pb_utils.Tensor.from_dlpack("tensor", bool_tensor)
+        self.assertTrue(np.array_equal(bool_array, bool_tensor_dlpack.as_numpy()))
 
-class TritonPythonModel:
+    def test_cuda_multi_stream(self):
+        # Test that external stream syncs with the default
+        # and pb_tensor has proper data
+        size = 5000
+        pytorch_tensor_1 = torch.tensor([0, 0, 0, 0], device="cuda")
+        pytorch_tensor_2 = torch.tensor([0, 0, 0, 0], device="cuda")
+        expected_output = torch.tensor([2, 2, 2, 2], device="cuda")
+        s1 = torch.cuda.Stream()
+        with torch.cuda.stream(s1):
+            matrix_a = torch.randn(size, size, device="cuda")
+            res = torch.matmul(matrix_a, matrix_a)
+            for _ in range(1000):
+                res = torch.matmul(res, matrix_a)
+            pytorch_tensor_1 += torch.tensor([2, 2, 2, 2], device="cuda")
+            pytorch_tensor_2 += torch.tensor([2, 2, 2, 2], device="cuda")
+
+        pb_tensor_1 = pb_utils.Tensor.from_dlpack("tensor", pytorch_tensor_1)
+        pb_tensor_2 = pb_utils.Tensor.from_dlpack("tensor", to_dlpack(pytorch_tensor_2))
+        pytorch_tensor_dlpack = from_dlpack(pb_tensor_1)
+        self.assertTrue(torch.equal(pytorch_tensor_dlpack, expected_output))
+        pytorch_tensor_dlpack = from_dlpack(pb_tensor_2)
+        self.assertTrue(torch.equal(pytorch_tensor_dlpack, expected_output))
+
+    def test_cuda_non_blocking_multi_stream(self):
+        # Test that external non-blocking stream syncs with the default stream
+        # and pb_tensor has proper data
+        size = 5000
+        cupy_tensor = cp.array([0, 0, 0, 0])
+        expected_output = cp.array([2, 2, 2, 2])
+        non_blocking_stream = cp.cuda.Stream(non_blocking=True)
+        with non_blocking_stream:
+            matrix_a = cp.random.rand(size, size)
+            res = cp.matmul(matrix_a, matrix_a)
+            for _ in range(1000):
+                res = cp.matmul(res, matrix_a)
+            cupy_tensor += cp.array([2, 2, 2, 2])
+
+        pb_tensor = pb_utils.Tensor.from_dlpack("tensor", cupy_tensor)
+        # Verify that non-blocking stream has no pending jobs left
+        self.assertTrue(non_blocking_stream.done)
+        cupy_tensor_dlpack = cp.from_dlpack(pb_tensor)
+        self.assertTrue(cp.array_equal(cupy_tensor_dlpack, expected_output))
+        self.assertFalse(pb_tensor.is_cpu())
+        self.assertEqual(pb_tensor.__dlpack_device__(), cupy_tensor.__dlpack_device__())
+
+    def test_cuda_multi_gpu(self):
+        # Test that when `pb_utils.Tensor.from_dlpack` is called on different
+        # GPU from where external tensor is stored, we receive a pointer
+        # and all pending work on different GPU's default stream
+        # on external tensor is done
+        size = 5000
+        # DLDeviceType::kDLCUDA, device_id 1
+        expected_dlpack_device = (2, 1)
+        with cp.cuda.Device(1):
+            expected_output = cp.array([2, 2, 2, 2])
+            cupy_tensor = cp.array([0, 0, 0, 0])
+            matrix_a = cp.random.rand(size, size)
+            res = cp.matmul(matrix_a, matrix_a)
+            for _ in range(1000):
+                res = cp.matmul(res, matrix_a)
+            cupy_tensor += cp.array([2, 2, 2, 2])
+        with cp.cuda.Device(0):
+            pb_tensor = pb_utils.Tensor.from_dlpack("tensor", cupy_tensor)
+            with cp.cuda.Device(1):
+                # To make sure that the default stream is done with
+                # all compute work
+                self.assertTrue(cp.cuda.Stream(null=True).done)
+            cupy_tensor_dlpack = cp.from_dlpack(pb_tensor)
+
+        with cp.cuda.Device(1):
+            self.assertTrue(cp.array_equal(cupy_tensor_dlpack, expected_output))
+
+        self.assertFalse(pb_tensor.is_cpu())
+        self.assertEqual(pb_tensor.__dlpack_device__(), expected_dlpack_device)
+        self.assertEqual(pb_tensor.__dlpack_device__(), cupy_tensor.__dlpack_device__())
 
+    def test_cuda_blocking_stream_multi_gpu(self):
+        # Test that when `pb_utils.Tensor.from_dlpack` is called on different
+        # GPU from where external tensor is stored, we receive a pointer
+        # and all pending work on different GPU's a blocking stream
+        # on external tensor is done
+        size = 5000
+        # DLDeviceType::kDLCUDA, device_id 1
+        expected_dlpack_device = (2, 1)
+        with cp.cuda.Device(1):
+            expected_output = cp.array([2, 2, 2, 2])
+            blocking_stream = cp.cuda.Stream(non_blocking=False)
+            with blocking_stream:
+                cupy_tensor = cp.array([0, 0, 0, 0])
+                matrix_a = cp.random.rand(size, size)
+                res = cp.matmul(matrix_a, matrix_a)
+                for _ in range(1000):
+                    res = cp.matmul(res, matrix_a)
+                cupy_tensor += cp.array([2, 2, 2, 2])
+        with cp.cuda.Device(0):
+            pb_tensor = pb_utils.Tensor.from_dlpack("tensor", cupy_tensor)
+            with cp.cuda.Device(1):
+                # To make sure that blocking stream is done with
+                # all compute work
+                self.assertTrue(blocking_stream.done)
+            cupy_tensor_dlpack = cp.from_dlpack(pb_tensor)
+
+        with cp.cuda.Device(1):
+            self.assertTrue(cp.array_equal(cupy_tensor_dlpack, expected_output))
+
+        self.assertFalse(pb_tensor.is_cpu())
+        self.assertEqual(pb_tensor.__dlpack_device__(), expected_dlpack_device)
+        self.assertEqual(pb_tensor.__dlpack_device__(), cupy_tensor.__dlpack_device__())
+
+    def test_cuda_non_blocking_stream_multi_gpu(self):
+        # Test that when `pb_utils.Tensor.from_dlpack` is called on different
+        # GPU from where external tensor is stored, we receive a pointer
+        # and all pending work on different GPU's non-blocking stream
+        # on external tensor is done.
+        # This test seems to be affected by `test_cuda_multi_gpu`
+        # and `test_cuda_blocking_stream_multi_gpu` if GPUs 0 and 1 are used.
+        # Thus for this test, we use GPUs 0 and 2
+        # JIRA: DLIS-4887
+        size = 5000
+        #  DLDeviceType::kDLCUDA, device_id 1
+        expected_dlpack_device = (2, 2)
+        with cp.cuda.Device(2):
+            expected_output = cp.array([2, 2, 2, 2])
+            non_blocking_stream = cp.cuda.Stream(non_blocking=True)
+            with non_blocking_stream:
+                cupy_tensor = cp.array([0, 0, 0, 0])
+                matrix_a = cp.random.rand(size, size)
+                res = cp.matmul(matrix_a, matrix_a)
+                for _ in range(1000):
+                    res = cp.matmul(res, matrix_a)
+                cupy_tensor += cp.array([2, 2, 2, 2])
+        with cp.cuda.Device(0):
+            pb_tensor = pb_utils.Tensor.from_dlpack("tensor", cupy_tensor)
+            with cp.cuda.Device(2):
+                # To make sure that non_blocking stream is done with
+                # all compute work
+                self.assertTrue(non_blocking_stream.done)
+            cupy_tensor_dlpack = cp.from_dlpack(pb_tensor)
+
+        with cp.cuda.Device(2):
+            self.assertTrue(cp.array_equal(cupy_tensor_dlpack, expected_output))
+
+        self.assertFalse(pb_tensor.is_cpu())
+        self.assertEqual(pb_tensor.__dlpack_device__(), expected_dlpack_device)
+        self.assertEqual(pb_tensor.__dlpack_device__(), cupy_tensor.__dlpack_device__())
+
+
+class TritonPythonModel:
     def execute(self, requests):
         responses = []
         for _ in requests:
             # Run the unittest and store the results in InferenceResponse.
-            test = unittest.main('model', exit=False)
+            test = unittest.main("model", exit=False)
             responses.append(
-                pb_utils.InferenceResponse([
-                    pb_utils.Tensor(
-                        'OUTPUT0',
-                        np.array([test.result.wasSuccessful()],
-                                 dtype=np.float16))
-                ]))
+                pb_utils.InferenceResponse(
+                    [
+                        pb_utils.Tensor(
+                            "OUTPUT0",
+                            np.array([test.result.wasSuccessful()], dtype=np.float16),
+                        )
+                    ]
+                )
+            )
         return responses
diff --git a/qa/python_models/error_code/config.pbtxt b/qa/python_models/error_code/config.pbtxt
new file mode 100644
index 0000000000..90fd5eb1e3
--- /dev/null
+++ b/qa/python_models/error_code/config.pbtxt
@@ -0,0 +1,47 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "error_code"
+backend: "python"
+max_batch_size: 4
+
+input [
+  {
+    name: "ERROR_CODE"
+    data_type: TYPE_STRING
+    dims: [ 1 ]
+  }
+]
+
+output [
+  {
+    name: "DUMMY_OUT"
+    data_type: TYPE_STRING
+    dims: [ 1 ]
+  }
+]
+
+instance_group [{ kind: KIND_CPU }]
diff --git a/qa/python_models/error_code/model.py b/qa/python_models/error_code/model.py
new file mode 100644
index 0000000000..078a4afb73
--- /dev/null
+++ b/qa/python_models/error_code/model.py
@@ -0,0 +1,59 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    def execute(self, requests):
+        error_code_map = {
+            "UNKNOWN": pb_utils.TritonError.UNKNOWN,
+            "INTERNAL": pb_utils.TritonError.INTERNAL,
+            "NOT_FOUND": pb_utils.TritonError.NOT_FOUND,
+            "INVALID_ARG": pb_utils.TritonError.INVALID_ARG,
+            "UNAVAILABLE": pb_utils.TritonError.UNAVAILABLE,
+            "UNSUPPORTED": pb_utils.TritonError.UNSUPPORTED,
+            "ALREADY_EXISTS": pb_utils.TritonError.ALREADY_EXISTS,
+            "CANCELLED": pb_utils.TritonError.CANCELLED,
+        }
+
+        responses = []
+
+        for request in requests:
+            err_code_tensor = pb_utils.get_input_tensor_by_name(
+                request, "ERROR_CODE"
+            ).as_numpy()
+            err_code_str = str(err_code_tensor[0][0], encoding="utf-8")
+            if err_code_str in error_code_map:
+                error = pb_utils.TritonError(
+                    message=("error code: " + err_code_str),
+                    code=error_code_map[err_code_str],
+                )
+            else:
+                error = pb_utils.TritonError("unrecognized error code: " + err_code_str)
+            responses.append(pb_utils.InferenceResponse(error=error))
+
+        return responses
diff --git a/qa/python_models/execute_cancel/config.pbtxt b/qa/python_models/execute_cancel/config.pbtxt
new file mode 100644
index 0000000000..df509863ad
--- /dev/null
+++ b/qa/python_models/execute_cancel/config.pbtxt
@@ -0,0 +1,47 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "execute_cancel"
+backend: "python"
+max_batch_size: 1
+
+input [
+  {
+    name: "EXECUTE_DELAY"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
+
+output [
+  {
+    name: "DUMMY_OUT"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
+
+instance_group [{ kind: KIND_CPU }]
diff --git a/qa/python_models/execute_cancel/model.py b/qa/python_models/execute_cancel/model.py
new file mode 100644
index 0000000000..ec7b96ec1a
--- /dev/null
+++ b/qa/python_models/execute_cancel/model.py
@@ -0,0 +1,108 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import json
+import threading
+import time
+
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    def initialize(self, args):
+        self._logger = pb_utils.Logger
+        self._model_config = json.loads(args["model_config"])
+        self._using_decoupled = pb_utils.using_decoupled_model_transaction_policy(
+            self._model_config
+        )
+
+    def execute(self, requests):
+        processed_requests = []
+        for request in requests:
+            delay_tensor = pb_utils.get_input_tensor_by_name(
+                request, "EXECUTE_DELAY"
+            ).as_numpy()
+            delay = delay_tensor[0][0]  # seconds
+            if self._using_decoupled:
+                processed_requests.append(
+                    {"response_sender": request.get_response_sender(), "delay": delay}
+                )
+            else:
+                processed_requests.append({"request": request, "delay": delay})
+        if self._using_decoupled:
+            return self._execute_decoupled(processed_requests)
+        return self._execute_processed_requests(processed_requests)
+
+    def _execute_processed_requests(self, processed_requests):
+        responses = []
+        for processed_request in processed_requests:
+            error = pb_utils.TritonError(message="not cancelled")
+            object_to_check_cancelled = None
+            if "response_sender" in processed_request:
+                object_to_check_cancelled = processed_request["response_sender"]
+            elif "request" in processed_request:
+                object_to_check_cancelled = processed_request["request"]
+            delay = processed_request["delay"]  # seconds
+            time_elapsed = 0.0  # seconds
+            while time_elapsed < delay:
+                time.sleep(1)
+                time_elapsed += 1.0
+                if object_to_check_cancelled.is_cancelled():
+                    self._logger.log_info(
+                        "[execute_cancel] Request cancelled at "
+                        + str(time_elapsed)
+                        + " s"
+                    )
+                    error = pb_utils.TritonError(
+                        message="cancelled", code=pb_utils.TritonError.CANCELLED
+                    )
+                    break
+                self._logger.log_info(
+                    "[execute_cancel] Request not cancelled at "
+                    + str(time_elapsed)
+                    + " s"
+                )
+            responses.append(pb_utils.InferenceResponse(error=error))
+        return responses
+
+    def _execute_decoupled(self, processed_requests):
+        def response_thread(execute_processed_requests, processed_requests):
+            time.sleep(2)  # execute after requests are released
+            responses = execute_processed_requests(processed_requests)
+            for i in range(len(responses)):  # len(responses) == len(processed_requests)
+                response_sender = processed_requests[i]["response_sender"]
+                response_sender.send(responses[i])
+                response_sender.send(
+                    flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL
+                )
+
+        thread = threading.Thread(
+            target=response_thread,
+            args=(self._execute_processed_requests, processed_requests),
+        )
+        thread.daemon = True
+        thread.start()
+        return None
diff --git a/qa/python_models/execute_error/model.py b/qa/python_models/execute_error/model.py
index 2a244e083e..9ecdbff816 100644
--- a/qa/python_models/execute_error/model.py
+++ b/qa/python_models/execute_error/model.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,24 +28,23 @@
 
 
 class TritonPythonModel:
-
     def execute(self, requests):
-        """ This function is called on inference request.
-        """
+        """This function is called on inference request."""
         responses = []
 
-        # Only generate the error for the first request
+        # Generate the error for the first and third request
         i = 0
         for request in requests:
             input_tensor = pb_utils.get_input_tensor_by_name(request, "IN")
             out_tensor = pb_utils.Tensor("OUT", input_tensor.as_numpy())
             if i == 0:
-                error = pb_utils.TritonError(
-                    'An error occured during execution')
-                responses.append(pb_utils.InferenceResponse([out_tensor],
-                                                            error))
-            else:
+                error = pb_utils.TritonError("An error occurred during execution")
+                responses.append(pb_utils.InferenceResponse([out_tensor], error))
+            elif i == 1:
                 responses.append(pb_utils.InferenceResponse([out_tensor]))
+            elif i == 2:
+                error = pb_utils.TritonError("An error occurred during execution")
+                responses.append(pb_utils.InferenceResponse(error=error))
             i += 1
 
         return responses
diff --git a/qa/python_models/execute_return_error/model.py b/qa/python_models/execute_return_error/model.py
index 6e19d68e4a..e304441f04 100644
--- a/qa/python_models/execute_return_error/model.py
+++ b/qa/python_models/execute_return_error/model.py
@@ -1,4 +1,4 @@
-# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,11 +24,8 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import triton_python_backend_utils as pb_utils
-
 
 class TritonPythonModel:
-
     def initialize(self, args):
         self._i = -1
 
diff --git a/qa/python_models/fini_error/model.py b/qa/python_models/fini_error/model.py
index 3f8c1ab5f3..7a9f409aee 100644
--- a/qa/python_models/fini_error/model.py
+++ b/qa/python_models/fini_error/model.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,7 +28,6 @@
 
 
 class TritonPythonModel:
-
     def execute(self, requests):
         """
         The body of this model doesn't matter. The main purpose of this model is
diff --git a/qa/python_models/ground_truth/config.pbtxt b/qa/python_models/ground_truth/config.pbtxt
new file mode 100644
index 0000000000..2b7a7d19a2
--- /dev/null
+++ b/qa/python_models/ground_truth/config.pbtxt
@@ -0,0 +1,52 @@
+# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "ground_truth"
+backend: "python"
+max_batch_size: 64
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
+
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
+
+instance_group [
+  {
+    count: 1
+    kind : KIND_CPU
+  }
+]
diff --git a/qa/python_models/ground_truth/model.py b/qa/python_models/ground_truth/model.py
new file mode 100644
index 0000000000..24a286e300
--- /dev/null
+++ b/qa/python_models/ground_truth/model.py
@@ -0,0 +1,51 @@
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import time
+
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    def execute(self, requests):
+        """
+        Mock Model that uses the input data to determine how long to wait
+        before returning identity data
+        """
+        assert len(requests) == 1
+        delay = 0
+        request = requests[0]
+        responses = []
+
+        delay_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT0")
+        delay_as_numpy = delay_tensor.as_numpy()
+        delay = float(delay_as_numpy[0][0])
+
+        out_tensor = pb_utils.Tensor("OUTPUT0", delay_as_numpy)
+        responses.append(pb_utils.InferenceResponse([out_tensor]))
+
+        time.sleep(delay)
+        return responses
diff --git a/qa/python_models/identity_fp32/model.py b/qa/python_models/identity_fp32/model.py
index 4273977263..2161a1e732 100644
--- a/qa/python_models/identity_fp32/model.py
+++ b/qa/python_models/identity_fp32/model.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,7 +28,6 @@
 
 
 class TritonPythonModel:
-
     def execute(self, requests):
         """
         Identity model in Python backend.
diff --git a/qa/python_models/identity_fp32_logging/config.pbtxt b/qa/python_models/identity_fp32_logging/config.pbtxt
new file mode 100644
index 0000000000..aaa4a2ee43
--- /dev/null
+++ b/qa/python_models/identity_fp32_logging/config.pbtxt
@@ -0,0 +1,53 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "identity_fp32_logging"
+backend: "python"
+max_batch_size: 64
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ -1 ]
+  }
+]
+
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ -1 ]
+  }
+]
+
+instance_group [
+  {
+    count: 1
+    kind : KIND_CPU
+  }
+]
+
diff --git a/qa/python_models/identity_fp32_logging/model.py b/qa/python_models/identity_fp32_logging/model.py
new file mode 100644
index 0000000000..91ace61fd5
--- /dev/null
+++ b/qa/python_models/identity_fp32_logging/model.py
@@ -0,0 +1,72 @@
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    def initialize(self, args):
+        logger = pb_utils.Logger
+        logger.log("Initialize-Specific Msg!", logger.INFO)
+        logger.log_info("Initialize-Info Msg!")
+        logger.log_warn("Initialize-Warning Msg!")
+        logger.log_error("Initialize-Error Msg!")
+        logger.log_verbose("Initialize-Verbose Msg!")
+
+    def execute(self, requests):
+        """
+        Identity model in Python backend.
+        """
+        # Log as early as possible
+        logger = pb_utils.Logger
+        logger.log("Execute-Specific Msg!", logger.INFO)
+        logger.log_info("Execute-Info Msg!")
+        logger.log_warn("Execute-Warning Msg!")
+        logger.log_error("Execute-Error Msg!")
+        logger.log_verbose("Execute-Verbose Msg!")
+
+        responses = []
+        for request in requests:
+            input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT0")
+            out_tensor = pb_utils.Tensor("OUTPUT0", input_tensor.as_numpy())
+            responses.append(pb_utils.InferenceResponse([out_tensor]))
+
+        # Log as late as possible
+        logger.log("Execute-Specific Msg!", logger.INFO)
+        logger.log_info("Execute-Info Msg!")
+        logger.log_warn("Execute-Warning Msg!")
+        logger.log_error("Execute-Error Msg!")
+        logger.log_verbose("Execute-Verbose Msg!")
+
+        return responses
+
+    def finalize(self):
+        logger = pb_utils.Logger
+        logger.log("Finalize-Specific Msg!", logger.INFO)
+        logger.log_info("Finalize-Info Msg!")
+        logger.log_warn("Finalize-Warning Msg!")
+        logger.log_error("Finalize-Error Msg!")
+        logger.log_verbose("Finalize-Verbose Msg!")
diff --git a/qa/python_models/identity_fp32_timeout/config.pbtxt b/qa/python_models/identity_fp32_timeout/config.pbtxt
new file mode 100644
index 0000000000..c14fd8e0a3
--- /dev/null
+++ b/qa/python_models/identity_fp32_timeout/config.pbtxt
@@ -0,0 +1,60 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "identity_fp32_timeout"
+backend: "python"
+max_batch_size: 64
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ -1 ]
+  }
+]
+
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ -1 ]
+  }
+]
+
+instance_group [
+  {
+    count: 1
+    kind : KIND_CPU
+  }
+]
+
+dynamic_batching {
+  default_queue_policy {
+    timeout_action: REJECT
+    allow_timeout_override: true
+    default_timeout_microseconds: 1000000
+  }
+}
diff --git a/qa/python_models/identity_fp32_timeout/model.py b/qa/python_models/identity_fp32_timeout/model.py
new file mode 100644
index 0000000000..356948e8de
--- /dev/null
+++ b/qa/python_models/identity_fp32_timeout/model.py
@@ -0,0 +1,45 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import time
+
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    def execute(self, requests):
+        """
+        Identity model in Python backend.
+        """
+        logger = pb_utils.Logger
+        responses = []
+        for request in requests:
+            input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT0")
+            out_tensor = pb_utils.Tensor("OUTPUT0", input_tensor.as_numpy())
+            logger.log_info(f"Request timeout: {request.timeout()}")
+            time.sleep(5)
+            responses.append(pb_utils.InferenceResponse([out_tensor]))
+        return responses
diff --git a/qa/python_models/init_args/model.py b/qa/python_models/init_args/model.py
index 2f3d933b79..12dd2212a1 100644
--- a/qa/python_models/init_args/model.py
+++ b/qa/python_models/init_args/model.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2020-2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,20 +24,39 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
+import os
+
 import numpy as np
 import triton_python_backend_utils as pb_utils
 
 
-class TritonPythonModel:
+def check_init_args(args):
+    expected_args = {
+        "model_name": "init_args",
+        "model_instance_name": "init_args_0_0",
+        "model_instance_kind": "CPU",
+        "model_instance_device_id": "0",
+        "model_repository": os.getenv("TRITON_DIR", "/opt/tritonserver")
+        + "/qa/L0_backend_python/models/init_args",
+        "model_version": "1",
+    }
 
-    def initialize(self, args):
-        self.args = args
-        if args['model_name'] != 'init_args' or args[
-                'model_instance_name'] != 'init_args_0':
+    for arg in expected_args:
+        if args[arg] != expected_args[arg]:
             raise pb_utils.TritonModelException(
-                'model_instance_name/model_name does not contain correct value.'
+                arg
+                + ' does not contain correct value. Expected "'
+                + expected_args[arg]
+                + ", got "
+                + args[arg]
             )
 
+
+class TritonPythonModel:
+    def initialize(self, args):
+        self.args = args
+        check_init_args(self.args)
+
     def execute(self, requests):
         """
         This function counts the number of keys in the
@@ -45,9 +64,13 @@ def execute(self, requests):
         correct.
         """
         keys = [
-            'model_config', 'model_instance_kind', 'model_instance_name',
-            'model_instance_device_id', 'model_repository', 'model_version',
-            'model_name'
+            "model_config",
+            "model_instance_kind",
+            "model_instance_name",
+            "model_instance_device_id",
+            "model_repository",
+            "model_version",
+            "model_name",
         ]
 
         correct_keys = 0
@@ -58,6 +81,7 @@ def execute(self, requests):
         responses = []
         for _ in requests:
             out_args = pb_utils.Tensor(
-                "OUT", np.array([correct_keys], dtype=np.float32))
+                "OUT", np.array([correct_keys], dtype=np.float32)
+            )
             responses.append(pb_utils.InferenceResponse([out_args]))
         return responses
diff --git a/qa/python_models/init_error/model.py b/qa/python_models/init_error/model.py
index 11c6a6fb07..654dc8ef2c 100644
--- a/qa/python_models/init_error/model.py
+++ b/qa/python_models/init_error/model.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,9 +28,8 @@
 
 
 class TritonPythonModel:
-
     def initialize(self, args):
-        self.model_config = args['model_config']
+        self.model_config = args["model_config"]
         lorem_ipsum
 
     def execute(self, requests):
diff --git a/qa/python_models/init_exit/config.pbtxt b/qa/python_models/init_exit/config.pbtxt
new file mode 100644
index 0000000000..a18aff189d
--- /dev/null
+++ b/qa/python_models/init_exit/config.pbtxt
@@ -0,0 +1,46 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "init_exit"
+backend: "python"
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ 16 ]
+  }
+]
+
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ 16 ]
+  }
+]
+
+instance_group [{ kind: KIND_CPU }]
diff --git a/qa/python_models/init_exit/model.py b/qa/python_models/init_exit/model.py
new file mode 100644
index 0000000000..e0fc8b55a4
--- /dev/null
+++ b/qa/python_models/init_exit/model.py
@@ -0,0 +1,40 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import os
+import signal
+import time
+
+
+class TritonPythonModel:
+    def initialize(self, args):
+        time.sleep(3)
+        # Simulate the case that the model goes out of memory and gets killed
+        # by the OOM killer
+        os.kill(os.getpid(), signal.SIGKILL)
+
+    def execute(self, requests):
+        pass
diff --git a/qa/python_models/iterative_sequence/config.pbtxt b/qa/python_models/iterative_sequence/config.pbtxt
new file mode 100644
index 0000000000..faa1735718
--- /dev/null
+++ b/qa/python_models/iterative_sequence/config.pbtxt
@@ -0,0 +1,51 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "iterative_sequence"
+backend: "python"
+max_batch_size: 0
+model_transaction_policy {
+  decoupled: True
+}
+input [
+  {
+    name: "IN"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  }
+]
+output [
+  {
+    name: "OUT"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  }
+]
+sequence_batching {
+  iterative_sequence : true
+}
+
+instance_group [{ kind: KIND_CPU }]
diff --git a/qa/python_models/iterative_sequence/model.py b/qa/python_models/iterative_sequence/model.py
new file mode 100644
index 0000000000..c45f82a607
--- /dev/null
+++ b/qa/python_models/iterative_sequence/model.py
@@ -0,0 +1,131 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import json
+
+import numpy as np
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    """
+    This model takes 1 input tensor, an INT32 [ 1 ] input named "IN", and
+    produces an output tensor "OUT" with the same shape as the input tensor.
+    The input value indicates the total number of responses to be generated and
+    the output value indicates the number of remaining responses. For example,
+    if the request input has value 2, the model will:
+        - Send a response with value 1.
+        - Release request with RESCHEDULE flag.
+        - When execute on the same request, send the last response with value 0.
+        - Release request with ALL flag.
+    """
+
+    def initialize(self, args):
+        self.model_config = model_config = json.loads(args["model_config"])
+
+        using_decoupled = pb_utils.using_decoupled_model_transaction_policy(
+            model_config
+        )
+        if not using_decoupled:
+            raise pb_utils.TritonModelException(
+                """the model `{}` can generate any number of responses per request,
+                enable decoupled transaction policy in model configuration to
+                serve this model""".format(
+                    args["model_name"]
+                )
+            )
+
+        # Get IN configuration
+        in_config = pb_utils.get_input_config_by_name(model_config, "IN")
+
+        # Validate the shape and data type of IN
+        in_shape = in_config["dims"]
+        if (len(in_shape) != 1) or (in_shape[0] != 1):
+            raise pb_utils.TritonModelException(
+                """the model `{}` requires the shape of 'IN' to be
+                [1], got {}""".format(
+                    args["model_name"], in_shape
+                )
+            )
+        if in_config["data_type"] != "TYPE_INT32":
+            raise pb_utils.TritonModelException(
+                """the model `{}` requires the data_type of 'IN' to be
+                'TYPE_INT32', got {}""".format(
+                    args["model_name"], in_config["data_type"]
+                )
+            )
+
+        # Get OUT configuration
+        out_config = pb_utils.get_output_config_by_name(model_config, "OUT")
+
+        # Validate the shape and data type of OUT
+        out_shape = out_config["dims"]
+        if (len(out_shape) != 1) or (out_shape[0] != 1):
+            raise pb_utils.TritonModelException(
+                """the model `{}` requires the shape of 'OUT' to be
+                [1], got {}""".format(
+                    args["model_name"], out_shape
+                )
+            )
+        if out_config["data_type"] != "TYPE_INT32":
+            raise pb_utils.TritonModelException(
+                """the model `{}` requires the data_type of 'OUT' to be
+                'TYPE_INT32', got {}""".format(
+                    args["model_name"], out_config["data_type"]
+                )
+            )
+
+        self.remaining_response = 0
+        self.reset_flag = True
+
+    def execute(self, requests):
+        for request in requests:
+            in_input = pb_utils.get_input_tensor_by_name(request, "IN").as_numpy()
+
+            if self.reset_flag:
+                self.remaining_response = in_input[0]
+                self.reset_flag = False
+
+            response_sender = request.get_response_sender()
+
+            self.remaining_response -= 1
+
+            out_output = pb_utils.Tensor(
+                "OUT", np.array([self.remaining_response], np.int32)
+            )
+            response = pb_utils.InferenceResponse(output_tensors=[out_output])
+
+            if self.remaining_response <= 0:
+                response_sender.send(
+                    response, flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL
+                )
+            else:
+                request.set_release_flags(
+                    pb_utils.TRITONSERVER_REQUEST_RELEASE_RESCHEDULE
+                )
+                response_sender.send(response)
+
+        return None
diff --git a/qa/python_models/model_env/model.py b/qa/python_models/model_env/model.py
index 0eff470394..8cc9db8d81 100644
--- a/qa/python_models/model_env/model.py
+++ b/qa/python_models/model_env/model.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,17 +25,18 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import os
+
 import triton_python_backend_utils as pb_utils
 
 
 class TritonPythonModel:
-
     def initialize(self, args):
         # Make sure that environment variables are correctly propagated
         # to the Python models
-        if "MY_ENV" not in os.environ or os.environ["MY_ENV"] != 'MY_ENV':
+        if "MY_ENV" not in os.environ or os.environ["MY_ENV"] != "MY_ENV":
             raise pb_utils.TritonModelException(
-                "MY_ENV doesn't exists or contains incorrect value")
+                "MY_ENV doesn't exists or contains incorrect value"
+            )
 
     def execute(self, requests):
         pass
diff --git a/qa/python_models/model_init_del/config.pbtxt b/qa/python_models/model_init_del/config.pbtxt
new file mode 100644
index 0000000000..be66468a0a
--- /dev/null
+++ b/qa/python_models/model_init_del/config.pbtxt
@@ -0,0 +1,52 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "model_init_del"
+backend: "python"
+max_batch_size: 0
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ -1 ]
+  }
+]
+
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ -1 ]
+  }
+]
+
+instance_group [
+  {
+    count: 1
+    kind: KIND_CPU
+  }
+]  # end instance_group
diff --git a/qa/python_models/model_init_del/model.py b/qa/python_models/model_init_del/model.py
new file mode 100644
index 0000000000..578279f8ef
--- /dev/null
+++ b/qa/python_models/model_init_del/model.py
@@ -0,0 +1,57 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import os
+import sys
+import time
+
+import triton_python_backend_utils as pb_utils
+
+sys.path.append(os.path.dirname(os.path.dirname(os.path.realpath(__file__))))
+from util import get_delay, inc_count
+
+
+class TritonPythonModel:
+    def initialize(self, args):
+        inc_count("initialize")
+        self._sleep("initialize")
+
+    def execute(self, requests):
+        responses = []
+        for request in requests:
+            input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT0")
+            out_tensor = pb_utils.Tensor("OUTPUT0", input_tensor.as_numpy())
+            responses.append(pb_utils.InferenceResponse([out_tensor]))
+        self._sleep("infer")
+        return responses
+
+    def finalize(self):
+        inc_count("finalize")
+
+    def _sleep(self, kind):
+        delay = get_delay(kind)
+        if delay > 0:
+            time.sleep(delay)
diff --git a/qa/python_models/model_init_del/util.py b/qa/python_models/model_init_del/util.py
new file mode 100755
index 0000000000..a36f13eea9
--- /dev/null
+++ b/qa/python_models/model_init_del/util.py
@@ -0,0 +1,189 @@
+#!/usr/bin/env python3
+
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import fcntl
+import os
+
+_model_name = "model_init_del"
+
+#
+# Helper functions for reading/writing state to disk
+#
+
+
+def _get_number(filename):
+    full_path = os.path.join(os.environ["MODEL_LOG_DIR"], filename)
+    try:
+        with open(full_path, mode="r", encoding="utf-8", errors="strict") as f:
+            fcntl.lockf(f, fcntl.LOCK_SH)
+            txt = f.read()
+    except FileNotFoundError:
+        txt = "0"
+    return int(txt)
+
+
+def _store_number(filename, number):
+    full_path = os.path.join(os.environ["MODEL_LOG_DIR"], filename)
+    txt = str(number)
+    with open(full_path, mode="w", encoding="utf-8", errors="strict") as f:
+        fcntl.lockf(f, fcntl.LOCK_EX)
+        f.write(txt)
+
+
+def _inc_number(filename):
+    full_path = os.path.join(os.environ["MODEL_LOG_DIR"], filename)
+    try:
+        with open(full_path, mode="r+", encoding="utf-8", errors="strict") as f:
+            fcntl.lockf(f, fcntl.LOCK_EX)
+            txt = f.read()
+            number = int(txt) + 1
+            txt = str(number)
+            f.truncate(0)
+            f.seek(0)
+            f.write(txt)
+    except FileNotFoundError:
+        number = 1
+        _store_number(filename, number)
+    return number
+
+
+#
+# Functions for communicating initialize and finalize count between the model
+# and test
+#
+
+
+def _get_count_filename(kind):
+    if kind != "initialize" and kind != "finalize":
+        raise KeyError("Invalid count kind: " + str(kind))
+    filename = _model_name + "_" + kind + "_count.txt"
+    return filename
+
+
+def get_count(kind):
+    return _get_number(_get_count_filename(kind))
+
+
+def inc_count(kind):
+    return _inc_number(_get_count_filename(kind))
+
+
+def reset_count(kind):
+    count = 0
+    _store_number(_get_count_filename(kind), count)
+    return count
+
+
+#
+# Functions for communicating varies of delay (in seconds) to the model
+#
+
+
+def _get_delay_filename(kind):
+    if kind != "initialize" and kind != "infer":
+        raise KeyError("Invalid delay kind: " + str(kind))
+    filename = _model_name + "_" + kind + "_delay.txt"
+    return filename
+
+
+def get_delay(kind):
+    return _get_number(_get_delay_filename(kind))
+
+
+def set_delay(kind, delay):
+    _store_number(_get_delay_filename(kind), delay)
+    return delay
+
+
+#
+# Functions for modifying the model
+#
+
+
+def update_instance_group(instance_group_str):
+    full_path = os.path.join(os.path.dirname(__file__), "config.pbtxt")
+    with open(full_path, mode="r+", encoding="utf-8", errors="strict") as f:
+        txt = f.read()
+        txt, post_match = txt.split("instance_group [")
+        txt += "instance_group [\n"
+        txt += instance_group_str
+        txt += "\n]  # end instance_group\n"
+        txt += post_match.split("\n]  # end instance_group\n")[1]
+        f.truncate(0)
+        f.seek(0)
+        f.write(txt)
+    return txt
+
+
+def update_sequence_batching(sequence_batching_str):
+    full_path = os.path.join(os.path.dirname(__file__), "config.pbtxt")
+    with open(full_path, mode="r+", encoding="utf-8", errors="strict") as f:
+        txt = f.read()
+        if "sequence_batching {" in txt:
+            txt, post_match = txt.split("sequence_batching {")
+            if sequence_batching_str != "":
+                txt += "sequence_batching {\n"
+                txt += sequence_batching_str
+                txt += "\n}  # end sequence_batching\n"
+            txt += post_match.split("\n}  # end sequence_batching\n")[1]
+        elif sequence_batching_str != "":
+            txt += "\nsequence_batching {\n"
+            txt += sequence_batching_str
+            txt += "\n}  # end sequence_batching\n"
+        f.truncate(0)
+        f.seek(0)
+        f.write(txt)
+    return txt
+
+
+def update_model_file():
+    full_path = os.path.join(os.path.dirname(__file__), "1", "model.py")
+    with open(full_path, mode="a", encoding="utf-8", errors="strict") as f:
+        f.write("\n# dummy model file update\n")
+
+
+def enable_batching():
+    full_path = os.path.join(os.path.dirname(__file__), "config.pbtxt")
+    with open(full_path, mode="r+", encoding="utf-8", errors="strict") as f:
+        txt = f.read()
+        txt = txt.replace("max_batch_size: 0", "max_batch_size: 2")
+        f.truncate(0)
+        f.seek(0)
+        f.write(txt)
+    return txt
+
+
+def disable_batching():
+    full_path = os.path.join(os.path.dirname(__file__), "config.pbtxt")
+    with open(full_path, mode="r+", encoding="utf-8", errors="strict") as f:
+        txt = f.read()
+        txt = txt.replace("max_batch_size: 2", "max_batch_size: 0")
+        f.truncate(0)
+        f.seek(0)
+        f.write(txt)
+    return txt
diff --git a/qa/python_models/multi_file/file1.py b/qa/python_models/multi_file/file1.py
old mode 100644
new mode 100755
index 3e6706ade9..46b6d76934
--- a/qa/python_models/multi_file/file1.py
+++ b/qa/python_models/multi_file/file1.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,4 +26,4 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-FILE_NAME = 'FILE1'
+FILE_NAME = "FILE1"
diff --git a/qa/python_models/multi_file/file2.py b/qa/python_models/multi_file/file2.py
old mode 100644
new mode 100755
index 2b73ab0e3d..b7174da748
--- a/qa/python_models/multi_file/file2.py
+++ b/qa/python_models/multi_file/file2.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,4 +26,4 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-FILE_NAME = 'FILE2'
+FILE_NAME = "FILE2"
diff --git a/qa/python_models/multi_file/model.py b/qa/python_models/multi_file/model.py
index a5a55002aa..b94d6f336f 100644
--- a/qa/python_models/multi_file/model.py
+++ b/qa/python_models/multi_file/model.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,16 +25,15 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import file1
-from . import file2
-
 import triton_python_backend_utils as pb_utils
 
+from . import file2
 
-class TritonPythonModel:
 
+class TritonPythonModel:
     def initialize(self, args):
-        if file1.FILE_NAME != 'FILE1' or file2.FILE_NAME != 'FILE2':
-            raise pb_utils.TritonModelException('Imports do not work')
+        if file1.FILE_NAME != "FILE1" or file2.FILE_NAME != "FILE2":
+            raise pb_utils.TritonModelException("Imports do not work")
 
     def execute(self, requests):
         pass
diff --git a/qa/python_models/non_contiguous/model.py b/qa/python_models/non_contiguous/model.py
index c8cb4b5570..de7417303b 100644
--- a/qa/python_models/non_contiguous/model.py
+++ b/qa/python_models/non_contiguous/model.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -29,7 +29,6 @@
 
 
 class TritonPythonModel:
-
     def execute(self, requests):
         responses = []
         new_shape = [10, 2, 6, 5, 11]
@@ -40,8 +39,8 @@ def execute(self, requests):
             output0 = pb_utils.Tensor("OUTPUT0", input_numpy.reshape(new_shape))
             # Transpose the tensor to create a non-contiguous tensor.
             output1 = pb_utils.Tensor("OUTPUT1", input_numpy.T)
-            output2 = pb_utils.Tensor("OUTPUT2",
-                                      np.transpose(input_numpy, shape_reorder))
-            responses.append(
-                pb_utils.InferenceResponse([output0, output1, output2]))
+            output2 = pb_utils.Tensor(
+                "OUTPUT2", np.transpose(input_numpy, shape_reorder)
+            )
+            responses.append(pb_utils.InferenceResponse([output0, output1, output2]))
         return responses
diff --git a/qa/python_models/optional/config.pbtxt b/qa/python_models/optional/config.pbtxt
index a496e48291..c681ec807f 100644
--- a/qa/python_models/optional/config.pbtxt
+++ b/qa/python_models/optional/config.pbtxt
@@ -53,10 +53,3 @@ output [
     dims: [ 1 ]
   }
 ]
-
-instance_group [
-  {
-    count: 1
-    kind : KIND_CPU
-  }
-]
diff --git a/qa/python_models/optional/model.py b/qa/python_models/optional/model.py
index 8e22d3b492..f0a790b43a 100644
--- a/qa/python_models/optional/model.py
+++ b/qa/python_models/optional/model.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,12 +24,11 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import triton_python_backend_utils as pb_utils
 import numpy as np
+import triton_python_backend_utils as pb_utils
 
 
 class TritonPythonModel:
-
     def execute(self, requests):
         """Model supporting optional inputs. If the input is not provided, an
         input tensor of size 1 containing scalar 5 will be used."""
@@ -48,11 +47,10 @@ def execute(self, requests):
             else:
                 input1_numpy = np.array([5], dtype=np.int32)
 
-            output0_tensor = pb_utils.Tensor("OUTPUT0",
-                                             input0_numpy + input1_numpy)
-            output1_tensor = pb_utils.Tensor("OUTPUT1",
-                                             input0_numpy - input1_numpy)
+            output0_tensor = pb_utils.Tensor("OUTPUT0", input0_numpy + input1_numpy)
+            output1_tensor = pb_utils.Tensor("OUTPUT1", input0_numpy - input1_numpy)
             responses.append(
-                pb_utils.InferenceResponse([output0_tensor, output1_tensor]))
+                pb_utils.InferenceResponse([output0_tensor, output1_tensor])
+            )
 
         return responses
diff --git a/qa/python_models/python_based_backends/add_sub_backend/model.py b/qa/python_models/python_based_backends/add_sub_backend/model.py
new file mode 100644
index 0000000000..7c9736b2d5
--- /dev/null
+++ b/qa/python_models/python_based_backends/add_sub_backend/model.py
@@ -0,0 +1,162 @@
+# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import json
+import os
+
+import triton_python_backend_utils as pb_utils
+
+_ADD_SUB_ARGS_FILENAME = "model.json"
+
+
+class TritonPythonModel:
+    @staticmethod
+    def auto_complete_config(auto_complete_model_config):
+        """This function is called only once when loading the model assuming
+        the server was not started with `--disable-auto-complete-config`.
+
+        Parameters
+        ----------
+        auto_complete_model_config : pb_utils.ModelConfig
+          An object containing the existing model configuration.
+
+        Returns
+        -------
+        pb_utils.ModelConfig
+          An object containing the auto-completed model configuration
+        """
+        inputs = [
+            {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]},
+            {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]},
+        ]
+        outputs = [{"name": "OUTPUT", "data_type": "TYPE_FP32", "dims": [4]}]
+
+        config = auto_complete_model_config.as_dict()
+        input_names = []
+        output_names = []
+
+        for input in config["input"]:
+            input_names.append(input["name"])
+
+        for output in config["output"]:
+            output_names.append(output["name"])
+
+        for input in inputs:
+            if input["name"] not in input_names:
+                auto_complete_model_config.add_input(input)
+
+        for output in outputs:
+            if output["name"] not in output_names:
+                auto_complete_model_config.add_output(output)
+
+        return auto_complete_model_config
+
+    def initialize(self, args):
+        """This function allows the model to initialize any state associated with this model.
+
+        Parameters
+        ----------
+        args : dict
+          Both keys and values are strings. The dictionary keys and values are:
+          * model_config: A JSON string containing the model configuration
+          * model_instance_kind: A string containing model instance kind
+          * model_instance_device_id: A string containing model instance device ID
+          * model_repository: Model repository path
+          * model_version: Model version
+          * model_name: Model name
+        """
+
+        self.model_config = model_config = json.loads(args["model_config"])
+
+        # Get OUTPUT configuration
+        output_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT")
+
+        engine_args_filepath = os.path.join(
+            pb_utils.get_model_dir(), _ADD_SUB_ARGS_FILENAME
+        )
+        assert os.path.isfile(
+            engine_args_filepath
+        ), f"'{_ADD_SUB_ARGS_FILENAME}' containing add sub model args must be provided in '{pb_utils.get_model_dir()}'"
+
+        with open(engine_args_filepath) as file:
+            self.add_sub_config = json.load(file)
+
+        assert (
+            "operation" in self.add_sub_config
+        ), f"Missing required key 'operation' in {_ADD_SUB_ARGS_FILENAME}"
+
+        extra_keys = set(self.add_sub_config.keys()) - {"operation"}
+        assert (
+            not extra_keys
+        ), f"Unsupported keys are provided in {_ADD_SUB_ARGS_FILENAME}: {', '.join(extra_keys)}"
+
+        assert self.add_sub_config["operation"] in [
+            "add",
+            "sub",
+        ], f"'operation' value must be 'add' or 'sub' in {_ADD_SUB_ARGS_FILENAME}"
+
+        # Convert Triton types to numpy types
+        self.output_dtype = pb_utils.triton_string_to_numpy(output_config["data_type"])
+
+    def execute(self, requests):
+        """This function is called when an inference request is made
+        for this model.
+
+        Parameters
+        ----------
+        requests : list
+          A list of pb_utils.InferenceRequest
+
+        Returns
+        -------
+        list
+          A list of pb_utils.InferenceResponse. The length of this list must
+          be the same as `requests`
+        """
+
+        responses = []
+
+        for request in requests:
+            in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
+            in_1 = pb_utils.get_input_tensor_by_name(request, "INPUT1")
+
+            if self.add_sub_config["operation"] == "add":
+                out = in_0.as_numpy() + in_1.as_numpy()
+            else:
+                out = in_0.as_numpy() - in_1.as_numpy()
+
+            # Create output tensors.
+            out_tensor = pb_utils.Tensor("OUTPUT", out.astype(self.output_dtype))
+
+            # Create InferenceResponse.
+            inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor])
+            responses.append(inference_response)
+
+        return responses
+
+    def finalize(self):
+        """`finalize` is called only once when the model is being unloaded."""
+        print("Cleaning up...")
diff --git a/qa/python_models/python_version/model.py b/qa/python_models/python_version/model.py
index ee358ffc55..5d77906fa9 100644
--- a/qa/python_models/python_version/model.py
+++ b/qa/python_models/python_version/model.py
@@ -1,4 +1,4 @@
-# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,18 +24,19 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
-import sys
+import locale
 import os
+import sys
+
+import numpy as np
 import triton_python_backend_utils as pb_utils
 
 
 class TritonPythonModel:
-
     @staticmethod
     def auto_complete_config(auto_complete_model_config):
-        input = {'name': 'INPUT', 'data_type': 'TYPE_FP32', 'dims': [1]}
-        output = {'name': 'OUTPUT', 'data_type': 'TYPE_FP32', 'dims': [1]}
+        input = {"name": "INPUT", "data_type": "TYPE_FP32", "dims": [1]}
+        output = {"name": "OUTPUT", "data_type": "TYPE_FP32", "dims": [1]}
 
         auto_complete_model_config.set_max_batch_size(0)
         auto_complete_model_config.add_input(input)
@@ -45,19 +46,21 @@ def auto_complete_config(auto_complete_model_config):
 
     def initialize(self, args):
         import tensorflow
-        self.model_config = args['model_config']
+
+        self.model_config = args["model_config"]
         # This is to make sure that /bin/bash is not picking up
         # the wrong shared libraries after installing Tensorflow.
         # Tensorflow uses a shared library which is common with
         # bash.
-        os.system('/bin/bash --help')
+        os.system("/bin/bash --help")
         print(
-            f'Python version is {sys.version_info.major}.{sys.version_info.minor}, NumPy version is {np.version.version}, and Tensorflow version is {tensorflow.__version__}',
-            flush=True)
+            f"Python version is {sys.version_info.major}.{sys.version_info.minor}, NumPy version is {np.version.version}, and Tensorflow version is {tensorflow.__version__}",
+            flush=True,
+        )
+        print(f"Locale is {locale.getlocale()}", flush=True)
 
     def execute(self, requests):
-        """ This function is called on inference request.
-        """
+        """This function is called on inference request."""
         responses = []
         for request in requests:
             input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT0")
diff --git a/qa/python_models/pytorch_fp32_fp32/model.py b/qa/python_models/pytorch_fp32_fp32/model.py
index 4f11d3c726..98269213b2 100644
--- a/qa/python_models/pytorch_fp32_fp32/model.py
+++ b/qa/python_models/pytorch_fp32_fp32/model.py
@@ -1,4 +1,4 @@
-# Copyright 2020-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,16 +25,13 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import numpy as np
-
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
-
 import triton_python_backend_utils as pb_utils
 
 
 class Net(nn.Module):
-
     def __init__(self):
         super(Net, self).__init__()
         self.conv1 = nn.Conv2d(1, 32, 3, 1)
@@ -61,7 +58,6 @@ def forward(self, x):
 
 
 class TritonPythonModel:
-
     def initialize(self, args):
         torch.manual_seed(0)
         self.model = Net()
diff --git a/qa/python_models/request_rescheduling_addsub/config.pbtxt b/qa/python_models/request_rescheduling_addsub/config.pbtxt
new file mode 100644
index 0000000000..7667bfb3c0
--- /dev/null
+++ b/qa/python_models/request_rescheduling_addsub/config.pbtxt
@@ -0,0 +1,61 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "request_rescheduling_addsub"
+backend: "python"
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ 16 ]
+  }
+]
+input [
+  {
+    name: "INPUT1"
+    data_type: TYPE_FP32
+    dims: [ 16 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ 16 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT1"
+    data_type: TYPE_FP32
+    dims: [ 16 ]
+  }
+]
+sequence_batching {
+  iterative_sequence : true
+}
+instance_group [{ kind: KIND_CPU }]
diff --git a/qa/python_models/request_rescheduling_addsub/model.py b/qa/python_models/request_rescheduling_addsub/model.py
new file mode 100644
index 0000000000..fb7b0ac9c7
--- /dev/null
+++ b/qa/python_models/request_rescheduling_addsub/model.py
@@ -0,0 +1,82 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import json
+
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    def initialize(self, args):
+        self.model_config = model_config = json.loads(args["model_config"])
+
+        output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0")
+        output1_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT1")
+
+        self.output0_dtype = pb_utils.triton_string_to_numpy(
+            output0_config["data_type"]
+        )
+        self.output1_dtype = pb_utils.triton_string_to_numpy(
+            output1_config["data_type"]
+        )
+
+        self.idx = 0
+
+    def execute(self, requests):
+        """This function is called on inference request."""
+
+        output0_dtype = self.output0_dtype
+        output1_dtype = self.output1_dtype
+
+        responses = []
+
+        for request in requests:
+            in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
+            in_1 = pb_utils.get_input_tensor_by_name(request, "INPUT1")
+
+            out_0, out_1 = (
+                in_0.as_numpy() + in_1.as_numpy(),
+                in_0.as_numpy() - in_1.as_numpy(),
+            )
+
+            out_tensor_0 = pb_utils.Tensor("OUTPUT0", out_0.astype(output0_dtype))
+            out_tensor_1 = pb_utils.Tensor("OUTPUT1", out_1.astype(output1_dtype))
+
+            inference_response = pb_utils.InferenceResponse(
+                output_tensors=[out_tensor_0, out_tensor_1]
+            )
+
+            # Explicitly reschedule the first request
+            if self.idx == 0:
+                request.set_release_flags(
+                    pb_utils.TRITONSERVER_REQUEST_RELEASE_RESCHEDULE
+                )
+                responses.append(None)
+                self.idx += 1
+            else:
+                responses.append(inference_response)
+
+        return responses
diff --git a/qa/python_models/response_sender_error/model.py b/qa/python_models/response_sender_error/model.py
index eef186e9d4..4f1e0e5e85 100644
--- a/qa/python_models/response_sender_error/model.py
+++ b/qa/python_models/response_sender_error/model.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,32 +24,32 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
 import json
+
 import triton_python_backend_utils as pb_utils
 
 
 class TritonPythonModel:
-    """ This model tries to create a response sender in
+    """This model tries to create a response sender in
     a model that is not configured with decoupled
     model transaction policy.
     """
 
     def initialize(self, args):
-        self.model_config = model_config = json.loads(args['model_config'])
+        self.model_config = model_config = json.loads(args["model_config"])
 
-        output0_config = pb_utils.get_output_config_by_name(
-            model_config, "OUTPUT0")
-        output1_config = pb_utils.get_output_config_by_name(
-            model_config, "OUTPUT1")
+        output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0")
+        output1_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT1")
 
         self.output0_dtype = pb_utils.triton_string_to_numpy(
-            output0_config['data_type'])
+            output0_config["data_type"]
+        )
         self.output1_dtype = pb_utils.triton_string_to_numpy(
-            output1_config['data_type'])
+            output1_config["data_type"]
+        )
 
     def execute(self, requests):
-        """ Tries to create a response sender object and use that
+        """Tries to create a response sender object and use that
         for sending the response.
         """
 
@@ -60,15 +60,16 @@ def execute(self, requests):
             response_sender = request.get_response_sender()
             in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
             in_1 = pb_utils.get_input_tensor_by_name(request, "INPUT1")
-            out_0, out_1 = (in_0.as_numpy() + in_1.as_numpy(),
-                            in_0.as_numpy() - in_1.as_numpy())
+            out_0, out_1 = (
+                in_0.as_numpy() + in_1.as_numpy(),
+                in_0.as_numpy() - in_1.as_numpy(),
+            )
 
-            out_tensor_0 = pb_utils.Tensor("OUTPUT0",
-                                           out_0.astype(output0_dtype))
-            out_tensor_1 = pb_utils.Tensor("OUTPUT1",
-                                           out_1.astype(output1_dtype))
+            out_tensor_0 = pb_utils.Tensor("OUTPUT0", out_0.astype(output0_dtype))
+            out_tensor_1 = pb_utils.Tensor("OUTPUT1", out_1.astype(output1_dtype))
             response_sender.send(
-                pb_utils.InferenceResponse([out_tensor_0, out_tensor_1]))
+                pb_utils.InferenceResponse([out_tensor_0, out_tensor_1])
+            )
             response_sender.close()
 
         return None
diff --git a/qa/python_models/sequence_int32/config.pbtxt b/qa/python_models/sequence_int32/config.pbtxt
new file mode 100644
index 0000000000..fb9236b347
--- /dev/null
+++ b/qa/python_models/sequence_int32/config.pbtxt
@@ -0,0 +1,80 @@
+# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "python_nobatch_sequence_int32"
+backend: "python"
+max_batch_size: 0
+version_policy: { latest { num_versions: 1 }}
+
+
+instance_group [
+  {
+    kind: KIND_GPU
+count: 4
+  }
+]
+
+
+input [
+  {
+    name: "INPUT"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+
+  }
+]
+output [
+  {
+    name: "OUTPUT"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+
+
+  }
+]
+sequence_batching {
+  max_sequence_idle_microseconds: 5000000
+  control_input [
+    {
+      name: "START"
+      control [
+        {
+          kind: CONTROL_SEQUENCE_START
+          int32_false_true: [ 0, 1 ]
+        }
+      ]
+    },
+    {
+      name: "READY"
+      control [
+        {
+          kind: CONTROL_SEQUENCE_READY
+          int32_false_true: [ 0, 1 ]
+        }
+      ]
+    }
+  ]
+}
diff --git a/qa/python_models/sequence_int32/model.py b/qa/python_models/sequence_int32/model.py
new file mode 100644
index 0000000000..445cb5b13e
--- /dev/null
+++ b/qa/python_models/sequence_int32/model.py
@@ -0,0 +1,92 @@
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import json
+
+import numpy as np
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    def initialize(self, args):
+        self.model_config = model_config = json.loads(args["model_config"])
+
+        output_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT")
+
+        self.output_dtype = pb_utils.triton_string_to_numpy(output_config["data_type"])
+
+        self.accumulator = np.zeros(1)
+        self.max_batch_size = model_config["max_batch_size"]
+
+    def execute(self, requests):
+        """
+        This function is called on inference request.
+        It is derived from "create_tf_modelfile" in
+        common/gen_qa_sequence_models.py and maintains
+        a true accumulator when the max batch size is 0
+
+        """
+        output_dtype = self.output_dtype
+
+        responses = []
+        for request in requests:
+            input_tensor = (
+                pb_utils.get_input_tensor_by_name(request, "INPUT")
+                .as_numpy()
+                .astype(np.int32)
+            )
+            start_tensor = (
+                pb_utils.get_input_tensor_by_name(request, "START")
+                .as_numpy()
+                .astype(np.int32)
+            )
+            ready_tensor = (
+                pb_utils.get_input_tensor_by_name(request, "READY")
+                .as_numpy()
+                .astype(np.int32)
+            )
+
+            if self.max_batch_size == 0:
+                tmp = np.where(
+                    np.equal(start_tensor, 1),
+                    input_tensor,
+                    np.add(self.accumulator, input_tensor),
+                )
+                newacc = np.where(np.equal(ready_tensor, 1), tmp, self.accumulator)
+                self.accumulator = newacc
+                out_tensor = pb_utils.Tensor(
+                    "OUTPUT", self.accumulator.astype(output_dtype)
+                )
+            else:
+                tmp = np.where(
+                    np.equal(ready_tensor, 1),
+                    np.add(start_tensor, input_tensor),
+                    np.zeros(np.shape(input_tensor), dtype=output_dtype),
+                )
+                out_tensor = pb_utils.Tensor("OUTPUT", tmp.astype(output_dtype))
+
+            responses.append(pb_utils.InferenceResponse([out_tensor]))
+        return responses
diff --git a/deploy/gke-marketplace-app/server-deployer/chart/triton/templates/istio-vs.yaml b/qa/python_models/sequence_py/config.pbtxt
similarity index 76%
rename from deploy/gke-marketplace-app/server-deployer/chart/triton/templates/istio-vs.yaml
rename to qa/python_models/sequence_py/config.pbtxt
index 32e65836a7..b58796058d 100644
--- a/deploy/gke-marketplace-app/server-deployer/chart/triton/templates/istio-vs.yaml
+++ b/qa/python_models/sequence_py/config.pbtxt
@@ -1,4 +1,4 @@
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,22 +24,30 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-apiVersion: networking.istio.io/v1alpha3
-kind: VirtualService
-metadata:
-  name: triton-vs
-spec:
-  hosts:
-  - "*"
-  gateways:
-  - triton-gateway
-  http:
-  - route:
-    - destination:
-        host: {{ template "triton-inference-server.name" . }}
-        port:
-          {{ if eq .Values.tritonProtocol "gRPC" }}
-          number: 8001
-          {{ else }}
-          number: 8000
-          {{ end }}
+backend: "python"
+max_batch_size: 4
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  }
+]
+
+sequence_batching {
+  oldest {
+    max_candidate_sequences: 4
+    max_queue_delay_microseconds: 1000000
+    preserve_ordering: False
+  }
+  max_sequence_idle_microseconds: 10000000
+}
diff --git a/qa/python_models/sequence_py/model.py b/qa/python_models/sequence_py/model.py
new file mode 100644
index 0000000000..b375af3e30
--- /dev/null
+++ b/qa/python_models/sequence_py/model.py
@@ -0,0 +1,93 @@
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import json
+
+import numpy as np
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    def initialize(self, args):
+        self.model_config = json.loads(args["model_config"])
+        self.sequences = {}
+        self.decoupled = self.model_config.get("model_transaction_policy", {}).get(
+            "decoupled"
+        )
+
+    def get_next_sequence_output_tensor(self, request):
+        sid = request.correlation_id()
+        flags = request.flags()
+        if flags == pb_utils.TRITONSERVER_REQUEST_FLAG_SEQUENCE_START:
+            if sid in self.sequences:
+                raise pb_utils.TritonModelException(
+                    "Can't start a new sequence with existing ID"
+                )
+            self.sequences[sid] = [1]
+        else:
+            if sid not in self.sequences:
+                raise pb_utils.TritonModelException(
+                    "Need START flag for a sequence ID that doesn't already exist."
+                )
+
+            last = self.sequences[sid][-1]
+            self.sequences[sid].append(last + 1)
+
+        output = self.sequences[sid][-1]
+        output = np.array([output])
+        out_tensor = pb_utils.Tensor("OUTPUT0", output.astype(np.int32))
+        return out_tensor
+
+    def execute(self, requests):
+        if self.decoupled:
+            return self.execute_decoupled(requests)
+        else:
+            return self.execute_non_decoupled(requests)
+
+    def execute_non_decoupled(self, requests):
+        responses = []
+        for request in requests:
+            output_tensor = self.get_next_sequence_output_tensor(request)
+            response = pb_utils.InferenceResponse([output_tensor])
+            responses.append(response)
+        return responses
+
+    def execute_decoupled(self, requests):
+        for request in requests:
+            sender = request.get_response_sender()
+            output_tensor = self.get_next_sequence_output_tensor(request)
+
+            # Send 3 responses per request
+            for _ in range(3):
+                response = pb_utils.InferenceResponse([output_tensor])
+                sender.send(response)
+
+            sender.send(flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)
+
+        return None
+
+    def finalize(self):
+        print(f"Cleaning up. Final sequences stored: {self.sequences}")
diff --git a/qa/python_models/string/model.py b/qa/python_models/string/model.py
index 1fd5aece6e..5e419d965a 100644
--- a/qa/python_models/string/model.py
+++ b/qa/python_models/string/model.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -35,15 +35,15 @@ class TritonPythonModel:
 
     def initialize(self, args):
         self._index = 0
-        self._dtypes = [np.bytes_, np.object_, np.object]
+        self._dtypes = [np.bytes_, np.object_]
 
     def execute(self, requests):
         responses = []
         for request in requests:
             in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
             out_tensor_0 = pb_utils.Tensor(
-                "OUTPUT0",
-                in_0.as_numpy().astype(self._dtypes[self._index]))
+                "OUTPUT0", in_0.as_numpy().astype(self._dtypes[self._index])
+            )
             self._index += 1
             responses.append(pb_utils.InferenceResponse([out_tensor_0]))
         return responses
diff --git a/qa/python_models/string_fixed/model.py b/qa/python_models/string_fixed/model.py
index d1aed94be3..d6e23eccb8 100644
--- a/qa/python_models/string_fixed/model.py
+++ b/qa/python_models/string_fixed/model.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -35,21 +35,29 @@ class TritonPythonModel:
 
     def initialize(self, args):
         self._index = 0
-        self._dtypes = [np.bytes_, np.object_, np.object]
+        self._dtypes = [np.bytes_, np.object_]
 
     def execute(self, requests):
+        # Create four different responses (empty string or fixed string) * (two
+        # datatypes)
         responses = []
         for _ in requests:
-            if self._index % 2 == 0:
+            if self._index == 0:
                 out_tensor_0 = pb_utils.Tensor(
-                    "OUTPUT0",
-                    np.array(['123456'], dtype=self._dtypes[self._index % 3]))
-            else:
-                # Test sending strings with no elements
+                    "OUTPUT0", np.array(["123456"], dtype=self._dtypes[0])
+                )
+            elif self._index == 1:
                 out_tensor_0 = pb_utils.Tensor(
-                    "OUTPUT0", np.array([],
-                                        dtype=self._dtypes[self._index % 3]))
-
+                    "OUTPUT0", np.array([], dtype=self._dtypes[1])
+                )
+            elif self._index == 2:
+                out_tensor_0 = pb_utils.Tensor(
+                    "OUTPUT0", np.array(["123456"], dtype=self._dtypes[0])
+                )
+            elif self._index == 3:
+                out_tensor_0 = pb_utils.Tensor(
+                    "OUTPUT0", np.array([], dtype=self._dtypes[1])
+                )
             self._index += 1
             responses.append(pb_utils.InferenceResponse([out_tensor_0]))
         return responses
diff --git a/qa/python_models/string_identity/model.py b/qa/python_models/string_identity/model.py
index 39575c119b..0288b129bc 100644
--- a/qa/python_models/string_identity/model.py
+++ b/qa/python_models/string_identity/model.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,23 +24,21 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import sys
 import json
+import sys
 
-sys.path.append('../../')
+sys.path.append("../../")
 import triton_python_backend_utils as pb_utils
 
 
 class TritonPythonModel:
-    """This model always returns the input that it has received.
-    """
+    """This model always returns the input that it has received."""
 
     def initialize(self, args):
-        self.model_config = json.loads(args['model_config'])
+        self.model_config = json.loads(args["model_config"])
 
     def execute(self, requests):
-        """ This function is called on inference request.
-        """
+        """This function is called on inference request."""
 
         responses = []
         for request in requests:
diff --git a/qa/python_models/sub_add/model.py b/qa/python_models/sub_add/model.py
index 0a53874629..8ac679c86f 100644
--- a/qa/python_models/sub_add/model.py
+++ b/qa/python_models/sub_add/model.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,32 +24,31 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import numpy as np
-import sys
 import json
+import sys
 
-sys.path.append('../../')
+import numpy as np
+
+sys.path.append("../../")
 import triton_python_backend_utils as pb_utils
 
 
 class TritonPythonModel:
-
     def initialize(self, args):
-        self.model_config = model_config = json.loads(args['model_config'])
+        self.model_config = model_config = json.loads(args["model_config"])
 
-        output0_config = pb_utils.get_output_config_by_name(
-            model_config, "OUTPUT0")
-        output1_config = pb_utils.get_output_config_by_name(
-            model_config, "OUTPUT1")
+        output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0")
+        output1_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT1")
 
         self.output0_dtype = pb_utils.triton_string_to_numpy(
-            output0_config['data_type'])
+            output0_config["data_type"]
+        )
         self.output1_dtype = pb_utils.triton_string_to_numpy(
-            output1_config['data_type'])
+            output1_config["data_type"]
+        )
 
     def execute(self, requests):
-        """ This function is called on inference request.
-        """
+        """This function is called on inference request."""
 
         output0_dtype = self.output0_dtype
         output1_dtype = self.output1_dtype
@@ -59,18 +58,21 @@ def execute(self, requests):
             input_tensors = request.inputs()
             in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
             in_1 = pb_utils.get_input_tensor_by_name(request, "INPUT1")
-            if in_0.as_numpy().dtype.type is np.bytes_ or in_0.as_numpy(
-            ).dtype == np.object_:
-                out_0, out_1 = (in_0.as_numpy().astype(np.int32) - in_1.as_numpy().astype(np.int32),\
-                    in_0.as_numpy().astype(np.int32) + in_1.as_numpy().astype(np.int32))
+            if (
+                in_0.as_numpy().dtype.type is np.bytes_
+                or in_0.as_numpy().dtype == np.object_
+            ):
+                out_0, out_1 = (
+                    in_0.as_numpy().astype(np.int32) - in_1.as_numpy().astype(np.int32),
+                    in_0.as_numpy().astype(np.int32) + in_1.as_numpy().astype(np.int32),
+                )
             else:
-                out_0, out_1 = (in_0.as_numpy() - in_1.as_numpy(),
-                                in_0.as_numpy() + in_1.as_numpy())
+                out_0, out_1 = (
+                    in_0.as_numpy() - in_1.as_numpy(),
+                    in_0.as_numpy() + in_1.as_numpy(),
+                )
 
-            out_tensor_0 = pb_utils.Tensor("OUTPUT0",
-                                           out_0.astype(output0_dtype))
-            out_tensor_1 = pb_utils.Tensor("OUTPUT1",
-                                           out_1.astype(output1_dtype))
-            responses.append(
-                pb_utils.InferenceResponse([out_tensor_0, out_tensor_1]))
+            out_tensor_0 = pb_utils.Tensor("OUTPUT0", out_0.astype(output0_dtype))
+            out_tensor_1 = pb_utils.Tensor("OUTPUT1", out_1.astype(output1_dtype))
+            responses.append(pb_utils.InferenceResponse([out_tensor_0, out_tensor_1]))
         return responses
diff --git a/qa/python_models/torchvision/resnet50/config.pbtxt b/qa/python_models/torchvision/resnet50/config.pbtxt
new file mode 100644
index 0000000000..fdbc7c7de9
--- /dev/null
+++ b/qa/python_models/torchvision/resnet50/config.pbtxt
@@ -0,0 +1,40 @@
+# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "resnet50_python"
+backend: "python"
+max_batch_size: 128
+input {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    format: FORMAT_NCHW
+    dims: [ 3, 224, 224 ]
+  }
+output {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ 1000 ]
+  }
diff --git a/qa/python_models/torchvision/resnet50/model.py b/qa/python_models/torchvision/resnet50/model.py
new file mode 100644
index 0000000000..1e2dbbf7a1
--- /dev/null
+++ b/qa/python_models/torchvision/resnet50/model.py
@@ -0,0 +1,62 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import torch
+import triton_python_backend_utils as pb_utils
+from torch.utils.dlpack import to_dlpack
+
+
+class TritonPythonModel:
+    def initialize(self, args):
+        """
+        This function initializes pre-trained ResNet50 model.
+        """
+        self.device = "cuda" if args["model_instance_kind"] == "GPU" else "cpu"
+        # Our tests currently depend on torchvision=0.14,
+        # to make sure `torch.hub` loads Resnet50 implementation
+        # compatible with torchvision=0.14, we need to provide tag
+        self.model = (
+            torch.hub.load(
+                "pytorch/vision:v0.14.1", "resnet50", weights="IMAGENET1K_V2"
+            )
+            .to(self.device)
+            .eval()
+        )
+
+    def execute(self, requests):
+        """
+        This function receives a list of requests (`pb_utils.InferenceRequest`),
+        performs inference on every request and appends it to responses.
+        """
+        responses = []
+        for request in requests:
+            input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT0")
+            result = self.model(
+                torch.as_tensor(input_tensor.as_numpy(), device=self.device)
+            )
+            out_tensor = pb_utils.Tensor.from_dlpack("OUTPUT0", to_dlpack(result))
+            responses.append(pb_utils.InferenceResponse([out_tensor]))
+        return responses
diff --git a/qa/python_models/variable_gpu_output/config.pbtxt b/qa/python_models/variable_gpu_output/config.pbtxt
new file mode 100644
index 0000000000..8fe69444f7
--- /dev/null
+++ b/qa/python_models/variable_gpu_output/config.pbtxt
@@ -0,0 +1,55 @@
+# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "variable_gpu_output"
+backend: "python"
+max_batch_size: 256
+
+input [
+  {
+    name: "INPUT"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT"
+    data_type: TYPE_FP32
+    dims: [ -1 ]
+  }
+]
+
+dynamic_batching {
+  max_queue_delay_microseconds: 1000000
+}
+
+instance_group [
+  {
+    count: 1
+    kind: KIND_GPU
+  }
+]
diff --git a/qa/python_models/variable_gpu_output/model.py b/qa/python_models/variable_gpu_output/model.py
new file mode 100644
index 0000000000..2da2a3cbd2
--- /dev/null
+++ b/qa/python_models/variable_gpu_output/model.py
@@ -0,0 +1,46 @@
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import torch
+import triton_python_backend_utils as pb_utils
+from torch.utils.dlpack import to_dlpack
+
+
+class TritonPythonModel:
+    def execute(self, requests):
+        # The client will send 5 requests
+        assert len(requests) == 5
+        responses = []
+        for i, request in enumerate(requests):
+            # Create an (i+1)-element array with all the tensors equal to (i+1)
+            output = torch.ones(i + 1, dtype=torch.float32, device="cuda")
+            output = output * (i + 1)
+            output_pb_tensor = pb_utils.Tensor.from_dlpack("OUTPUT", to_dlpack(output))
+            inference_response = pb_utils.InferenceResponse(
+                output_tensors=[output_pb_tensor]
+            )
+            responses.append(inference_response)
+        return responses
diff --git a/qa/python_models/wrong_model/model.py b/qa/python_models/wrong_model/model.py
index 9059255395..2cac72324f 100644
--- a/qa/python_models/wrong_model/model.py
+++ b/qa/python_models/wrong_model/model.py
@@ -1,4 +1,4 @@
-# Copyright 2020-2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,7 +28,6 @@
 
 
 class TritonPythonModel:
-
     def execute(self, requests):
         """
         This model ensures that errors in the execute function are properly
diff --git a/qa/python_models/wrong_return_type/config.pbtxt b/qa/python_models/wrong_return_type/config.pbtxt
new file mode 100644
index 0000000000..e34905e635
--- /dev/null
+++ b/qa/python_models/wrong_return_type/config.pbtxt
@@ -0,0 +1,49 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "wrong_return_type"
+backend: "python"
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  }
+]
+
+sequence_batching {
+  iterative_sequence : true
+}
+
+instance_group [{ kind: KIND_CPU }]
diff --git a/qa/python_models/wrong_return_type/model.py b/qa/python_models/wrong_return_type/model.py
new file mode 100644
index 0000000000..c5e6f660fc
--- /dev/null
+++ b/qa/python_models/wrong_return_type/model.py
@@ -0,0 +1,67 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import json
+
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    def initialize(self, args):
+        self.model_config = model_config = json.loads(args["model_config"])
+
+        output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0")
+
+        self.output0_dtype = pb_utils.triton_string_to_numpy(
+            output0_config["data_type"]
+        )
+
+    def execute(self, requests):
+        output0_dtype = self.output0_dtype
+
+        responses = []
+
+        for request in requests:
+            in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
+
+            out_0 = in_0.as_numpy()
+
+            # Create output tensors. You need pb_utils.Tensor
+            # objects to create pb_utils.InferenceResponse.
+            out_tensor_0 = pb_utils.Tensor("OUTPUT0", out_0.astype(output0_dtype))
+
+            inference_response = pb_utils.InferenceResponse(
+                output_tensors=[out_tensor_0]
+            )
+
+            request.set_release_flags(pb_utils.TRITONSERVER_REQUEST_RELEASE_RESCHEDULE)
+            # Should append `None` for rescheduled requests.
+            responses.append(inference_response)
+
+        return responses
+
+    def finalize(self):
+        pass
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index d17392b869..f64894c5cd 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -1,4 +1,4 @@
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -68,13 +68,6 @@ if(${TRITON_ENABLE_GPU})
   message(STATUS "Using CUDA ${CUDA_VERSION}")
 endif() # TRITON_ENABLE_GPU
 
-# GRPC
-#
-if(${TRITON_ENABLE_GRPC})
-  find_package(gRPC CONFIG REQUIRED)
-  message(STATUS "Using gRPC ${gRPC_VERSION}")
-endif()
-
 # libevent
 #
 if(${TRITON_ENABLE_HTTP} OR ${TRITON_ENABLE_METRICS} OR
@@ -83,6 +76,16 @@ if(${TRITON_ENABLE_HTTP} OR ${TRITON_ENABLE_METRICS} OR
   message(STATUS "Using libevent ${Libevent_VERSION}")
 endif()
 
+# OpenTelemetry
+#
+if (NOT WIN32 AND ${TRITON_ENABLE_TRACING})
+    find_package(absl CONFIG REQUIRED)
+    find_package(CURL CONFIG REQUIRED)
+    find_package(nlohmann_json CONFIG REQUIRED)
+    find_package(opentelemetry-cpp CONFIG REQUIRED)
+    message(STATUS "Using opentelemetry-cpp ${opentelemetry-cpp_VERSION}")
+endif()
+
 # re2
 #
 find_library(RE2_LIBRARY NAMES re2)
@@ -93,6 +96,7 @@ find_library(RE2_LIBRARY NAMES re2)
 add_executable(
   main
   classification.cc
+  command_line_parser.cc
   common.cc
   main.cc
   shared_memory_manager.cc
@@ -121,8 +125,11 @@ if(CMAKE_CXX_COMPILER_ID STREQUAL "MSVC")
   target_compile_options(
     main
     PRIVATE
-      /W1 /D_WIN32_WINNT=0x0A00 /EHsc
+      /W1 /D_WIN32_WINNT=0x0A00 /EHsc /Zc:preprocessor
   )
+  target_compile_definitions(main
+    PRIVATE
+      NOMINMAX)
 else()
   target_compile_options(
     main
@@ -185,18 +192,6 @@ if(${TRITON_ENABLE_HTTP} OR ${TRITON_ENABLE_METRICS} OR
   )
 endif()
 
-if(${TRITON_ENABLE_GRPC})
-  target_include_directories(
-    main
-    PRIVATE
-      $<TARGET_PROPERTY:gRPC::grpc,INTERFACE_INCLUDE_DIRECTORIES>
-  )
-
-  target_compile_definitions(
-    main
-    PRIVATE TRITON_ENABLE_GRPC=1
-  )
-endif() # TRITON_ENABLE_GRPC
 
 if(${TRITON_ENABLE_HTTP})
   target_compile_definitions(
@@ -245,6 +240,14 @@ if(${TRITON_ENABLE_TRACING})
     main
     PRIVATE TRITON_ENABLE_TRACING=1
   )
+# FIXME: remove, when Windows support is added for Opentelemetry
+  if (NOT WIN32)
+    target_include_directories(
+      main
+      PRIVATE
+        ${OPENTELEMETRY_CPP_INCLUDE_DIRS}
+    )
+  endif()
 endif() # TRITON_ENABLE_TRACING
 
 if(${TRITON_ENABLE_NVTX})
@@ -278,116 +281,31 @@ else()
   )
 endif()
 
-# grpc endpoint
-#
 if(${TRITON_ENABLE_GRPC})
-  add_library(
-    grpc-endpoint-library EXCLUDE_FROM_ALL
-    grpc_server.cc
-    grpc_server.h
-  )
-
-  target_compile_features(grpc-endpoint-library PRIVATE cxx_std_11)
-  if(CMAKE_CXX_COMPILER_ID STREQUAL "MSVC")
-    target_compile_options(
-      grpc-endpoint-library
-      PRIVATE
-        /W1 /D_WIN32_WINNT=0x0A00 /EHsc
-    )
-  else()
-    target_compile_options(
-      grpc-endpoint-library
-      PRIVATE
-        -Wall -Wextra -Wno-unused-parameter -Wno-deprecated-declarations -Werror
-    )
-  endif()
-
-  set_target_properties(
-    grpc-endpoint-library
-    PROPERTIES
-      POSITION_INDEPENDENT_CODE ON
-  )
+  #
+  # GRPC
+  #
+  find_package(gRPC CONFIG REQUIRED)
+  message(STATUS "Using gRPC ${gRPC_VERSION}")
 
+  add_subdirectory(grpc)
   target_link_libraries(
-    grpc-endpoint-library
-    PUBLIC
-      proto-library           # from repo-common
-      triton-common-logging   # from repo-common
-      triton-common-json      # from repo-common
-      grpc-service-library    # from repo-common
-      triton-core-serverapi   # from repo-core
-      triton-core-serverstub  # from repo-core
-      gRPC::grpc++
-      gRPC::grpc
-      protobuf::libprotobuf
+      main
+      PRIVATE
+        grpc-endpoint-library
   )
 
   target_include_directories(
-    grpc-endpoint-library
+    main
     PRIVATE
       $<TARGET_PROPERTY:gRPC::grpc,INTERFACE_INCLUDE_DIRECTORIES>
   )
 
   target_compile_definitions(
-    grpc-endpoint-library
-    PRIVATE TRITON_ENABLE_GRPC=1
-  )
-
-  if(${TRITON_ENABLE_GPU})
-    target_compile_definitions(
-      grpc-endpoint-library
-      PRIVATE TRITON_ENABLE_GPU=1
-      PRIVATE TRITON_MIN_COMPUTE_CAPABILITY=${TRITON_MIN_COMPUTE_CAPABILITY}
-    )
-
-    target_link_libraries(
-      grpc-endpoint-library
-      PUBLIC
-        CUDA::cudart
-    )
-  endif() # TRITON_ENABLE_GPU
-
-  if(${TRITON_ENABLE_METRICS})
-    target_compile_definitions(
-      grpc-endpoint-library
-      PRIVATE TRITON_ENABLE_METRICS=1
-    )
-  endif() # TRITON_ENABLE_METRICS
-
-  if(${TRITON_ENABLE_LOGGING})
-    target_compile_definitions(
-      grpc-endpoint-library
-      PRIVATE TRITON_ENABLE_LOGGING=1
-    )
-  endif() # TRITON_ENABLE_LOGGING
-
-  if(${TRITON_ENABLE_STATS})
-    target_compile_definitions(
-      grpc-endpoint-library
-      PRIVATE TRITON_ENABLE_STATS=1
-    )
-  endif() # TRITON_ENABLE_STATS
-
-  if(${TRITON_ENABLE_TRACING})
-    target_compile_definitions(
-      grpc-endpoint-library
-      PRIVATE TRITON_ENABLE_TRACING=1
-    )
-  endif() # TRITON_ENABLE_TRACING
-
-  if(${TRITON_ENABLE_NVTX})
-    target_compile_definitions(
-      grpc-endpoint-library
-      PRIVATE TRITON_ENABLE_NVTX=1
-    )
-  endif() # TRITON_ENABLE_NVTX
-
-target_link_libraries(
     main
-    PRIVATE
-      grpc-endpoint-library
+    PRIVATE TRITON_ENABLE_GRPC=1
   )
-endif() # TRITON_ENABLE_GRPC
+endif()
 
 # http endpoint
 #
@@ -440,7 +358,7 @@ if(${TRITON_ENABLE_HTTP}
     target_compile_options(
       http-endpoint-library
       PRIVATE
-        /W1 /D_WIN32_WINNT=0x0A00 /EHsc
+        /W1 /D_WIN32_WINNT=0x0A00 /EHsc /Zc:preprocessor
     )
   else()
     target_compile_options(
@@ -473,6 +391,16 @@ if(${TRITON_ENABLE_HTTP}
     PRIVATE $<TARGET_PROPERTY:libevhtp::evhtp,INTERFACE_INCLUDE_DIRECTORIES>
   )
 
+  # FIXME when Triton support of Opentelemetry is available on Windows
+  # add ${OPENTELEMETRY_CPP_INCLUDE_DIRS} to above target_include_directories
+  # JIRA DLIS-4786
+  if (NOT WIN32 AND ${TRITON_ENABLE_TRACING})
+    target_include_directories(
+      http-endpoint-library
+      PRIVATE ${OPENTELEMETRY_CPP_INCLUDE_DIRS}
+    )
+  endif()
+
   if(${TRITON_ENABLE_GPU})
     target_compile_definitions(
       http-endpoint-library
@@ -579,6 +507,20 @@ if(${TRITON_ENABLE_TRACING})
     tracer.cc tracer.h
   )
 
+  if (NOT WIN32)
+    target_compile_features(tracing-library PRIVATE cxx_std_17)
+
+    target_include_directories(
+      tracing-library
+      PRIVATE ${OPENTELEMETRY_CPP_INCLUDE_DIRS}
+    )
+
+    target_link_libraries(
+      tracing-library
+      PRIVATE
+      ${OPENTELEMETRY_CPP_LIBRARIES})
+  endif()
+
   target_link_libraries(
     tracing-library
     PUBLIC
@@ -658,7 +600,7 @@ if (NOT WIN32)
     target_compile_options(
       simple
       PRIVATE
-        /W1 /D_WIN32_WINNT=0x0A00 /EHsc
+        /W1 /D_WIN32_WINNT=0x0A00 /EHsc /Zc:preprocessor
     )
   else()
     target_compile_options(
@@ -722,7 +664,7 @@ if (NOT WIN32)
     target_compile_options(
       multi_server
       PRIVATE
-        /W1 /D_WIN32_WINNT=0x0A00 /EHsc
+        /W1 /D_WIN32_WINNT=0x0A00 /EHsc /Zc:preprocessor
     )
   else()
     target_compile_options(
@@ -787,7 +729,7 @@ if (NOT WIN32)
       target_compile_options(
         memory_alloc
         PRIVATE
-          /W1 /D_WIN32_WINNT=0x0A00 /EHsc
+          /W1 /D_WIN32_WINNT=0x0A00 /EHsc /Zc:preprocessor
       )
     else()
       target_compile_options(
@@ -831,6 +773,6 @@ if (NOT WIN32)
 endif() # NOT WIN32
 
 # Currently unit tests do not build for windows...
-if (NOT WIN32)
+if ( NOT WIN32)
   add_subdirectory(test test)
 endif() # NOT WIN32
diff --git a/src/classification.cc b/src/classification.cc
index d8dab03817..2d8cd26b9e 100644
--- a/src/classification.cc
+++ b/src/classification.cc
@@ -28,6 +28,7 @@
 
 #include <algorithm>
 #include <numeric>
+
 #include "common.h"
 
 namespace triton { namespace server {
diff --git a/src/classification.h b/src/classification.h
index 27c8ba1ef6..9264baa2b0 100644
--- a/src/classification.h
+++ b/src/classification.h
@@ -27,6 +27,7 @@
 
 #include <string>
 #include <vector>
+
 #include "triton/core/tritonserver.h"
 
 namespace triton { namespace server {
diff --git a/src/command_line_parser.cc b/src/command_line_parser.cc
new file mode 100644
index 0000000000..20307eae9f
--- /dev/null
+++ b/src/command_line_parser.cc
@@ -0,0 +1,2244 @@
+// Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NVIDIA CORPORATION nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//
+
+#include "command_line_parser.h"
+constexpr const char* GLOBAL_OPTION_GROUP = "";
+
+#ifdef _WIN32
+int optind = 1;
+const char* optarg = nullptr;
+
+/// Implementation of `getopt_long` for Windows.
+/// Linux uses available implementation:
+/// https://github.com/gcc-mirror/gcc/blob/fab08d12b40ad637c5a4ce8e026fb43cd3f0fad1/include/getopt.h
+/// and
+/// https://github.com/gcc-mirror/gcc/blob/fab08d12b40ad637c5a4ce8e026fb43cd3f0fad1/libiberty/getopt.c#L521
+/// Parameters' description is available here:
+/// https://github.com/gcc-mirror/gcc/blob/fab08d12b40ad637c5a4ce8e026fb43cd3f0fad1/libiberty/getopt.c#L464-L518
+/// `optind' is an index to iterate over `argv`, (whose length is `argc`),
+/// and starts from 1, since argv[0] is the program name.
+/// Text in the current `argv`-element is returned in `optarg'.
+/// Note: if option was provided in the form of --<key>=<value>, then
+/// optarg is (argv[optind] + found + 1), i.e. everything after `=`.
+/// Alternatively, option can be provided as --<key> <value>.
+/// In this case, <value> is storred as a separate parameter in `argv`.
+/// `longind` returns the index in `longopts` of the long-named option found.
+
+int
+getopt_long(
+    int argc, char* const argv[], const char* optstring,
+    const struct option* longopts, int* longind)
+{
+  if (optind >= argc) {
+    return -1;
+  }
+  const struct option* curr_longopt = longopts;
+  std::string argv_str = argv[optind];
+  size_t found = argv_str.find_first_of("=");
+  std::string key = argv_str.substr(
+      2, (found == std::string::npos) ? std::string::npos : (found - 2));
+  int option_index = 0;
+  for (curr_longopt, option_index; curr_longopt->name;
+       curr_longopt++, option_index++) {
+    if (key == curr_longopt->name) {
+      if (longind != NULL)
+        (*longind) = option_index;
+      if (curr_longopt->has_arg == required_argument) {
+        if (found == std::string::npos) {
+          optind++;
+          if (optind >= argc) {
+            std::cerr << argv[0] << ": option '" << argv_str
+                      << "' requires an argument" << std::endl;
+            return '?';
+          }
+          optarg = argv[optind];
+        } else {
+          optarg = (argv[optind] + found + 1);
+        }
+      }
+      optind++;
+      return curr_longopt->val;
+    }
+  }
+  return -1;
+}
+#endif
+
+#include <algorithm>
+#include <iomanip>
+#include <iostream>
+#include <string>
+
+#include "common.h"
+
+#define TRITONJSON_STATUSTYPE TRITONSERVER_Error*
+#define TRITONJSON_STATUSRETURN(M) \
+  return TRITONSERVER_ErrorNew(TRITONSERVER_ERROR_INTERNAL, (M).c_str())
+#define TRITONJSON_STATUSSUCCESS nullptr
+#include "triton/common/triton_json.h"
+
+
+namespace triton { namespace server {
+
+// [FIXME] expose following parse helpers for other type of parser
+namespace {
+
+// A wrapper around std::stoi, std::stoull, std::stoll, std::stod
+// to catch `invalid argument` and `out of range` exceptions
+template <typename T>
+T StringTo(const std::string& arg);
+
+template <>
+int
+StringTo(const std::string& arg)
+{
+  return std::stoi(arg);
+}
+
+template <>
+uint64_t
+StringTo(const std::string& arg)
+{
+  return std::stoull(arg);
+}
+
+template <>
+int64_t
+StringTo(const std::string& arg)
+{
+  return std::stoll(arg);
+}
+
+template <>
+double
+StringTo(const std::string& arg)
+{
+  return std::stod(arg);
+}
+
+// There must be specialization for the types to be parsed into so that
+// the argument is properly validated and parsed. Attempted to use input
+// operator (>>) but it will consume improper argument without error
+// (i.e. parse "1.4" to 'int' will return 1 but we want to report error).
+template <typename T>
+T
+ParseOption(const std::string& arg)
+{
+  try {
+    return StringTo<T>(arg);
+  }
+  catch (const std::invalid_argument& ia) {
+    std::stringstream ss;
+    ss << "Invalid option value. Got " << arg << std::endl;
+    throw ParseException(ss.str());
+  }
+  catch (const std::out_of_range& oor) {
+    std::stringstream ss;
+    ss << "Provided option value is out of bound. Got " << arg << std::endl;
+    throw ParseException(ss.str());
+  }
+}
+
+template <>
+bool
+ParseOption(const std::string& arg)
+{
+  // 'arg' need to comply with template declaration
+  std::string larg = arg;
+  std::transform(larg.begin(), larg.end(), larg.begin(), [](unsigned char c) {
+    return std::tolower(c);
+  });
+
+  if ((larg == "true") || (larg == "on") || (larg == "1")) {
+    return true;
+  }
+  if ((larg == "false") || (larg == "off") || (larg == "0")) {
+    return false;
+  }
+
+  throw ParseException("invalid value for bool option: " + arg);
+}
+
+// Condition here merely to avoid compilation error, this function will
+// be defined but not used otherwise.
+#ifdef TRITON_ENABLE_LOGGING
+int
+ParseIntBoolOption(std::string arg)
+{
+  std::transform(arg.begin(), arg.end(), arg.begin(), [](unsigned char c) {
+    return std::tolower(c);
+  });
+
+  if (arg == "true") {
+    return 1;
+  }
+  if (arg == "false") {
+    return 0;
+  }
+
+  return ParseOption<int>(arg);
+}
+#endif  // TRITON_ENABLE_LOGGING
+
+std::string
+PairsToJsonStr(std::vector<std::pair<std::string, std::string>> settings)
+{
+  triton::common::TritonJson::Value json(
+      triton::common::TritonJson::ValueType::OBJECT);
+  for (const auto& setting : settings) {
+    const auto& key = setting.first;
+    const auto& value = setting.second;
+    json.SetStringObject(key.c_str(), value);
+  }
+  triton::common::TritonJson::WriteBuffer buffer;
+  auto err = json.Write(&buffer);
+  if (err != nullptr) {
+    LOG_TRITONSERVER_ERROR(err, "failed to convert config to JSON");
+  }
+  return buffer.Contents();
+}
+
+template <typename T1, typename T2>
+std::pair<T1, T2>
+ParsePairOption(const std::string& arg, const std::string& delim_str)
+{
+  int delim = arg.find(delim_str);
+
+  if ((delim < 0)) {
+    std::stringstream ss;
+    ss << "Cannot parse pair option due to incorrect number of inputs."
+          "--<pair option> argument requires format <first>"
+       << delim_str << "<second>. "
+       << "Found: " << arg << std::endl;
+    throw ParseException(ss.str());
+  }
+
+  std::string first_string = arg.substr(0, delim);
+  std::string second_string = arg.substr(delim + delim_str.length());
+
+  // Specific conversion from key-value string to actual key-value type,
+  // should be extracted out of this function if we need to parse
+  // more pair option of different types.
+  return {ParseOption<T1>(first_string), ParseOption<T2>(second_string)};
+}
+
+// Split 'options' by 'delim_str' and place split strings into a vector
+std::vector<std::string>
+SplitOptions(std::string options, const std::string& delim_str)
+{
+  std::vector<std::string> res;
+
+  int delim = options.find(delim_str);
+  while ((delim >= 0)) {
+    res.emplace_back(options.substr(0, delim));
+    options = options.substr(delim + delim_str.length());
+    delim = options.find(delim_str);
+  }
+  // include last element
+  res.emplace_back(options);
+  return res;
+}
+
+}  // namespace
+
+enum TritonOptionId {
+  OPTION_HELP = 1000,
+#ifdef TRITON_ENABLE_LOGGING
+  OPTION_LOG_VERBOSE,
+  OPTION_LOG_INFO,
+  OPTION_LOG_WARNING,
+  OPTION_LOG_ERROR,
+  OPTION_LOG_FORMAT,
+  OPTION_LOG_FILE,
+#endif  // TRITON_ENABLE_LOGGING
+  OPTION_ID,
+  OPTION_MODEL_REPOSITORY,
+  OPTION_EXIT_ON_ERROR,
+  OPTION_DISABLE_AUTO_COMPLETE_CONFIG,
+  OPTION_STRICT_MODEL_CONFIG,
+  OPTION_STRICT_READINESS,
+#if defined(TRITON_ENABLE_HTTP)
+  OPTION_ALLOW_HTTP,
+  OPTION_HTTP_HEADER_FORWARD_PATTERN,
+  OPTION_HTTP_PORT,
+  OPTION_REUSE_HTTP_PORT,
+  OPTION_HTTP_ADDRESS,
+  OPTION_HTTP_THREAD_COUNT,
+  OPTION_HTTP_RESTRICTED_API,
+#endif  // TRITON_ENABLE_HTTP
+#if defined(TRITON_ENABLE_GRPC)
+  OPTION_ALLOW_GRPC,
+  OPTION_GRPC_PORT,
+  OPTION_REUSE_GRPC_PORT,
+  OPTION_GRPC_ADDRESS,
+  OPTION_GRPC_HEADER_FORWARD_PATTERN,
+  OPTION_GRPC_INFER_ALLOCATION_POOL_SIZE,
+  OPTION_GRPC_USE_SSL,
+  OPTION_GRPC_USE_SSL_MUTUAL,
+  OPTION_GRPC_SERVER_CERT,
+  OPTION_GRPC_SERVER_KEY,
+  OPTION_GRPC_ROOT_CERT,
+  OPTION_GRPC_RESPONSE_COMPRESSION_LEVEL,
+  OPTION_GRPC_ARG_KEEPALIVE_TIME_MS,
+  OPTION_GRPC_ARG_KEEPALIVE_TIMEOUT_MS,
+  OPTION_GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS,
+  OPTION_GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA,
+  OPTION_GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS,
+  OPTION_GRPC_ARG_HTTP2_MAX_PING_STRIKES,
+  OPTION_GRPC_RESTRICTED_PROTOCOL,
+  OPTION_GRPC_ARG_MAX_CONNECTION_AGE_MS,
+  OPTION_GRPC_ARG_MAX_CONNECTION_AGE_GRACE_MS,
+#endif  // TRITON_ENABLE_GRPC
+#if defined(TRITON_ENABLE_SAGEMAKER)
+  OPTION_ALLOW_SAGEMAKER,
+  OPTION_SAGEMAKER_PORT,
+  OPTION_SAGEMAKER_SAFE_PORT_RANGE,
+  OPTION_SAGEMAKER_THREAD_COUNT,
+#endif  // TRITON_ENABLE_SAGEMAKER
+#if defined(TRITON_ENABLE_VERTEX_AI)
+  OPTION_ALLOW_VERTEX_AI,
+  OPTION_VERTEX_AI_PORT,
+  OPTION_VERTEX_AI_THREAD_COUNT,
+  OPTION_VERTEX_AI_DEFAULT_MODEL,
+#endif  // TRITON_ENABLE_VERTEX_AI
+#ifdef TRITON_ENABLE_METRICS
+  OPTION_ALLOW_METRICS,
+  OPTION_ALLOW_GPU_METRICS,
+  OPTION_ALLOW_CPU_METRICS,
+  OPTION_METRICS_ADDRESS,
+  OPTION_METRICS_PORT,
+  OPTION_METRICS_INTERVAL_MS,
+  OPTION_METRICS_CONFIG,
+#endif  // TRITON_ENABLE_METRICS
+#ifdef TRITON_ENABLE_TRACING
+  OPTION_TRACE_FILEPATH,
+  OPTION_TRACE_LEVEL,
+  OPTION_TRACE_RATE,
+  OPTION_TRACE_COUNT,
+  OPTION_TRACE_LOG_FREQUENCY,
+  OPTION_TRACE_CONFIG,
+#endif  // TRITON_ENABLE_TRACING
+  OPTION_MODEL_CONTROL_MODE,
+  OPTION_POLL_REPO_SECS,
+  OPTION_STARTUP_MODEL,
+  OPTION_RATE_LIMIT,
+  OPTION_RATE_LIMIT_RESOURCE,
+  OPTION_PINNED_MEMORY_POOL_BYTE_SIZE,
+  OPTION_CUDA_MEMORY_POOL_BYTE_SIZE,
+  OPTION_CUDA_VIRTUAL_ADDRESS_SIZE,
+  OPTION_RESPONSE_CACHE_BYTE_SIZE,
+  OPTION_CACHE_CONFIG,
+  OPTION_CACHE_DIR,
+  OPTION_MIN_SUPPORTED_COMPUTE_CAPABILITY,
+  OPTION_EXIT_TIMEOUT_SECS,
+  OPTION_BACKEND_DIR,
+  OPTION_REPOAGENT_DIR,
+  OPTION_BUFFER_MANAGER_THREAD_COUNT,
+  OPTION_MODEL_LOAD_THREAD_COUNT,
+  OPTION_BACKEND_CONFIG,
+  OPTION_HOST_POLICY,
+  OPTION_MODEL_LOAD_GPU_LIMIT,
+  OPTION_MODEL_NAMESPACING
+};
+
+void
+TritonParser::SetupOptions()
+{
+  global_options_.push_back(
+      {OPTION_HELP, "help", Option::ArgNone, "Print usage"});
+
+  server_options_.push_back(
+      {OPTION_ID, "id", Option::ArgStr, "Identifier for this server."});
+  server_options_.push_back(
+      {OPTION_EXIT_TIMEOUT_SECS, "exit-timeout-secs", Option::ArgInt,
+       "Timeout (in seconds) when exiting to wait for in-flight inferences to "
+       "finish. After the timeout expires the server exits even if inferences "
+       "are still in flight."});
+
+  model_repo_options_.push_back(
+      {OPTION_MODEL_REPOSITORY, "model-store", Option::ArgStr,
+       "Equivalent to --model-repository."});
+  model_repo_options_.push_back(
+      {OPTION_MODEL_REPOSITORY, "model-repository", Option::ArgStr,
+       "Path to model repository directory. It may be specified multiple times "
+       "to add multiple model repositories. Note that if a model is not unique "
+       "across all model repositories at any time, the model will not be "
+       "available."});
+  model_repo_options_.push_back(
+      {OPTION_EXIT_ON_ERROR, "exit-on-error", Option::ArgBool,
+       "Exit the inference server if an error occurs during initialization."});
+  model_repo_options_.push_back(
+      {OPTION_DISABLE_AUTO_COMPLETE_CONFIG, "disable-auto-complete-config",
+       Option::ArgNone,
+       "If set, disables the triton and backends from auto completing model "
+       "configuration files. Model configuration files must be provided and "
+       "all required "
+       "configuration settings must be specified."});
+  model_repo_options_.push_back(
+      {OPTION_STRICT_READINESS, "strict-readiness", Option::ArgBool,
+       "If true /v2/health/ready endpoint indicates ready if the server "
+       "is responsive and all models are available. If false "
+       "/v2/health/ready endpoint indicates ready if server is responsive "
+       "even if some/all models are unavailable."});
+  model_repo_options_.push_back(
+      {OPTION_MODEL_CONTROL_MODE, "model-control-mode", Option::ArgStr,
+       "Specify the mode for model management. Options are \"none\", \"poll\" "
+       "and \"explicit\". The default is \"none\". "
+       "For \"none\", the server will load all models in the model "
+       "repository(s) at startup and will not make any changes to the load "
+       "models after that. For \"poll\", the server will poll the model "
+       "repository(s) to detect changes and will load/unload models based on "
+       "those changes. The poll rate is controlled by 'repository-poll-secs'. "
+       "For \"explicit\", model load and unload is initiated by using the "
+       "model control APIs, and only models specified with --load-model will "
+       "be loaded at startup."});
+  model_repo_options_.push_back(
+      {OPTION_POLL_REPO_SECS, "repository-poll-secs", Option::ArgInt,
+       "Interval in seconds between each poll of the model repository to check "
+       "for changes. Valid only when --model-control-mode=poll is "
+       "specified."});
+  model_repo_options_.push_back(
+      {OPTION_STARTUP_MODEL, "load-model", Option::ArgStr,
+       "Name of the model to be loaded on server startup. It may be specified "
+       "multiple times to add multiple models. To load ALL models at startup, "
+       "specify '*' as the model name with --load-model=* as the ONLY "
+       "--load-model argument, this does not imply any pattern matching. "
+       "Specifying --load-model=* in conjunction with another --load-model "
+       "argument will result in error. Note that this option will only take "
+       "effect if --model-control-mode=explicit is true."});
+  model_repo_options_.push_back(
+      {OPTION_MODEL_LOAD_THREAD_COUNT, "model-load-thread-count",
+       Option::ArgInt,
+       "The number of threads used to concurrently load models in "
+       "model repositories. Default is 4."});
+  model_repo_options_.push_back(
+      {OPTION_MODEL_NAMESPACING, "model-namespacing", Option::ArgBool,
+       "Whether model namespacing is enable or not. If true, models with the "
+       "same name can be served if they are in different namespace."});
+
+#if defined(TRITON_ENABLE_HTTP)
+  http_options_.push_back(
+      {OPTION_ALLOW_HTTP, "allow-http", Option::ArgBool,
+       "Allow the server to listen for HTTP requests."});
+  http_options_.push_back(
+      {OPTION_HTTP_ADDRESS, "http-address", Option::ArgStr,
+       "The address for the http server to bind to. Default is 0.0.0.0"});
+  http_options_.push_back(
+      {OPTION_HTTP_PORT, "http-port", Option::ArgInt,
+       "The port for the server to listen on for HTTP "
+       "requests. Default is 8000."});
+  http_options_.push_back(
+      {OPTION_REUSE_HTTP_PORT, "reuse-http-port", Option::ArgBool,
+       "Allow multiple servers to listen on the same HTTP port when every "
+       "server has this option set. If you plan to use this option as a way to "
+       "load balance between different Triton servers, the same model "
+       "repository or set of models must be used for every server."});
+  http_options_.push_back(
+      {OPTION_HTTP_HEADER_FORWARD_PATTERN, "http-header-forward-pattern",
+       Option::ArgStr,
+       "The regular expression pattern that will be used for forwarding HTTP "
+       "headers as inference request parameters."});
+  http_options_.push_back(
+      {OPTION_HTTP_THREAD_COUNT, "http-thread-count", Option::ArgInt,
+       "Number of threads handling HTTP requests."});
+  http_options_.push_back(
+      {OPTION_HTTP_RESTRICTED_API, "http-restricted-api",
+       "<string>:<string>=<string>",
+       "Specify restricted HTTP api setting. The format of this "
+       "flag is --http-restricted-api=<apis>,<key>=<value>. Where "
+       "<api> is a comma-separated list of apis to be restricted. "
+       "<key> will be additional header key to be checked when a HTTP request "
+       "is received, and <value> is the value expected to be matched."
+       " Allowed APIs: " +
+           Join(RESTRICTED_CATEGORY_NAMES, ", ")});
+#endif  // TRITON_ENABLE_HTTP
+
+#if defined(TRITON_ENABLE_GRPC)
+  grpc_options_.push_back(
+      {OPTION_ALLOW_GRPC, "allow-grpc", Option::ArgBool,
+       "Allow the server to listen for GRPC requests."});
+  grpc_options_.push_back(
+      {OPTION_GRPC_ADDRESS, "grpc-address", Option::ArgStr,
+       "The address for the grpc server to binds to. Default is 0.0.0.0"});
+  grpc_options_.push_back(
+      {OPTION_GRPC_PORT, "grpc-port", Option::ArgInt,
+       "The port for the server to listen on for GRPC "
+       "requests. Default is 8001."});
+  grpc_options_.push_back(
+      {OPTION_REUSE_GRPC_PORT, "reuse-grpc-port", Option::ArgBool,
+       "Allow multiple servers to listen on the same GRPC port when every "
+       "server has this option set. If you plan to use this option as a way to "
+       "load balance between different Triton servers, the same model "
+       "repository or set of models must be used for every server."});
+  grpc_options_.push_back(
+      {OPTION_GRPC_HEADER_FORWARD_PATTERN, "grpc-header-forward-pattern",
+       Option::ArgStr,
+       "The regular expression pattern that will be used for forwarding GRPC "
+       "headers as inference request parameters."});
+  grpc_options_.push_back(
+      {OPTION_GRPC_INFER_ALLOCATION_POOL_SIZE,
+       "grpc-infer-allocation-pool-size", Option::ArgInt,
+       "The maximum number of inference request/response objects that remain "
+       "allocated for reuse. As long as the number of in-flight requests "
+       "doesn't exceed this value there will be no allocation/deallocation of "
+       "request/response objects."});
+  grpc_options_.push_back(
+      {OPTION_GRPC_USE_SSL, "grpc-use-ssl", Option::ArgBool,
+       "Use SSL authentication for GRPC requests. Default is false."});
+  grpc_options_.push_back(
+      {OPTION_GRPC_USE_SSL_MUTUAL, "grpc-use-ssl-mutual", Option::ArgBool,
+       "Use mututal SSL authentication for GRPC requests. This option will "
+       "preempt '--grpc-use-ssl' if it is also specified. Default is false."});
+  grpc_options_.push_back(
+      {OPTION_GRPC_SERVER_CERT, "grpc-server-cert", Option::ArgStr,
+       "File holding PEM-encoded server certificate. Ignored unless "
+       "--grpc-use-ssl is true."});
+  grpc_options_.push_back(
+      {OPTION_GRPC_SERVER_KEY, "grpc-server-key", Option::ArgStr,
+       "File holding PEM-encoded server key. Ignored unless "
+       "--grpc-use-ssl is true."});
+  grpc_options_.push_back(
+      {OPTION_GRPC_ROOT_CERT, "grpc-root-cert", Option::ArgStr,
+       "File holding PEM-encoded root certificate. Ignore unless "
+       "--grpc-use-ssl is false."});
+  grpc_options_.push_back(
+      {OPTION_GRPC_RESPONSE_COMPRESSION_LEVEL,
+       "grpc-infer-response-compression-level", Option::ArgStr,
+       "The compression level to be used while returning the infer response to "
+       "the peer. Allowed values are none, low, medium and high. By default, "
+       "compression level is selected as none."});
+  grpc_options_.push_back(
+      {OPTION_GRPC_ARG_KEEPALIVE_TIME_MS, "grpc-keepalive-time", Option::ArgInt,
+       "The period (in milliseconds) after which a keepalive ping is sent on "
+       "the transport. Default is 7200000 (2 hours)."});
+  grpc_options_.push_back(
+      {OPTION_GRPC_ARG_KEEPALIVE_TIMEOUT_MS, "grpc-keepalive-timeout",
+       Option::ArgInt,
+       "The period (in milliseconds) the sender of the keepalive ping waits "
+       "for an acknowledgement. If it does not receive an acknowledgment "
+       "within this time, it will close the connection. "
+       "Default is 20000 (20 seconds)."});
+  grpc_options_.push_back(
+      {OPTION_GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS,
+       "grpc-keepalive-permit-without-calls", Option::ArgBool,
+       "Allows keepalive pings to be sent even if there are no calls in flight "
+       "(0 : false; 1 : true). Default is 0 (false)."});
+  grpc_options_.push_back(
+      {OPTION_GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA,
+       "grpc-http2-max-pings-without-data", Option::ArgInt,
+       "The maximum number of pings that can be sent when there is no "
+       "data/header frame to be sent. gRPC Core will not continue sending "
+       "pings if we run over the limit. Setting it to 0 allows sending pings "
+       "without such a restriction. Default is 2."});
+  grpc_options_.push_back(
+      {OPTION_GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS,
+       "grpc-http2-min-recv-ping-interval-without-data", Option::ArgInt,
+       "If there are no data/header frames being sent on the transport, this "
+       "channel argument on the server side controls the minimum time "
+       "(in milliseconds) that gRPC Core would expect between receiving "
+       "successive pings. If the time between successive pings is less than "
+       "this time, then the ping will be considered a bad ping from the peer. "
+       "Such a ping counts as a ‘ping strike’. Default is 300000 (5 "
+       "minutes)."});
+  grpc_options_.push_back(
+      {OPTION_GRPC_ARG_HTTP2_MAX_PING_STRIKES, "grpc-http2-max-ping-strikes",
+       Option::ArgInt,
+       "Maximum number of bad pings that the server will tolerate before "
+       "sending an HTTP2 GOAWAY frame and closing the transport. Setting it to "
+       "0 allows the server to accept any number of bad pings. Default is 2."});
+  grpc_options_.push_back(
+      {OPTION_GRPC_ARG_MAX_CONNECTION_AGE_MS, "grpc-max-connection-age",
+       Option::ArgInt,
+       "Maximum time that a channel may exist in milliseconds."
+       "Default is undefined."});
+  grpc_options_.push_back(
+      {OPTION_GRPC_ARG_MAX_CONNECTION_AGE_GRACE_MS,
+       "grpc-max-connection-age-grace", Option::ArgInt,
+       "Grace period after the channel reaches its max age. "
+       "Default is undefined."});
+  grpc_options_.push_back(
+      {OPTION_GRPC_RESTRICTED_PROTOCOL, "grpc-restricted-protocol",
+       "<string>:<string>=<string>",
+       "Specify restricted GRPC protocol setting. The format of this "
+       "flag is --grpc-restricted-protocol=<protocols>,<key>=<value>. Where "
+       "<protocol> is a comma-separated list of protocols to be restricted. "
+       "<key> will be additional header key to be checked when a GRPC request "
+       "is received, and <value> is the value expected to be matched."
+       " Allowed protocols: " +
+           Join(RESTRICTED_CATEGORY_NAMES, ", ")});
+#endif  // TRITON_ENABLE_GRPC
+
+#ifdef TRITON_ENABLE_LOGGING
+  logging_options_.push_back(
+      {OPTION_LOG_VERBOSE, "log-verbose", Option::ArgInt,
+       "Set verbose logging level. Zero (0) disables verbose logging and "
+       "values >= 1 enable verbose logging."});
+  logging_options_.push_back(
+      {OPTION_LOG_INFO, "log-info", Option::ArgBool,
+       "Enable/disable info-level logging."});
+  logging_options_.push_back(
+      {OPTION_LOG_WARNING, "log-warning", Option::ArgBool,
+       "Enable/disable warning-level logging."});
+  logging_options_.push_back(
+      {OPTION_LOG_ERROR, "log-error", Option::ArgBool,
+       "Enable/disable error-level logging."});
+  logging_options_.push_back(
+      {OPTION_LOG_FORMAT, "log-format", Option::ArgStr,
+       "Set the logging format. Options are \"default\" and \"ISO8601\". "
+       "The default is \"default\". For \"default\", the log severity (L) and "
+       "timestamp will be logged as \"LMMDD hh:mm:ss.ssssss\". "
+       "For \"ISO8601\", the log format will be \"YYYY-MM-DDThh:mm:ssZ L\"."});
+  logging_options_.push_back(
+      {OPTION_LOG_FILE, "log-file", Option::ArgStr,
+       "Set the name of the log output file. If specified, log outputs will be "
+       "saved to this file. If not specified, log outputs will stream to the "
+       "console."});
+#endif  // TRITON_ENABLE_LOGGING
+
+#if defined(TRITON_ENABLE_SAGEMAKER)
+  sagemaker_options_.push_back(
+      {OPTION_ALLOW_SAGEMAKER, "allow-sagemaker", Option::ArgBool,
+       "Allow the server to listen for Sagemaker requests. Default is false."});
+  sagemaker_options_.push_back(
+      {OPTION_SAGEMAKER_PORT, "sagemaker-port", Option::ArgInt,
+       "The port for the server to listen on for Sagemaker requests. Default "
+       "is 8080."});
+  sagemaker_options_.push_back(
+      {OPTION_SAGEMAKER_SAFE_PORT_RANGE, "sagemaker-safe-port-range",
+       "<integer>-<integer>",
+       "Set the allowed port range for endpoints other than the SageMaker "
+       "endpoints."});
+  sagemaker_options_.push_back(
+      {OPTION_SAGEMAKER_THREAD_COUNT, "sagemaker-thread-count", Option::ArgInt,
+       "Number of threads handling Sagemaker requests. Default is 8."});
+#endif  // TRITON_ENABLE_SAGEMAKER
+
+#if defined(TRITON_ENABLE_VERTEX_AI)
+  vertex_options_.push_back(
+      {OPTION_ALLOW_VERTEX_AI, "allow-vertex-ai", Option::ArgBool,
+       "Allow the server to listen for Vertex AI requests. Default is true if "
+       "AIP_MODE=PREDICTION, false otherwise."});
+  vertex_options_.push_back(
+      {OPTION_VERTEX_AI_PORT, "vertex-ai-port", Option::ArgInt,
+       "The port for the server to listen on for Vertex AI requests. Default "
+       "is AIP_HTTP_PORT if set, 8080 otherwise."});
+  vertex_options_.push_back(
+      {OPTION_VERTEX_AI_THREAD_COUNT, "vertex-ai-thread-count", Option::ArgInt,
+       "Number of threads handling Vertex AI requests. Default is 8."});
+  vertex_options_.push_back(
+      {OPTION_VERTEX_AI_DEFAULT_MODEL, "vertex-ai-default-model",
+       Option::ArgStr,
+       "The name of the model to use for single-model inference requests."});
+#endif  // TRITON_ENABLE_VERTEX_AI
+
+#if defined(TRITON_ENABLE_METRICS)
+  metric_options_.push_back(
+      {OPTION_ALLOW_METRICS, "allow-metrics", Option::ArgBool,
+       "Allow the server to provide prometheus metrics."});
+  metric_options_.push_back(
+      {OPTION_ALLOW_GPU_METRICS, "allow-gpu-metrics", Option::ArgBool,
+       "Allow the server to provide GPU metrics. Ignored unless "
+       "--allow-metrics is true."});
+  metric_options_.push_back(
+      {OPTION_ALLOW_CPU_METRICS, "allow-cpu-metrics", Option::ArgBool,
+       "Allow the server to provide CPU metrics. Ignored unless "
+       "--allow-metrics is true."});
+  metric_options_.push_back(
+      {OPTION_METRICS_ADDRESS, "metrics-address", Option::ArgStr,
+       "The address for the metrics server to bind to. Default is the same as "
+       "--http-address if built with HTTP support. Otherwise, default is "
+       "0.0.0.0"});
+  metric_options_.push_back(
+      {OPTION_METRICS_PORT, "metrics-port", Option::ArgInt,
+       "The port reporting prometheus metrics. Default is 8002."});
+  metric_options_.push_back(
+      {OPTION_METRICS_INTERVAL_MS, "metrics-interval-ms", Option::ArgFloat,
+       "Metrics will be collected once every <metrics-interval-ms> "
+       "milliseconds. Default is 2000 milliseconds."});
+  metric_options_.push_back(
+      {OPTION_METRICS_CONFIG, "metrics-config", "<string>=<string>",
+       "Specify a metrics-specific configuration setting. The format of this "
+       "flag is --metrics-config=<setting>=<value>. It can be specified "
+       "multiple times."});
+#endif  // TRITON_ENABLE_METRICS
+
+#ifdef TRITON_ENABLE_TRACING
+  tracing_options_.push_back(
+      {OPTION_TRACE_CONFIG, "trace-config", "<string>,<string>=<string>",
+       "Specify global or trace mode specific configuration setting. "
+       "The format of this flag is --trace-config "
+       "<mode>,<setting>=<value>. "
+       "Where <mode> is either \"triton\" or \"opentelemetry\". "
+       "The default is \"triton\". To specify global trace settings "
+       "(level, rate, count, or mode), the format would be "
+       "--trace-config <setting>=<value>. For \"triton\" mode, the server will "
+       "use "
+       "Triton's Trace APIs. For \"opentelemetry\" mode, the server will use "
+       "OpenTelemetry's APIs to generate, collect and export traces for "
+       "individual inference requests."});
+#endif  // TRITON_ENABLE_TRACING
+
+  cache_options_.push_back(
+      {OPTION_CACHE_CONFIG, "cache-config", "<string>,<string>=<string>",
+       "Specify a cache-specific configuration setting. The format of this "
+       "flag is --cache-config=<cache_name>,<setting>=<value>. Where "
+       "<cache_name> is the name of the cache, such as 'local' or 'redis'. "
+       "Example: --cache-config=local,size=1048576 will configure a 'local' "
+       "cache implementation with a fixed buffer pool of size 1048576 bytes."});
+  cache_options_.push_back(
+      {OPTION_CACHE_DIR, "cache-directory", Option::ArgStr,
+       "The global directory searched for cache shared libraries. Default is "
+       "'/opt/tritonserver/caches'. This directory is expected to contain a "
+       "cache implementation as a shared library with the name "
+       "'libtritoncache.so'."});
+
+
+  rate_limiter_options_.push_back(
+      // FIXME:  fix the default to execution_count once RL logic is complete.
+      {OPTION_RATE_LIMIT, "rate-limit", Option::ArgStr,
+       "Specify the mode for rate limiting. Options are \"execution_count\" "
+       "and \"off\". The default is \"off\". For "
+       "\"execution_count\", the server will determine the instance using "
+       "configured priority and the number of time the instance has been "
+       "used to run inference. The inference will finally be executed once "
+       "the required resources are available. For \"off\", the server will "
+       "ignore any rate limiter config and run inference as soon as an "
+       "instance is ready."});
+  rate_limiter_options_.push_back(
+      {OPTION_RATE_LIMIT_RESOURCE, "rate-limit-resource",
+       "<string>:<integer>:<integer>",
+       "The number of resources available to the server. The format of this "
+       "flag is --rate-limit-resource=<resource_name>:<count>:<device>. The "
+       "<device> is optional and if not listed will be applied to every "
+       "device. If the resource is specified as \"GLOBAL\" in the model "
+       "configuration the resource is considered shared among all the devices "
+       "in the system. The <device> property is ignored for such resources. "
+       "This flag can be specified multiple times to specify each resources "
+       "and their availability. By default, the max across all instances that "
+       "list the resource is selected as its availability. The values for this "
+       "flag is case-insensitive."});
+
+  memory_device_options_.push_back(
+      {OPTION_PINNED_MEMORY_POOL_BYTE_SIZE, "pinned-memory-pool-byte-size",
+       Option::ArgInt,
+       "The total byte size that can be allocated as pinned system memory. "
+       "If GPU support is enabled, the server will allocate pinned system "
+       "memory to accelerate data transfer between host and devices until it "
+       "exceeds the specified byte size. If 'numa-node' is configured via "
+       "--host-policy, the pinned system memory of the pool size will be "
+       "allocated on each numa node. This option will not affect the "
+       "allocation conducted by the backend frameworks. Default is 256 MB."});
+  memory_device_options_.push_back(
+      {OPTION_CUDA_MEMORY_POOL_BYTE_SIZE, "cuda-memory-pool-byte-size",
+       "<integer>:<integer>",
+       "The total byte size that can be allocated as CUDA memory for the GPU "
+       "device. If GPU support is enabled, the server will allocate CUDA "
+       "memory to minimize data transfer between host and devices until it "
+       "exceeds the specified byte size. This option will not affect the "
+       "allocation conducted by the backend frameworks. The argument should be "
+       "2 integers separated by colons in the format "
+       "<GPU device ID>:<pool byte size>. This option can be used multiple "
+       "times, but only once per GPU device. Subsequent uses will overwrite "
+       "previous uses for the same GPU device. Default is 64 MB."});
+  memory_device_options_.push_back(
+      {OPTION_CUDA_VIRTUAL_ADDRESS_SIZE, "cuda-virtual-address-size",
+       "<integer>:<integer>",
+       "The total CUDA virtual address size that will be used for each "
+       "implicit state when growable memory is used. This value determines "
+       "the maximum size of each implicit state. The state size cannot go "
+       "beyond this value. The argument should be "
+       "2 integers separated by colons in the format "
+       "<GPU device ID>:<CUDA virtual address size>. This option can be used "
+       "multiple "
+       "times, but only once per GPU device. Subsequent uses will overwrite "
+       "previous uses for the same GPU device. Default is 1 GB."});
+  memory_device_options_.push_back(
+      {OPTION_MIN_SUPPORTED_COMPUTE_CAPABILITY,
+       "min-supported-compute-capability", Option::ArgFloat,
+       "The minimum supported CUDA compute capability. GPUs that don't support "
+       "this compute capability will not be used by the server."});
+  memory_device_options_.push_back(
+      {OPTION_BUFFER_MANAGER_THREAD_COUNT, "buffer-manager-thread-count",
+       Option::ArgInt,
+       "The number of threads used to accelerate copies and other operations "
+       "required to manage input and output tensor contents. Default is 0."});
+  memory_device_options_.push_back(
+      {OPTION_HOST_POLICY, "host-policy", "<string>,<string>=<string>",
+       "Specify a host policy setting associated with a policy name. The "
+       "format of this flag is --host-policy=<policy_name>,<setting>=<value>. "
+       "Currently supported settings are 'numa-node', 'cpu-cores'. Note that "
+       "'numa-node' setting will affect pinned memory pool behavior, see "
+       "--pinned-memory-pool for more detail."});
+  memory_device_options_.push_back(
+      {OPTION_MODEL_LOAD_GPU_LIMIT, "model-load-gpu-limit",
+       "<device_id>:<fraction>",
+       "Specify the limit on GPU memory usage as a fraction. If model loading "
+       "on the device is requested and the current memory usage exceeds the "
+       "limit, the load will be rejected. If not specified, the limit will "
+       "not be set."});
+
+  backend_options_.push_back(
+      {OPTION_BACKEND_DIR, "backend-directory", Option::ArgStr,
+       "The global directory searched for backend shared libraries. Default is "
+       "'/opt/tritonserver/backends'."});
+  backend_options_.push_back(
+      {OPTION_BACKEND_CONFIG, "backend-config", "<string>,<string>=<string>",
+       "Specify a backend-specific configuration setting. The format of this "
+       "flag is --backend-config=<backend_name>,<setting>=<value>. Where "
+       "<backend_name> is the name of the backend, such as 'tensorrt'."});
+
+  repo_agent_options_.push_back(
+      {OPTION_REPOAGENT_DIR, "repoagent-directory", Option::ArgStr,
+       "The global directory searched for repository agent shared libraries. "
+       "Default is '/opt/tritonserver/repoagents'."});
+
+  // Deprecations
+  deprecated_options_.push_back(
+      {OPTION_STRICT_MODEL_CONFIG, "strict-model-config", Option::ArgBool,
+       "DEPRECATED: If true model configuration files must be provided and all "
+       "required "
+       "configuration settings must be specified. If false the model "
+       "configuration may be absent or only partially specified and the "
+       "server will attempt to derive the missing required configuration."});
+  deprecated_options_.push_back(
+      {OPTION_RESPONSE_CACHE_BYTE_SIZE, "response-cache-byte-size",
+       Option::ArgInt, "DEPRECATED: Please use --cache-config instead."});
+#ifdef TRITON_ENABLE_TRACING
+  deprecated_options_.push_back(
+      {OPTION_TRACE_FILEPATH, "trace-file", Option::ArgStr,
+       "DEPRECATED: Please use --trace-config triton,file=<path/to/your/file>"
+       " Set the file where trace output will be saved. If "
+       "--trace-log-frequency"
+       " is also specified, this argument value will be the prefix of the files"
+       " to save the trace output. See --trace-log-frequency for detail."});
+  deprecated_options_.push_back(
+      {OPTION_TRACE_LEVEL, "trace-level", Option::ArgStr,
+       "DEPRECATED: Please use --trace-config level=<OFF|TIMESTAMPS|TENSORS>"
+       "Specify a trace level. OFF to disable tracing, TIMESTAMPS to "
+       "trace timestamps, TENSORS to trace tensors. It may be specified "
+       "multiple times to trace multiple information. Default is OFF."});
+  deprecated_options_.push_back(
+      {OPTION_TRACE_RATE, "trace-rate", Option::ArgInt,
+       "DEPRECATED: Please use --trace-config rate=<rate value>"
+       "Set the trace sampling rate. Default is 1000."});
+  deprecated_options_.push_back(
+      {OPTION_TRACE_COUNT, "trace-count", Option::ArgInt,
+       "DEPRECATED: Please use --trace-config count=<count value>"
+       "Set the number of traces to be sampled. If the value is -1, the number "
+       "of traces to be sampled will not be limited. Default is -1."});
+  deprecated_options_.push_back(
+      {OPTION_TRACE_LOG_FREQUENCY, "trace-log-frequency", Option::ArgInt,
+       "DEPRECATED: Please use --trace-config triton,log-frequency=<value>"
+       "Set the trace log frequency. If the value is 0, Triton will only log "
+       "the trace output to <trace-file> when shutting down. Otherwise, Triton "
+       "will log the trace output to <trace-file>.<idx> when it collects the "
+       "specified number of traces. For example, if the log frequency is 100, "
+       "when Triton collects the 100-th trace, it logs the traces to file "
+       "<trace-file>.0, and when it collects the 200-th trace, it logs the "
+       "101-th to the 200-th traces to file <trace-file>.1. Default is 0."});
+#endif  // TRITON_ENABLE_TRACING
+}
+
+void
+TritonParser::SetupOptionGroups()
+{
+  SetupOptions();
+  option_groups_.emplace_back(GLOBAL_OPTION_GROUP, global_options_);
+  option_groups_.emplace_back("Server", server_options_);
+  option_groups_.emplace_back("Logging", logging_options_);
+  option_groups_.emplace_back("Model Repository", model_repo_options_);
+  option_groups_.emplace_back("HTTP", http_options_);
+  option_groups_.emplace_back("GRPC", grpc_options_);
+  option_groups_.emplace_back("Sagemaker", sagemaker_options_);
+  option_groups_.emplace_back("Vertex", vertex_options_);
+  option_groups_.emplace_back("Metrics", metric_options_);
+  option_groups_.emplace_back("Tracing", tracing_options_);
+  option_groups_.emplace_back("Backend", backend_options_);
+  option_groups_.emplace_back("Repository Agent", repo_agent_options_);
+  option_groups_.emplace_back("Response Cache", cache_options_);
+  option_groups_.emplace_back("Rate Limiter", rate_limiter_options_);
+  option_groups_.emplace_back(
+      "Memory/Device Management", memory_device_options_);
+  option_groups_.emplace_back("DEPRECATED", deprecated_options_);
+}
+
+TritonParser::TritonParser()
+{
+  SetupOptionGroups();
+}
+
+void
+TritonServerParameters::CheckPortCollision()
+{
+  // [FIXME] try to make this function endpoint type agnostic
+  // List of enabled services and their constraints
+  std::vector<
+      std::tuple<std::string, std::string, int32_t, bool, int32_t, int32_t>>
+      ports;
+#ifdef TRITON_ENABLE_HTTP
+  if (allow_http_) {
+    ports.emplace_back("HTTP", http_address_, http_port_, false, -1, -1);
+  }
+#endif  // TRITON_ENABLE_HTTP
+#ifdef TRITON_ENABLE_GRPC
+  if (allow_grpc_) {
+    ports.emplace_back(
+        "GRPC", grpc_options_.socket_.address_, grpc_options_.socket_.port_,
+        false, -1, -1);
+  }
+#endif  // TRITON_ENABLE_GRPC
+#ifdef TRITON_ENABLE_METRICS
+  if (allow_metrics_) {
+    ports.emplace_back(
+        "metrics", metrics_address_, metrics_port_, false, -1, -1);
+  }
+#endif  // TRITON_ENABLE_METRICS
+#ifdef TRITON_ENABLE_SAGEMAKER
+  if (allow_sagemaker_) {
+    ports.emplace_back(
+        "SageMaker", sagemaker_address_, sagemaker_port_,
+        sagemaker_safe_range_set_, sagemaker_safe_range_.first,
+        sagemaker_safe_range_.second);
+  }
+#endif  // TRITON_ENABLE_SAGEMAKER
+#ifdef TRITON_ENABLE_VERTEX_AI
+  if (allow_vertex_ai_) {
+    ports.emplace_back(
+        "Vertex AI", vertex_ai_address_, vertex_ai_port_, false, -1, -1);
+  }
+#endif  // TRITON_ENABLE_VERTEX_AI
+
+  for (auto curr_it = ports.begin(); curr_it != ports.end(); ++curr_it) {
+    // If the current service doesn't specify the allow port range for other
+    // services, then we don't need to revisit the checked services
+    auto comparing_it = (std::get<3>(*curr_it)) ? ports.begin() : (curr_it + 1);
+    for (; comparing_it != ports.end(); ++comparing_it) {
+      if (comparing_it == curr_it) {
+        continue;
+      }
+      if (std::get<1>(*curr_it) != std::get<1>(*comparing_it)) {
+        continue;
+      }
+      // Set range and comparing service port is out of range
+      if (std::get<3>(*curr_it) &&
+          ((std::get<2>(*comparing_it) < std::get<4>(*curr_it)) ||
+           (std::get<2>(*comparing_it) > std::get<5>(*curr_it)))) {
+        std::stringstream ss;
+        ss << "The server cannot listen to " << std::get<0>(*comparing_it)
+           << " requests at port " << std::get<2>(*comparing_it)
+           << ", allowed port range is [" << std::get<4>(*curr_it) << ", "
+           << std::get<5>(*curr_it) << "]" << std::endl;
+        throw ParseException(ss.str());
+      }
+      if (std::get<2>(*curr_it) == std::get<2>(*comparing_it)) {
+        std::stringstream ss;
+        ss << "The server cannot listen to " << std::get<0>(*curr_it)
+           << " requests "
+           << "and " << std::get<0>(*comparing_it)
+           << " requests at the same address and port " << std::get<1>(*curr_it)
+           << ":" << std::get<2>(*curr_it) << std::endl;
+        throw ParseException(ss.str());
+      }
+    }
+  }
+}
+
+TritonServerParameters::ManagedTritonServerOptionPtr
+TritonServerParameters::BuildTritonServerOptions()
+{
+  TRITONSERVER_ServerOptions* loptions = nullptr;
+  THROW_IF_ERR(
+      ParseException, TRITONSERVER_ServerOptionsNew(&loptions),
+      "creating server options");
+  ManagedTritonServerOptionPtr managed_ptr(
+      loptions, TRITONSERVER_ServerOptionsDelete);
+  THROW_IF_ERR(
+      ParseException,
+      TRITONSERVER_ServerOptionsSetServerId(loptions, server_id_.c_str()),
+      "setting server ID");
+  for (const auto& model_repository_path : model_repository_paths_) {
+    THROW_IF_ERR(
+        ParseException,
+        TRITONSERVER_ServerOptionsSetModelRepositoryPath(
+            loptions, model_repository_path.c_str()),
+        "setting model repository path");
+  }
+  THROW_IF_ERR(
+      ParseException,
+      TRITONSERVER_ServerOptionsSetModelControlMode(loptions, control_mode_),
+      "setting model control mode");
+  for (const auto& model : startup_models_) {
+    THROW_IF_ERR(
+        ParseException,
+        TRITONSERVER_ServerOptionsSetStartupModel(loptions, model.c_str()),
+        "setting startup model");
+  }
+  THROW_IF_ERR(
+      ParseException,
+      TRITONSERVER_ServerOptionsSetRateLimiterMode(loptions, rate_limit_mode_),
+      "setting rate limiter configuration");
+  for (const auto& resource : rate_limit_resources_) {
+    THROW_IF_ERR(
+        ParseException,
+        TRITONSERVER_ServerOptionsAddRateLimiterResource(
+            loptions, std::get<0>(resource).c_str(), std::get<1>(resource),
+            std::get<2>(resource)),
+        "setting rate limiter resource");
+  }
+  THROW_IF_ERR(
+      ParseException,
+      TRITONSERVER_ServerOptionsSetPinnedMemoryPoolByteSize(
+          loptions, pinned_memory_pool_byte_size_),
+      "setting total pinned memory byte size");
+  for (const auto& cuda_pool : cuda_pools_) {
+    THROW_IF_ERR(
+        ParseException,
+        TRITONSERVER_ServerOptionsSetCudaMemoryPoolByteSize(
+            loptions, cuda_pool.first, cuda_pool.second),
+        "setting total CUDA memory byte size");
+  }
+  for (const auto& cuda_virtual_address_size : cuda_virtual_address_size_) {
+    THROW_IF_ERR(
+        ParseException,
+        TRITONSERVER_ServerOptionsSetCudaVirtualAddressSize(
+            loptions, cuda_virtual_address_size.first,
+            cuda_virtual_address_size.second),
+        "setting total CUDA virtual address size");
+  }
+  THROW_IF_ERR(
+      ParseException,
+      TRITONSERVER_ServerOptionsSetMinSupportedComputeCapability(
+          loptions, min_supported_compute_capability_),
+      "setting minimum supported CUDA compute capability");
+  THROW_IF_ERR(
+      ParseException,
+      TRITONSERVER_ServerOptionsSetExitOnError(loptions, exit_on_error_),
+      "setting exit on error");
+  THROW_IF_ERR(
+      ParseException,
+      TRITONSERVER_ServerOptionsSetStrictModelConfig(
+          loptions, strict_model_config_),
+      "setting strict model configuration");
+  THROW_IF_ERR(
+      ParseException,
+      TRITONSERVER_ServerOptionsSetStrictReadiness(loptions, strict_readiness_),
+      "setting strict readiness");
+  // [FIXME] std::max seems to be part of Parse()
+  THROW_IF_ERR(
+      ParseException,
+      TRITONSERVER_ServerOptionsSetExitTimeout(
+          loptions, std::max(0, exit_timeout_secs_)),
+      "setting exit timeout");
+  THROW_IF_ERR(
+      ParseException,
+      TRITONSERVER_ServerOptionsSetBufferManagerThreadCount(
+          loptions, std::max(0, buffer_manager_thread_count_)),
+      "setting buffer manager thread count");
+  THROW_IF_ERR(
+      ParseException,
+      TRITONSERVER_ServerOptionsSetModelLoadThreadCount(
+          loptions, std::max(1u, model_load_thread_count_)),
+      "setting model load thread count");
+  THROW_IF_ERR(
+      ParseException,
+      TRITONSERVER_ServerOptionsSetModelNamespacing(
+          loptions, enable_model_namespacing_),
+      "setting model namespacing");
+
+#ifdef TRITON_ENABLE_LOGGING
+  TRITONSERVER_ServerOptionsSetLogFile(loptions, log_file_.c_str());
+  THROW_IF_ERR(
+      ParseException, TRITONSERVER_ServerOptionsSetLogInfo(loptions, log_info_),
+      "setting log info enable");
+  THROW_IF_ERR(
+      ParseException, TRITONSERVER_ServerOptionsSetLogWarn(loptions, log_warn_),
+      "setting log warn enable");
+  THROW_IF_ERR(
+      ParseException,
+      TRITONSERVER_ServerOptionsSetLogError(loptions, log_error_),
+      "setting log error enable");
+  THROW_IF_ERR(
+      ParseException,
+      TRITONSERVER_ServerOptionsSetLogVerbose(loptions, log_verbose_),
+      "setting log verbose level");
+  switch (log_format_) {
+    case triton::common::Logger::Format::kDEFAULT:
+      THROW_IF_ERR(
+          ParseException,
+          TRITONSERVER_ServerOptionsSetLogFormat(
+              loptions, TRITONSERVER_LOG_DEFAULT),
+          "setting log format");
+      break;
+    case triton::common::Logger::Format::kISO8601:
+      THROW_IF_ERR(
+          ParseException,
+          TRITONSERVER_ServerOptionsSetLogFormat(
+              loptions, TRITONSERVER_LOG_ISO8601),
+          "setting log format");
+      break;
+  }
+#endif  // TRITON_ENABLE_LOGGING
+
+#ifdef TRITON_ENABLE_METRICS
+  THROW_IF_ERR(
+      ParseException,
+      TRITONSERVER_ServerOptionsSetMetrics(loptions, allow_metrics_),
+      "setting metrics enable");
+  THROW_IF_ERR(
+      ParseException,
+      TRITONSERVER_ServerOptionsSetGpuMetrics(loptions, allow_gpu_metrics_),
+      "setting GPU metrics enable");
+  THROW_IF_ERR(
+      ParseException,
+      TRITONSERVER_ServerOptionsSetCpuMetrics(loptions, allow_cpu_metrics_),
+      "setting CPU metrics enable");
+  THROW_IF_ERR(
+      ParseException,
+      TRITONSERVER_ServerOptionsSetMetricsInterval(
+          loptions, metrics_interval_ms_),
+      "setting metrics interval");
+  for (const auto& mcs : metrics_config_settings_) {
+    THROW_IF_ERR(
+        ParseException,
+        TRITONSERVER_ServerOptionsSetMetricsConfig(
+            loptions, std::get<0>(mcs).c_str(), std::get<1>(mcs).c_str(),
+            std::get<2>(mcs).c_str()),
+        "setting metrics configuration");
+  }
+
+#endif  // TRITON_ENABLE_METRICS
+
+  THROW_IF_ERR(
+      ParseException,
+      TRITONSERVER_ServerOptionsSetBackendDirectory(
+          loptions, backend_dir_.c_str()),
+      "setting backend directory");
+
+  // Enable cache and configure it if a cache CLI arg is passed,
+  // this will allow for an empty configuration.
+  if (enable_cache_) {
+    THROW_IF_ERR(
+        ParseException,
+        TRITONSERVER_ServerOptionsSetCacheDirectory(
+            loptions, cache_dir_.c_str()),
+        "setting cache directory");
+
+    for (const auto& cache_pair : cache_config_settings_) {
+      const auto& cache_name = cache_pair.first;
+      const auto& settings = cache_pair.second;
+      const auto& json_config_str = PairsToJsonStr(settings);
+      THROW_IF_ERR(
+          ParseException,
+          TRITONSERVER_ServerOptionsSetCacheConfig(
+              loptions, cache_name.c_str(), json_config_str.c_str()),
+          "setting cache configuration");
+    }
+  }
+
+  THROW_IF_ERR(
+      ParseException,
+      TRITONSERVER_ServerOptionsSetRepoAgentDirectory(
+          loptions, repoagent_dir_.c_str()),
+      "setting repository agent directory");
+  for (const auto& bcs : backend_config_settings_) {
+    THROW_IF_ERR(
+        ParseException,
+        TRITONSERVER_ServerOptionsSetBackendConfig(
+            loptions, std::get<0>(bcs).c_str(), std::get<1>(bcs).c_str(),
+            std::get<2>(bcs).c_str()),
+        "setting backend configuration");
+  }
+  for (const auto& limit : load_gpu_limit_) {
+    THROW_IF_ERR(
+        ParseException,
+        TRITONSERVER_ServerOptionsSetModelLoadDeviceLimit(
+            loptions, TRITONSERVER_INSTANCEGROUPKIND_GPU, limit.first,
+            limit.second),
+        "setting model load GPU limit");
+  }
+  for (const auto& hp : host_policies_) {
+    THROW_IF_ERR(
+        ParseException,
+        TRITONSERVER_ServerOptionsSetHostPolicy(
+            loptions, std::get<0>(hp).c_str(), std::get<1>(hp).c_str(),
+            std::get<2>(hp).c_str()),
+        "setting host policy");
+  }
+  return managed_ptr;
+}
+
+std::pair<TritonServerParameters, std::vector<char*>>
+TritonParser::Parse(int argc, char** argv)
+{
+  //
+  // Step 1. Before parsing setup
+  //
+  TritonServerParameters lparams;
+  bool strict_model_config_present{false};
+  bool disable_auto_complete_config{false};
+  bool cache_size_present{false};
+  bool cache_config_present{false};
+#ifdef TRITON_ENABLE_TRACING
+  bool explicit_disable_trace{false};
+  bool trace_filepath_present{false};
+  bool trace_level_present{false};
+  bool trace_rate_present{false};
+  bool trace_count_present{false};
+  bool trace_log_frequency_present{false};
+#endif  // TRITON_ENABLE_TRACING
+  int option_index = 0;
+
+#ifdef TRITON_ENABLE_GRPC
+  triton::server::grpc::Options& lgrpc_options = lparams.grpc_options_;
+#endif  // TRITON_ENABLE_GRPC
+
+#ifdef TRITON_ENABLE_VERTEX_AI
+  // Set different default value if specific flag is set
+  {
+    auto aip_mode =
+        triton::server::GetEnvironmentVariableOrDefault("AIP_MODE", "");
+    // Enable Vertex AI service and disable HTTP / GRPC service by default
+    // if detecting Vertex AI environment
+    if (aip_mode == "PREDICTION") {
+      lparams.allow_vertex_ai_ = true;
+#ifdef TRITON_ENABLE_HTTP
+      lparams.allow_http_ = false;
+#endif  // TRITON_ENABLE_HTTP
+#ifdef TRITON_ENABLE_GRPC
+      lparams.allow_grpc_ = false;
+#endif  // TRITON_ENABLE_GRPC
+    }
+    auto port = triton::server::GetEnvironmentVariableOrDefault(
+        "AIP_HTTP_PORT", "8080");
+    lparams.vertex_ai_port_ = ParseOption<int>(port);
+  }
+#endif  // TRITON_ENABLE_VERTEX_AI
+
+  //
+  // Step 2. parse options
+  //
+  std::vector<struct option> long_options;
+  for (const auto& group : option_groups_) {
+    for (const auto& o : group.second) {
+      long_options.push_back(o.GetLongOption());
+    }
+  }
+  long_options.push_back({nullptr, 0, nullptr, 0});
+
+  int flag;
+  while ((flag = getopt_long(
+              argc, argv, "", &long_options[0], &option_index)) != -1) {
+    try {
+      switch (flag) {
+        case OPTION_HELP:
+          // [FIXME] how help is printed?
+        case '?':
+          // [FIXME] fall through when seeing this, currently consumes all
+          // options [FIXME] disable stderr output of `getopt_long`
+          throw ParseException();
+#ifdef TRITON_ENABLE_LOGGING
+        case OPTION_LOG_VERBOSE:
+          lparams.log_verbose_ = ParseIntBoolOption(optarg);
+          break;
+        case OPTION_LOG_INFO:
+          lparams.log_info_ = ParseOption<bool>(optarg);
+          break;
+        case OPTION_LOG_WARNING:
+          lparams.log_warn_ = ParseOption<bool>(optarg);
+          break;
+        case OPTION_LOG_ERROR:
+          lparams.log_error_ = ParseOption<bool>(optarg);
+          break;
+        case OPTION_LOG_FORMAT: {
+          std::string format_str(optarg);
+          if (format_str == "default") {
+            lparams.log_format_ = triton::common::Logger::Format::kDEFAULT;
+          } else if (format_str == "ISO8601") {
+            lparams.log_format_ = triton::common::Logger::Format::kISO8601;
+          } else {
+            throw ParseException("invalid argument for --log-format");
+          }
+          break;
+        }
+        case OPTION_LOG_FILE:
+          lparams.log_file_ = optarg;
+          break;
+#endif  // TRITON_ENABLE_LOGGING
+
+        case OPTION_ID:
+          lparams.server_id_ = optarg;
+          break;
+        case OPTION_MODEL_REPOSITORY:
+          lparams.model_repository_paths_.insert(optarg);
+          break;
+        case OPTION_EXIT_ON_ERROR:
+          lparams.exit_on_error_ = ParseOption<bool>(optarg);
+          break;
+        case OPTION_DISABLE_AUTO_COMPLETE_CONFIG:
+          disable_auto_complete_config = true;
+          break;
+        case OPTION_STRICT_MODEL_CONFIG:
+          std::cerr << "Warning: '--strict-model-config' has been deprecated! "
+                       "Please use '--disable-auto-complete-config' instead."
+                    << std::endl;
+          strict_model_config_present = true;
+          lparams.strict_model_config_ = ParseOption<bool>(optarg);
+          break;
+        case OPTION_STRICT_READINESS:
+          lparams.strict_readiness_ = ParseOption<bool>(optarg);
+          break;
+
+#ifdef TRITON_ENABLE_HTTP
+        case OPTION_ALLOW_HTTP:
+          lparams.allow_http_ = ParseOption<bool>(optarg);
+          break;
+        case OPTION_HTTP_PORT:
+          lparams.http_port_ = ParseOption<int>(optarg);
+          break;
+        case OPTION_REUSE_HTTP_PORT:
+          lparams.reuse_http_port_ = ParseOption<bool>(optarg);
+          break;
+        case OPTION_HTTP_ADDRESS:
+          lparams.http_address_ = optarg;
+          break;
+        case OPTION_HTTP_HEADER_FORWARD_PATTERN:
+          lparams.http_forward_header_pattern_ = optarg;
+          break;
+          break;
+        case OPTION_HTTP_THREAD_COUNT:
+          lparams.http_thread_cnt_ = ParseOption<int>(optarg);
+          break;
+        case OPTION_HTTP_RESTRICTED_API:
+          ParseRestrictedFeatureOption(
+              optarg, long_options[option_index].name, "", "api",
+              lparams.http_restricted_apis_);
+          break;
+
+#endif  // TRITON_ENABLE_HTTP
+
+#ifdef TRITON_ENABLE_SAGEMAKER
+        case OPTION_ALLOW_SAGEMAKER:
+          lparams.allow_sagemaker_ = ParseOption<bool>(optarg);
+          break;
+        case OPTION_SAGEMAKER_PORT:
+          lparams.sagemaker_port_ = ParseOption<int>(optarg);
+          break;
+        case OPTION_SAGEMAKER_SAFE_PORT_RANGE:
+          lparams.sagemaker_safe_range_set_ = true;
+          lparams.sagemaker_safe_range_ =
+              ParsePairOption<int, int>(optarg, "-");
+          break;
+        case OPTION_SAGEMAKER_THREAD_COUNT:
+          lparams.sagemaker_thread_cnt_ = ParseOption<int>(optarg);
+          break;
+#endif  // TRITON_ENABLE_SAGEMAKER
+
+#ifdef TRITON_ENABLE_VERTEX_AI
+        case OPTION_ALLOW_VERTEX_AI:
+          lparams.allow_vertex_ai_ = ParseOption<bool>(optarg);
+          break;
+        case OPTION_VERTEX_AI_PORT:
+          lparams.vertex_ai_port_ = ParseOption<int>(optarg);
+          break;
+        case OPTION_VERTEX_AI_THREAD_COUNT:
+          lparams.vertex_ai_thread_cnt_ = ParseOption<int>(optarg);
+          break;
+        case OPTION_VERTEX_AI_DEFAULT_MODEL:
+          lparams.vertex_ai_default_model_ = optarg;
+          break;
+#endif  // TRITON_ENABLE_VERTEX_AI
+
+#ifdef TRITON_ENABLE_GRPC
+        case OPTION_ALLOW_GRPC:
+          lparams.allow_grpc_ = ParseOption<bool>(optarg);
+          break;
+        case OPTION_GRPC_PORT:
+          lgrpc_options.socket_.port_ = ParseOption<int>(optarg);
+          break;
+        case OPTION_REUSE_GRPC_PORT:
+          lgrpc_options.socket_.reuse_port_ = ParseOption<bool>(optarg);
+          break;
+        case OPTION_GRPC_ADDRESS:
+          lgrpc_options.socket_.address_ = optarg;
+          break;
+        case OPTION_GRPC_INFER_ALLOCATION_POOL_SIZE:
+          lgrpc_options.infer_allocation_pool_size_ = ParseOption<int>(optarg);
+          break;
+        case OPTION_GRPC_USE_SSL:
+          lgrpc_options.ssl_.use_ssl_ = ParseOption<bool>(optarg);
+          break;
+        case OPTION_GRPC_USE_SSL_MUTUAL:
+          lgrpc_options.ssl_.use_mutual_auth_ = ParseOption<bool>(optarg);
+          lgrpc_options.ssl_.use_ssl_ = true;
+          break;
+        case OPTION_GRPC_SERVER_CERT:
+          lgrpc_options.ssl_.server_cert_ = optarg;
+          break;
+        case OPTION_GRPC_SERVER_KEY:
+          lgrpc_options.ssl_.server_key_ = optarg;
+          break;
+        case OPTION_GRPC_ROOT_CERT:
+          lgrpc_options.ssl_.root_cert_ = optarg;
+          break;
+        case OPTION_GRPC_RESPONSE_COMPRESSION_LEVEL: {
+          std::string mode_str(optarg);
+          std::transform(
+              mode_str.begin(), mode_str.end(), mode_str.begin(), ::tolower);
+          if (mode_str == "none") {
+            lgrpc_options.infer_compression_level_ = GRPC_COMPRESS_LEVEL_NONE;
+          } else if (mode_str == "low") {
+            lgrpc_options.infer_compression_level_ = GRPC_COMPRESS_LEVEL_LOW;
+          } else if (mode_str == "medium") {
+            lgrpc_options.infer_compression_level_ = GRPC_COMPRESS_LEVEL_MED;
+          } else if (mode_str == "high") {
+            lgrpc_options.infer_compression_level_ = GRPC_COMPRESS_LEVEL_HIGH;
+          } else {
+            throw ParseException(
+                "invalid argument for "
+                "--grpc_infer_response_compression_level");
+          }
+          break;
+        }
+        case OPTION_GRPC_ARG_KEEPALIVE_TIME_MS:
+          lgrpc_options.keep_alive_.keepalive_time_ms_ =
+              ParseOption<int>(optarg);
+          break;
+        case OPTION_GRPC_ARG_KEEPALIVE_TIMEOUT_MS:
+          lgrpc_options.keep_alive_.keepalive_timeout_ms_ =
+              ParseOption<int>(optarg);
+          break;
+        case OPTION_GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS:
+          lgrpc_options.keep_alive_.keepalive_permit_without_calls_ =
+              ParseOption<bool>(optarg);
+          break;
+        case OPTION_GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA:
+          lgrpc_options.keep_alive_.http2_max_pings_without_data_ =
+              ParseOption<int>(optarg);
+          break;
+        case OPTION_GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS:
+          lgrpc_options.keep_alive_
+              .http2_min_recv_ping_interval_without_data_ms_ =
+              ParseOption<int>(optarg);
+          break;
+        case OPTION_GRPC_ARG_HTTP2_MAX_PING_STRIKES:
+          lgrpc_options.keep_alive_.http2_max_ping_strikes_ =
+              ParseOption<int>(optarg);
+          break;
+        case OPTION_GRPC_ARG_MAX_CONNECTION_AGE_MS:
+          lgrpc_options.keep_alive_.max_connection_age_ms_ =
+              ParseOption<int>(optarg);
+          break;
+        case OPTION_GRPC_ARG_MAX_CONNECTION_AGE_GRACE_MS:
+          lgrpc_options.keep_alive_.max_connection_age_grace_ms_ =
+              ParseOption<int>(optarg);
+          break;
+        case OPTION_GRPC_RESTRICTED_PROTOCOL: {
+          ParseRestrictedFeatureOption(
+              optarg, long_options[option_index].name,
+              std::string(
+                  triton::server::grpc::kRestrictedProtocolHeaderTemplate),
+              "protocol", lgrpc_options.restricted_protocols_);
+          break;
+        }
+        case OPTION_GRPC_HEADER_FORWARD_PATTERN:
+          lgrpc_options.forward_header_pattern_ = optarg;
+          break;
+#endif  // TRITON_ENABLE_GRPC
+
+#ifdef TRITON_ENABLE_METRICS
+        case OPTION_ALLOW_METRICS:
+          lparams.allow_metrics_ = ParseOption<bool>(optarg);
+          break;
+        case OPTION_ALLOW_GPU_METRICS:
+          lparams.allow_gpu_metrics_ = ParseOption<bool>(optarg);
+          break;
+        case OPTION_ALLOW_CPU_METRICS:
+          lparams.allow_cpu_metrics_ = ParseOption<bool>(optarg);
+          break;
+        case OPTION_METRICS_ADDRESS:
+          lparams.metrics_address_ = optarg;
+          break;
+        case OPTION_METRICS_PORT:
+          lparams.metrics_port_ = ParseOption<int>(optarg);
+          break;
+        case OPTION_METRICS_INTERVAL_MS:
+          lparams.metrics_interval_ms_ = ParseOption<int>(optarg);
+          break;
+        case OPTION_METRICS_CONFIG:
+          lparams.metrics_config_settings_.push_back(
+              ParseMetricsConfigOption(optarg));
+          break;
+#endif  // TRITON_ENABLE_METRICS
+
+#ifdef TRITON_ENABLE_TRACING
+        case OPTION_TRACE_FILEPATH: {
+          std::cerr << "Warning: '--trace-file' has been deprecated and will be"
+                       " removed in future releases. Please use "
+                       "'--trace-config triton,file=<filepath> instead."
+                    << std::endl;
+          trace_filepath_present = true;
+          lparams.trace_filepath_ = optarg;
+          break;
+        }
+        case OPTION_TRACE_LEVEL: {
+          std::cerr
+              << "Warning: '--trace-level' has been deprecated and will be"
+                 " removed in future releases. Please use "
+                 "'--trace-config level=<OFF|TIMESTAMPS|TENSORS> instead."
+              << std::endl;
+          trace_level_present = true;
+          auto parsed_level = ParseTraceLevelOption(optarg);
+          explicit_disable_trace |=
+              (parsed_level == TRITONSERVER_TRACE_LEVEL_DISABLED);
+          lparams.trace_level_ = static_cast<TRITONSERVER_InferenceTraceLevel>(
+              lparams.trace_level_ | parsed_level);
+          break;
+        }
+        case OPTION_TRACE_RATE:
+          std::cerr << "Warning: '--trace-rate' has been deprecated and will be"
+                       " removed in future releases. Please use "
+                       "'--trace-config rate=<rate value> instead."
+                    << std::endl;
+          trace_rate_present = true;
+          lparams.trace_rate_ = ParseOption<int>(optarg);
+          break;
+
+        case OPTION_TRACE_COUNT:
+          std::cerr
+              << "Warning: '--trace-count' has been deprecated and will be"
+                 " removed in future releases. Please use "
+                 "'--trace-config count=<count value> instead."
+              << std::endl;
+          trace_count_present = true;
+          lparams.trace_count_ = ParseOption<int>(optarg);
+          break;
+        case OPTION_TRACE_LOG_FREQUENCY:
+          std::cerr
+              << "Warning: '--trace-log-frequency' has been deprecated and "
+                 "will be"
+                 " removed in future releases. Please use "
+                 "'--trace-config triton,log-frequency=<log frequency "
+                 "value> instead."
+              << std::endl;
+          trace_log_frequency_present = true;
+          lparams.trace_log_frequency_ = ParseOption<int>(optarg);
+          break;
+        case OPTION_TRACE_CONFIG: {
+          auto trace_config_setting = ParseTraceConfigOption(optarg);
+          triton::server::TraceConfig& tc =
+              lparams
+                  .trace_config_map_[std::get<0>(trace_config_setting).c_str()];
+          tc.push_back(std::make_pair(
+              std::get<1>(trace_config_setting).c_str(),
+              std::get<2>(trace_config_setting).c_str()));
+          break;
+        }
+#endif  // TRITON_ENABLE_TRACING
+
+        case OPTION_POLL_REPO_SECS:
+          lparams.repository_poll_secs_ = ParseOption<int>(optarg);
+          break;
+        case OPTION_STARTUP_MODEL:
+          lparams.startup_models_.insert(optarg);
+          break;
+        case OPTION_MODEL_CONTROL_MODE: {
+          std::string mode_str(optarg);
+          std::transform(
+              mode_str.begin(), mode_str.end(), mode_str.begin(), ::tolower);
+          if (mode_str == "none") {
+            lparams.control_mode_ = TRITONSERVER_MODEL_CONTROL_NONE;
+          } else if (mode_str == "poll") {
+            lparams.control_mode_ = TRITONSERVER_MODEL_CONTROL_POLL;
+          } else if (mode_str == "explicit") {
+            lparams.control_mode_ = TRITONSERVER_MODEL_CONTROL_EXPLICIT;
+          } else {
+            throw ParseException("invalid argument for --model-control-mode");
+          }
+          break;
+        }
+        case OPTION_RATE_LIMIT: {
+          std::string rate_limit_str(optarg);
+          std::transform(
+              rate_limit_str.begin(), rate_limit_str.end(),
+              rate_limit_str.begin(), ::tolower);
+          if (rate_limit_str == "execution_count") {
+            lparams.rate_limit_mode_ = TRITONSERVER_RATE_LIMIT_EXEC_COUNT;
+          } else if (rate_limit_str == "off") {
+            lparams.rate_limit_mode_ = TRITONSERVER_RATE_LIMIT_OFF;
+          } else {
+            throw ParseException("invalid argument for --rate-limit");
+          }
+          break;
+        }
+        case OPTION_RATE_LIMIT_RESOURCE: {
+          std::string rate_limit_resource_str(optarg);
+          std::transform(
+              rate_limit_resource_str.begin(), rate_limit_resource_str.end(),
+              rate_limit_resource_str.begin(), ::tolower);
+          lparams.rate_limit_resources_.push_back(
+              ParseRateLimiterResourceOption(optarg));
+          break;
+        }
+        case OPTION_PINNED_MEMORY_POOL_BYTE_SIZE:
+          lparams.pinned_memory_pool_byte_size_ = ParseOption<int64_t>(optarg);
+          break;
+        case OPTION_CUDA_MEMORY_POOL_BYTE_SIZE:
+          lparams.cuda_pools_.push_back(
+              ParsePairOption<int, uint64_t>(optarg, ":"));
+          break;
+        case OPTION_CUDA_VIRTUAL_ADDRESS_SIZE:
+          lparams.cuda_virtual_address_size_.push_back(
+              ParsePairOption<int, size_t>(optarg, ":"));
+          break;
+        case OPTION_RESPONSE_CACHE_BYTE_SIZE: {
+          cache_size_present = true;
+          const auto byte_size = std::to_string(ParseOption<int64_t>(optarg));
+          lparams.cache_config_settings_["local"] = {{"size", byte_size}};
+          std::cerr
+              << "Warning: '--response-cache-byte-size' has been deprecated! "
+                 "This will default to the 'local' cache implementation with "
+                 "the provided byte size for its config. Please use "
+                 "'--cache-config' instead. The equivalent "
+                 "--cache-config CLI args would be: "
+                 "'--cache-config=local,size=" +
+                     byte_size + "'"
+              << std::endl;
+          break;
+        }
+        case OPTION_CACHE_CONFIG: {
+          cache_config_present = true;
+          const auto cache_setting = ParseCacheConfigOption(optarg);
+          const auto& cache_name = std::get<0>(cache_setting);
+          const auto& key = std::get<1>(cache_setting);
+          const auto& value = std::get<2>(cache_setting);
+          lparams.cache_config_settings_[cache_name].push_back({key, value});
+          break;
+        }
+        case OPTION_CACHE_DIR:
+          lparams.cache_dir_ = optarg;
+          break;
+        case OPTION_MIN_SUPPORTED_COMPUTE_CAPABILITY:
+          lparams.min_supported_compute_capability_ =
+              ParseOption<double>(optarg);
+          break;
+        case OPTION_EXIT_TIMEOUT_SECS:
+          lparams.exit_timeout_secs_ = ParseOption<int>(optarg);
+          break;
+        case OPTION_BACKEND_DIR:
+          lparams.backend_dir_ = optarg;
+          break;
+        case OPTION_REPOAGENT_DIR:
+          lparams.repoagent_dir_ = optarg;
+          break;
+        case OPTION_BUFFER_MANAGER_THREAD_COUNT:
+          lparams.buffer_manager_thread_count_ = ParseOption<int>(optarg);
+          break;
+        case OPTION_MODEL_LOAD_THREAD_COUNT:
+          lparams.model_load_thread_count_ = ParseOption<int>(optarg);
+          break;
+        case OPTION_BACKEND_CONFIG:
+          lparams.backend_config_settings_.push_back(
+              ParseBackendConfigOption(optarg));
+          break;
+        case OPTION_HOST_POLICY:
+          lparams.host_policies_.push_back(ParseHostPolicyOption(optarg));
+          break;
+        case OPTION_MODEL_LOAD_GPU_LIMIT:
+          lparams.load_gpu_limit_.emplace(
+              ParsePairOption<int, double>(optarg, ":"));
+          break;
+        case OPTION_MODEL_NAMESPACING:
+          lparams.enable_model_namespacing_ = ParseOption<bool>(optarg);
+          break;
+      }
+    }
+    catch (const ParseException& pe) {
+      if ((pe.what() != NULL) && (strlen(pe.what()) != 0)) {
+        std::stringstream ss;
+        ss << "Bad option: \"--" << long_options[option_index].name << "\".\n"
+           << pe.what() << std::endl;
+        throw ParseException(ss.str());
+      } else {
+        // In case of `Unrecognized option` or `Help` option, just throw a
+        // ParseException
+        throw ParseException();
+      }
+    }
+  }
+
+  if (optind < argc) {
+    throw ParseException(std::string("Unexpected argument: ") + argv[optind]);
+  }
+
+  //
+  // Step 3. Post parsing validation, usually for options that depend on the
+  // others which are not determined until after parsing.
+  //
+
+  if (lparams.control_mode_ != TRITONSERVER_MODEL_CONTROL_POLL) {
+    lparams.repository_poll_secs_ = 0;
+  }
+
+#ifdef TRITON_ENABLE_VERTEX_AI
+  // Set default model repository if specific flag is set, postpone the
+  // check to after parsing so we only monitor the default repository if
+  // Vertex service is allowed
+  if (lparams.model_repository_paths_.empty()) {
+    auto aip_storage_uri =
+        triton::server::GetEnvironmentVariableOrDefault("AIP_STORAGE_URI", "");
+    if (!aip_storage_uri.empty()) {
+      lparams.model_repository_paths_.insert(aip_storage_uri);
+    }
+  }
+#endif  // TRITON_ENABLE_VERTEX_AI
+
+#ifdef TRITON_ENABLE_METRICS
+  lparams.allow_gpu_metrics_ &= lparams.allow_metrics_;
+  lparams.allow_cpu_metrics_ &= lparams.allow_metrics_;
+  // Set metrics_address to default if never specified
+  if (lparams.metrics_address_.empty()) {
+#ifdef TRITON_ENABLE_HTTP
+    // If built with HTTP support, default to HTTP address
+    lparams.metrics_address_ = lparams.http_address_;
+#else
+    // Otherwise have default for builds without HTTP support
+    lparams.metrics_address_ = "0.0.0.0";
+#endif  // TRITON_ENABLE_HTTP
+  }
+#endif  // TRITON_ENABLE_METRICS
+
+#ifdef TRITON_ENABLE_TRACING
+  PostProcessTraceArgs(
+      lparams, trace_level_present, trace_rate_present, trace_count_present,
+      trace_filepath_present, trace_log_frequency_present,
+      explicit_disable_trace);
+#endif  // TRITON_ENABLE_TRACING
+
+  // Check if there is a conflict between --disable-auto-complete-config
+  // and --strict-model-config
+  if (disable_auto_complete_config) {
+    if (strict_model_config_present && !lparams.strict_model_config_) {
+      std::cerr
+          << "Warning: Overriding deprecated '--strict-model-config' from "
+             "False to True in favor of '--disable-auto-complete-config'!"
+          << std::endl;
+    }
+    lparams.strict_model_config_ = true;
+  }
+
+  // Check if there is a conflict between --response-cache-byte-size
+  // and --cache-config
+  if (cache_size_present && cache_config_present) {
+    throw ParseException(
+        "Error: Incompatible flags --response-cache-byte-size and "
+        "--cache-config both provided. Please provide one or the other.");
+  }
+  lparams.enable_cache_ = (cache_size_present || cache_config_present);
+  return {lparams, {}};
+}
+
+std::string
+TritonParser::FormatUsageMessage(std::string str, int offset)
+{
+  int width = 60;
+  int current_pos = offset;
+  while (current_pos + width < int(str.length())) {
+    int n = str.rfind(' ', current_pos + width);
+    if (n != int(std::string::npos)) {
+      str.replace(n, 1, "\n\t");
+      current_pos += (width + 9);
+    }
+  }
+
+  return str;
+}
+
+std::string
+TritonParser::Usage()
+{
+  std::stringstream ss;
+  for (const auto& group : option_groups_) {
+    if (!group.first.empty() && !group.second.empty()) {
+      ss << std::endl << group.first << ":" << std::endl;
+    }
+
+    for (const auto& o : group.second) {
+      if (!o.arg_desc_.empty()) {
+        ss << "  --" << o.flag_ << " <" << o.arg_desc_ << ">" << std::endl
+           << "\t" << FormatUsageMessage(o.desc_, 0) << std::endl;
+      } else {
+        ss << "  --" << o.flag_ << std::endl
+           << "\t" << FormatUsageMessage(o.desc_, 0) << std::endl;
+      }
+    }
+  }
+  return ss.str();
+}
+
+std::tuple<std::string, std::string, std::string>
+TritonParser::ParseMetricsConfigOption(const std::string& arg)
+{
+  // Format is "<setting>=<value>" for generic configs/settings
+  int delim_setting = arg.find("=");
+  if (delim_setting < 0) {
+    std::stringstream ss;
+    ss << "--metrics-config option format is "
+       << "<setting>=<value>. Got " << arg << std::endl;
+    throw ParseException(ss.str());
+  }
+
+  // Break section before "=" into substr to avoid matching commas
+  // in setting values.
+  auto name_substr = arg.substr(0, delim_setting);
+  int delim_name = name_substr.find(",");
+
+  // No name-specific configs currently supported, though it may be in
+  // the future. Map global configs to empty string like other configs for
+  // now.
+  std::string name_string = std::string();
+  if (delim_name >= 0) {
+    std::stringstream ss;
+    ss << "--metrics-config option format is "
+       << "<setting>=<value>. Got " << arg << std::endl;
+    throw ParseException(ss.str());
+  }  // else global metrics config
+
+  std::string setting_string =
+      arg.substr(delim_name + 1, delim_setting - delim_name - 1);
+  std::string value_string = arg.substr(delim_setting + 1);
+
+  if (setting_string.empty() || value_string.empty()) {
+    std::stringstream ss;
+    ss << "--metrics-config option format is "
+       << "<setting>=<value>. Got " << arg << std::endl;
+    throw ParseException(ss.str());
+  }
+
+  return {name_string, setting_string, value_string};
+}
+
+std::tuple<std::string, std::string, std::string>
+TritonParser::ParseCacheConfigOption(const std::string& arg)
+{
+  // Format is "<cache_name>,<setting>=<value>" for specific
+  // config/settings and "<setting>=<value>" for cache agnostic
+  // configs/settings
+  int delim_name = arg.find(",");
+  int delim_setting = arg.find("=", delim_name + 1);
+
+  std::string name_string = std::string();
+  if (delim_name > 0) {
+    name_string = arg.substr(0, delim_name);
+  }
+  // No cache-agnostic global settings are currently supported
+  else {
+    std::stringstream ss;
+    ss << "No cache specified. --cache-config option format is "
+       << "<cache name>,<setting>=<value>. Got " << arg << std::endl;
+    throw ParseException(ss.str());
+  }
+
+  if (delim_setting < 0) {
+    std::stringstream ss;
+    ss << "--cache-config option format is '<cache "
+          "name>,<setting>=<value>'. Got "
+       << arg << std::endl;
+    throw ParseException(ss.str());
+  }
+  std::string setting_string =
+      arg.substr(delim_name + 1, delim_setting - delim_name - 1);
+  std::string value_string = arg.substr(delim_setting + 1);
+
+  if (setting_string.empty() || value_string.empty()) {
+    std::stringstream ss;
+    ss << "--cache-config option format is '<cache "
+          "name>,<setting>=<value>'. Got "
+       << arg << std::endl;
+    throw ParseException(ss.str());
+  }
+
+  return {name_string, setting_string, value_string};
+}
+
+std::tuple<std::string, int, int>
+TritonParser::ParseRateLimiterResourceOption(const std::string& arg)
+{
+  std::string error_string(
+      "--rate-limit-resource option format is "
+      "'<resource_name>:<count>:<device>' or '<resource_name>:<count>'. "
+      "Got " +
+      arg);
+
+  std::string name_string("");
+  int count = -1;
+  int device_id = -1;
+
+  size_t delim_first = arg.find(":");
+  size_t delim_second = arg.find(":", delim_first + 1);
+
+  if (delim_second != std::string::npos) {
+    // Handle format `<resource_name>:<count>:<device>'
+    size_t delim_third = arg.find(":", delim_second + 1);
+    if (delim_third != std::string::npos) {
+      throw ParseException(error_string);
+    }
+    name_string = arg.substr(0, delim_first);
+    count = ParseOption<int>(
+        arg.substr(delim_first + 1, delim_second - delim_first - 1));
+    device_id = ParseOption<int>(arg.substr(delim_second + 1));
+  } else if (delim_first != std::string::npos) {
+    // Handle format `<resource_name>:<count>'
+    name_string = arg.substr(0, delim_first);
+    count = ParseOption<int>(arg.substr(delim_first + 1));
+  } else {
+    // If no colons found
+    throw ParseException(error_string);
+  }
+
+  return {name_string, count, device_id};
+}
+
+std::tuple<std::string, std::string, std::string>
+TritonParser::ParseBackendConfigOption(const std::string& arg)
+{
+  // Format is "<backend_name>,<setting>=<value>" for specific
+  // config/settings and "<setting>=<value>" for backend agnostic
+  // configs/settings
+  int delim_name = arg.find(",");
+  int delim_setting = arg.find("=", delim_name + 1);
+
+  std::string name_string = std::string();
+  if (delim_name > 0) {
+    name_string = arg.substr(0, delim_name);
+  } else if (delim_name == 0) {
+    std::stringstream ss;
+    ss << "No backend specified. --backend-config option format is "
+       << "<backend name>,<setting>=<value> or "
+       << "<setting>=<value>. Got " << arg << std::endl;
+    throw ParseException(ss.str());
+  }  // else global backend config
+
+  if (delim_setting < 0) {
+    std::stringstream ss;
+    ss << "--backend-config option format is '<backend "
+          "name>,<setting>=<value>'. Got "
+       << arg << std::endl;
+    throw ParseException(ss.str());
+  }
+  std::string setting_string =
+      arg.substr(delim_name + 1, delim_setting - delim_name - 1);
+  std::string value_string = arg.substr(delim_setting + 1);
+
+  if (setting_string.empty() || value_string.empty()) {
+    std::stringstream ss;
+    ss << "--backend-config option format is '<backend "
+          "name>,<setting>=<value>'. Got "
+       << arg << std::endl;
+    throw ParseException(ss.str());
+  }
+
+  return {name_string, setting_string, value_string};
+}
+
+void
+TritonParser::ParseRestrictedFeatureOption(
+    const std::string& arg, const std::string& option_name,
+    const std::string& key_prefix, const std::string& feature_type,
+    RestrictedFeatures& restricted_features)
+{
+  const auto& parsed_tuple =
+      ParseGenericConfigOption(arg, ":", "=", option_name, "config name");
+
+  const auto& features = SplitOptions(std::get<0>(parsed_tuple), ",");
+  const auto& key = std::get<1>(parsed_tuple);
+  const auto& value = std::get<2>(parsed_tuple);
+
+  for (const auto& feature : features) {
+    const auto& category = RestrictedFeatures::ToCategory(feature);
+
+    if (category == RestrictedCategory::INVALID) {
+      std::stringstream ss;
+      ss << "unknown restricted " << feature_type << " '" << feature << "' "
+         << std::endl;
+      throw ParseException(ss.str());
+    }
+
+    if (restricted_features.IsRestricted(category)) {
+      // restricted feature can only be in one group
+      std::stringstream ss;
+      ss << "restricted " << feature_type << " '" << feature
+         << "' can not be specified in multiple config groups" << std::endl;
+      throw ParseException(ss.str());
+    }
+    restricted_features.Insert(
+        category, std::make_pair(key_prefix + key, value));
+  }
+}
+
+std::tuple<std::string, std::string, std::string>
+TritonParser::ParseHostPolicyOption(const std::string& arg)
+{
+  return ParseGenericConfigOption(arg, ",", "=", "host-policy", "policy name");
+}
+
+std::tuple<std::string, std::string, std::string>
+TritonParser::ParseGenericConfigOption(
+    const std::string& arg, const std::string& first_delim,
+    const std::string& second_delim, const std::string& option_name,
+    const std::string& config_name)
+{
+  // Format is "<string>,<string>=<string>"
+  int delim_name = arg.find(first_delim);
+  int delim_setting = arg.find(second_delim, delim_name + 1);
+
+  std::string error_string = "--" + option_name + " option format is '<" +
+                             config_name + ">" + first_delim + "<setting>" +
+                             second_delim + "<value>'. Got " + arg + "\n";
+
+  // Check for 2 semicolons
+  if ((delim_name < 0) || (delim_setting < 0)) {
+    throw ParseException(error_string);
+  }
+
+  std::string name_string = arg.substr(0, delim_name);
+  std::string setting_string =
+      arg.substr(delim_name + 1, delim_setting - delim_name - 1);
+  std::string value_string = arg.substr(delim_setting + 1);
+
+  if (name_string.empty() || setting_string.empty() || value_string.empty()) {
+    throw ParseException(error_string);
+  }
+
+  return {name_string, setting_string, value_string};
+}
+
+#ifdef TRITON_ENABLE_TRACING
+TRITONSERVER_InferenceTraceLevel
+TritonParser::ParseTraceLevelOption(std::string arg)
+{
+  std::transform(arg.begin(), arg.end(), arg.begin(), [](unsigned char c) {
+    return std::tolower(c);
+  });
+
+  if ((arg == "false") || (arg == "off")) {
+    return TRITONSERVER_TRACE_LEVEL_DISABLED;
+  }
+  if ((arg == "true") || (arg == "on") || (arg == "min") || (arg == "max") ||
+      (arg == "timestamps")) {
+    return TRITONSERVER_TRACE_LEVEL_TIMESTAMPS;
+  }
+  if (arg == "tensors") {
+    return TRITONSERVER_TRACE_LEVEL_TENSORS;
+  }
+
+  throw ParseException("invalid value for trace level option: " + arg);
+}
+
+InferenceTraceMode
+TritonParser::ParseTraceModeOption(std::string arg)
+{
+  std::transform(arg.begin(), arg.end(), arg.begin(), [](unsigned char c) {
+    return std::tolower(c);
+  });
+
+  if (arg == "triton") {
+    return TRACE_MODE_TRITON;
+  }
+  if (arg == "opentelemetry") {
+    return TRACE_MODE_OPENTELEMETRY;
+  }
+
+  throw ParseException(
+      "invalid value for trace mode option: " + arg +
+      ". Available options are \"triton\" and \"opentelemetry\"");
+}
+
+std::tuple<std::string, std::string, std::string>
+TritonParser::ParseTraceConfigOption(const std::string& arg)
+{
+  int delim_name = arg.find(",");
+  int delim_setting = arg.find("=", delim_name + 1);
+
+  std::string name_string = std::string();
+  if (delim_name > 0) {
+    name_string =
+        std::to_string(ParseTraceModeOption(arg.substr(0, delim_name)));
+  } else if (delim_name == 0) {
+    std::stringstream ss;
+    ss << "No trace mode specified. --trace-config option format is "
+       << "<trace mode>,<setting>=<value> or "
+       << "<setting>=<value>. Got " << arg << std::endl;
+    throw ParseException(ss.str());
+  }  // else global trace config
+
+  if (delim_setting < 0) {
+    std::stringstream ss;
+    ss << "--trace-config option format is '<trace mode>,<setting>=<value>'. "
+          "Got "
+       << arg << std::endl;
+    throw ParseException(ss.str());
+  }
+  std::string setting_string =
+      arg.substr(delim_name + 1, delim_setting - delim_name - 1);
+  std::string value_string = arg.substr(delim_setting + 1);
+
+  if (setting_string.empty() || value_string.empty()) {
+    std::stringstream ss;
+    ss << "--trace-config option format is '<trace mode>,<setting>=<value>'. "
+          "Got "
+       << arg << std::endl;
+    throw ParseException(ss.str());
+  }
+
+  return {name_string, setting_string, value_string};
+}
+
+void
+TritonParser::SetGlobalTraceArgs(
+    TritonServerParameters& lparams, bool trace_level_present,
+    bool trace_rate_present, bool trace_count_present,
+    bool explicit_disable_trace)
+{
+  for (const auto& global_setting : lparams.trace_config_map_[""]) {
+    try {
+      if (global_setting.first == "rate") {
+        if (trace_rate_present) {
+          std::cerr << "Warning: Overriding deprecated '--trace-rate' "
+                       "in favor of provided rate value in --trace-config!"
+                    << std::endl;
+        }
+        lparams.trace_rate_ = ParseOption<int>(global_setting.second);
+      }
+      if (global_setting.first == "level") {
+        if (trace_level_present) {
+          std::cerr << "Warning: Overriding deprecated '--trace-level' "
+                       "in favor of provided level in --trace-config!"
+                    << std::endl;
+        }
+        auto parsed_level_config = ParseTraceLevelOption(global_setting.second);
+        explicit_disable_trace |=
+            (parsed_level_config == TRITONSERVER_TRACE_LEVEL_DISABLED);
+        lparams.trace_level_ = static_cast<TRITONSERVER_InferenceTraceLevel>(
+            lparams.trace_level_ | parsed_level_config);
+      }
+      if (global_setting.first == "mode") {
+        lparams.trace_mode_ = ParseTraceModeOption(global_setting.second);
+      }
+      if (global_setting.first == "count") {
+        if (trace_count_present) {
+          std::cerr << "Warning: Overriding deprecated '--trace-count' "
+                       "in favor of provided count in --trace-config!"
+                    << std::endl;
+        }
+        lparams.trace_count_ = ParseOption<int>(global_setting.second);
+      }
+    }
+    catch (const ParseException& pe) {
+      std::stringstream ss;
+      ss << "Bad option: \"--trace-config " << global_setting.first << "\".\n"
+         << pe.what() << std::endl;
+      throw ParseException(ss.str());
+    }
+  }
+}
+
+void
+TritonParser::SetTritonTraceArgs(
+    TritonServerParameters& lparams, bool trace_filepath_present,
+    bool trace_log_frequency_present)
+{
+  for (const auto& mode_setting :
+       lparams.trace_config_map_[std::to_string(TRACE_MODE_TRITON)]) {
+    try {
+      if (mode_setting.first == "file") {
+        if (trace_filepath_present) {
+          std::cerr << "Warning: Overriding deprecated '--trace-file' "
+                       "in favor of provided file in --trace-config!"
+                    << std::endl;
+        }
+        lparams.trace_filepath_ = mode_setting.second;
+      } else if (mode_setting.first == "log-frequency") {
+        if (trace_log_frequency_present) {
+          std::cerr << "Warning: Overriding deprecated '--trace-file' "
+                       "in favor of provided file in --trace-config!"
+                    << std::endl;
+        }
+        lparams.trace_log_frequency_ = ParseOption<int>(mode_setting.second);
+      }
+    }
+    catch (const ParseException& pe) {
+      std::stringstream ss;
+      ss << "Bad option: \"--trace-config triton," << mode_setting.first
+         << "\".\n"
+         << pe.what() << std::endl;
+      throw ParseException(ss.str());
+    }
+  }
+}
+
+void
+TritonParser::VerifyOpentelemetryTraceArgs(
+    bool trace_filepath_present, bool trace_log_frequency_present)
+{
+  if (trace_filepath_present) {
+    std::cerr << "Warning: '--trace-file' is deprecated and will "
+                 "be ignored with opentelemetry tracing mode. "
+              << std::endl;
+  }
+  if (trace_log_frequency_present) {
+    std::cerr << "Warning: '--trace-log-frequency' is deprecated "
+                 "and will be ignored with opentelemetry tracing mode."
+              << std::endl;
+  }
+}
+
+void
+TritonParser::PostProcessTraceArgs(
+    TritonServerParameters& lparams, bool trace_level_present,
+    bool trace_rate_present, bool trace_count_present,
+    bool trace_filepath_present, bool trace_log_frequency_present,
+    bool explicit_disable_trace)
+{
+  SetGlobalTraceArgs(
+      lparams, trace_level_present, trace_rate_present, trace_count_present,
+      explicit_disable_trace);
+
+  if (lparams.trace_mode_ == TRACE_MODE_OPENTELEMETRY) {
+    VerifyOpentelemetryTraceArgs(
+        trace_filepath_present, trace_log_frequency_present);
+  } else if (lparams.trace_mode_ == TRACE_MODE_TRITON) {
+    SetTritonTraceArgs(
+        lparams, trace_filepath_present, trace_log_frequency_present);
+  }
+
+  if (explicit_disable_trace) {
+    lparams.trace_level_ = TRITONSERVER_TRACE_LEVEL_DISABLED;
+  }
+}
+
+#endif  // TRITON_ENABLE_TRACING
+}}      // namespace triton::server
diff --git a/src/command_line_parser.h b/src/command_line_parser.h
new file mode 100644
index 0000000000..ef562a3efb
--- /dev/null
+++ b/src/command_line_parser.h
@@ -0,0 +1,345 @@
+// Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NVIDIA CORPORATION nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//
+#pragma once
+
+#include <list>
+#include <map>
+#include <memory>
+#include <set>
+#include <string>
+#include <thread>
+#include <unordered_map>
+#include <vector>
+
+#include "restricted_features.h"
+#include "triton/common/logging.h"
+#include "triton/core/tritonserver.h"
+#ifdef TRITON_ENABLE_GRPC
+// To avoid ambiguous reference during build
+// grpc headers should be imported first
+// https://github.com/open-telemetry/opentelemetry-cpp/blob/main/examples/otlp/README.md#additional-notes-regarding-abseil-library
+#include "grpc/grpc_server.h"
+#endif  // TRITON_ENABLE_GRPC
+#if defined(TRITON_ENABLE_HTTP) || defined(TRITON_ENABLE_METRICS)
+#include "http_server.h"
+#endif  // TRITON_ENABLE_HTTP || TRITON_ENABLE_METRICS
+#ifdef TRITON_ENABLE_SAGEMAKER
+#include "sagemaker_server.h"
+#endif  // TRITON_ENABLE_SAGEMAKER
+#ifdef TRITON_ENABLE_VERTEX_AI
+#include "vertex_ai_server.h"
+#endif  // TRITON_ENABLE_VERTEX_AI
+
+#ifndef _WIN32
+#include <getopt.h>
+#include <unistd.h>
+#else
+// Minimum implementation of <getopt.h> for Windows
+#define required_argument 1
+#define no_argument 2
+struct option {
+  option(const char* name, int has_arg, int* flag, int val)
+      : name(name), has_arg(has_arg), flag(flag), val(val)
+  {
+  }
+  const char* name;
+  int has_arg;
+  int* flag;
+  int val;
+};
+#endif
+#ifdef TRITON_ENABLE_TRACING
+#include "tracer.h"
+#endif
+
+
+namespace triton { namespace server {
+
+// Command-line options
+struct Option {
+  static constexpr const char* ArgNone = "";
+  static constexpr const char* ArgBool = "boolean";
+  static constexpr const char* ArgFloat = "float";
+  static constexpr const char* ArgInt = "integer";
+  static constexpr const char* ArgStr = "string";
+
+  Option(int id, std::string flag, std::string arg_desc, std::string desc)
+      : id_(id), flag_(flag), arg_desc_(arg_desc), desc_(desc)
+  {
+  }
+
+  struct option GetLongOption() const
+  {
+    struct option lo {
+      flag_.c_str(), (!arg_desc_.empty()) ? required_argument : no_argument,
+          nullptr, id_
+    };
+    return lo;
+  }
+
+  const int id_;
+  const std::string flag_;
+  const std::string arg_desc_;
+  const std::string desc_;
+};
+
+struct TritonServerParameters {
+  std::string server_id_{"triton"};
+  bool exit_on_error_{true};
+  bool strict_model_config_{false};
+  bool strict_readiness_{true};
+  int32_t exit_timeout_secs_{30};
+#ifdef TRITON_ENABLE_GPU
+  double min_supported_compute_capability_{TRITON_MIN_COMPUTE_CAPABILITY};
+#else
+  double min_supported_compute_capability_{0.0};
+#endif  // TRITON_ENABLE_GPU
+  std::string repoagent_dir_{"/opt/tritonserver/repoagents"};
+  std::string backend_dir_{"/opt/tritonserver/backends"};
+  std::vector<std::tuple<std::string, std::string, std::string>>
+      backend_config_settings_;
+
+  // Model repository manager configuration
+  bool enable_model_namespacing_{false};
+  std::set<std::string> model_repository_paths_{};
+  TRITONSERVER_ModelControlMode control_mode_{TRITONSERVER_MODEL_CONTROL_NONE};
+  std::set<std::string> startup_models_{};
+  // Interval, in seconds, when the model repository is polled for changes.
+  int32_t repository_poll_secs_{15};
+  // Number of threads to use for concurrently loading models
+  uint32_t model_load_thread_count_{4};
+  std::map<int, double> load_gpu_limit_;
+
+  // Rate limiter configuration
+  // FIXME: Once the rate limiter implementation is complete make
+  // EXEC_COUNT the default.
+  // TRITONSERVER_RateLimitMode
+  // rate_limit_mode_{TRITONSERVER_RATE_LIMIT_EXEC_COUNT};
+  TRITONSERVER_RateLimitMode rate_limit_mode_{TRITONSERVER_RATE_LIMIT_OFF};
+  std::vector<std::tuple<std::string, int, int>> rate_limit_resources_;
+
+  // memory pool configuration
+  int64_t pinned_memory_pool_byte_size_{1 << 28};
+  std::list<std::pair<int, uint64_t>> cuda_pools_;
+  std::list<std::pair<int, size_t>> cuda_virtual_address_size_;
+
+  // [FIXME] this option is broken after backend separation: this should have
+  // controlled backend copy behavior but not properly propagate to backend
+  // after separation, need to go through backend config.
+  int32_t buffer_manager_thread_count_{0};
+
+  std::vector<std::tuple<std::string, std::string, std::string>> host_policies_;
+
+  // Cache configuration
+  bool enable_cache_{false};
+  std::string cache_dir_{"/opt/tritonserver/caches"};
+  std::unordered_map<
+      std::string, std::vector<std::pair<std::string, std::string>>>
+      cache_config_settings_;
+
+#ifdef TRITON_ENABLE_LOGGING
+  bool log_info_{true};
+  bool log_warn_{true};
+  bool log_error_{true};
+  int32_t log_verbose_{0};
+  triton::common::Logger::Format log_format_{
+      triton::common::Logger::Format::kDEFAULT};
+  std::string log_file_{};
+#endif  // TRITON_ENABLE_LOGGING
+
+#ifdef TRITON_ENABLE_TRACING
+  std::string trace_filepath_{};
+  TRITONSERVER_InferenceTraceLevel trace_level_{
+      TRITONSERVER_TRACE_LEVEL_DISABLED};
+  int32_t trace_rate_{1000};
+  int32_t trace_count_{-1};
+  int32_t trace_log_frequency_{0};
+  InferenceTraceMode trace_mode_{TRACE_MODE_TRITON};
+  TraceConfigMap trace_config_map_;
+#endif  // TRITON_ENABLE_TRACING
+
+// The configurations for various endpoints (i.e. HTTP, GRPC and metrics)
+#ifdef TRITON_ENABLE_HTTP
+  bool allow_http_{true};
+  std::string http_address_{"0.0.0.0"};
+  int32_t http_port_{8000};
+  bool reuse_http_port_{false};
+  std::string http_forward_header_pattern_;
+  // The number of threads to initialize for the HTTP front-end.
+  int http_thread_cnt_{8};
+  RestrictedFeatures http_restricted_apis_{};
+#endif  // TRITON_ENABLE_HTTP
+
+#ifdef TRITON_ENABLE_GRPC
+  bool allow_grpc_{true};
+  triton::server::grpc::Options grpc_options_;
+#endif  // TRITON_ENABLE_GRPC
+
+#ifdef TRITON_ENABLE_METRICS
+  bool allow_metrics_{true};
+  // Defaults to http_address_ if TRITON_ENABLE_HTTP is enabled for backwards,
+  // otherwise defaults to "0.0.0.0" for TRITON_ENABLE_HTTP is disabled.
+  std::string metrics_address_{""};
+  int32_t metrics_port_{8002};
+  // Metric settings for Triton core
+  float metrics_interval_ms_{2000};
+  bool allow_gpu_metrics_{true};
+  bool allow_cpu_metrics_{true};
+  std::vector<std::tuple<std::string, std::string, std::string>>
+      metrics_config_settings_;
+#endif  // TRITON_ENABLE_METRICS
+
+#ifdef TRITON_ENABLE_SAGEMAKER
+  bool allow_sagemaker_{false};
+  std::string sagemaker_address_{"0.0.0.0"};
+  int32_t sagemaker_port_{8080};
+  bool sagemaker_safe_range_set_{false};
+  std::pair<int32_t, int32_t> sagemaker_safe_range_{-1, -1};
+  // The number of threads to initialize for the SageMaker HTTP front-end.
+  int sagemaker_thread_cnt_{8};
+#endif  // TRITON_ENABLE_SAGEMAKER
+
+#ifdef TRITON_ENABLE_VERTEX_AI
+  bool allow_vertex_ai_{false};
+  std::string vertex_ai_address_{"0.0.0.0"};
+  int32_t vertex_ai_port_{8080};
+  // The number of threads to initialize for the Vertex AI HTTP front-end.
+  int vertex_ai_thread_cnt_{8};
+  std::string vertex_ai_default_model_{};
+#endif  // TRITON_ENABLE_VERTEX_AI
+
+  // [FIXME] who should call this function?
+  void CheckPortCollision();
+  using ManagedTritonServerOptionPtr = std::unique_ptr<
+      TRITONSERVER_ServerOptions, decltype(&TRITONSERVER_ServerOptionsDelete)>;
+  ManagedTritonServerOptionPtr BuildTritonServerOptions();
+};
+
+// Exception type to be thrown if the error is parsing related
+class ParseException : public std::exception {
+ public:
+  ParseException() = default;
+  ParseException(const std::string& message) : message_(message) {}
+
+  virtual const char* what() const throw() { return message_.c_str(); }
+
+ private:
+  const std::string message_{""};
+};
+
+// [WIP] Fall-through parser, Parse() will convert the recognized options into
+// parameter object and return the unrecognized options to be another argument
+// list for other parser to consume.
+// This allows the composition of parser chain.
+// [FIXME] abstract interface, concrete class below should only parse Triton
+// core and endpoint control options (endpoint specific options in their own
+// parser)
+class TritonParser {
+ public:
+  TritonParser();
+  // Parse command line arguments into a parameters struct and transform
+  // the argument list to contain only unrecognized options. The content of
+  // unrecognized argument list shares the same lifecycle as 'argv'.
+  // Raise ParseException if fail to parse recognized options.
+  std::pair<TritonServerParameters, std::vector<char*>> Parse(
+      int argc, char** argv);
+
+  // Return usage of all recognized options
+  std::string Usage();
+
+ private:
+  std::string FormatUsageMessage(std::string str, int offset);
+  // Helper functions for parsing options that require multi-value parsing.
+  std::tuple<std::string, std::string, std::string> ParseCacheConfigOption(
+      const std::string& arg);
+  std::tuple<std::string, int, int> ParseRateLimiterResourceOption(
+      const std::string& arg);
+  std::tuple<std::string, std::string, std::string> ParseBackendConfigOption(
+      const std::string& arg);
+  std::tuple<std::string, std::string, std::string> ParseHostPolicyOption(
+      const std::string& arg);
+  std::tuple<std::string, std::string, std::string> ParseMetricsConfigOption(
+      const std::string& arg);
+  void ParseRestrictedFeatureOption(
+      const std::string& arg, const std::string& option_name,
+      const std::string& header_prefix, const std::string& feature_type,
+      RestrictedFeatures& restricted_features);
+#ifdef TRITON_ENABLE_TRACING
+  TRITONSERVER_InferenceTraceLevel ParseTraceLevelOption(std::string arg);
+  InferenceTraceMode ParseTraceModeOption(std::string arg);
+  std::tuple<std::string, std::string, std::string> ParseTraceConfigOption(
+      const std::string& arg);
+  // Helper functions for post processing for collected trace arguments.
+  void SetGlobalTraceArgs(
+      TritonServerParameters& lparams, bool trace_level_present,
+      bool trace_rate_present, bool trace_count_present,
+      bool explicit_disable_trace);
+  void SetTritonTraceArgs(
+      TritonServerParameters& lparams, bool trace_filepath_present,
+      bool trace_log_frequency_present);
+  void VerifyOpentelemetryTraceArgs(
+      bool trace_filepath_present, bool trace_log_frequency_present);
+  void PostProcessTraceArgs(
+      TritonServerParameters& lparams, bool trace_level_present,
+      bool trace_rate_present, bool trace_count_present,
+      bool trace_filepath_present, bool trace_log_frequency_present,
+      bool explicit_disable_trace);
+#endif  // TRITON_ENABLE_TRACING
+  // Helper function to parse option in
+  // "<string>[1st_delim]<string>[2nd_delim]<string>" format
+  std::tuple<std::string, std::string, std::string> ParseGenericConfigOption(
+      const std::string& arg, const std::string& first_delim,
+      const std::string& second_delim, const std::string& option_name,
+      const std::string& config_name);
+
+  // Initialize individual option groups
+  void SetupOptions();
+  // Initialize option group mappings
+  void SetupOptionGroups();
+
+  // Sum of option groups: vector to maintain insertion order for Usage()
+  std::vector<std::pair<std::string, std::vector<Option>&>> option_groups_;
+  // Individual option groups
+  std::vector<Option> global_options_;
+  std::vector<Option> server_options_;
+  std::vector<Option> model_repo_options_;
+  std::vector<Option> logging_options_;
+  std::vector<Option> http_options_;
+  std::vector<Option> grpc_options_;
+  std::vector<Option> sagemaker_options_;
+  std::vector<Option> vertex_options_;
+  std::vector<Option> metric_options_;
+  std::vector<Option> tracing_options_;
+  std::vector<Option> backend_options_;
+  std::vector<Option> repo_agent_options_;
+  std::vector<Option> cache_options_;
+  std::vector<Option> rate_limiter_options_;
+  std::vector<Option> memory_device_options_;
+  // Group deprecated options to keep preferred options more succinct
+  std::vector<Option> deprecated_options_;
+};
+}}  // namespace triton::server
diff --git a/src/common.cc b/src/common.cc
index 3ce4878148..289d868866 100644
--- a/src/common.cc
+++ b/src/common.cc
@@ -1,4 +1,4 @@
-// Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 //
 // Redistribution and use in source and binary forms, with or without
 // modification, are permitted provided that the following conditions
@@ -26,6 +26,10 @@
 
 #include "common.h"
 
+#include <algorithm>
+#include <iterator>
+
+#include "restricted_features.h"
 #include "triton/core/tritonserver.h"
 
 namespace triton { namespace server {
@@ -92,4 +96,10 @@ GetElementCount(const std::vector<int64_t>& dims)
   return cnt;
 }
 
+bool
+Contains(const std::vector<std::string>& vec, const std::string& str)
+{
+  return std::find(vec.begin(), vec.end(), str) != vec.end();
+}
+
 }}  // namespace triton::server
diff --git a/src/common.h b/src/common.h
index c631551a34..aa160f394f 100644
--- a/src/common.h
+++ b/src/common.h
@@ -1,4 +1,4 @@
-// Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 //
 // Redistribution and use in source and binary forms, with or without
 // modification, are permitted provided that the following conditions
@@ -26,8 +26,10 @@
 #pragma once
 
 #include <iostream>
+#include <sstream>
 #include <string>
 #include <vector>
+
 #include "triton/core/tritonserver.h"
 
 namespace triton { namespace server {
@@ -45,6 +47,10 @@ constexpr int MAX_GRPC_MESSAGE_SIZE = INT32_MAX;
 /// dimension can take on any size.
 constexpr int WILDCARD_DIM = -1;
 
+/// Request parameter keys that start with a "triton_" prefix for internal use
+const std::vector<std::string> TRITON_RESERVED_REQUEST_PARAMS{
+    "triton_enable_empty_final_response"};
+
 #define RETURN_IF_ERR(X)             \
   do {                               \
     TRITONSERVER_Error* err__ = (X); \
@@ -57,10 +63,12 @@ constexpr int WILDCARD_DIM = -1;
   do {                                                                 \
     TRITONSERVER_Error* err__ = (X);                                   \
     if (err__ != nullptr) {                                            \
-      return TRITONSERVER_ErrorNew(                                    \
+      auto new_err = TRITONSERVER_ErrorNew(                            \
           TRITONSERVER_ErrorCode(err__),                               \
           (std::string(MSG) + ": " + TRITONSERVER_ErrorMessage(err__)) \
               .c_str());                                               \
+      TRITONSERVER_ErrorDelete(err__);                                 \
+      return new_err;                                                  \
     }                                                                  \
   } while (false)
 
@@ -90,6 +98,18 @@ constexpr int WILDCARD_DIM = -1;
     }                                                             \
   } while (false)
 
+#define THROW_IF_ERR(EX_TYPE, X, MSG)                                     \
+  do {                                                                    \
+    TRITONSERVER_Error* err__ = (X);                                      \
+    if (err__ != nullptr) {                                               \
+      auto ex__ = (EX_TYPE)(std::string("error: ") + (MSG) + ": " +       \
+                            TRITONSERVER_ErrorCodeString(err__) + " - " + \
+                            TRITONSERVER_ErrorMessage(err__));            \
+      TRITONSERVER_ErrorDelete(err__);                                    \
+      throw ex__;                                                         \
+    }                                                                     \
+  } while (false)
+
 #define IGNORE_ERR(X)                  \
   do {                                 \
     TRITONSERVER_Error* err__ = (X);   \
@@ -129,10 +149,39 @@ std::string GetEnvironmentVariableOrDefault(
     const std::string& variable_name, const std::string& default_value);
 
 /// Get the number of elements in a shape.
+///
 /// \param dims The shape.
 /// \return The number of elements, or -1 if the number of elements
 /// cannot be determined because the shape contains one or more
-/// wilcard dimensions.
+/// wildcard dimensions.
 int64_t GetElementCount(const std::vector<int64_t>& dims);
 
+/// Returns if 'vec' contains 'str'.
+///
+/// \param vec The vector of strings to search.
+/// \param str The string to lookup.
+/// \return True if the str is found, false otherwise.
+bool Contains(const std::vector<std::string>& vec, const std::string& str);
+
+/// Joins container of strings into a single string delimited by
+/// 'delim'.
+///
+/// \param container The container of strings to join.
+/// \param delim The delimiter to join with.
+/// \return The joint string.
+template <class T>
+std::string
+Join(const T& container, const std::string& delim)
+{
+  if (container.empty()) {
+    return "";
+  }
+  std::stringstream ss;
+  ss << container[0];
+  for (size_t i = 1; i < container.size(); ++i) {
+    ss << delim << container[i];
+  }
+  return ss.str();
+}
+
 }}  // namespace triton::server
diff --git a/src/data_compressor.h b/src/data_compressor.h
index d8eafb9662..a63fb43774 100644
--- a/src/data_compressor.h
+++ b/src/data_compressor.h
@@ -25,6 +25,9 @@
 // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 #pragma once
 
+#include <event2/buffer.h>
+#include <zlib.h>
+
 #include <cassert>
 #include <cstring>
 #include <iostream>
@@ -32,8 +35,6 @@
 #include <string>
 #include <vector>
 
-#include <event2/buffer.h>
-#include <zlib.h>
 #include "common.h"
 #include "triton/core/tritonserver.h"
 
@@ -129,7 +130,7 @@ class DataCompressor {
               CommitEVBuffer(
                   compressed_data, &current_reserved_space,
                   expected_compressed_size),
-              "unexpected error comitting output buffer for compression: ");
+              "unexpected error committing output buffer for compression: ");
           RETURN_MSG_IF_ERR(
               AllocEVBuffer(
                   expected_compressed_size, compressed_data,
@@ -154,7 +155,7 @@ class DataCompressor {
           CommitEVBuffer(
               compressed_data, &current_reserved_space,
               expected_compressed_size - stream.avail_out),
-          "unexpected error comitting output buffer for compression: ");
+          "unexpected error committing output buffer for compression: ");
     }
     return nullptr;  // success
   }
@@ -238,7 +239,7 @@ class DataCompressor {
                     CommitEVBuffer(
                         decompressed_data, &current_reserved_space,
                         output_buffer_size),
-                    "unexpected error comitting output buffer for "
+                    "unexpected error committing output buffer for "
                     "decompression: ");
                 RETURN_MSG_IF_ERR(
                     AllocEVBuffer(
@@ -265,7 +266,7 @@ class DataCompressor {
                 CommitEVBuffer(
                     decompressed_data, &current_reserved_space,
                     output_buffer_size - stream.avail_out),
-                "unexpected error comitting output buffer for compression: ");
+                "unexpected error committing output buffer for compression: ");
           }
           break;
         }
diff --git a/src/grpc/CMakeLists.txt b/src/grpc/CMakeLists.txt
new file mode 100644
index 0000000000..1308b6e1b2
--- /dev/null
+++ b/src/grpc/CMakeLists.txt
@@ -0,0 +1,144 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+add_library(
+  grpc-endpoint-library EXCLUDE_FROM_ALL
+  grpc_server.cc
+  grpc_server.h
+  grpc_handler.h
+  grpc_utils.cc
+  grpc_utils.h
+  infer_handler.cc
+  infer_handler.h
+  stream_infer_handler.h
+  stream_infer_handler.cc
+)
+
+target_compile_features(grpc-endpoint-library PRIVATE cxx_std_11)
+if(CMAKE_CXX_COMPILER_ID STREQUAL "MSVC")
+  target_compile_options(
+    grpc-endpoint-library
+    PRIVATE
+      /W1 /D_WIN32_WINNT=0x0A00 /EHsc /Zc:preprocessor
+  )
+else()
+  target_compile_options(
+    grpc-endpoint-library
+    PRIVATE
+      -Wall -Wextra -Wno-unused-parameter -Wno-deprecated-declarations -Werror
+  )
+endif()
+
+set_target_properties(
+  grpc-endpoint-library
+  PROPERTIES
+    POSITION_INDEPENDENT_CODE ON
+)
+
+target_link_libraries(
+  grpc-endpoint-library
+  PUBLIC
+    proto-library                 # from repo-common
+    triton-common-logging         # from repo-common
+    triton-common-table-printer   # from repo-common
+    triton-common-json            # from repo-common
+    grpc-health-library           # from repo-common
+    grpc-service-library          # from repo-common
+    triton-core-serverapi         # from repo-core
+    triton-core-serverstub        # from repo-core
+    gRPC::grpc++
+    gRPC::grpc
+    protobuf::libprotobuf
+)
+
+target_include_directories(
+  grpc-endpoint-library
+  PRIVATE $<TARGET_PROPERTY:gRPC::grpc,INTERFACE_INCLUDE_DIRECTORIES>
+)
+
+# FIXME when Triton support of OpenTelemetry is available on Windows
+# add ${OPENTELEMETRY_CPP_INCLUDE_DIRS} to above target_include_directories
+# JIRA DLIS-4786
+if (NOT WIN32 AND ${TRITON_ENABLE_TRACING})
+  target_include_directories(
+    grpc-endpoint-library
+    PRIVATE ${OPENTELEMETRY_CPP_INCLUDE_DIRS}
+  )
+endif()
+
+target_compile_definitions(
+  grpc-endpoint-library
+  PRIVATE TRITON_ENABLE_GRPC=1
+)
+
+if(${TRITON_ENABLE_GPU})
+  target_compile_definitions(
+    grpc-endpoint-library
+    PRIVATE TRITON_ENABLE_GPU=1
+    PRIVATE TRITON_MIN_COMPUTE_CAPABILITY=${TRITON_MIN_COMPUTE_CAPABILITY}
+  )
+
+  target_link_libraries(
+    grpc-endpoint-library
+    PUBLIC
+      CUDA::cudart
+  )
+endif() # TRITON_ENABLE_GPU
+
+if(${TRITON_ENABLE_METRICS})
+  target_compile_definitions(
+    grpc-endpoint-library
+    PRIVATE TRITON_ENABLE_METRICS=1
+  )
+endif() # TRITON_ENABLE_METRICS
+
+if(${TRITON_ENABLE_LOGGING})
+  target_compile_definitions(
+    grpc-endpoint-library
+    PRIVATE TRITON_ENABLE_LOGGING=1
+  )
+endif() # TRITON_ENABLE_LOGGING
+
+if(${TRITON_ENABLE_STATS})
+  target_compile_definitions(
+    grpc-endpoint-library
+    PRIVATE TRITON_ENABLE_STATS=1
+  )
+endif() # TRITON_ENABLE_STATS
+
+if(${TRITON_ENABLE_TRACING})
+  target_compile_definitions(
+    grpc-endpoint-library
+    PRIVATE TRITON_ENABLE_TRACING=1
+  )
+endif() # TRITON_ENABLE_TRACING
+
+if(${TRITON_ENABLE_NVTX})
+  target_compile_definitions(
+    grpc-endpoint-library
+    PRIVATE TRITON_ENABLE_NVTX=1
+  )
+endif() # TRITON_ENABLE_NVTX
diff --git a/src/grpc/grpc_handler.h b/src/grpc/grpc_handler.h
new file mode 100644
index 0000000000..4f1bcdfac0
--- /dev/null
+++ b/src/grpc/grpc_handler.h
@@ -0,0 +1,46 @@
+// Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NVIDIA CORPORATION nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#pragma once
+
+#include <string>
+
+namespace triton { namespace server { namespace grpc {
+class HandlerBase {
+ public:
+  virtual ~HandlerBase() = default;
+  virtual void Start() = 0;
+  virtual void Stop() = 0;
+};
+
+class ICallData {
+ public:
+  virtual ~ICallData() = default;
+  virtual bool Process(bool ok) = 0;
+  virtual std::string Name() = 0;
+  virtual uint64_t Id() = 0;
+};
+
+}}}  // namespace triton::server::grpc
diff --git a/src/grpc/grpc_server.cc b/src/grpc/grpc_server.cc
new file mode 100644
index 0000000000..ebe53c82e0
--- /dev/null
+++ b/src/grpc/grpc_server.cc
@@ -0,0 +1,2552 @@
+// Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NVIDIA CORPORATION nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#include "grpc_server.h"
+
+#include <google/protobuf/arena.h>
+#include <grpc++/alarm.h>
+
+#include <chrono>
+#include <condition_variable>
+#include <cstdint>
+#include <fstream>
+#include <list>
+#include <map>
+#include <mutex>
+#include <queue>
+#include <sstream>
+#include <thread>
+
+#include "../classification.h"
+#include "../common.h"
+#include "grpc++/grpc++.h"
+#include "grpc++/security/server_credentials.h"
+#include "grpc++/server.h"
+#include "grpc++/server_builder.h"
+#include "grpc++/server_context.h"
+#include "grpc++/support/status.h"
+#include "triton/common/logging.h"
+#include "triton/common/table_printer.h"
+#include "triton/core/tritonserver.h"
+
+#define TRITONJSON_STATUSTYPE TRITONSERVER_Error*
+#define TRITONJSON_STATUSRETURN(M) \
+  return TRITONSERVER_ErrorNew(TRITONSERVER_ERROR_INTERNAL, (M).c_str())
+#define TRITONJSON_STATUSSUCCESS nullptr
+#include "triton/common/triton_json.h"
+
+#ifdef TRITON_ENABLE_TRACING
+#include "../tracer.h"
+#endif  // TRITON_ENABLE_TRACING
+
+#define REGISTER_GRPC_INFER_THREAD_COUNT 2
+
+namespace triton { namespace server { namespace grpc {
+
+namespace {
+
+//
+// The server has separate handling mechanisms for inference RPCs
+// and non-inference RPCs.
+//
+
+//=========================================================================
+//  The following section contains the handling mechanism for non-inference
+//  RPCs. A single thread is created to handle all these requests as they
+//  are deemed to be not performance critical.
+//=========================================================================
+
+template <typename ResponderType, typename RequestType, typename ResponseType>
+class CommonCallData : public ICallData {
+ public:
+  using StandardRegisterFunc = std::function<void(
+      ::grpc::ServerContext*, RequestType*, ResponderType*, void*)>;
+  using StandardCallbackFunc =
+      std::function<void(RequestType&, ResponseType*, ::grpc::Status*)>;
+
+  CommonCallData(
+      const std::string& name, const uint64_t id,
+      const StandardRegisterFunc OnRegister,
+      const StandardCallbackFunc OnExecute, const bool async,
+      ::grpc::ServerCompletionQueue* cq,
+      const std::pair<std::string, std::string>& restricted_kv,
+      const uint64_t& response_delay = 0)
+      : name_(name), id_(id), OnRegister_(OnRegister), OnExecute_(OnExecute),
+        async_(async), cq_(cq), responder_(&ctx_), step_(Steps::START),
+        restricted_kv_(restricted_kv), response_delay_(response_delay)
+  {
+    OnRegister_(&ctx_, &request_, &responder_, this);
+    LOG_VERBOSE(1) << "Ready for RPC '" << name_ << "', " << id_;
+  }
+
+  ~CommonCallData()
+  {
+    if (async_thread_.joinable()) {
+      async_thread_.join();
+    }
+  }
+
+  bool Process(bool ok) override;
+
+  std::string Name() override { return name_; }
+
+  uint64_t Id() override { return id_; }
+
+ private:
+  void Execute();
+  void AddToCompletionQueue();
+  void WriteResponse();
+  bool ExecutePrecondition();
+
+  const std::string name_;
+  const uint64_t id_;
+  const StandardRegisterFunc OnRegister_;
+  const StandardCallbackFunc OnExecute_;
+  const bool async_;
+  ::grpc::ServerCompletionQueue* cq_;
+
+  ::grpc::ServerContext ctx_;
+  ::grpc::Alarm alarm_;
+
+  ResponderType responder_;
+  RequestType request_;
+  ResponseType response_;
+  ::grpc::Status status_;
+
+  std::thread async_thread_;
+
+  Steps step_;
+
+  std::pair<std::string, std::string> restricted_kv_{"", ""};
+
+  const uint64_t response_delay_;
+};
+
+template <typename ResponderType, typename RequestType, typename ResponseType>
+bool
+CommonCallData<ResponderType, RequestType, ResponseType>::Process(bool rpc_ok)
+{
+  LOG_VERBOSE(1) << "Process for " << name_ << ", rpc_ok=" << rpc_ok << ", "
+                 << id_ << " step " << step_;
+
+  // If RPC failed on a new request then the server is shutting down
+  // and so we should do nothing (including not registering for a new
+  // request). If RPC failed on a non-START step then there is nothing
+  // we can do since we one execute one step.
+  const bool shutdown = (!rpc_ok && (step_ == Steps::START));
+  if (shutdown) {
+    if (async_thread_.joinable()) {
+      async_thread_.join();
+    }
+    step_ = Steps::FINISH;
+  }
+
+  if (step_ == Steps::START) {
+    // Start a new request to replace this one...
+    if (!shutdown) {
+      new CommonCallData<ResponderType, RequestType, ResponseType>(
+          name_, id_ + 1, OnRegister_, OnExecute_, async_, cq_, restricted_kv_,
+          response_delay_);
+    }
+
+    if (!async_) {
+      // For synchronous calls, execute and write response
+      // here.
+      Execute();
+      WriteResponse();
+    } else {
+      // For asynchronous calls, delegate the execution to another
+      // thread.
+      step_ = Steps::ISSUED;
+      async_thread_ = std::thread(&CommonCallData::Execute, this);
+    }
+  } else if (step_ == Steps::WRITEREADY) {
+    // Will only come here for asynchronous mode.
+    WriteResponse();
+  } else if (step_ == Steps::COMPLETE) {
+    step_ = Steps::FINISH;
+  }
+
+  return step_ != Steps::FINISH;
+}
+
+template <typename ResponderType, typename RequestType, typename ResponseType>
+void
+CommonCallData<ResponderType, RequestType, ResponseType>::Execute()
+{
+  if (ExecutePrecondition()) {
+    OnExecute_(request_, &response_, &status_);
+  } else {
+    status_ = ::grpc::Status(
+        ::grpc::StatusCode::UNAVAILABLE,
+        std::string("This protocol is restricted, expecting header '") +
+            restricted_kv_.first + "'");
+  }
+  step_ = Steps::WRITEREADY;
+
+  if (async_) {
+    // For asynchronous operation, need to add itself onto the completion
+    // queue so that the response can be written once the object is
+    // taken up next for execution.
+    AddToCompletionQueue();
+  }
+}
+
+template <typename ResponderType, typename RequestType, typename ResponseType>
+bool
+CommonCallData<ResponderType, RequestType, ResponseType>::ExecutePrecondition()
+{
+  if (!restricted_kv_.first.empty()) {
+    const auto& metadata = ctx_.client_metadata();
+    const auto it = metadata.find(restricted_kv_.first);
+    return (it != metadata.end()) && (it->second == restricted_kv_.second);
+  }
+  return true;
+}
+
+template <typename ResponderType, typename RequestType, typename ResponseType>
+void
+CommonCallData<ResponderType, RequestType, ResponseType>::AddToCompletionQueue()
+{
+  alarm_.Set(cq_, gpr_now(gpr_clock_type::GPR_CLOCK_REALTIME), this);
+}
+
+template <typename ResponderType, typename RequestType, typename ResponseType>
+void
+CommonCallData<ResponderType, RequestType, ResponseType>::WriteResponse()
+{
+  if (response_delay_ != 0) {
+    // Will delay the write of the response by the specified time.
+    // This can be used to test the flow where there are other
+    // responses available to be written.
+    LOG_VERBOSE(1) << "Delaying the write of the response by "
+                   << response_delay_ << " seconds";
+    std::this_thread::sleep_for(std::chrono::seconds(response_delay_));
+  }
+  step_ = Steps::COMPLETE;
+  responder_.Finish(response_, status_, this);
+}
+
+//
+// CommonHandler
+//
+// A common handler for all non-inference requests.
+//
+class CommonHandler : public HandlerBase {
+ public:
+  CommonHandler(
+      const std::string& name,
+      const std::shared_ptr<TRITONSERVER_Server>& tritonserver,
+      const std::shared_ptr<SharedMemoryManager>& shm_manager,
+      TraceManager* trace_manager,
+      inference::GRPCInferenceService::AsyncService* service,
+      ::grpc::health::v1::Health::AsyncService* health_service,
+      ::grpc::ServerCompletionQueue* cq,
+      const RestrictedFeatures& restricted_keys, const uint64_t response_delay);
+
+  // Descriptive name of of the handler.
+  const std::string& Name() const { return name_; }
+
+  // Start handling requests.
+  void Start() override;
+
+  // Stop handling requests.
+  void Stop() override;
+
+ private:
+  void SetUpAllRequests();
+
+  // [FIXME] turn into generated code
+  void RegisterServerLive();
+  void RegisterServerReady();
+  void RegisterHealthCheck();
+  void RegisterModelReady();
+  void RegisterServerMetadata();
+  void RegisterModelMetadata();
+  void RegisterModelConfig();
+  void RegisterModelStatistics();
+  void RegisterTrace();
+  void RegisterLogging();
+  void RegisterSystemSharedMemoryStatus();
+  void RegisterSystemSharedMemoryRegister();
+  void RegisterSystemSharedMemoryUnregister();
+  void RegisterCudaSharedMemoryStatus();
+  void RegisterCudaSharedMemoryRegister();
+  void RegisterCudaSharedMemoryUnregister();
+  void RegisterRepositoryIndex();
+  void RegisterRepositoryModelLoad();
+  void RegisterRepositoryModelUnload();
+
+  const std::string name_;
+  std::shared_ptr<TRITONSERVER_Server> tritonserver_;
+
+  std::shared_ptr<SharedMemoryManager> shm_manager_;
+  TraceManager* trace_manager_;
+
+  inference::GRPCInferenceService::AsyncService* service_;
+  ::grpc::health::v1::Health::AsyncService* health_service_;
+  ::grpc::ServerCompletionQueue* cq_;
+  std::unique_ptr<std::thread> thread_;
+  RestrictedFeatures restricted_keys_{};
+  const uint64_t response_delay_ = 0;
+};
+
+CommonHandler::CommonHandler(
+    const std::string& name,
+    const std::shared_ptr<TRITONSERVER_Server>& tritonserver,
+    const std::shared_ptr<SharedMemoryManager>& shm_manager,
+    TraceManager* trace_manager,
+    inference::GRPCInferenceService::AsyncService* service,
+    ::grpc::health::v1::Health::AsyncService* health_service,
+    ::grpc::ServerCompletionQueue* cq,
+    const RestrictedFeatures& restricted_keys,
+    const uint64_t response_delay = 0)
+    : name_(name), tritonserver_(tritonserver), shm_manager_(shm_manager),
+      trace_manager_(trace_manager), service_(service),
+      health_service_(health_service), cq_(cq),
+      restricted_keys_(restricted_keys), response_delay_(response_delay)
+{
+}
+
+void
+CommonHandler::Start()
+{
+  // Use a barrier to make sure we don't return until thread has
+  // started.
+  auto barrier = std::make_shared<Barrier>(2);
+
+  thread_.reset(new std::thread([this, barrier] {
+    SetUpAllRequests();
+    barrier->Wait();
+
+    void* tag;
+    bool ok;
+
+    while (cq_->Next(&tag, &ok)) {
+      ICallData* call_data = static_cast<ICallData*>(tag);
+      if (!call_data->Process(ok)) {
+        LOG_VERBOSE(1) << "Done for " << call_data->Name() << ", "
+                       << call_data->Id();
+        delete call_data;
+      }
+    }
+  }));
+
+  barrier->Wait();
+  LOG_VERBOSE(1) << "Thread started for " << Name();
+}
+
+void
+CommonHandler::Stop()
+{
+  if (thread_->joinable()) {
+    thread_->join();
+  }
+
+  LOG_VERBOSE(1) << "Thread exited for " << Name();
+}
+
+void
+CommonHandler::SetUpAllRequests()
+{
+  // Define all the RPCs to be handled by this handler below
+  //
+  // Within each of the Register function, the format of RPC specification is:
+  // 1. A OnRegister function: This will be called when the
+  //    server is ready to receive the requests for this RPC.
+  // 2. A OnExecute function: This will be called when the
+  //    to process the request.
+  // 3. Create a CommonCallData object with the above callback
+  //    functions
+
+  // health (GRPC standard)
+  RegisterHealthCheck();
+  // health (Triton)
+  RegisterServerLive();
+  RegisterServerReady();
+  RegisterModelReady();
+
+  // Metadata
+  RegisterServerMetadata();
+  RegisterModelMetadata();
+
+  // model config
+  RegisterModelConfig();
+
+  // shared memory
+  // system..
+  RegisterSystemSharedMemoryStatus();
+  RegisterSystemSharedMemoryRegister();
+  RegisterSystemSharedMemoryUnregister();
+  // cuda..
+  RegisterCudaSharedMemoryStatus();
+  RegisterCudaSharedMemoryRegister();
+  RegisterCudaSharedMemoryUnregister();
+
+  // model repository
+  RegisterRepositoryIndex();
+  RegisterRepositoryModelLoad();
+  RegisterRepositoryModelUnload();
+
+  // statistics
+  RegisterModelStatistics();
+
+  // trace
+  RegisterTrace();
+
+  // logging
+  RegisterLogging();
+}
+
+void
+CommonHandler::RegisterServerLive()
+{
+  auto OnRegisterServerLive =
+      [this](
+          ::grpc::ServerContext* ctx, inference::ServerLiveRequest* request,
+          ::grpc::ServerAsyncResponseWriter<inference::ServerLiveResponse>*
+              responder,
+          void* tag) {
+        this->service_->RequestServerLive(
+            ctx, request, responder, this->cq_, this->cq_, tag);
+      };
+
+  auto OnExecuteServerLive = [this](
+                                 inference::ServerLiveRequest& request,
+                                 inference::ServerLiveResponse* response,
+                                 ::grpc::Status* status) {
+    bool live = false;
+    TRITONSERVER_Error* err =
+        TRITONSERVER_ServerIsLive(tritonserver_.get(), &live);
+
+    response->set_live((err == nullptr) && live);
+
+    GrpcStatusUtil::Create(status, err);
+    TRITONSERVER_ErrorDelete(err);
+  };
+
+  const std::pair<std::string, std::string>& restricted_kv =
+      restricted_keys_.Get(RestrictedCategory::HEALTH);
+  new CommonCallData<
+      ::grpc::ServerAsyncResponseWriter<inference::ServerLiveResponse>,
+      inference::ServerLiveRequest, inference::ServerLiveResponse>(
+      "ServerLive", 0, OnRegisterServerLive, OnExecuteServerLive,
+      false /* async */, cq_, restricted_kv, response_delay_);
+}
+
+void
+CommonHandler::RegisterServerReady()
+{
+  auto OnRegisterServerReady =
+      [this](
+          ::grpc::ServerContext* ctx, inference::ServerReadyRequest* request,
+          ::grpc::ServerAsyncResponseWriter<inference::ServerReadyResponse>*
+              responder,
+          void* tag) {
+        this->service_->RequestServerReady(
+            ctx, request, responder, this->cq_, this->cq_, tag);
+      };
+
+  auto OnExecuteServerReady = [this](
+                                  inference::ServerReadyRequest& request,
+                                  inference::ServerReadyResponse* response,
+                                  ::grpc::Status* status) {
+    bool ready = false;
+    TRITONSERVER_Error* err =
+        TRITONSERVER_ServerIsReady(tritonserver_.get(), &ready);
+
+    response->set_ready((err == nullptr) && ready);
+
+    GrpcStatusUtil::Create(status, err);
+    TRITONSERVER_ErrorDelete(err);
+  };
+
+  const std::pair<std::string, std::string>& restricted_kv =
+      restricted_keys_.Get(RestrictedCategory::HEALTH);
+  new CommonCallData<
+      ::grpc::ServerAsyncResponseWriter<inference::ServerReadyResponse>,
+      inference::ServerReadyRequest, inference::ServerReadyResponse>(
+      "ServerReady", 0, OnRegisterServerReady, OnExecuteServerReady,
+      false /* async */, cq_, restricted_kv, response_delay_);
+}
+
+void
+CommonHandler::RegisterHealthCheck()
+{
+  auto OnRegisterHealthCheck =
+      [this](
+          ::grpc::ServerContext* ctx,
+          ::grpc::health::v1::HealthCheckRequest* request,
+          ::grpc::ServerAsyncResponseWriter<
+              ::grpc::health::v1::HealthCheckResponse>* responder,
+          void* tag) {
+        this->health_service_->RequestCheck(
+            ctx, request, responder, this->cq_, this->cq_, tag);
+      };
+
+  auto OnExecuteHealthCheck = [this](
+                                  ::grpc::health::v1::HealthCheckRequest&
+                                      request,
+                                  ::grpc::health::v1::HealthCheckResponse*
+                                      response,
+                                  ::grpc::Status* status) {
+    bool live = false;
+    TRITONSERVER_Error* err =
+        TRITONSERVER_ServerIsReady(tritonserver_.get(), &live);
+
+    auto serving_status =
+        ::grpc::health::v1::HealthCheckResponse_ServingStatus_UNKNOWN;
+    if (err == nullptr) {
+      serving_status =
+          live ? ::grpc::health::v1::HealthCheckResponse_ServingStatus_SERVING
+               : ::grpc::health::v1::
+                     HealthCheckResponse_ServingStatus_NOT_SERVING;
+    }
+    response->set_status(serving_status);
+
+    GrpcStatusUtil::Create(status, err);
+    TRITONSERVER_ErrorDelete(err);
+  };
+
+  const std::pair<std::string, std::string>& restricted_kv =
+      restricted_keys_.Get(RestrictedCategory::HEALTH);
+  new CommonCallData<
+      ::grpc::ServerAsyncResponseWriter<
+          ::grpc::health::v1::HealthCheckResponse>,
+      ::grpc::health::v1::HealthCheckRequest,
+      ::grpc::health::v1::HealthCheckResponse>(
+      "Check", 0, OnRegisterHealthCheck, OnExecuteHealthCheck,
+      false /* async */, cq_, restricted_kv, response_delay_);
+}
+
+void
+CommonHandler::RegisterModelReady()
+{
+  auto OnRegisterModelReady =
+      [this](
+          ::grpc::ServerContext* ctx, inference::ModelReadyRequest* request,
+          ::grpc::ServerAsyncResponseWriter<inference::ModelReadyResponse>*
+              responder,
+          void* tag) {
+        this->service_->RequestModelReady(
+            ctx, request, responder, this->cq_, this->cq_, tag);
+      };
+
+  auto OnExecuteModelReady = [this](
+                                 inference::ModelReadyRequest& request,
+                                 inference::ModelReadyResponse* response,
+                                 ::grpc::Status* status) {
+    bool is_ready = false;
+    int64_t requested_model_version;
+    auto err =
+        GetModelVersionFromString(request.version(), &requested_model_version);
+    if (err == nullptr) {
+      err = TRITONSERVER_ServerModelIsReady(
+          tritonserver_.get(), request.name().c_str(), requested_model_version,
+          &is_ready);
+    }
+
+    response->set_ready(is_ready);
+
+    GrpcStatusUtil::Create(status, err);
+    TRITONSERVER_ErrorDelete(err);
+  };
+
+  const std::pair<std::string, std::string>& restricted_kv =
+      restricted_keys_.Get(RestrictedCategory::HEALTH);
+  new CommonCallData<
+      ::grpc::ServerAsyncResponseWriter<inference::ModelReadyResponse>,
+      inference::ModelReadyRequest, inference::ModelReadyResponse>(
+      "ModelReady", 0, OnRegisterModelReady, OnExecuteModelReady,
+      false /* async */, cq_, restricted_kv, response_delay_);
+}
+
+void
+CommonHandler::RegisterServerMetadata()
+{
+  auto OnRegisterServerMetadata =
+      [this](
+          ::grpc::ServerContext* ctx, inference::ServerMetadataRequest* request,
+          ::grpc::ServerAsyncResponseWriter<inference::ServerMetadataResponse>*
+              responder,
+          void* tag) {
+        this->service_->RequestServerMetadata(
+            ctx, request, responder, this->cq_, this->cq_, tag);
+      };
+
+  auto OnExecuteServerMetadata =
+      [this](
+          inference::ServerMetadataRequest& request,
+          inference::ServerMetadataResponse* response, ::grpc::Status* status) {
+        TRITONSERVER_Message* server_metadata_message = nullptr;
+        TRITONSERVER_Error* err = TRITONSERVER_ServerMetadata(
+            tritonserver_.get(), &server_metadata_message);
+        GOTO_IF_ERR(err, earlyexit);
+
+        const char* buffer;
+        size_t byte_size;
+        err = TRITONSERVER_MessageSerializeToJson(
+            server_metadata_message, &buffer, &byte_size);
+        GOTO_IF_ERR(err, earlyexit);
+
+        {
+          triton::common::TritonJson::Value server_metadata_json;
+          err = server_metadata_json.Parse(buffer, byte_size);
+          GOTO_IF_ERR(err, earlyexit);
+
+          const char* name;
+          size_t namelen;
+          err = server_metadata_json.MemberAsString("name", &name, &namelen);
+          GOTO_IF_ERR(err, earlyexit);
+
+          const char* version;
+          size_t versionlen;
+          err = server_metadata_json.MemberAsString(
+              "version", &version, &versionlen);
+          GOTO_IF_ERR(err, earlyexit);
+
+          response->set_name(std::string(name, namelen));
+          response->set_version(std::string(version, versionlen));
+
+          if (server_metadata_json.Find("extensions")) {
+            triton::common::TritonJson::Value extensions_json;
+            err = server_metadata_json.MemberAsArray(
+                "extensions", &extensions_json);
+            GOTO_IF_ERR(err, earlyexit);
+
+            for (size_t idx = 0; idx < extensions_json.ArraySize(); ++idx) {
+              const char* ext;
+              size_t extlen;
+              err = extensions_json.IndexAsString(idx, &ext, &extlen);
+              GOTO_IF_ERR(err, earlyexit);
+              response->add_extensions(std::string(ext, extlen));
+            }
+          }
+          TRITONSERVER_MessageDelete(server_metadata_message);
+        }
+
+      earlyexit:
+        GrpcStatusUtil::Create(status, err);
+        TRITONSERVER_ErrorDelete(err);
+      };
+
+  const std::pair<std::string, std::string>& restricted_kv =
+      restricted_keys_.Get(RestrictedCategory::METADATA);
+  new CommonCallData<
+      ::grpc::ServerAsyncResponseWriter<inference::ServerMetadataResponse>,
+      inference::ServerMetadataRequest, inference::ServerMetadataResponse>(
+      "ServerMetadata", 0, OnRegisterServerMetadata, OnExecuteServerMetadata,
+      false /* async */, cq_, restricted_kv, response_delay_);
+}
+
+void
+CommonHandler::RegisterModelMetadata()
+{
+  auto OnRegisterModelMetadata =
+      [this](
+          ::grpc::ServerContext* ctx, inference::ModelMetadataRequest* request,
+          ::grpc::ServerAsyncResponseWriter<inference::ModelMetadataResponse>*
+              responder,
+          void* tag) {
+        this->service_->RequestModelMetadata(
+            ctx, request, responder, this->cq_, this->cq_, tag);
+      };
+
+  auto OnExecuteModelMetadata = [this](
+                                    inference::ModelMetadataRequest& request,
+                                    inference::ModelMetadataResponse* response,
+                                    ::grpc::Status* status) {
+    int64_t requested_model_version;
+    auto err =
+        GetModelVersionFromString(request.version(), &requested_model_version);
+    GOTO_IF_ERR(err, earlyexit);
+
+    {
+      TRITONSERVER_Message* model_metadata_message = nullptr;
+      err = TRITONSERVER_ServerModelMetadata(
+          tritonserver_.get(), request.name().c_str(), requested_model_version,
+          &model_metadata_message);
+      GOTO_IF_ERR(err, earlyexit);
+
+      const char* buffer;
+      size_t byte_size;
+      err = TRITONSERVER_MessageSerializeToJson(
+          model_metadata_message, &buffer, &byte_size);
+      GOTO_IF_ERR(err, earlyexit);
+
+      triton::common::TritonJson::Value model_metadata_json;
+      err = model_metadata_json.Parse(buffer, byte_size);
+      GOTO_IF_ERR(err, earlyexit);
+
+      const char* name;
+      size_t namelen;
+      err = model_metadata_json.MemberAsString("name", &name, &namelen);
+      GOTO_IF_ERR(err, earlyexit);
+
+      response->set_name(std::string(name, namelen));
+
+      if (model_metadata_json.Find("versions")) {
+        triton::common::TritonJson::Value versions_json;
+        err = model_metadata_json.MemberAsArray("versions", &versions_json);
+        GOTO_IF_ERR(err, earlyexit);
+
+        for (size_t idx = 0; idx < versions_json.ArraySize(); ++idx) {
+          const char* version;
+          size_t versionlen;
+          err = versions_json.IndexAsString(idx, &version, &versionlen);
+          GOTO_IF_ERR(err, earlyexit);
+          response->add_versions(std::string(version, versionlen));
+        }
+      }
+
+      const char* platform;
+      size_t platformlen;
+      err = model_metadata_json.MemberAsString(
+          "platform", &platform, &platformlen);
+      GOTO_IF_ERR(err, earlyexit);
+      response->set_platform(std::string(platform, platformlen));
+
+      if (model_metadata_json.Find("inputs")) {
+        triton::common::TritonJson::Value inputs_json;
+        err = model_metadata_json.MemberAsArray("inputs", &inputs_json);
+        GOTO_IF_ERR(err, earlyexit);
+
+        for (size_t idx = 0; idx < inputs_json.ArraySize(); ++idx) {
+          triton::common::TritonJson::Value io_json;
+          err = inputs_json.IndexAsObject(idx, &io_json);
+          GOTO_IF_ERR(err, earlyexit);
+
+          inference::ModelMetadataResponse::TensorMetadata* io =
+              response->add_inputs();
+
+          const char* name;
+          size_t namelen;
+          err = io_json.MemberAsString("name", &name, &namelen);
+          GOTO_IF_ERR(err, earlyexit);
+
+          const char* datatype;
+          size_t datatypelen;
+          err = io_json.MemberAsString("datatype", &datatype, &datatypelen);
+          GOTO_IF_ERR(err, earlyexit);
+
+          io->set_name(std::string(name, namelen));
+          io->set_datatype(std::string(datatype, datatypelen));
+
+          if (io_json.Find("shape")) {
+            triton::common::TritonJson::Value shape_json;
+            err = io_json.MemberAsArray("shape", &shape_json);
+            GOTO_IF_ERR(err, earlyexit);
+
+            for (size_t sidx = 0; sidx < shape_json.ArraySize(); ++sidx) {
+              int64_t d;
+              err = shape_json.IndexAsInt(sidx, &d);
+              GOTO_IF_ERR(err, earlyexit);
+
+              io->add_shape(d);
+            }
+          }
+        }
+      }
+
+      if (model_metadata_json.Find("outputs")) {
+        triton::common::TritonJson::Value outputs_json;
+        err = model_metadata_json.MemberAsArray("outputs", &outputs_json);
+        GOTO_IF_ERR(err, earlyexit);
+
+        for (size_t idx = 0; idx < outputs_json.ArraySize(); ++idx) {
+          triton::common::TritonJson::Value io_json;
+          err = outputs_json.IndexAsObject(idx, &io_json);
+          GOTO_IF_ERR(err, earlyexit);
+
+          inference::ModelMetadataResponse::TensorMetadata* io =
+              response->add_outputs();
+
+          const char* name;
+          size_t namelen;
+          err = io_json.MemberAsString("name", &name, &namelen);
+          GOTO_IF_ERR(err, earlyexit);
+
+          const char* datatype;
+          size_t datatypelen;
+          err = io_json.MemberAsString("datatype", &datatype, &datatypelen);
+          GOTO_IF_ERR(err, earlyexit);
+
+          io->set_name(std::string(name, namelen));
+          io->set_datatype(std::string(datatype, datatypelen));
+
+          if (io_json.Find("shape")) {
+            triton::common::TritonJson::Value shape_json;
+            err = io_json.MemberAsArray("shape", &shape_json);
+            GOTO_IF_ERR(err, earlyexit);
+
+            for (size_t sidx = 0; sidx < shape_json.ArraySize(); ++sidx) {
+              int64_t d;
+              err = shape_json.IndexAsInt(sidx, &d);
+              GOTO_IF_ERR(err, earlyexit);
+
+              io->add_shape(d);
+            }
+          }
+        }
+      }
+
+      TRITONSERVER_MessageDelete(model_metadata_message);
+    }
+
+  earlyexit:
+    GrpcStatusUtil::Create(status, err);
+    TRITONSERVER_ErrorDelete(err);
+  };
+
+  const std::pair<std::string, std::string>& restricted_kv =
+      restricted_keys_.Get(RestrictedCategory::METADATA);
+  new CommonCallData<
+      ::grpc::ServerAsyncResponseWriter<inference::ModelMetadataResponse>,
+      inference::ModelMetadataRequest, inference::ModelMetadataResponse>(
+      "ModelMetadata", 0, OnRegisterModelMetadata, OnExecuteModelMetadata,
+      false /* async */, cq_, restricted_kv, response_delay_);
+}
+
+void
+CommonHandler::RegisterModelConfig()
+{
+  auto OnRegisterModelConfig =
+      [this](
+          ::grpc::ServerContext* ctx, inference::ModelConfigRequest* request,
+          ::grpc::ServerAsyncResponseWriter<inference::ModelConfigResponse>*
+              responder,
+          void* tag) {
+        this->service_->RequestModelConfig(
+            ctx, request, responder, this->cq_, this->cq_, tag);
+      };
+
+  auto OnExecuteModelConfig = [this](
+                                  inference::ModelConfigRequest& request,
+                                  inference::ModelConfigResponse* response,
+                                  ::grpc::Status* status) {
+    int64_t requested_model_version;
+    auto err =
+        GetModelVersionFromString(request.version(), &requested_model_version);
+    if (err == nullptr) {
+      TRITONSERVER_Message* model_config_message = nullptr;
+      err = TRITONSERVER_ServerModelConfig(
+          tritonserver_.get(), request.name().c_str(), requested_model_version,
+          1 /* config_version */, &model_config_message);
+      if (err == nullptr) {
+        const char* buffer;
+        size_t byte_size;
+        err = TRITONSERVER_MessageSerializeToJson(
+            model_config_message, &buffer, &byte_size);
+        if (err == nullptr) {
+          ::google::protobuf::util::JsonStringToMessage(
+              ::google::protobuf::stringpiece_internal::StringPiece(
+                  buffer, (int)byte_size),
+              response->mutable_config());
+        }
+        TRITONSERVER_MessageDelete(model_config_message);
+      }
+    }
+
+    GrpcStatusUtil::Create(status, err);
+    TRITONSERVER_ErrorDelete(err);
+  };
+
+  const std::pair<std::string, std::string>& restricted_kv =
+      restricted_keys_.Get(RestrictedCategory::MODEL_CONFIG);
+  new CommonCallData<
+      ::grpc::ServerAsyncResponseWriter<inference::ModelConfigResponse>,
+      inference::ModelConfigRequest, inference::ModelConfigResponse>(
+      "ModelConfig", 0, OnRegisterModelConfig, OnExecuteModelConfig,
+      false /* async */, cq_, restricted_kv, response_delay_);
+}
+
+void
+CommonHandler::RegisterModelStatistics()
+{
+  auto OnRegisterModelStatistics =
+      [this](
+          ::grpc::ServerContext* ctx,
+          inference::ModelStatisticsRequest* request,
+          ::grpc::ServerAsyncResponseWriter<inference::ModelStatisticsResponse>*
+              responder,
+          void* tag) {
+        this->service_->RequestModelStatistics(
+            ctx, request, responder, this->cq_, this->cq_, tag);
+      };
+
+  auto OnExecuteModelStatistics = [this](
+                                      inference::ModelStatisticsRequest&
+                                          request,
+                                      inference::ModelStatisticsResponse*
+                                          response,
+                                      ::grpc::Status* status) {
+#ifdef TRITON_ENABLE_STATS
+    triton::common::TritonJson::Value model_stats_json;
+
+    int64_t requested_model_version;
+    auto err =
+        GetModelVersionFromString(request.version(), &requested_model_version);
+    GOTO_IF_ERR(err, earlyexit);
+
+    {
+      TRITONSERVER_Message* model_stats_message = nullptr;
+      err = TRITONSERVER_ServerModelStatistics(
+          tritonserver_.get(), request.name().c_str(), requested_model_version,
+          &model_stats_message);
+      GOTO_IF_ERR(err, earlyexit);
+
+      const char* buffer;
+      size_t byte_size;
+      err = TRITONSERVER_MessageSerializeToJson(
+          model_stats_message, &buffer, &byte_size);
+      GOTO_IF_ERR(err, earlyexit);
+
+      err = model_stats_json.Parse(buffer, byte_size);
+      GOTO_IF_ERR(err, earlyexit);
+
+      TRITONSERVER_MessageDelete(model_stats_message);
+    }
+
+    if (model_stats_json.Find("model_stats")) {
+      triton::common::TritonJson::Value stats_json;
+      err = model_stats_json.MemberAsArray("model_stats", &stats_json);
+      GOTO_IF_ERR(err, earlyexit);
+
+      for (size_t idx = 0; idx < stats_json.ArraySize(); ++idx) {
+        triton::common::TritonJson::Value model_stat;
+        err = stats_json.IndexAsObject(idx, &model_stat);
+        GOTO_IF_ERR(err, earlyexit);
+
+        auto statistics = response->add_model_stats();
+
+        const char* name;
+        size_t namelen;
+        err = model_stat.MemberAsString("name", &name, &namelen);
+        GOTO_IF_ERR(err, earlyexit);
+
+        const char* version;
+        size_t versionlen;
+        err = model_stat.MemberAsString("version", &version, &versionlen);
+        GOTO_IF_ERR(err, earlyexit);
+
+        statistics->set_name(std::string(name, namelen));
+        statistics->set_version(std::string(version, versionlen));
+
+        uint64_t ucnt;
+        err = model_stat.MemberAsUInt("last_inference", &ucnt);
+        GOTO_IF_ERR(err, earlyexit);
+        statistics->set_last_inference(ucnt);
+
+        err = model_stat.MemberAsUInt("inference_count", &ucnt);
+        GOTO_IF_ERR(err, earlyexit);
+        statistics->set_inference_count(ucnt);
+
+        err = model_stat.MemberAsUInt("execution_count", &ucnt);
+        GOTO_IF_ERR(err, earlyexit);
+        statistics->set_execution_count(ucnt);
+
+        triton::common::TritonJson::Value infer_stats_json;
+        err = model_stat.MemberAsObject("inference_stats", &infer_stats_json);
+        GOTO_IF_ERR(err, earlyexit);
+
+        {
+          triton::common::TritonJson::Value success_json;
+          err = infer_stats_json.MemberAsObject("success", &success_json);
+          GOTO_IF_ERR(err, earlyexit);
+
+          err = success_json.MemberAsUInt("count", &ucnt);
+          GOTO_IF_ERR(err, earlyexit);
+          statistics->mutable_inference_stats()->mutable_success()->set_count(
+              ucnt);
+          err = success_json.MemberAsUInt("ns", &ucnt);
+          GOTO_IF_ERR(err, earlyexit);
+          statistics->mutable_inference_stats()->mutable_success()->set_ns(
+              ucnt);
+        }
+
+        {
+          triton::common::TritonJson::Value fail_json;
+          err = infer_stats_json.MemberAsObject("fail", &fail_json);
+          GOTO_IF_ERR(err, earlyexit);
+
+          err = fail_json.MemberAsUInt("count", &ucnt);
+          GOTO_IF_ERR(err, earlyexit);
+          statistics->mutable_inference_stats()->mutable_fail()->set_count(
+              ucnt);
+          err = fail_json.MemberAsUInt("ns", &ucnt);
+          GOTO_IF_ERR(err, earlyexit);
+          statistics->mutable_inference_stats()->mutable_fail()->set_ns(ucnt);
+        }
+
+        {
+          triton::common::TritonJson::Value queue_json;
+          err = infer_stats_json.MemberAsObject("queue", &queue_json);
+          GOTO_IF_ERR(err, earlyexit);
+
+          err = queue_json.MemberAsUInt("count", &ucnt);
+          GOTO_IF_ERR(err, earlyexit);
+          statistics->mutable_inference_stats()->mutable_queue()->set_count(
+              ucnt);
+          err = queue_json.MemberAsUInt("ns", &ucnt);
+          GOTO_IF_ERR(err, earlyexit);
+          statistics->mutable_inference_stats()->mutable_queue()->set_ns(ucnt);
+        }
+
+        {
+          triton::common::TritonJson::Value compute_input_json;
+          err = infer_stats_json.MemberAsObject(
+              "compute_input", &compute_input_json);
+          GOTO_IF_ERR(err, earlyexit);
+
+          err = compute_input_json.MemberAsUInt("count", &ucnt);
+          GOTO_IF_ERR(err, earlyexit);
+          statistics->mutable_inference_stats()
+              ->mutable_compute_input()
+              ->set_count(ucnt);
+          err = compute_input_json.MemberAsUInt("ns", &ucnt);
+          GOTO_IF_ERR(err, earlyexit);
+          statistics->mutable_inference_stats()
+              ->mutable_compute_input()
+              ->set_ns(ucnt);
+        }
+
+        {
+          triton::common::TritonJson::Value compute_infer_json;
+          err = infer_stats_json.MemberAsObject(
+              "compute_infer", &compute_infer_json);
+          GOTO_IF_ERR(err, earlyexit);
+
+          err = compute_infer_json.MemberAsUInt("count", &ucnt);
+          GOTO_IF_ERR(err, earlyexit);
+          statistics->mutable_inference_stats()
+              ->mutable_compute_infer()
+              ->set_count(ucnt);
+          err = compute_infer_json.MemberAsUInt("ns", &ucnt);
+          GOTO_IF_ERR(err, earlyexit);
+          statistics->mutable_inference_stats()
+              ->mutable_compute_infer()
+              ->set_ns(ucnt);
+        }
+
+        {
+          triton::common::TritonJson::Value compute_output_json;
+          err = infer_stats_json.MemberAsObject(
+              "compute_output", &compute_output_json);
+          GOTO_IF_ERR(err, earlyexit);
+
+          err = compute_output_json.MemberAsUInt("count", &ucnt);
+          GOTO_IF_ERR(err, earlyexit);
+          statistics->mutable_inference_stats()
+              ->mutable_compute_output()
+              ->set_count(ucnt);
+          err = compute_output_json.MemberAsUInt("ns", &ucnt);
+          GOTO_IF_ERR(err, earlyexit);
+          statistics->mutable_inference_stats()
+              ->mutable_compute_output()
+              ->set_ns(ucnt);
+        }
+
+        {
+          triton::common::TritonJson::Value cache_hit_json;
+          err = infer_stats_json.MemberAsObject("cache_hit", &cache_hit_json);
+          GOTO_IF_ERR(err, earlyexit);
+
+          err = cache_hit_json.MemberAsUInt("count", &ucnt);
+          GOTO_IF_ERR(err, earlyexit);
+          statistics->mutable_inference_stats()->mutable_cache_hit()->set_count(
+              ucnt);
+          err = cache_hit_json.MemberAsUInt("ns", &ucnt);
+          GOTO_IF_ERR(err, earlyexit);
+          statistics->mutable_inference_stats()->mutable_cache_hit()->set_ns(
+              ucnt);
+        }
+
+        {
+          triton::common::TritonJson::Value cache_miss_json;
+          err = infer_stats_json.MemberAsObject("cache_miss", &cache_miss_json);
+          GOTO_IF_ERR(err, earlyexit);
+
+          err = cache_miss_json.MemberAsUInt("count", &ucnt);
+          GOTO_IF_ERR(err, earlyexit);
+          statistics->mutable_inference_stats()
+              ->mutable_cache_miss()
+              ->set_count(ucnt);
+          err = cache_miss_json.MemberAsUInt("ns", &ucnt);
+          GOTO_IF_ERR(err, earlyexit);
+          statistics->mutable_inference_stats()->mutable_cache_miss()->set_ns(
+              ucnt);
+        }
+
+        triton::common::TritonJson::Value batches_json;
+        err = model_stat.MemberAsArray("batch_stats", &batches_json);
+        GOTO_IF_ERR(err, earlyexit);
+
+        for (size_t idx = 0; idx < batches_json.ArraySize(); ++idx) {
+          triton::common::TritonJson::Value batch_stat;
+          err = batches_json.IndexAsObject(idx, &batch_stat);
+          GOTO_IF_ERR(err, earlyexit);
+
+          auto batch_statistics = statistics->add_batch_stats();
+
+          uint64_t ucnt;
+          err = batch_stat.MemberAsUInt("batch_size", &ucnt);
+          GOTO_IF_ERR(err, earlyexit);
+          batch_statistics->set_batch_size(ucnt);
+
+          {
+            triton::common::TritonJson::Value compute_input_json;
+            err =
+                batch_stat.MemberAsObject("compute_input", &compute_input_json);
+            GOTO_IF_ERR(err, earlyexit);
+
+            err = compute_input_json.MemberAsUInt("count", &ucnt);
+            GOTO_IF_ERR(err, earlyexit);
+            batch_statistics->mutable_compute_input()->set_count(ucnt);
+            err = compute_input_json.MemberAsUInt("ns", &ucnt);
+            GOTO_IF_ERR(err, earlyexit);
+            batch_statistics->mutable_compute_input()->set_ns(ucnt);
+          }
+
+          {
+            triton::common::TritonJson::Value compute_infer_json;
+            err =
+                batch_stat.MemberAsObject("compute_infer", &compute_infer_json);
+            GOTO_IF_ERR(err, earlyexit);
+
+            err = compute_infer_json.MemberAsUInt("count", &ucnt);
+            GOTO_IF_ERR(err, earlyexit);
+            batch_statistics->mutable_compute_infer()->set_count(ucnt);
+            err = compute_infer_json.MemberAsUInt("ns", &ucnt);
+            GOTO_IF_ERR(err, earlyexit);
+            batch_statistics->mutable_compute_infer()->set_ns(ucnt);
+          }
+
+          {
+            triton::common::TritonJson::Value compute_output_json;
+            err = batch_stat.MemberAsObject(
+                "compute_output", &compute_output_json);
+            GOTO_IF_ERR(err, earlyexit);
+
+            err = compute_output_json.MemberAsUInt("count", &ucnt);
+            GOTO_IF_ERR(err, earlyexit);
+            batch_statistics->mutable_compute_output()->set_count(ucnt);
+            err = compute_output_json.MemberAsUInt("ns", &ucnt);
+            GOTO_IF_ERR(err, earlyexit);
+            batch_statistics->mutable_compute_output()->set_ns(ucnt);
+          }
+        }
+
+        triton::common::TritonJson::Value memory_usage_json;
+        err = model_stat.MemberAsArray("memory_usage", &memory_usage_json);
+        GOTO_IF_ERR(err, earlyexit);
+
+        for (size_t idx = 0; idx < memory_usage_json.ArraySize(); ++idx) {
+          triton::common::TritonJson::Value usage;
+          err = memory_usage_json.IndexAsObject(idx, &usage);
+          GOTO_IF_ERR(err, earlyexit);
+
+          auto memory_usage = statistics->add_memory_usage();
+          {
+            const char* type;
+            size_t type_len;
+            err = usage.MemberAsString("type", &type, &type_len);
+            GOTO_IF_ERR(err, earlyexit);
+            memory_usage->set_type(std::string(type, type_len));
+          }
+          {
+            int64_t id;
+            err = usage.MemberAsInt("id", &id);
+            GOTO_IF_ERR(err, earlyexit);
+            memory_usage->set_id(id);
+          }
+          {
+            uint64_t byte_size;
+            err = usage.MemberAsUInt("byte_size", &byte_size);
+            GOTO_IF_ERR(err, earlyexit);
+            memory_usage->set_byte_size(byte_size);
+          }
+        }
+      }
+    }
+
+  earlyexit:
+    GrpcStatusUtil::Create(status, err);
+    TRITONSERVER_ErrorDelete(err);
+#else
+    auto err = TRITONSERVER_ErrorNew(
+        TRITONSERVER_ERROR_UNAVAILABLE,
+        "the server does not support model statistics");
+    GrpcStatusUtil::Create(status, err);
+    TRITONSERVER_ErrorDelete(err);
+#endif
+  };
+
+  const std::pair<std::string, std::string>& restricted_kv =
+      restricted_keys_.Get(RestrictedCategory::STATISTICS);
+  new CommonCallData<
+      ::grpc::ServerAsyncResponseWriter<inference::ModelStatisticsResponse>,
+      inference::ModelStatisticsRequest, inference::ModelStatisticsResponse>(
+      "ModelStatistics", 0, OnRegisterModelStatistics, OnExecuteModelStatistics,
+      false /* async */, cq_, restricted_kv, response_delay_);
+}
+
+void
+CommonHandler::RegisterTrace()
+{
+  auto OnRegisterTrace =
+      [this](
+          ::grpc::ServerContext* ctx, inference::TraceSettingRequest* request,
+          ::grpc::ServerAsyncResponseWriter<inference::TraceSettingResponse>*
+              responder,
+          void* tag) {
+        this->service_->RequestTraceSetting(
+            ctx, request, responder, this->cq_, this->cq_, tag);
+      };
+
+  auto OnExecuteTrace = [this](
+                            inference::TraceSettingRequest& request,
+                            inference::TraceSettingResponse* response,
+                            ::grpc::Status* status) {
+#ifdef TRITON_ENABLE_TRACING
+    TRITONSERVER_Error* err = nullptr;
+    TRITONSERVER_InferenceTraceLevel level = TRITONSERVER_TRACE_LEVEL_DISABLED;
+    uint32_t rate;
+    int32_t count;
+    uint32_t log_frequency;
+    std::string filepath;
+
+    if (!request.model_name().empty()) {
+      bool ready = false;
+      GOTO_IF_ERR(
+          TRITONSERVER_ServerModelIsReady(
+              tritonserver_.get(), request.model_name().c_str(),
+              -1 /* model version */, &ready),
+          earlyexit);
+      if (!ready) {
+        err = TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INVALID_ARG,
+            (std::string("Request for unknown model : ") + request.model_name())
+                .c_str());
+        GOTO_IF_ERR(err, earlyexit);
+      }
+    }
+
+    // Update trace setting
+    if (!request.settings().empty()) {
+      TraceManager::NewSetting new_setting;
+      {
+        static std::string setting_name = "trace_file";
+        auto it = request.settings().find(setting_name);
+        if (it != request.settings().end()) {
+          if (it->second.value().size() == 0) {
+            new_setting.clear_filepath_ = true;
+          } else if (it->second.value().size() == 1) {
+            filepath = it->second.value()[0];
+            new_setting.filepath_ = &filepath;
+          } else {
+            err = TRITONSERVER_ErrorNew(
+                TRITONSERVER_ERROR_INVALID_ARG,
+                (std::string("expect only 1 value for '") + setting_name + "'")
+                    .c_str());
+            GOTO_IF_ERR(err, earlyexit);
+          }
+        }
+      }
+      {
+        static std::string setting_name = "trace_level";
+        auto it = request.settings().find(setting_name);
+        if (it != request.settings().end()) {
+          if (it->second.value().size() == 0) {
+            new_setting.clear_level_ = true;
+          } else {
+            for (const auto& level_str : it->second.value()) {
+              if (level_str == "OFF") {
+                if (it->second.value().size() == 1) {
+                  level = TRITONSERVER_TRACE_LEVEL_DISABLED;
+                  new_setting.level_ = &level;
+                } else {
+                  err = TRITONSERVER_ErrorNew(
+                      TRITONSERVER_ERROR_INVALID_ARG,
+                      "Expect only one trace level 'OFF' is specified");
+                  GOTO_IF_ERR(err, earlyexit);
+                }
+              } else if (level_str == "TIMESTAMPS") {
+                level = static_cast<TRITONSERVER_InferenceTraceLevel>(
+                    level | TRITONSERVER_TRACE_LEVEL_TIMESTAMPS);
+                new_setting.level_ = &level;
+              } else if (level_str == "TENSORS") {
+                level = static_cast<TRITONSERVER_InferenceTraceLevel>(
+                    level | TRITONSERVER_TRACE_LEVEL_TENSORS);
+                new_setting.level_ = &level;
+              }
+            }
+          }
+        }
+      }
+      {
+        static std::string setting_name = "trace_rate";
+        auto it = request.settings().find(setting_name);
+        if (it != request.settings().end()) {
+          if (it->second.value().size() == 0) {
+            new_setting.clear_rate_ = true;
+          } else if (it->second.value().size() == 1) {
+            try {
+              rate = std::stoi(it->second.value()[0]);
+              new_setting.rate_ = &rate;
+            }
+            catch (const std::invalid_argument& ia) {
+              err = TRITONSERVER_ErrorNew(
+                  TRITONSERVER_ERROR_INVALID_ARG,
+                  (std::string("Unable to parse '") + setting_name +
+                   "', got: " + it->second.value()[0])
+                      .c_str());
+              GOTO_IF_ERR(err, earlyexit);
+            }
+            catch (const std::out_of_range& oor) {
+              err = TRITONSERVER_ErrorNew(
+                  TRITONSERVER_ERROR_INVALID_ARG,
+                  (std::string("Unable to parse '") + setting_name +
+                   "', value is out of range [ " +
+                   std::to_string(std::numeric_limits<std::uint32_t>::min()) +
+                   ", " +
+                   std::to_string(std::numeric_limits<std::uint32_t>::max()) +
+                   " ], got: " + it->second.value()[0])
+                      .c_str());
+              GOTO_IF_ERR(err, earlyexit);
+            }
+          } else {
+            err = TRITONSERVER_ErrorNew(
+                TRITONSERVER_ERROR_INVALID_ARG,
+                (std::string("expect only 1 value for '") + setting_name + "'")
+                    .c_str());
+            GOTO_IF_ERR(err, earlyexit);
+          }
+        }
+      }
+      {
+        static std::string setting_name = "trace_count";
+        auto it = request.settings().find(setting_name);
+        if (it != request.settings().end()) {
+          if (it->second.value().size() == 0) {
+            new_setting.clear_count_ = true;
+          } else if (it->second.value().size() == 1) {
+            try {
+              count = std::stoi(it->second.value()[0]);
+              if (count < TraceManager::MIN_TRACE_COUNT_VALUE) {
+                err = TRITONSERVER_ErrorNew(
+                    TRITONSERVER_ERROR_INVALID_ARG,
+                    (std::string("Unable to parse '") + setting_name +
+                     "'. Expecting value >= " +
+                     std::to_string(TraceManager::MIN_TRACE_COUNT_VALUE) +
+                     ", got: " + it->second.value()[0])
+                        .c_str());
+                GOTO_IF_ERR(err, earlyexit);
+              }
+              new_setting.count_ = &count;
+            }
+            catch (const std::invalid_argument& ia) {
+              err = TRITONSERVER_ErrorNew(
+                  TRITONSERVER_ERROR_INVALID_ARG,
+                  (std::string("Unable to parse '") + setting_name +
+                   "', got: " + it->second.value()[0])
+                      .c_str());
+              GOTO_IF_ERR(err, earlyexit);
+            }
+            catch (const std::out_of_range& oor) {
+              err = TRITONSERVER_ErrorNew(
+                  TRITONSERVER_ERROR_INVALID_ARG,
+                  (std::string("Unable to parse '") + setting_name +
+                   "', value is out of range [ " +
+                   std::to_string(TraceManager::MIN_TRACE_COUNT_VALUE) + ", " +
+                   std::to_string(std::numeric_limits<std::int32_t>::max()) +
+                   " ], got: " + it->second.value()[0])
+                      .c_str());
+              GOTO_IF_ERR(err, earlyexit);
+            }
+          } else {
+            err = TRITONSERVER_ErrorNew(
+                TRITONSERVER_ERROR_INVALID_ARG,
+                (std::string("expect only 1 value for '") + setting_name + "'")
+                    .c_str());
+            GOTO_IF_ERR(err, earlyexit);
+          }
+        }
+      }
+      {
+        static std::string setting_name = "log_frequency";
+        auto it = request.settings().find(setting_name);
+        if (it != request.settings().end()) {
+          if (it->second.value().size() == 0) {
+            new_setting.clear_log_frequency_ = true;
+          } else if (it->second.value().size() == 1) {
+            try {
+              log_frequency = std::stoi(it->second.value()[0]);
+              new_setting.log_frequency_ = &log_frequency;
+            }
+            catch (const std::invalid_argument& ia) {
+              err = TRITONSERVER_ErrorNew(
+                  TRITONSERVER_ERROR_INVALID_ARG,
+                  (std::string("Unable to parse '") + setting_name +
+                   "', got: " + it->second.value()[0])
+                      .c_str());
+              GOTO_IF_ERR(err, earlyexit);
+            }
+            catch (const std::out_of_range& oor) {
+              err = TRITONSERVER_ErrorNew(
+                  TRITONSERVER_ERROR_INVALID_ARG,
+                  (std::string("Unable to parse '") + setting_name +
+                   "', value is out of range [ " +
+                   std::to_string(std::numeric_limits<std::uint32_t>::min()) +
+                   ", " +
+                   std::to_string(std::numeric_limits<std::uint32_t>::max()) +
+                   " ], got: " + it->second.value()[0])
+                      .c_str());
+              GOTO_IF_ERR(err, earlyexit);
+            }
+          } else {
+            err = TRITONSERVER_ErrorNew(
+                TRITONSERVER_ERROR_INVALID_ARG,
+                (std::string("expect only 1 value for '") + setting_name + "'")
+                    .c_str());
+            GOTO_IF_ERR(err, earlyexit);
+          }
+        }
+      }
+
+      err =
+          trace_manager_->UpdateTraceSetting(request.model_name(), new_setting);
+      GOTO_IF_ERR(err, earlyexit);
+    }
+
+    // Get current trace setting, this is needed even if the setting
+    // has been updated above as some values may not be provided in the request.
+    trace_manager_->GetTraceSetting(
+        request.model_name(), &level, &rate, &count, &log_frequency, &filepath);
+    // level
+    {
+      inference::TraceSettingResponse::SettingValue level_setting;
+      if (level == TRITONSERVER_TRACE_LEVEL_DISABLED) {
+        level_setting.add_value("OFF");
+      } else {
+        if (level & TRITONSERVER_TRACE_LEVEL_TIMESTAMPS) {
+          level_setting.add_value("TIMESTAMPS");
+        }
+        if (level & TRITONSERVER_TRACE_LEVEL_TENSORS) {
+          level_setting.add_value("TENSORS");
+        }
+      }
+      (*response->mutable_settings())["trace_level"] = level_setting;
+    }
+    (*response->mutable_settings())["trace_rate"].add_value(
+        std::to_string(rate));
+    (*response->mutable_settings())["trace_count"].add_value(
+        std::to_string(count));
+    (*response->mutable_settings())["log_frequency"].add_value(
+        std::to_string(log_frequency));
+    (*response->mutable_settings())["trace_file"].add_value(filepath);
+
+  earlyexit:
+    GrpcStatusUtil::Create(status, err);
+    TRITONSERVER_ErrorDelete(err);
+#else
+    auto err = TRITONSERVER_ErrorNew(
+        TRITONSERVER_ERROR_UNAVAILABLE, "the server does not support trace");
+    GrpcStatusUtil::Create(status, err);
+    TRITONSERVER_ErrorDelete(err);
+#endif
+  };
+
+  const std::pair<std::string, std::string>& restricted_kv =
+      restricted_keys_.Get(RestrictedCategory::TRACE);
+  new CommonCallData<
+      ::grpc::ServerAsyncResponseWriter<inference::TraceSettingResponse>,
+      inference::TraceSettingRequest, inference::TraceSettingResponse>(
+      "Trace", 0, OnRegisterTrace, OnExecuteTrace, false /* async */, cq_,
+      restricted_kv, response_delay_);
+}
+
+void
+CommonHandler::RegisterLogging()
+{
+  auto OnRegisterLogging =
+      [this](
+          ::grpc::ServerContext* ctx, inference::LogSettingsRequest* request,
+          ::grpc::ServerAsyncResponseWriter<inference::LogSettingsResponse>*
+              responder,
+          void* tag) {
+        this->service_->RequestLogSettings(
+            ctx, request, responder, this->cq_, this->cq_, tag);
+      };
+
+  auto OnExecuteLogging = [this](
+                              inference::LogSettingsRequest& request,
+                              inference::LogSettingsResponse* response,
+                              ::grpc::Status* status) {
+
+#ifdef TRITON_ENABLE_LOGGING
+    TRITONSERVER_Error* err = nullptr;
+    // Update log settings
+    // Server and Core repos do not have the same Logger object
+    // Each update must be applied to both server and core repo versions
+    if (!request.settings().empty()) {
+      {
+        static std::string setting_name = "log_file";
+        auto it = request.settings().find(setting_name);
+        if (it != request.settings().end()) {
+          const auto& log_param = it->second;
+          if (log_param.parameter_choice_case() !=
+              inference::LogSettingsRequest_SettingValue::ParameterChoiceCase::
+                  kStringParam) {
+            err = TRITONSERVER_ErrorNew(
+                TRITONSERVER_ERROR_INVALID_ARG,
+                (std::string("expect string for '") + setting_name + "'")
+                    .c_str());
+            GOTO_IF_ERR(err, earlyexit);
+          } else {
+            // Set new settings in server then in core
+            const std::string& log_file_path = it->second.string_param();
+            const std::string& error = LOG_SET_OUT_FILE(log_file_path);
+            if (!error.empty()) {
+              err = TRITONSERVER_ErrorNew(
+                  TRITONSERVER_ERROR_INTERNAL, (error).c_str());
+              GOTO_IF_ERR(err, earlyexit);
+            }
+            // Okay to pass nullptr because we know the update will be applied
+            // to the global object.
+            err = TRITONSERVER_ServerOptionsSetLogFile(
+                nullptr, log_file_path.c_str());
+            if (err != nullptr) {
+              GOTO_IF_ERR(err, earlyexit);
+            }
+          }
+        }
+      }
+      {
+        static std::string setting_name = "log_info";
+        auto it = request.settings().find(setting_name);
+        if (it != request.settings().end()) {
+          const auto& log_param = it->second;
+          if (log_param.parameter_choice_case() !=
+              inference::LogSettingsRequest_SettingValue::ParameterChoiceCase::
+                  kBoolParam) {
+            err = TRITONSERVER_ErrorNew(
+                TRITONSERVER_ERROR_INVALID_ARG,
+                (std::string("expect boolean for '") + setting_name + "'")
+                    .c_str());
+            GOTO_IF_ERR(err, earlyexit);
+          } else {
+            bool log_info_status = it->second.bool_param();
+            LOG_ENABLE_INFO(log_info_status);
+            TRITONSERVER_ServerOptionsSetLogInfo(nullptr, log_info_status);
+          }
+        }
+      }
+      {
+        static std::string setting_name = "log_warning";
+        auto it = request.settings().find(setting_name);
+        if (it != request.settings().end()) {
+          const auto& log_param = it->second;
+          if (log_param.parameter_choice_case() !=
+              inference::LogSettingsRequest_SettingValue::ParameterChoiceCase::
+                  kBoolParam) {
+            err = TRITONSERVER_ErrorNew(
+                TRITONSERVER_ERROR_INVALID_ARG,
+                (std::string("expect boolean for '") + setting_name + "'")
+                    .c_str());
+            GOTO_IF_ERR(err, earlyexit);
+          } else {
+            bool log_warn_status = it->second.bool_param();
+            LOG_ENABLE_WARNING(log_warn_status);
+            TRITONSERVER_ServerOptionsSetLogWarn(nullptr, log_warn_status);
+          }
+        }
+      }
+      {
+        static std::string setting_name = "log_error";
+        auto it = request.settings().find(setting_name);
+        if (it != request.settings().end()) {
+          const auto& log_param = it->second;
+          if (log_param.parameter_choice_case() !=
+              inference::LogSettingsRequest_SettingValue::ParameterChoiceCase::
+                  kBoolParam) {
+            err = TRITONSERVER_ErrorNew(
+                TRITONSERVER_ERROR_INVALID_ARG,
+                (std::string("expect boolean for '") + setting_name + "'")
+                    .c_str());
+            GOTO_IF_ERR(err, earlyexit);
+          } else {
+            bool log_error_status = it->second.bool_param();
+            LOG_ENABLE_ERROR(log_error_status);
+            TRITONSERVER_ServerOptionsSetLogError(nullptr, log_error_status);
+          }
+        }
+      }
+      {
+        static std::string setting_name = "log_verbose_level";
+        auto it = request.settings().find(setting_name);
+        if (it != request.settings().end()) {
+          const auto& log_param = it->second;
+          if (log_param.parameter_choice_case() !=
+              inference::LogSettingsRequest_SettingValue::ParameterChoiceCase::
+                  kUint32Param) {
+            err = TRITONSERVER_ErrorNew(
+                TRITONSERVER_ERROR_INVALID_ARG,
+                (std::string("expect int32 for '") + setting_name + "'")
+                    .c_str());
+            GOTO_IF_ERR(err, earlyexit);
+          } else {
+            uint32_t verbose_level = it->second.uint32_param();
+            LOG_SET_VERBOSE(static_cast<int32_t>(verbose_level));
+            TRITONSERVER_ServerOptionsSetLogVerbose(nullptr, verbose_level);
+          }
+        }
+      }
+      {
+        static std::string setting_name = "log_format";
+        auto it = request.settings().find(setting_name);
+        if (it != request.settings().end()) {
+          const auto& log_param = it->second;
+          if (log_param.parameter_choice_case() !=
+              inference::LogSettingsRequest_SettingValue::ParameterChoiceCase::
+                  kStringParam) {
+            err = TRITONSERVER_ErrorNew(
+                TRITONSERVER_ERROR_INVALID_ARG,
+                (std::string("expect string for '") + setting_name + "'")
+                    .c_str());
+            GOTO_IF_ERR(err, earlyexit);
+          } else {
+            const std::string& log_format_parse = it->second.string_param();
+            triton::common::Logger::Format log_format_final =
+                triton::common::Logger::Format::kDEFAULT;
+            if (log_format_parse == "ISO8601") {
+              log_format_final = triton::common::Logger::Format::kISO8601;
+            } else if (log_format_parse != "default") {
+              err = TRITONSERVER_ErrorNew(
+                  TRITONSERVER_ERROR_INVALID_ARG,
+                  ("invalid argument for log_format, got: " + log_format_parse)
+                      .c_str());
+              GOTO_IF_ERR(err, earlyexit);
+            }
+            LOG_SET_FORMAT(log_format_final);
+            switch (log_format_final) {
+              case triton::common::Logger::Format::kDEFAULT:
+                TRITONSERVER_ServerOptionsSetLogFormat(
+                    nullptr, TRITONSERVER_LOG_DEFAULT);
+                break;
+              case triton::common::Logger::Format::kISO8601:
+                TRITONSERVER_ServerOptionsSetLogFormat(
+                    nullptr, TRITONSERVER_LOG_ISO8601);
+                break;
+            }
+          }
+        }
+      }
+      GOTO_IF_ERR(err, earlyexit);
+    }
+    (*response->mutable_settings())["log_file"].set_string_param(LOG_FILE);
+    (*response->mutable_settings())["log_info"].set_bool_param(LOG_INFO_IS_ON);
+    (*response->mutable_settings())["log_warning"].set_bool_param(
+        LOG_WARNING_IS_ON);
+    (*response->mutable_settings())["log_error"].set_bool_param(
+        LOG_ERROR_IS_ON);
+    (*response->mutable_settings())["log_verbose_level"].set_uint32_param(
+        LOG_VERBOSE_LEVEL);
+    (*response->mutable_settings())["log_format"].set_string_param(
+        LOG_FORMAT_STRING);
+  earlyexit:
+    GrpcStatusUtil::Create(status, err);
+    TRITONSERVER_ErrorDelete(err);
+#else
+    auto err = TRITONSERVER_ErrorNew(
+        TRITONSERVER_ERROR_UNAVAILABLE,
+        "the server does not support dynamic logging");
+    GrpcStatusUtil::Create(status, err);
+    TRITONSERVER_ErrorDelete(err);
+#endif
+  };
+
+  const std::pair<std::string, std::string>& restricted_kv =
+      restricted_keys_.Get(RestrictedCategory::LOGGING);
+  new CommonCallData<
+      ::grpc::ServerAsyncResponseWriter<inference::LogSettingsResponse>,
+      inference::LogSettingsRequest, inference::LogSettingsResponse>(
+      "Logging", 0, OnRegisterLogging, OnExecuteLogging, false /* async */, cq_,
+      restricted_kv, response_delay_);
+}
+
+void
+CommonHandler::RegisterSystemSharedMemoryStatus()
+{
+  auto OnRegisterSystemSharedMemoryStatus =
+      [this](
+          ::grpc::ServerContext* ctx,
+          inference::SystemSharedMemoryStatusRequest* request,
+          ::grpc::ServerAsyncResponseWriter<
+              inference::SystemSharedMemoryStatusResponse>* responder,
+          void* tag) {
+        this->service_->RequestSystemSharedMemoryStatus(
+            ctx, request, responder, this->cq_, this->cq_, tag);
+      };
+
+  auto OnExecuteSystemSharedMemoryStatus =
+      [this](
+          inference::SystemSharedMemoryStatusRequest& request,
+          inference::SystemSharedMemoryStatusResponse* response,
+          ::grpc::Status* status) {
+        triton::common::TritonJson::Value shm_status_json(
+            triton::common::TritonJson::ValueType::ARRAY);
+        TRITONSERVER_Error* err = shm_manager_->GetStatus(
+            request.name(), TRITONSERVER_MEMORY_CPU, &shm_status_json);
+        GOTO_IF_ERR(err, earlyexit);
+
+        for (size_t idx = 0; idx < shm_status_json.ArraySize(); ++idx) {
+          triton::common::TritonJson::Value shm_region_json;
+          err = shm_status_json.IndexAsObject(idx, &shm_region_json);
+          GOTO_IF_ERR(err, earlyexit);
+
+          const char* name;
+          size_t namelen;
+          err = shm_region_json.MemberAsString("name", &name, &namelen);
+          GOTO_IF_ERR(err, earlyexit);
+
+          const char* key;
+          size_t keylen;
+          err = shm_region_json.MemberAsString("key", &key, &keylen);
+          GOTO_IF_ERR(err, earlyexit);
+
+          uint64_t offset;
+          err = shm_region_json.MemberAsUInt("offset", &offset);
+          GOTO_IF_ERR(err, earlyexit);
+
+          uint64_t byte_size;
+          err = shm_region_json.MemberAsUInt("byte_size", &byte_size);
+          GOTO_IF_ERR(err, earlyexit);
+
+          inference::SystemSharedMemoryStatusResponse::RegionStatus
+              region_status;
+          region_status.set_name(std::string(name, namelen));
+          region_status.set_key(std::string(key, keylen));
+          region_status.set_offset(offset);
+          region_status.set_byte_size(byte_size);
+
+          (*response->mutable_regions())[name] = region_status;
+        }
+
+      earlyexit:
+        GrpcStatusUtil::Create(status, err);
+        TRITONSERVER_ErrorDelete(err);
+      };
+
+  const std::pair<std::string, std::string>& restricted_kv =
+      restricted_keys_.Get(RestrictedCategory::SHARED_MEMORY);
+  new CommonCallData<
+      ::grpc::ServerAsyncResponseWriter<
+          inference::SystemSharedMemoryStatusResponse>,
+      inference::SystemSharedMemoryStatusRequest,
+      inference::SystemSharedMemoryStatusResponse>(
+      "SystemSharedMemoryStatus", 0, OnRegisterSystemSharedMemoryStatus,
+      OnExecuteSystemSharedMemoryStatus, false /* async */, cq_, restricted_kv,
+      response_delay_);
+}
+
+void
+CommonHandler::RegisterSystemSharedMemoryRegister()
+{
+  auto OnRegisterSystemSharedMemoryRegister =
+      [this](
+          ::grpc::ServerContext* ctx,
+          inference::SystemSharedMemoryRegisterRequest* request,
+          ::grpc::ServerAsyncResponseWriter<
+              inference::SystemSharedMemoryRegisterResponse>* responder,
+          void* tag) {
+        this->service_->RequestSystemSharedMemoryRegister(
+            ctx, request, responder, this->cq_, this->cq_, tag);
+      };
+
+  auto OnExecuteSystemSharedMemoryRegister =
+      [this](
+          inference::SystemSharedMemoryRegisterRequest& request,
+          inference::SystemSharedMemoryRegisterResponse* response,
+          ::grpc::Status* status) {
+        TRITONSERVER_Error* err = shm_manager_->RegisterSystemSharedMemory(
+            request.name(), request.key(), request.offset(),
+            request.byte_size());
+
+        GrpcStatusUtil::Create(status, err);
+        TRITONSERVER_ErrorDelete(err);
+      };
+
+  const std::pair<std::string, std::string>& restricted_kv =
+      restricted_keys_.Get(RestrictedCategory::SHARED_MEMORY);
+  new CommonCallData<
+      ::grpc::ServerAsyncResponseWriter<
+          inference::SystemSharedMemoryRegisterResponse>,
+      inference::SystemSharedMemoryRegisterRequest,
+      inference::SystemSharedMemoryRegisterResponse>(
+      "SystemSharedMemoryRegister", 0, OnRegisterSystemSharedMemoryRegister,
+      OnExecuteSystemSharedMemoryRegister, false /* async */, cq_,
+      restricted_kv, response_delay_);
+}
+
+void
+CommonHandler::RegisterSystemSharedMemoryUnregister()
+{
+  auto OnRegisterSystemSharedMemoryUnregister =
+      [this](
+          ::grpc::ServerContext* ctx,
+          inference::SystemSharedMemoryUnregisterRequest* request,
+          ::grpc::ServerAsyncResponseWriter<
+              inference::SystemSharedMemoryUnregisterResponse>* responder,
+          void* tag) {
+        this->service_->RequestSystemSharedMemoryUnregister(
+            ctx, request, responder, this->cq_, this->cq_, tag);
+      };
+
+  auto OnExecuteSystemSharedMemoryUnregister =
+      [this](
+          inference::SystemSharedMemoryUnregisterRequest& request,
+          inference::SystemSharedMemoryUnregisterResponse* response,
+          ::grpc::Status* status) {
+        TRITONSERVER_Error* err = nullptr;
+        if (request.name().empty()) {
+          err = shm_manager_->UnregisterAll(TRITONSERVER_MEMORY_CPU);
+        } else {
+          err =
+              shm_manager_->Unregister(request.name(), TRITONSERVER_MEMORY_CPU);
+        }
+
+        GrpcStatusUtil::Create(status, err);
+        TRITONSERVER_ErrorDelete(err);
+      };
+
+  const std::pair<std::string, std::string>& restricted_kv =
+      restricted_keys_.Get(RestrictedCategory::SHARED_MEMORY);
+  new CommonCallData<
+      ::grpc::ServerAsyncResponseWriter<
+          inference::SystemSharedMemoryUnregisterResponse>,
+      inference::SystemSharedMemoryUnregisterRequest,
+      inference::SystemSharedMemoryUnregisterResponse>(
+      "SystemSharedMemoryUnregister", 0, OnRegisterSystemSharedMemoryUnregister,
+      OnExecuteSystemSharedMemoryUnregister, false /* async */, cq_,
+      restricted_kv, response_delay_);
+}
+
+void
+CommonHandler::RegisterCudaSharedMemoryStatus()
+{
+  auto OnRegisterCudaSharedMemoryStatus =
+      [this](
+          ::grpc::ServerContext* ctx,
+          inference::CudaSharedMemoryStatusRequest* request,
+          ::grpc::ServerAsyncResponseWriter<
+              inference::CudaSharedMemoryStatusResponse>* responder,
+          void* tag) {
+        this->service_->RequestCudaSharedMemoryStatus(
+            ctx, request, responder, this->cq_, this->cq_, tag);
+      };
+  auto OnExecuteCudaSharedMemoryStatus =
+      [this](
+          inference::CudaSharedMemoryStatusRequest& request,
+          inference::CudaSharedMemoryStatusResponse* response,
+          ::grpc::Status* status) {
+        triton::common::TritonJson::Value shm_status_json(
+            triton::common::TritonJson::ValueType::ARRAY);
+        TRITONSERVER_Error* err = shm_manager_->GetStatus(
+            request.name(), TRITONSERVER_MEMORY_GPU, &shm_status_json);
+        GOTO_IF_ERR(err, earlyexit);
+
+        for (size_t idx = 0; idx < shm_status_json.ArraySize(); ++idx) {
+          triton::common::TritonJson::Value shm_region_json;
+          err = shm_status_json.IndexAsObject(idx, &shm_region_json);
+          GOTO_IF_ERR(err, earlyexit);
+
+          const char* name;
+          size_t namelen;
+          err = shm_region_json.MemberAsString("name", &name, &namelen);
+          GOTO_IF_ERR(err, earlyexit);
+
+          uint64_t device_id;
+          err = shm_region_json.MemberAsUInt("device_id", &device_id);
+          GOTO_IF_ERR(err, earlyexit);
+
+          uint64_t byte_size;
+          err = shm_region_json.MemberAsUInt("byte_size", &byte_size);
+          GOTO_IF_ERR(err, earlyexit);
+
+
+          inference::CudaSharedMemoryStatusResponse::RegionStatus region_status;
+          region_status.set_name(std::string(name, namelen));
+          region_status.set_device_id(device_id);
+          region_status.set_byte_size(byte_size);
+
+          (*response->mutable_regions())[name] = region_status;
+        }
+      earlyexit:
+        GrpcStatusUtil::Create(status, err);
+        TRITONSERVER_ErrorDelete(err);
+      };
+
+  const std::pair<std::string, std::string>& restricted_kv =
+      restricted_keys_.Get(RestrictedCategory::SHARED_MEMORY);
+  new CommonCallData<
+      ::grpc::ServerAsyncResponseWriter<
+          inference::CudaSharedMemoryStatusResponse>,
+      inference::CudaSharedMemoryStatusRequest,
+      inference::CudaSharedMemoryStatusResponse>(
+      "CudaSharedMemoryStatus", 0, OnRegisterCudaSharedMemoryStatus,
+      OnExecuteCudaSharedMemoryStatus, false /* async */, cq_, restricted_kv,
+      response_delay_);
+}
+
+void
+CommonHandler::RegisterCudaSharedMemoryRegister()
+{
+  auto OnRegisterCudaSharedMemoryRegister =
+      [this](
+          ::grpc::ServerContext* ctx,
+          inference::CudaSharedMemoryRegisterRequest* request,
+          ::grpc::ServerAsyncResponseWriter<
+              inference::CudaSharedMemoryRegisterResponse>* responder,
+          void* tag) {
+        this->service_->RequestCudaSharedMemoryRegister(
+            ctx, request, responder, this->cq_, this->cq_, tag);
+      };
+
+  auto OnExecuteCudaSharedMemoryRegister =
+      [this](
+          inference::CudaSharedMemoryRegisterRequest& request,
+          inference::CudaSharedMemoryRegisterResponse* response,
+          ::grpc::Status* status) {
+        TRITONSERVER_Error* err = nullptr;
+#ifdef TRITON_ENABLE_GPU
+        err = shm_manager_->RegisterCUDASharedMemory(
+            request.name(),
+            reinterpret_cast<const cudaIpcMemHandle_t*>(
+                request.raw_handle().c_str()),
+            request.byte_size(), request.device_id());
+#else
+        err = TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INVALID_ARG,
+            std::string(
+                "failed to register CUDA shared memory region: '" +
+                request.name() + "', GPUs not supported")
+                .c_str());
+#endif  // TRITON_ENABLE_GPU
+
+        GrpcStatusUtil::Create(status, err);
+        TRITONSERVER_ErrorDelete(err);
+      };
+
+  const std::pair<std::string, std::string>& restricted_kv =
+      restricted_keys_.Get(RestrictedCategory::SHARED_MEMORY);
+  new CommonCallData<
+      ::grpc::ServerAsyncResponseWriter<
+          inference::CudaSharedMemoryRegisterResponse>,
+      inference::CudaSharedMemoryRegisterRequest,
+      inference::CudaSharedMemoryRegisterResponse>(
+      "CudaSharedMemoryRegister", 0, OnRegisterCudaSharedMemoryRegister,
+      OnExecuteCudaSharedMemoryRegister, false /* async */, cq_, restricted_kv,
+      response_delay_);
+}
+
+void
+CommonHandler::RegisterCudaSharedMemoryUnregister()
+{
+  auto OnRegisterCudaSharedMemoryUnregister =
+      [this](
+          ::grpc::ServerContext* ctx,
+          inference::CudaSharedMemoryUnregisterRequest* request,
+          ::grpc::ServerAsyncResponseWriter<
+              inference::CudaSharedMemoryUnregisterResponse>* responder,
+          void* tag) {
+        this->service_->RequestCudaSharedMemoryUnregister(
+            ctx, request, responder, this->cq_, this->cq_, tag);
+      };
+
+  auto OnExecuteCudaSharedMemoryUnregister =
+      [this](
+          inference::CudaSharedMemoryUnregisterRequest& request,
+          inference::CudaSharedMemoryUnregisterResponse* response,
+          ::grpc::Status* status) {
+        TRITONSERVER_Error* err = nullptr;
+        if (request.name().empty()) {
+          err = shm_manager_->UnregisterAll(TRITONSERVER_MEMORY_GPU);
+        } else {
+          err =
+              shm_manager_->Unregister(request.name(), TRITONSERVER_MEMORY_GPU);
+        }
+
+        GrpcStatusUtil::Create(status, err);
+        TRITONSERVER_ErrorDelete(err);
+      };
+  const std::pair<std::string, std::string>& restricted_kv =
+      restricted_keys_.Get(RestrictedCategory::SHARED_MEMORY);
+
+  new CommonCallData<
+      ::grpc::ServerAsyncResponseWriter<
+          inference::CudaSharedMemoryUnregisterResponse>,
+      inference::CudaSharedMemoryUnregisterRequest,
+      inference::CudaSharedMemoryUnregisterResponse>(
+      "CudaSharedMemoryUnregister", 0, OnRegisterCudaSharedMemoryUnregister,
+      OnExecuteCudaSharedMemoryUnregister, false /* async */, cq_,
+      restricted_kv, response_delay_);
+}
+
+void
+CommonHandler::RegisterRepositoryIndex()
+{
+  auto OnRegisterRepositoryIndex =
+      [this](
+          ::grpc::ServerContext* ctx,
+          inference::RepositoryIndexRequest* request,
+          ::grpc::ServerAsyncResponseWriter<inference::RepositoryIndexResponse>*
+              responder,
+          void* tag) {
+        this->service_->RequestRepositoryIndex(
+            ctx, request, responder, this->cq_, this->cq_, tag);
+      };
+
+  auto OnExecuteRepositoryIndex =
+      [this](
+          inference::RepositoryIndexRequest& request,
+          inference::RepositoryIndexResponse* response,
+          ::grpc::Status* status) {
+        TRITONSERVER_Error* err = nullptr;
+        if (request.repository_name().empty()) {
+          uint32_t flags = 0;
+          if (request.ready()) {
+            flags |= TRITONSERVER_INDEX_FLAG_READY;
+          }
+
+          TRITONSERVER_Message* model_index_message = nullptr;
+          err = TRITONSERVER_ServerModelIndex(
+              tritonserver_.get(), flags, &model_index_message);
+          GOTO_IF_ERR(err, earlyexit);
+
+          const char* buffer;
+          size_t byte_size;
+          err = TRITONSERVER_MessageSerializeToJson(
+              model_index_message, &buffer, &byte_size);
+          GOTO_IF_ERR(err, earlyexit);
+
+          triton::common::TritonJson::Value model_index_json;
+          err = model_index_json.Parse(buffer, byte_size);
+          GOTO_IF_ERR(err, earlyexit);
+
+          err = model_index_json.AssertType(
+              triton::common::TritonJson::ValueType::ARRAY);
+          GOTO_IF_ERR(err, earlyexit);
+
+          for (size_t idx = 0; idx < model_index_json.ArraySize(); ++idx) {
+            triton::common::TritonJson::Value index_json;
+            err = model_index_json.IndexAsObject(idx, &index_json);
+            GOTO_IF_ERR(err, earlyexit);
+
+            auto model_index = response->add_models();
+
+            const char* name;
+            size_t namelen;
+            err = index_json.MemberAsString("name", &name, &namelen);
+            GOTO_IF_ERR(err, earlyexit);
+            model_index->set_name(std::string(name, namelen));
+
+            if (index_json.Find("version")) {
+              const char* version;
+              size_t versionlen;
+              err = index_json.MemberAsString("version", &version, &versionlen);
+              GOTO_IF_ERR(err, earlyexit);
+              model_index->set_version(std::string(version, versionlen));
+            }
+            if (index_json.Find("state")) {
+              const char* state;
+              size_t statelen;
+              err = index_json.MemberAsString("state", &state, &statelen);
+              GOTO_IF_ERR(err, earlyexit);
+              model_index->set_state(std::string(state, statelen));
+            }
+            if (index_json.Find("reason")) {
+              const char* reason;
+              size_t reasonlen;
+              err = index_json.MemberAsString("reason", &reason, &reasonlen);
+              GOTO_IF_ERR(err, earlyexit);
+              model_index->set_reason(std::string(reason, reasonlen));
+            }
+          }
+
+          TRITONSERVER_MessageDelete(model_index_message);
+        } else {
+          err = TRITONSERVER_ErrorNew(
+              TRITONSERVER_ERROR_UNSUPPORTED,
+              "'repository_name' specification is not supported");
+        }
+
+      earlyexit:
+        GrpcStatusUtil::Create(status, err);
+        TRITONSERVER_ErrorDelete(err);
+      };
+
+  const std::pair<std::string, std::string>& restricted_kv =
+      restricted_keys_.Get(RestrictedCategory::MODEL_REPOSITORY);
+  new CommonCallData<
+      ::grpc::ServerAsyncResponseWriter<inference::RepositoryIndexResponse>,
+      inference::RepositoryIndexRequest, inference::RepositoryIndexResponse>(
+      "RepositoryIndex", 0, OnRegisterRepositoryIndex, OnExecuteRepositoryIndex,
+      false /* async */, cq_, restricted_kv, response_delay_);
+}
+
+void
+CommonHandler::RegisterRepositoryModelLoad()
+{
+  auto OnRegisterRepositoryModelLoad =
+      [this](
+          ::grpc::ServerContext* ctx,
+          inference::RepositoryModelLoadRequest* request,
+          ::grpc::ServerAsyncResponseWriter<
+              inference::RepositoryModelLoadResponse>* responder,
+          void* tag) {
+        this->service_->RequestRepositoryModelLoad(
+            ctx, request, responder, this->cq_, this->cq_, tag);
+      };
+
+  auto OnExecuteRepositoryModelLoad =
+      [this](
+          inference::RepositoryModelLoadRequest& request,
+          inference::RepositoryModelLoadResponse* response,
+          ::grpc::Status* status) {
+        TRITONSERVER_Error* err = nullptr;
+        if (request.repository_name().empty()) {
+          std::vector<TRITONSERVER_Parameter*> params;
+          // WAR for the const-ness check
+          std::vector<const TRITONSERVER_Parameter*> const_params;
+          for (const auto& param_proto : request.parameters()) {
+            if (param_proto.first == "config") {
+              if (param_proto.second.parameter_choice_case() !=
+                  inference::ModelRepositoryParameter::ParameterChoiceCase::
+                      kStringParam) {
+                err = TRITONSERVER_ErrorNew(
+                    TRITONSERVER_ERROR_INVALID_ARG,
+                    (std::string("invalid value type for load parameter '") +
+                     param_proto.first + "', expected string_param.")
+                        .c_str());
+                break;
+              } else {
+                auto param = TRITONSERVER_ParameterNew(
+                    param_proto.first.c_str(), TRITONSERVER_PARAMETER_STRING,
+                    param_proto.second.string_param().c_str());
+                if (param != nullptr) {
+                  params.emplace_back(param);
+                  const_params.emplace_back(param);
+                } else {
+                  err = TRITONSERVER_ErrorNew(
+                      TRITONSERVER_ERROR_INTERNAL,
+                      "unexpected error on creating Triton parameter");
+                  break;
+                }
+              }
+            } else if (param_proto.first.rfind("file:", 0) == 0) {
+              if (param_proto.second.parameter_choice_case() !=
+                  inference::ModelRepositoryParameter::ParameterChoiceCase::
+                      kBytesParam) {
+                err = TRITONSERVER_ErrorNew(
+                    TRITONSERVER_ERROR_INVALID_ARG,
+                    (std::string("invalid value type for load parameter '") +
+                     param_proto.first + "', expected bytes_param.")
+                        .c_str());
+                break;
+              } else {
+                auto param = TRITONSERVER_ParameterBytesNew(
+                    param_proto.first.c_str(),
+                    param_proto.second.bytes_param().data(),
+                    param_proto.second.bytes_param().length());
+                if (param != nullptr) {
+                  params.emplace_back(param);
+                  const_params.emplace_back(param);
+                } else {
+                  err = TRITONSERVER_ErrorNew(
+                      TRITONSERVER_ERROR_INTERNAL,
+                      "unexpected error on creating Triton parameter");
+                  break;
+                }
+              }
+            } else {
+              err = TRITONSERVER_ErrorNew(
+                  TRITONSERVER_ERROR_INVALID_ARG,
+                  (std::string("unrecognized load parameter '") +
+                   param_proto.first + "'.")
+                      .c_str());
+              break;
+            }
+          }
+          if (err == nullptr) {
+            err = TRITONSERVER_ServerLoadModelWithParameters(
+                tritonserver_.get(), request.model_name().c_str(),
+                const_params.data(), const_params.size());
+          }
+          // Assumes no further 'params' access after load API returns
+          for (auto& param : params) {
+            TRITONSERVER_ParameterDelete(param);
+          }
+        } else {
+          err = TRITONSERVER_ErrorNew(
+              TRITONSERVER_ERROR_UNSUPPORTED,
+              "'repository_name' specification is not supported");
+        }
+
+        GrpcStatusUtil::Create(status, err);
+        TRITONSERVER_ErrorDelete(err);
+      };
+
+  const std::pair<std::string, std::string>& restricted_kv =
+      restricted_keys_.Get(RestrictedCategory::MODEL_REPOSITORY);
+  new CommonCallData<
+      ::grpc::ServerAsyncResponseWriter<inference::RepositoryModelLoadResponse>,
+      inference::RepositoryModelLoadRequest,
+      inference::RepositoryModelLoadResponse>(
+      "RepositoryModelLoad", 0, OnRegisterRepositoryModelLoad,
+      OnExecuteRepositoryModelLoad, true /* async */, cq_, restricted_kv,
+      response_delay_);
+}
+
+void
+CommonHandler::RegisterRepositoryModelUnload()
+{
+  auto OnRegisterRepositoryModelUnload =
+      [this](
+          ::grpc::ServerContext* ctx,
+          inference::RepositoryModelUnloadRequest* request,
+          ::grpc::ServerAsyncResponseWriter<
+              inference::RepositoryModelUnloadResponse>* responder,
+          void* tag) {
+        this->service_->RequestRepositoryModelUnload(
+            ctx, request, responder, this->cq_, this->cq_, tag);
+      };
+
+  auto OnExecuteRepositoryModelUnload =
+      [this](
+          inference::RepositoryModelUnloadRequest& request,
+          inference::RepositoryModelUnloadResponse* response,
+          ::grpc::Status* status) {
+        TRITONSERVER_Error* err = nullptr;
+        if (request.repository_name().empty()) {
+          // Check if the dependent models should be removed
+          bool unload_dependents = false;
+          for (auto param : request.parameters()) {
+            if (param.first.compare("unload_dependents") == 0) {
+              const auto& unload_param = param.second;
+              if (unload_param.parameter_choice_case() !=
+                  inference::ModelRepositoryParameter::ParameterChoiceCase::
+                      kBoolParam) {
+                err = TRITONSERVER_ErrorNew(
+                    TRITONSERVER_ERROR_INVALID_ARG,
+                    "invalid value type for 'unload_dependents' parameter, "
+                    "expected "
+                    "bool_param.");
+              }
+              unload_dependents = unload_param.bool_param();
+              break;
+            }
+          }
+          if (err == nullptr) {
+            if (unload_dependents) {
+              err = TRITONSERVER_ServerUnloadModelAndDependents(
+                  tritonserver_.get(), request.model_name().c_str());
+            } else {
+              err = TRITONSERVER_ServerUnloadModel(
+                  tritonserver_.get(), request.model_name().c_str());
+            }
+          }
+        } else {
+          err = TRITONSERVER_ErrorNew(
+              TRITONSERVER_ERROR_UNSUPPORTED,
+              "'repository_name' specification is not supported");
+        }
+
+        GrpcStatusUtil::Create(status, err);
+        TRITONSERVER_ErrorDelete(err);
+      };
+
+  const std::pair<std::string, std::string>& restricted_kv =
+      restricted_keys_.Get(RestrictedCategory::MODEL_REPOSITORY);
+  new CommonCallData<
+      ::grpc::ServerAsyncResponseWriter<
+          inference::RepositoryModelUnloadResponse>,
+      inference::RepositoryModelUnloadRequest,
+      inference::RepositoryModelUnloadResponse>(
+      "RepositoryModelUnload", 0, OnRegisterRepositoryModelUnload,
+      OnExecuteRepositoryModelUnload, true /* async */, cq_, restricted_kv,
+      response_delay_);
+}
+
+}  // namespace
+
+//
+// Server
+//
+Server::Server(
+    const std::shared_ptr<TRITONSERVER_Server>& tritonserver,
+    triton::server::TraceManager* trace_manager,
+    const std::shared_ptr<SharedMemoryManager>& shm_manager,
+    const Options& options)
+    : tritonserver_(tritonserver), trace_manager_(trace_manager),
+      shm_manager_(shm_manager), server_addr_(
+                                     options.socket_.address_ + ":" +
+                                     std::to_string(options.socket_.port_))
+{
+  std::shared_ptr<::grpc::ServerCredentials> credentials;
+  const auto& ssl_options = options.ssl_;
+  if (ssl_options.use_ssl_) {
+    std::string key;
+    std::string cert;
+    std::string root;
+    ReadFile(ssl_options.server_cert_, cert);
+    ReadFile(ssl_options.server_key_, key);
+    ReadFile(ssl_options.root_cert_, root);
+    ::grpc::SslServerCredentialsOptions::PemKeyCertPair keycert = {key, cert};
+    ::grpc::SslServerCredentialsOptions sslOpts;
+    sslOpts.pem_root_certs = root;
+    sslOpts.pem_key_cert_pairs.push_back(keycert);
+    if (ssl_options.use_mutual_auth_) {
+      sslOpts.client_certificate_request =
+          GRPC_SSL_REQUEST_AND_REQUIRE_CLIENT_CERTIFICATE_AND_VERIFY;
+    }
+    credentials = ::grpc::SslServerCredentials(sslOpts);
+  } else {
+    credentials = ::grpc::InsecureServerCredentials();
+  }
+
+  builder_.AddListeningPort(server_addr_, credentials, &bound_port_);
+  builder_.SetMaxMessageSize(MAX_GRPC_MESSAGE_SIZE);
+  builder_.RegisterService(&service_);
+  builder_.RegisterService(&health_service_);
+  builder_.AddChannelArgument(
+      GRPC_ARG_ALLOW_REUSEPORT, options.socket_.reuse_port_);
+
+  {
+    // GRPC KeepAlive Docs:
+    // https://grpc.github.io/grpc/cpp/md_doc_keepalive.html NOTE: In order to
+    // work properly, the client-side settings should be in agreement with
+    // server-side settings.
+    const auto& keepalive_options = options.keep_alive_;
+    builder_.AddChannelArgument(
+        GRPC_ARG_KEEPALIVE_TIME_MS, keepalive_options.keepalive_time_ms_);
+    builder_.AddChannelArgument(
+        GRPC_ARG_KEEPALIVE_TIMEOUT_MS, keepalive_options.keepalive_timeout_ms_);
+    builder_.AddChannelArgument(
+        GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS,
+        keepalive_options.keepalive_permit_without_calls_);
+    builder_.AddChannelArgument(
+        GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA,
+        keepalive_options.http2_max_pings_without_data_);
+    builder_.AddChannelArgument(
+        GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS,
+        keepalive_options.http2_min_recv_ping_interval_without_data_ms_);
+    builder_.AddChannelArgument(
+        GRPC_ARG_HTTP2_MAX_PING_STRIKES,
+        keepalive_options.http2_max_ping_strikes_);
+    if (keepalive_options.max_connection_age_ms_ != 0) {
+      builder_.AddChannelArgument(
+          GRPC_ARG_MAX_CONNECTION_AGE_MS,
+          keepalive_options.max_connection_age_ms_);
+    }
+    if (keepalive_options.max_connection_age_grace_ms_ != 0) {
+      builder_.AddChannelArgument(
+          GRPC_ARG_MAX_CONNECTION_AGE_GRACE_MS,
+          keepalive_options.max_connection_age_grace_ms_);
+    }
+
+    std::vector<std::string> headers{"GRPC KeepAlive Option", "Value"};
+    triton::common::TablePrinter table_printer(headers);
+    std::vector<std::string> row{
+        "keepalive_time_ms",
+        std::to_string(keepalive_options.keepalive_time_ms_)};
+    table_printer.InsertRow(row);
+
+    row = {
+        "keepalive_timeout_ms",
+        std::to_string(keepalive_options.keepalive_timeout_ms_)};
+    table_printer.InsertRow(row);
+
+    row = {
+        "keepalive_permit_without_calls",
+        std::to_string(keepalive_options.keepalive_permit_without_calls_)};
+    table_printer.InsertRow(row);
+
+    row = {
+        "http2_max_pings_without_data",
+        std::to_string(keepalive_options.http2_max_pings_without_data_)};
+    table_printer.InsertRow(row);
+
+    row = {
+        "http2_min_recv_ping_interval_without_data_ms",
+        std::to_string(
+            keepalive_options.http2_min_recv_ping_interval_without_data_ms_)};
+    table_printer.InsertRow(row);
+
+    row = {
+        "http2_max_ping_strikes",
+        std::to_string(keepalive_options.http2_max_ping_strikes_)};
+    table_printer.InsertRow(row);
+
+    if (keepalive_options.max_connection_age_ms_ != 0) {
+      row = {
+          "max_connection_age_ms",
+          std::to_string(keepalive_options.max_connection_age_ms_)};
+      table_printer.InsertRow(row);
+    }
+
+    if (keepalive_options.max_connection_age_grace_ms_ != 0) {
+      row = {
+          "max_connection_age_grace_ms",
+          std::to_string(keepalive_options.max_connection_age_grace_ms_)};
+      table_printer.InsertRow(row);
+    }
+    LOG_VERBOSE(1) << table_printer.PrintTable();
+  }
+
+  common_cq_ = builder_.AddCompletionQueue();
+  model_infer_cq_ = builder_.AddCompletionQueue();
+  model_stream_infer_cq_ = builder_.AddCompletionQueue();
+
+  // For testing purposes only, add artificial delay in grpc responses.
+  const char* dstr = getenv("TRITONSERVER_SERVER_DELAY_GRPC_RESPONSE_SEC");
+  uint64_t response_delay = 0;
+  if (dstr != nullptr) {
+    response_delay = atoi(dstr);
+  }
+  // A common Handler for other non-inference requests
+  common_handler_.reset(new CommonHandler(
+      "CommonHandler", tritonserver_, shm_manager_, trace_manager_, &service_,
+      &health_service_, common_cq_.get(), options.restricted_protocols_,
+      response_delay));
+
+  // [FIXME] "register" logic is different for infer
+  // Handler for model inference requests.
+  std::pair<std::string, std::string> restricted_kv =
+      options.restricted_protocols_.Get(RestrictedCategory::INFERENCE);
+  for (int i = 0; i < REGISTER_GRPC_INFER_THREAD_COUNT; ++i) {
+    model_infer_handlers_.emplace_back(new ModelInferHandler(
+        "ModelInferHandler", tritonserver_, trace_manager_, shm_manager_,
+        &service_, model_infer_cq_.get(),
+        options.infer_allocation_pool_size_ /* max_state_bucket_count */,
+        options.infer_compression_level_, restricted_kv,
+        options.forward_header_pattern_));
+  }
+
+  // Handler for streaming inference requests. Keeps one handler for streaming
+  // to avoid possible concurrent writes which is not allowed
+  model_stream_infer_handlers_.emplace_back(new ModelStreamInferHandler(
+      "ModelStreamInferHandler", tritonserver_, trace_manager_, shm_manager_,
+      &service_, model_stream_infer_cq_.get(),
+      options.infer_allocation_pool_size_ /* max_state_bucket_count */,
+      options.infer_compression_level_, restricted_kv,
+      options.forward_header_pattern_));
+}
+
+Server::~Server()
+{
+  IGNORE_ERR(Stop());
+}
+
+TRITONSERVER_Error*
+Server::Create(
+    const std::shared_ptr<TRITONSERVER_Server>& tritonserver,
+    triton::server::TraceManager* trace_manager,
+    const std::shared_ptr<SharedMemoryManager>& shm_manager,
+    const Options& server_options, std::unique_ptr<Server>* server)
+{
+  const std::string addr = server_options.socket_.address_ + ":" +
+                           std::to_string(server_options.socket_.port_);
+  try {
+    server->reset(
+        new Server(tritonserver, trace_manager, shm_manager, server_options));
+  }
+  catch (const std::invalid_argument& pe) {
+    return TRITONSERVER_ErrorNew(TRITONSERVER_ERROR_INVALID_ARG, pe.what());
+    ;
+  }
+
+  return nullptr;  // success
+}
+
+TRITONSERVER_Error*
+Server::Start()
+{
+  if (running_) {
+    return TRITONSERVER_ErrorNew(
+        TRITONSERVER_ERROR_ALREADY_EXISTS, "GRPC server is already running.");
+  }
+
+  server_ = builder_.BuildAndStart();
+  // Check if binding port failed
+  if (bound_port_ == 0) {
+    return TRITONSERVER_ErrorNew(
+        TRITONSERVER_ERROR_UNAVAILABLE,
+        (std::string("Socket '") + server_addr_ + "' already in use ").c_str());
+  }
+
+  common_handler_->Start();
+  for (auto& model_infer_handler : model_infer_handlers_) {
+    model_infer_handler->Start();
+  }
+  for (auto& model_stream_infer_handler : model_stream_infer_handlers_) {
+    model_stream_infer_handler->Start();
+  }
+
+  running_ = true;
+  LOG_INFO << "Started GRPCInferenceService at " << server_addr_;
+  return nullptr;  // success
+}
+
+TRITONSERVER_Error*
+Server::Stop()
+{
+  if (!running_) {
+    return TRITONSERVER_ErrorNew(
+        TRITONSERVER_ERROR_UNAVAILABLE, "GRPC server is not running.");
+  }
+
+  // Always shutdown the completion queue after the server.
+  server_->Shutdown();
+
+  common_cq_->Shutdown();
+  model_infer_cq_->Shutdown();
+  model_stream_infer_cq_->Shutdown();
+
+  // Must stop all handlers explicitly to wait for all the handler
+  // threads to join since they are referencing completion queue, etc.
+  common_handler_->Stop();
+  for (auto& model_infer_handler : model_infer_handlers_) {
+    model_infer_handler->Stop();
+  }
+  for (auto& model_stream_infer_handler : model_stream_infer_handlers_) {
+    model_stream_infer_handler->Stop();
+  }
+
+  running_ = false;
+  return nullptr;  // success
+}
+
+}}}  // namespace triton::server::grpc
diff --git a/src/grpc/grpc_server.h b/src/grpc/grpc_server.h
new file mode 100644
index 0000000000..8a38cdd4fe
--- /dev/null
+++ b/src/grpc/grpc_server.h
@@ -0,0 +1,139 @@
+// Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NVIDIA CORPORATION nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#pragma once
+
+#include <grpc++/grpc++.h>
+
+#include <vector>
+
+#include "../restricted_features.h"
+#include "../shared_memory_manager.h"
+#include "../tracer.h"
+#include "grpc_handler.h"
+#include "grpc_service.grpc.pb.h"
+#include "grpc_utils.h"
+#include "health.grpc.pb.h"
+#include "infer_handler.h"
+#include "stream_infer_handler.h"
+#include "triton/core/tritonserver.h"
+
+namespace triton { namespace server { namespace grpc {
+
+// GRPC uses HTTP2 which requires header to be in lowercase, so the Triton
+// specific header that may be set for GRPC is defined to be all lowercases
+constexpr char kRestrictedProtocolHeaderTemplate[] = "triton-grpc-protocol-";
+
+struct SocketOptions {
+  std::string address_{"0.0.0.0"};
+  int32_t port_{8001};
+  bool reuse_port_{false};
+};
+
+struct SslOptions {
+  // Whether SSL is used for communication
+  bool use_ssl_{false};
+  // File holding PEM-encoded server certificate
+  std::string server_cert_{""};
+  // File holding PEM-encoded server key
+  std::string server_key_{""};
+  // File holding PEM-encoded root certificate
+  std::string root_cert_{""};
+  // Whether to use Mutual Authentication
+  bool use_mutual_auth_{false};
+};
+
+// GRPC KeepAlive: https://grpc.github.io/grpc/cpp/md_doc_keepalive.html
+// https://grpc.io/docs/guides/keepalive/
+struct KeepAliveOptions {
+  int keepalive_time_ms_{7200000};
+  int keepalive_timeout_ms_{20000};
+  bool keepalive_permit_without_calls_{false};
+  int http2_max_pings_without_data_{2};
+  int http2_min_recv_ping_interval_without_data_ms_{300000};
+  int http2_max_ping_strikes_{2};
+  int max_connection_age_ms_{0};
+  int max_connection_age_grace_ms_{0};
+};
+
+struct Options {
+  SocketOptions socket_;
+  SslOptions ssl_;
+  KeepAliveOptions keep_alive_;
+  grpc_compression_level infer_compression_level_{GRPC_COMPRESS_LEVEL_NONE};
+  // The maximum number of inference request/response objects that
+  // remain allocated for reuse. As long as the number of in-flight
+  // requests doesn't exceed this value there will be no
+  // allocation/deallocation of request/response objects.
+  int infer_allocation_pool_size_{8};
+  RestrictedFeatures restricted_protocols_;
+  std::string forward_header_pattern_;
+};
+
+class Server {
+ public:
+  static TRITONSERVER_Error* Create(
+      const std::shared_ptr<TRITONSERVER_Server>& tritonserver,
+      triton::server::TraceManager* trace_manager,
+      const std::shared_ptr<SharedMemoryManager>& shm_manager,
+      const Options& server_options, std::unique_ptr<Server>* server);
+
+  ~Server();
+
+  TRITONSERVER_Error* Start();
+  TRITONSERVER_Error* Stop();
+
+ private:
+  Server(
+      const std::shared_ptr<TRITONSERVER_Server>& tritonserver,
+      triton::server::TraceManager* trace_manager,
+      const std::shared_ptr<SharedMemoryManager>& shm_manager,
+      const Options& server_options);
+
+  std::shared_ptr<TRITONSERVER_Server> tritonserver_;
+  TraceManager* trace_manager_;
+  std::shared_ptr<SharedMemoryManager> shm_manager_;
+  const std::string server_addr_;
+
+  ::grpc::ServerBuilder builder_;
+
+  inference::GRPCInferenceService::AsyncService service_;
+  ::grpc::health::v1::Health::AsyncService health_service_;
+
+  std::unique_ptr<::grpc::Server> server_;
+
+  std::unique_ptr<::grpc::ServerCompletionQueue> common_cq_;
+  std::unique_ptr<::grpc::ServerCompletionQueue> model_infer_cq_;
+  std::unique_ptr<::grpc::ServerCompletionQueue> model_stream_infer_cq_;
+
+  std::unique_ptr<HandlerBase> common_handler_;
+  std::vector<std::unique_ptr<HandlerBase>> model_infer_handlers_;
+  std::vector<std::unique_ptr<HandlerBase>> model_stream_infer_handlers_;
+
+  int bound_port_{0};
+  bool running_{false};
+};
+
+}}}  // namespace triton::server::grpc
diff --git a/src/grpc/grpc_utils.cc b/src/grpc/grpc_utils.cc
new file mode 100644
index 0000000000..4589899441
--- /dev/null
+++ b/src/grpc/grpc_utils.cc
@@ -0,0 +1,160 @@
+// Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NVIDIA CORPORATION nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#include "grpc_utils.h"
+
+namespace triton { namespace server { namespace grpc {
+
+std::ostream&
+operator<<(std::ostream& out, const Steps& step)
+{
+  switch (step) {
+    case START:
+      out << "START";
+      break;
+    case COMPLETE:
+      out << "COMPLETE";
+      break;
+    case FINISH:
+      out << "FINISH";
+      break;
+    case ISSUED:
+      out << "ISSUED";
+      break;
+    case READ:
+      out << "READ";
+      break;
+    case WRITEREADY:
+      out << "WRITEREADY";
+      break;
+    case WRITTEN:
+      out << "WRITTEN";
+      break;
+    case WAITING_NOTIFICATION:
+      out << "WAITING_NOTIFICATION";
+      break;
+    case CANCELLATION_ISSUED:
+      out << "CANCELLATION_ISSUED";
+      break;
+    case CANCELLED:
+      out << "CANCELLED";
+      break;
+    case PARTIAL_COMPLETION:
+      out << "PARTIAL_COMPLETION";
+      break;
+  }
+
+  return out;
+}
+
+void
+GrpcStatusUtil::Create(::grpc::Status* status, TRITONSERVER_Error* err)
+{
+  if (err == nullptr) {
+    *status = ::grpc::Status::OK;
+  } else {
+    *status = ::grpc::Status(
+        GrpcStatusUtil::CodeToStatus(TRITONSERVER_ErrorCode(err)),
+        TRITONSERVER_ErrorMessage(err));
+  }
+}
+
+::grpc::StatusCode
+GrpcStatusUtil::CodeToStatus(TRITONSERVER_Error_Code code)
+{
+  // GRPC status codes:
+  // https://github.com/grpc/grpc/blob/master/include/grpc/impl/codegen/status.h
+  switch (code) {
+    case TRITONSERVER_ERROR_UNKNOWN:
+      return ::grpc::StatusCode::UNKNOWN;
+    case TRITONSERVER_ERROR_INTERNAL:
+      return ::grpc::StatusCode::INTERNAL;
+    case TRITONSERVER_ERROR_NOT_FOUND:
+      return ::grpc::StatusCode::NOT_FOUND;
+    case TRITONSERVER_ERROR_INVALID_ARG:
+      return ::grpc::StatusCode::INVALID_ARGUMENT;
+    case TRITONSERVER_ERROR_UNAVAILABLE:
+      return ::grpc::StatusCode::UNAVAILABLE;
+    case TRITONSERVER_ERROR_UNSUPPORTED:
+      return ::grpc::StatusCode::UNIMPLEMENTED;
+    case TRITONSERVER_ERROR_ALREADY_EXISTS:
+      return ::grpc::StatusCode::ALREADY_EXISTS;
+    case TRITONSERVER_ERROR_CANCELLED:
+      return ::grpc::StatusCode::CANCELLED;
+  }
+
+  return ::grpc::StatusCode::UNKNOWN;
+}
+
+TRITONSERVER_Error*
+ParseClassificationParams(
+    const inference::ModelInferRequest::InferRequestedOutputTensor& output,
+    bool* has_classification, uint32_t* classification_count)
+{
+  *has_classification = false;
+
+  const auto& class_it = output.parameters().find("classification");
+  if (class_it != output.parameters().end()) {
+    *has_classification = true;
+
+    const auto& param = class_it->second;
+    if (param.parameter_choice_case() !=
+        inference::InferParameter::ParameterChoiceCase::kInt64Param) {
+      return TRITONSERVER_ErrorNew(
+          TRITONSERVER_ERROR_INVALID_ARG,
+          "invalid value type for 'classification' parameter, expected "
+          "int64_param");
+    }
+
+    const int64_t cnt = param.int64_param();
+    if (cnt <= 0) {
+      return TRITONSERVER_ErrorNew(
+          TRITONSERVER_ERROR_INVALID_ARG,
+          "invalid value for 'classification' parameter, expected >= 0");
+    }
+
+    *classification_count = cnt;
+  }
+
+  return nullptr;  // success
+}
+
+void
+ReadFile(const std::string& filename, std::string& data)
+{
+  data.clear();
+  if (!filename.empty()) {
+    std::ifstream file(filename.c_str(), std::ios::in);
+    if (file.is_open()) {
+      std::stringstream ss;
+      ss << file.rdbuf();
+      file.close();
+      data = ss.str();
+    }
+  }
+}
+
+}}}  // namespace triton::server::grpc
diff --git a/src/grpc/grpc_utils.h b/src/grpc/grpc_utils.h
new file mode 100644
index 0000000000..898e4acb4f
--- /dev/null
+++ b/src/grpc/grpc_utils.h
@@ -0,0 +1,187 @@
+// Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NVIDIA CORPORATION nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#pragma once
+
+#include <list>
+#include <memory>
+#include <unordered_map>
+
+#include "../classification.h"
+#include "../common.h"
+#include "../shared_memory_manager.h"
+#include "grpc_service.grpc.pb.h"
+#include "triton/common/logging.h"
+#include "triton/core/tritonserver.h"
+
+namespace triton { namespace server { namespace grpc {
+
+// The step of processing that the state is in. Every state must
+// recognize START, COMPLETE and FINISH and the others are optional.
+typedef enum {
+  // This marks the starting stage of the RPC
+  START,
+  // This marks that RPC is complete.
+  COMPLETE,
+  // This marks the stage where all the notifications from the gRPC
+  // completion queue is received and state can be safely released.
+  FINISH,
+  // This stage means that RPC has been issued to Triton for inference
+  // and is waiting for the server callbacks or cancellation to be
+  // invoked.
+  ISSUED,
+  // This stage means the request has been read from the network and
+  // can be sent to Triton for execution.
+  READ,
+  // This stage means that the response is ready to be written back to
+  // the network.
+  WRITEREADY,
+  // This stage means that response has been written completely to the
+  // network.
+  WRITTEN,
+  // This marks the special stage for the state object to differentiate
+  // the tag delivered from AsyncNotifyWhenDone() method.
+  WAITING_NOTIFICATION,
+  // This stage means that the cancellation for the RPC has been issued
+  // to the server.
+  CANCELLATION_ISSUED,
+  // This stage marks that the state has been successfully cancelled.
+  CANCELLED,
+  // This is intermediary stage where the state has been been partially
+  // completed by grpc responder Finish call or AsyncNotifyWhenDone()
+  // notification. The other next call will move the stage to fully
+  // complete.
+  PARTIAL_COMPLETION
+} Steps;
+
+// Debugging helper
+std::ostream& operator<<(std::ostream& out, const Steps& step);
+
+//
+// GrpcStatusUtil
+//
+class GrpcStatusUtil {
+ public:
+  static void Create(::grpc::Status* status, TRITONSERVER_Error* err);
+  static ::grpc::StatusCode CodeToStatus(TRITONSERVER_Error_Code code);
+};
+
+template <typename TensorType>
+TRITONSERVER_Error*
+ParseSharedMemoryParams(
+    const TensorType& tensor, bool* has_shared_memory, std::string* region_name,
+    int64_t* offset, size_t* byte_size)
+{
+  *has_shared_memory = false;
+  *offset = 0 /* default value */;
+  const auto& region_it = tensor.parameters().find("shared_memory_region");
+  if (region_it != tensor.parameters().end()) {
+    *has_shared_memory = true;
+    const auto& infer_param = region_it->second;
+    if (infer_param.parameter_choice_case() !=
+        inference::InferParameter::ParameterChoiceCase::kStringParam) {
+      return TRITONSERVER_ErrorNew(
+          TRITONSERVER_ERROR_INVALID_ARG,
+          std::string(
+              "invalid value type for 'shared_memory_region' parameter for "
+              "tensor '" +
+              tensor.name() + "', expected string_param.")
+              .c_str());
+    }
+    *region_name = infer_param.string_param();
+  }
+
+  const auto& offset_it = tensor.parameters().find("shared_memory_offset");
+  if (offset_it != tensor.parameters().end()) {
+    if (!*has_shared_memory) {
+      return TRITONSERVER_ErrorNew(
+          TRITONSERVER_ERROR_INVALID_ARG,
+          std::string(
+              "'shared_memory_offset' can not be specified without "
+              "'shared_memory_region' parameter for tensor '" +
+              tensor.name() + "'")
+              .c_str());
+    }
+    const auto& infer_param = offset_it->second;
+    if (infer_param.parameter_choice_case() !=
+        inference::InferParameter::ParameterChoiceCase::kInt64Param) {
+      return TRITONSERVER_ErrorNew(
+          TRITONSERVER_ERROR_INVALID_ARG,
+          std::string(
+              "invalid value type for 'shared_memory_offset' parameter for "
+              "tensor '" +
+              tensor.name() + "', expected int64_param.")
+              .c_str());
+    }
+    *offset = infer_param.int64_param();
+  }
+
+  const auto& bs_it = tensor.parameters().find("shared_memory_byte_size");
+  if (bs_it != tensor.parameters().end()) {
+    if (!*has_shared_memory) {
+      return TRITONSERVER_ErrorNew(
+          TRITONSERVER_ERROR_INVALID_ARG,
+          std::string(
+              "'shared_memory_byte_size' can not be specified without "
+              "'shared_memory_region' parameter for tensor '" +
+              tensor.name() + "'")
+              .c_str());
+    }
+    const auto& infer_param = bs_it->second;
+    if (infer_param.parameter_choice_case() !=
+        inference::InferParameter::ParameterChoiceCase::kInt64Param) {
+      return TRITONSERVER_ErrorNew(
+          TRITONSERVER_ERROR_INVALID_ARG,
+          std::string(
+              "invalid value type for 'shared_memory_byte_size' parameter "
+              "for "
+              "tensor '" +
+              tensor.name() + "', expected int64_param.")
+              .c_str());
+    }
+    *byte_size = infer_param.int64_param();
+  } else {
+    if (*has_shared_memory) {
+      return TRITONSERVER_ErrorNew(
+          TRITONSERVER_ERROR_INVALID_ARG,
+          std::string(
+              "'shared_memory_byte_size' must be specified along with "
+              "'shared_memory_region' parameter for tensor '" +
+              tensor.name() + "'")
+              .c_str());
+    }
+  }
+
+  return nullptr;
+}
+
+TRITONSERVER_Error* ParseClassificationParams(
+    const inference::ModelInferRequest::InferRequestedOutputTensor& output,
+    bool* has_classification, uint32_t* classification_count);
+
+
+void ReadFile(const std::string& filename, std::string& data);
+
+}}}  // namespace triton::server::grpc
diff --git a/src/grpc/infer_handler.cc b/src/grpc/infer_handler.cc
new file mode 100644
index 0000000000..30d93fa4f9
--- /dev/null
+++ b/src/grpc/infer_handler.cc
@@ -0,0 +1,1068 @@
+// Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NVIDIA CORPORATION nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#include "infer_handler.h"
+
+#ifndef NDEBUG
+uint64_t
+NextUniqueId()
+{
+  static std::atomic<uint64_t> id(0);
+  return ++id;
+}
+#endif  // NDEBUG
+
+namespace triton { namespace server { namespace grpc {
+
+TRITONSERVER_Error*
+OutputBufferAttributesHelper(
+    TRITONSERVER_ResponseAllocator* allocator, const char* tensor_name,
+    const TensorShmMap& shm_map,
+    TRITONSERVER_BufferAttributes* buffer_attributes)
+{
+  // We only need to set the cuda ipc handle here. The rest of the buffer
+  // attributes have been properly populated by triton core.
+  if (tensor_name != nullptr) {
+    const auto& pr = shm_map.find(tensor_name);
+
+    if (pr != shm_map.end()) {
+      if (pr->second.memory_type_ == TRITONSERVER_MEMORY_GPU) {
+        RETURN_IF_ERR(TRITONSERVER_BufferAttributesSetCudaIpcHandle(
+            buffer_attributes, pr->second.cuda_ipc_handle_));
+      }
+    }
+  }
+
+  return nullptr;  // Success
+}
+
+TRITONSERVER_Error*
+OutputBufferQueryHelper(
+    TRITONSERVER_ResponseAllocator* allocator, const char* tensor_name,
+    size_t* byte_size, const TensorShmMap& shm_map,
+    TRITONSERVER_MemoryType* memory_type, int64_t* memory_type_id)
+{
+  // Check if shared memory is used if named tensor is provided
+  if (tensor_name != nullptr) {
+    const auto& pr = shm_map.find(tensor_name);
+    if (pr != shm_map.end()) {
+      // The output is in shared memory so check that shared memory
+      // size is at least large enough for the output, if byte size is provided
+      if ((byte_size != nullptr) && (*byte_size > pr->second.byte_size_)) {
+        // Don't return error yet and just set to the default properties for
+        // GRPC buffer, error will be raised when allocation happens
+        *memory_type = TRITONSERVER_MEMORY_CPU;
+        *memory_type_id = 0;
+      } else {
+        *memory_type = pr->second.memory_type_;
+        *memory_type_id = pr->second.memory_type_id_;
+      }
+      return nullptr;  // Success
+    }
+  }
+
+  // Not using shared memory so a buffer created directly in
+  // the response protobuf will be used, and the type will be CPU.
+  *memory_type = TRITONSERVER_MEMORY_CPU;
+  *memory_type_id = 0;
+  return nullptr;  // Success
+}
+
+// Make sure to keep InferResponseAlloc and OutputBufferQuery logic in sync
+TRITONSERVER_Error*
+InferResponseAlloc(
+    TRITONSERVER_ResponseAllocator* allocator, const char* tensor_name,
+    size_t byte_size, TRITONSERVER_MemoryType preferred_memory_type,
+    int64_t preferred_memory_type_id, void* userp, void** buffer,
+    void** buffer_userp, TRITONSERVER_MemoryType* actual_memory_type,
+    int64_t* actual_memory_type_id)
+{
+  AllocPayload<inference::ModelInferResponse>* payload =
+      reinterpret_cast<AllocPayload<inference::ModelInferResponse>*>(userp);
+
+  // ModelInfer RPC expects exactly one response per request. Hence,
+  // will be creating and using just one response object.
+  inference::ModelInferResponse* response =
+      payload->response_queue_->GetNonDecoupledResponse();
+  return ResponseAllocatorHelper(
+      allocator, tensor_name, byte_size, preferred_memory_type,
+      preferred_memory_type_id, response, payload->shm_map_, buffer,
+      buffer_userp, actual_memory_type, actual_memory_type_id);
+}
+
+// Make sure to keep InferResponseAlloc and OutputBufferQuery logic in sync
+TRITONSERVER_Error*
+OutputBufferQuery(
+    TRITONSERVER_ResponseAllocator* allocator, void* userp,
+    const char* tensor_name, size_t* byte_size,
+    TRITONSERVER_MemoryType* memory_type, int64_t* memory_type_id)
+{
+  AllocPayload<inference::ModelInferResponse>* payload =
+      reinterpret_cast<AllocPayload<inference::ModelInferResponse>*>(userp);
+
+  return OutputBufferQueryHelper(
+      allocator, tensor_name, byte_size, payload->shm_map_, memory_type,
+      memory_type_id);
+}
+
+// Make sure to keep InferResponseAlloc, OutputBufferQuery, and
+// OutputBufferAttributes logic in sync
+TRITONSERVER_Error*
+OutputBufferAttributes(
+    TRITONSERVER_ResponseAllocator* allocator, const char* tensor_name,
+    TRITONSERVER_BufferAttributes* buffer_attributes, void* userp,
+    void* buffer_userp)
+{
+  AllocPayload<inference::ModelInferResponse>* payload =
+      reinterpret_cast<AllocPayload<inference::ModelInferResponse>*>(userp);
+
+  return OutputBufferAttributesHelper(
+      allocator, tensor_name, payload->shm_map_, buffer_attributes);
+  return nullptr;  // Success
+}
+
+TRITONSERVER_Error*
+InferResponseFree(
+    TRITONSERVER_ResponseAllocator* allocator, void* buffer, void* buffer_userp,
+    size_t byte_size, TRITONSERVER_MemoryType memory_type,
+    int64_t memory_type_id)
+{
+  LOG_VERBOSE(1) << "GRPC free: "
+                 << "size " << byte_size << ", addr " << buffer;
+
+  // Don't do anything when releasing a buffer since InferResponseAlloc
+  // wrote directly into the response protobuf.
+  return nullptr;  // Success
+}
+
+TRITONSERVER_Error* InferGRPCToInputHelper(
+    const std::string& input_name, const std::string& model_name,
+    const TRITONSERVER_DataType tensor_dt, const TRITONSERVER_DataType input_dt,
+    const size_t binary_data_byte_size);
+
+TRITONSERVER_Error* InferGRPCToInput(
+    const std::shared_ptr<TRITONSERVER_Server>& tritonserver,
+    const std::shared_ptr<SharedMemoryManager>& shm_manager,
+    const inference::ModelInferRequest& request,
+    std::list<std::string>* serialized_data,
+    TRITONSERVER_InferenceRequest* inference_request);
+
+TRITONSERVER_Error*
+InferGRPCToInputHelper(
+    const std::string& input_name, const std::string& model_name,
+    const TRITONSERVER_DataType tensor_dt, const TRITONSERVER_DataType input_dt,
+    const size_t binary_data_byte_size)
+{
+  if (binary_data_byte_size != 0) {
+    return TRITONSERVER_ErrorNew(
+        TRITONSERVER_ERROR_INVALID_ARG,
+        std::string(
+            "unexpected explicit tensor data for input tensor '" + input_name +
+            "' for model '" + model_name +
+            "', binary data was already supplied.")
+            .c_str());
+  }
+
+  if (tensor_dt != input_dt) {
+    return TRITONSERVER_ErrorNew(
+        TRITONSERVER_ERROR_INVALID_ARG,
+        std::string(
+            "unexpected explicit tensor data for input tensor '" + input_name +
+            "' for model '" + model_name + "' of type '" +
+            TRITONSERVER_DataTypeString(tensor_dt) + "', expected datatype '" +
+            TRITONSERVER_DataTypeString(input_dt) + "'")
+            .c_str());
+  }
+
+  return nullptr;  // success
+}
+
+TRITONSERVER_Error*
+InferResponseStart(TRITONSERVER_ResponseAllocator* allocator, void* userp)
+{
+  AllocPayload<inference::ModelInferResponse>* payload =
+      reinterpret_cast<AllocPayload<inference::ModelInferResponse>*>(userp);
+
+  // ModelInfer RPC expects exactly one response per request. Hence, always call
+  // GetNonDecoupledResponse() to create one response object on response start.
+  payload->response_queue_->GetNonDecoupledResponse();
+
+  return nullptr;  // success
+}
+
+TRITONSERVER_Error*
+SetInferenceRequestMetadata(
+    TRITONSERVER_InferenceRequest* inference_request,
+    const inference::ModelInferRequest& request, StateParameters& state_params)
+{
+  RETURN_IF_ERR(TRITONSERVER_InferenceRequestSetId(
+      inference_request, request.id().c_str()));
+
+  uint32_t flags = 0;
+  for (auto param : request.parameters()) {
+    if (param.first.compare("sequence_id") == 0) {
+      const auto& infer_param = param.second;
+      if (infer_param.parameter_choice_case() ==
+          inference::InferParameter::ParameterChoiceCase::kInt64Param) {
+        RETURN_IF_ERR(TRITONSERVER_InferenceRequestSetCorrelationId(
+            inference_request, infer_param.int64_param()));
+      } else if (
+          infer_param.parameter_choice_case() ==
+          inference::InferParameter::ParameterChoiceCase::kStringParam) {
+        RETURN_IF_ERR(TRITONSERVER_InferenceRequestSetCorrelationIdString(
+            inference_request, infer_param.string_param().c_str()));
+      } else {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INVALID_ARG,
+            "invalid value type for 'sequence_id' parameter, expected "
+            "int64_param or string_param.");
+      }
+    } else if (param.first.compare("sequence_start") == 0) {
+      const auto& infer_param = param.second;
+      if (infer_param.parameter_choice_case() !=
+          inference::InferParameter::ParameterChoiceCase::kBoolParam) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INVALID_ARG,
+            "invalid value type for 'sequence_start' parameter, expected "
+            "bool_param.");
+      }
+      if (infer_param.bool_param()) {
+        flags |= TRITONSERVER_REQUEST_FLAG_SEQUENCE_START;
+      }
+    } else if (param.first.compare("sequence_end") == 0) {
+      const auto& infer_param = param.second;
+      if (infer_param.parameter_choice_case() !=
+          inference::InferParameter::ParameterChoiceCase::kBoolParam) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INVALID_ARG,
+            "invalid value type for 'sequence_end' parameter, expected "
+            "bool_param.");
+      }
+      if (infer_param.bool_param()) {
+        flags |= TRITONSERVER_REQUEST_FLAG_SEQUENCE_END;
+      }
+    } else if (param.first.compare("priority") == 0) {
+      const auto& infer_param = param.second;
+      if (infer_param.parameter_choice_case() ==
+          inference::InferParameter::ParameterChoiceCase::kInt64Param) {
+        if (infer_param.int64_param() < 0) {
+          return TRITONSERVER_ErrorNew(
+              TRITONSERVER_ERROR_INVALID_ARG,
+              "invalid value for 'priority', expected value >= 0.");
+        }
+        RETURN_IF_ERR(TRITONSERVER_InferenceRequestSetPriorityUInt64(
+            inference_request, infer_param.int64_param()));
+      } else if (
+          infer_param.parameter_choice_case() ==
+          inference::InferParameter::ParameterChoiceCase::kUint64Param) {
+        RETURN_IF_ERR(TRITONSERVER_InferenceRequestSetPriorityUInt64(
+            inference_request, infer_param.uint64_param()));
+      } else {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INVALID_ARG,
+            "invalid value type for 'priority' parameter, expected "
+            "int64_param or uint64_param.");
+      }
+    } else if (param.first.compare("timeout") == 0) {
+      const auto& infer_param = param.second;
+      if (infer_param.parameter_choice_case() !=
+          inference::InferParameter::ParameterChoiceCase::kInt64Param) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INVALID_ARG,
+            "invalid value type for 'timeout' parameter, expected "
+            "int64_param.");
+      }
+      RETURN_IF_ERR(TRITONSERVER_InferenceRequestSetTimeoutMicroseconds(
+          inference_request, infer_param.int64_param()));
+    } else if (param.first.rfind("triton_", 0) == 0) {
+      if (!Contains(TRITON_RESERVED_REQUEST_PARAMS, param.first)) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INVALID_ARG,
+            (std::string(
+                 "parameter keys starting with 'triton_' are reserved for "
+                 "Triton "
+                 "usage. Only the following keys starting with 'triton_' are "
+                 "allowed: ") +
+             Join(TRITON_RESERVED_REQUEST_PARAMS, " "))
+                .c_str());
+      }
+      RETURN_IF_ERR(SetStateParameterFromTritonParameter(state_params, param));
+    } else {
+      const auto& infer_param = param.second;
+      if (infer_param.parameter_choice_case() ==
+          inference::InferParameter::ParameterChoiceCase::kInt64Param) {
+        RETURN_IF_ERR(TRITONSERVER_InferenceRequestSetIntParameter(
+            inference_request, param.first.c_str(), infer_param.int64_param()));
+      } else if (
+          infer_param.parameter_choice_case() ==
+          inference::InferParameter::ParameterChoiceCase::kBoolParam) {
+        RETURN_IF_ERR(TRITONSERVER_InferenceRequestSetBoolParameter(
+            inference_request, param.first.c_str(), infer_param.bool_param()));
+      } else if (
+          infer_param.parameter_choice_case() ==
+          inference::InferParameter::ParameterChoiceCase::kStringParam) {
+        RETURN_IF_ERR(TRITONSERVER_InferenceRequestSetStringParameter(
+            inference_request, param.first.c_str(),
+            infer_param.string_param().c_str()));
+      } else {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INVALID_ARG,
+            std::string(
+                "invalid value type for '" + param.first +
+                "' parameter, expected "
+                "int64_param, bool_param, or string_param.")
+                .c_str());
+      }
+    }
+  }
+
+  RETURN_IF_ERR(
+      TRITONSERVER_InferenceRequestSetFlags(inference_request, flags));
+
+  for (const auto& input : request.inputs()) {
+    RETURN_IF_ERR(TRITONSERVER_InferenceRequestAddInput(
+        inference_request, input.name().c_str(),
+        TRITONSERVER_StringToDataType(input.datatype().c_str()),
+        input.shape().data(), input.shape_size()));
+  }
+
+  for (const auto& output : request.outputs()) {
+    RETURN_IF_ERR(TRITONSERVER_InferenceRequestAddRequestedOutput(
+        inference_request, output.name().c_str()));
+  }
+
+  return nullptr;  // Success
+}
+
+TRITONSERVER_Error*
+SetStateParameterFromTritonParameter(
+    StateParameters& state_params,
+    const std::pair<std::string, inference::InferParameter>& param)
+{
+  const auto& key = param.first;
+  const auto& value = param.second;
+  if (key == "triton_enable_empty_final_response") {
+    if (value.parameter_choice_case() !=
+        inference::InferParameter::ParameterChoiceCase::kBoolParam) {
+      return TRITONSERVER_ErrorNew(
+          TRITONSERVER_ERROR_INVALID_ARG,
+          (std::string("invalid value type for '") + key +
+           std::string("' parameter, expected bool_param."))
+              .c_str());
+    }
+    state_params.enable_empty_final_response_ = value.bool_param();
+  }
+
+  return nullptr;  // success
+}
+
+TRITONSERVER_Error*
+InferGRPCToInput(
+    const std::shared_ptr<TRITONSERVER_Server>& tritonserver,
+    const std::shared_ptr<SharedMemoryManager>& shm_manager,
+    const inference::ModelInferRequest& request,
+    std::list<std::string>* serialized_data,
+    TRITONSERVER_InferenceRequest* inference_request)
+{
+  // Verify that the batch-byte-size of each input matches the size of
+  // the provided tensor data (provided raw or from shared memory)
+  int index = 0;
+  for (const auto& io : request.inputs()) {
+    const void* base;
+    size_t byte_size = 0;
+    TRITONSERVER_MemoryType memory_type = TRITONSERVER_MEMORY_CPU;
+    int64_t memory_type_id = 0;
+
+    std::string region_name;
+    int64_t offset;
+    bool has_shared_memory;
+    RETURN_IF_ERR(
+        ParseSharedMemoryParams<inference::ModelInferRequest::InferInputTensor>(
+            io, &has_shared_memory, &region_name, &offset, &byte_size));
+
+    TRITONSERVER_BufferAttributes* buffer_attributes;
+    RETURN_IF_ERR(TRITONSERVER_BufferAttributesNew(&buffer_attributes));
+    auto buffer_attributes_del =
+        [](TRITONSERVER_BufferAttributes* buffer_attributes) {
+          TRITONSERVER_BufferAttributesDelete(buffer_attributes);
+        };
+    std::unique_ptr<
+        TRITONSERVER_BufferAttributes, decltype(buffer_attributes_del)>
+        buffer_attrsl(buffer_attributes, buffer_attributes_del);
+    char* cuda_ipc_handle = nullptr;
+
+    if (has_shared_memory) {
+      if (io.has_contents()) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INVALID_ARG,
+            std::string(
+                "unexpected 'content' provided when using shared memory "
+                "for "
+                "input tensor '" +
+                io.name() + "' for model '" + request.model_name() + "'")
+                .c_str());
+      }
+      void* tmp;
+      RETURN_IF_ERR(shm_manager->GetMemoryInfo(
+          region_name, offset, &tmp, &memory_type, &memory_type_id));
+      base = tmp;
+      if (memory_type == TRITONSERVER_MEMORY_GPU) {
+#ifdef TRITON_ENABLE_GPU
+        RETURN_IF_ERR(shm_manager->GetCUDAHandle(
+            region_name,
+            reinterpret_cast<cudaIpcMemHandle_t**>(&cuda_ipc_handle)));
+#endif
+      }
+    } else {
+      if (io.has_contents() && (!request.raw_input_contents().empty())) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INVALID_ARG,
+            std::string(
+                "contents field must not be specified when using "
+                "raw_input_contents for '" +
+                io.name() + "' for model '" + request.model_name() + "'")
+                .c_str());
+      } else if (io.has_contents()) {
+        // Check the presence of explicit tensors
+        TRITONSERVER_DataType dtype =
+            TRITONSERVER_StringToDataType(io.datatype().c_str());
+        const size_t elem_byte_size = TRITONSERVER_DataTypeByteSize(dtype);
+        if (io.contents().bool_contents_size() != 0) {
+          RETURN_IF_ERR(InferGRPCToInputHelper(
+              io.name(), request.model_name(), TRITONSERVER_TYPE_BOOL, dtype,
+              byte_size));
+          base = (const void*)io.contents().bool_contents().data();
+          byte_size = io.contents().bool_contents_size() * elem_byte_size;
+        }
+
+        if (io.contents().int_contents_size() != 0) {
+          if (dtype == TRITONSERVER_TYPE_INT8) {
+            RETURN_IF_ERR(InferGRPCToInputHelper(
+                io.name(), request.model_name(), TRITONSERVER_TYPE_INT8, dtype,
+                byte_size));
+            serialized_data->emplace_back();
+            auto& serialized = serialized_data->back();
+            serialized.reserve(
+                io.contents().int_contents_size() * elem_byte_size);
+            for (const auto& element : io.contents().int_contents()) {
+              // Assuming the system is little-endian, picking the
+              // least significant byte of 32-bit integer as a
+              // int8 element
+              serialized.append(
+                  reinterpret_cast<const char*>(&element), elem_byte_size);
+            }
+            base = serialized.c_str();
+            byte_size = serialized.size();
+          } else if (dtype == TRITONSERVER_TYPE_INT16) {
+            RETURN_IF_ERR(InferGRPCToInputHelper(
+                io.name(), request.model_name(), TRITONSERVER_TYPE_INT16, dtype,
+                byte_size));
+            serialized_data->emplace_back();
+            auto& serialized = serialized_data->back();
+            serialized.reserve(
+                io.contents().int_contents_size() * elem_byte_size);
+            for (const auto& element : io.contents().int_contents()) {
+              // Assuming the system is little-endian, picking the
+              // least 2 significant bytes of 32-bit integer as a
+              // int16 element
+              serialized.append(
+                  reinterpret_cast<const char*>(&element), elem_byte_size);
+            }
+            base = serialized.c_str();
+            byte_size = serialized.size();
+          } else {
+            RETURN_IF_ERR(InferGRPCToInputHelper(
+                io.name(), request.model_name(), TRITONSERVER_TYPE_INT32, dtype,
+                byte_size));
+            base = (const void*)io.contents().int_contents().data();
+            byte_size = io.contents().int_contents_size() * elem_byte_size;
+          }
+        }
+
+        if (io.contents().int64_contents_size() != 0) {
+          RETURN_IF_ERR(InferGRPCToInputHelper(
+              io.name(), request.model_name(), TRITONSERVER_TYPE_INT64, dtype,
+              byte_size));
+          base = (const void*)io.contents().int64_contents().data();
+          byte_size = io.contents().int64_contents_size() * elem_byte_size;
+        }
+
+        if (io.contents().uint_contents_size() != 0) {
+          if (dtype == TRITONSERVER_TYPE_UINT8) {
+            RETURN_IF_ERR(InferGRPCToInputHelper(
+                io.name(), request.model_name(), TRITONSERVER_TYPE_UINT8, dtype,
+                byte_size));
+            serialized_data->emplace_back();
+            auto& serialized = serialized_data->back();
+            serialized.reserve(
+                io.contents().uint_contents_size() * elem_byte_size);
+            for (const auto& element : io.contents().uint_contents()) {
+              // Assuming the system is little-endian, picking the
+              // least significant byte of 32-bit unsigned integer as a
+              // uint8 element
+              serialized.append(
+                  reinterpret_cast<const char*>(&element), elem_byte_size);
+            }
+            base = serialized.c_str();
+            byte_size = serialized.size();
+          } else if (dtype == TRITONSERVER_TYPE_UINT16) {
+            RETURN_IF_ERR(InferGRPCToInputHelper(
+                io.name(), request.model_name(), TRITONSERVER_TYPE_UINT16,
+                dtype, byte_size));
+            serialized_data->emplace_back();
+            auto& serialized = serialized_data->back();
+            serialized.reserve(
+                io.contents().uint_contents_size() * elem_byte_size);
+            for (const auto& element : io.contents().uint_contents()) {
+              // Assuming the system is little-endian, picking the
+              // least 2 significant bytes of 32-bit integer as a
+              // uint16 element
+              serialized.append(
+                  reinterpret_cast<const char*>(&element), elem_byte_size);
+            }
+            base = serialized.c_str();
+            byte_size = serialized.size();
+          } else {
+            RETURN_IF_ERR(InferGRPCToInputHelper(
+                io.name(), request.model_name(), TRITONSERVER_TYPE_UINT32,
+                dtype, byte_size));
+            base = (const void*)io.contents().uint_contents().data();
+            byte_size = io.contents().uint_contents_size() * elem_byte_size;
+          }
+        }
+
+        if (io.contents().uint64_contents_size() != 0) {
+          RETURN_IF_ERR(InferGRPCToInputHelper(
+              io.name(), request.model_name(), TRITONSERVER_TYPE_UINT64, dtype,
+              byte_size));
+          base = (const void*)io.contents().uint64_contents().data();
+          byte_size = io.contents().uint64_contents_size() * elem_byte_size;
+        }
+
+        if (io.contents().fp32_contents_size() != 0) {
+          RETURN_IF_ERR(InferGRPCToInputHelper(
+              io.name(), request.model_name(), TRITONSERVER_TYPE_FP32, dtype,
+              byte_size));
+          base = (const void*)io.contents().fp32_contents().data();
+          byte_size = io.contents().fp32_contents_size() * elem_byte_size;
+        }
+
+        if (io.contents().fp64_contents_size() != 0) {
+          RETURN_IF_ERR(InferGRPCToInputHelper(
+              io.name(), request.model_name(), TRITONSERVER_TYPE_FP64, dtype,
+              byte_size));
+          base = (const void*)io.contents().fp64_contents().data();
+          byte_size = io.contents().fp64_contents_size() * elem_byte_size;
+        }
+
+        if (io.contents().bytes_contents_size() != 0) {
+          RETURN_IF_ERR(InferGRPCToInputHelper(
+              io.name(), request.model_name(), TRITONSERVER_TYPE_BYTES, dtype,
+              byte_size));
+
+          serialized_data->emplace_back();
+          auto& serialized = serialized_data->back();
+
+          // Serialize the output tensor strings. Each string is
+          // serialized as a 4-byte length followed by the string itself
+          // with no null-terminator.
+          for (const auto& element : io.contents().bytes_contents()) {
+            uint32_t len{(uint32_t)element.size()};
+            serialized.append(
+                reinterpret_cast<const char*>(&len), sizeof(uint32_t));
+            if (element.size() > 0) {
+              serialized.append(element.c_str(), len);
+            }
+          }
+          base = serialized.c_str();
+          byte_size = serialized.size();
+        }
+      } else if (request.raw_input_contents().size() > index) {
+        // Try to read the raw contents if available
+        const std::string& raw = request.raw_input_contents()[index++];
+        base = raw.c_str();
+        byte_size = raw.size();
+      } else {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INVALID_ARG,
+            std::string(
+                "unable to find data for input tensor '" + io.name() +
+                "' for model '" + request.model_name() + "' in request.")
+                .c_str());
+      }
+    }
+
+    if (cuda_ipc_handle != nullptr) {
+      RETURN_IF_ERR(TRITONSERVER_BufferAttributesSetCudaIpcHandle(
+          buffer_attributes, reinterpret_cast<void*>(cuda_ipc_handle)));
+    }
+
+    RETURN_IF_ERR(TRITONSERVER_BufferAttributesSetMemoryType(
+        buffer_attributes, memory_type));
+    RETURN_IF_ERR(TRITONSERVER_BufferAttributesSetMemoryTypeId(
+        buffer_attributes, memory_type_id));
+    RETURN_IF_ERR(
+        TRITONSERVER_BufferAttributesSetByteSize(buffer_attributes, byte_size));
+    RETURN_IF_ERR(
+        TRITONSERVER_InferenceRequestAppendInputDataWithBufferAttributes(
+            inference_request, io.name().c_str(), base, buffer_attributes));
+  }
+
+  return nullptr;  // success
+}
+
+void
+InferRequestComplete(
+    TRITONSERVER_InferenceRequest* request, const uint32_t flags, void* userp)
+{
+  LOG_VERBOSE(1) << "ModelInferHandler::InferRequestComplete";
+
+  RequestReleasePayload* request_release_payload =
+      static_cast<RequestReleasePayload*>(userp);
+
+  if ((flags & TRITONSERVER_REQUEST_RELEASE_ALL) != 0) {
+    delete request_release_payload;
+  }
+}
+
+//===========================================================================
+//  The following section contains the handling mechanism for ModelInfer RPC.
+//  This implementation is tuned towards performance and reducing latency.
+//===========================================================================
+
+void
+ModelInferHandler::StartNewRequest()
+{
+  auto context = std::make_shared<State::Context>(cq_);
+  context->SetCompressionLevel(compression_level_);
+  State* state = StateNew(tritonserver_.get(), context);
+
+#ifdef TRITON_ENABLE_TRACING
+  // Can't create trace as we don't know the model to be requested,
+  // track timestamps in 'state'
+  state->trace_timestamps_.emplace_back(
+      std::make_pair("GRPC_WAITREAD_START", TraceManager::CaptureTimestamp()));
+#endif  // TRITON_ENABLE_TRACING
+
+  service_->RequestModelInfer(
+      state->context_->ctx_.get(), &state->request_,
+      state->context_->responder_.get(), cq_, cq_, state);
+
+  LOG_VERBOSE(1) << "New request handler for " << Name() << ", "
+                 << state->unique_id_;
+}
+
+bool
+ModelInferHandler::Process(InferHandler::State* state, bool rpc_ok)
+{
+  // There are multiple handlers registered in the gRPC service.
+  // Hence, there we can have a case where a handler thread is
+  // making progress in the state machine for a request and the
+  // other thread is issuing cancellation on the same request.
+  // Need to protect the state transitions for these cases.
+  std::lock_guard<std::recursive_mutex> lock(state->step_mtx_);
+
+  // Handle notification for cancellation which can be raised
+  // asynchronously if detected on the network.
+  if (state->IsGrpcContextCancelled()) {
+    bool resume = state->context_->HandleCancellation(state, rpc_ok, Name());
+    return resume;
+  }
+
+
+  LOG_VERBOSE(1) << "Process for " << Name() << ", rpc_ok=" << rpc_ok << ", "
+                 << state->unique_id_ << " step " << state->step_;
+
+  // We need an explicit finish indicator. Can't use 'state->step_'
+  // because we launch an async thread that could update 'state's
+  // step_ to be FINISH before this thread exits this function.
+  bool finished = false;
+
+  // If RPC failed on a new request then the server is shutting down
+  // and so we should do nothing (including not registering for a new
+  // request). If RPC failed on a non-START step then there is nothing
+  // we can do since we one execute one step.
+  const bool shutdown = (!rpc_ok && (state->step_ == Steps::START));
+  if (shutdown) {
+    state->step_ = Steps::FINISH;
+    finished = true;
+  }
+
+  if (state->step_ == Steps::START) {
+#ifdef TRITON_ENABLE_TRACING
+    // Can't create trace as we don't know the model to be requested,
+    // track timestamps in 'state'
+    state->trace_timestamps_.emplace_back(
+        std::make_pair("GRPC_WAITREAD_END", TraceManager::CaptureTimestamp()));
+#endif  // TRITON_ENABLE_TRACING
+
+    // Start a new request to replace this one...
+    if (!shutdown) {
+      StartNewRequest();
+    }
+
+    if (ExecutePrecondition(state)) {
+      Execute(state);
+    } else {
+      ::grpc::Status status = ::grpc::Status(
+          ::grpc::StatusCode::UNAVAILABLE,
+          std::string("This protocol is restricted, expecting header '") +
+              restricted_kv_.first + "'");
+
+
+#ifdef TRITON_ENABLE_TRACING
+      state->trace_timestamps_.emplace_back(
+          std::make_pair("GRPC_SEND_START", TraceManager::CaptureTimestamp()));
+#endif  // TRITON_ENABLE_TRACING
+
+      state->step_ = COMPLETE;
+      state->context_->responder_->Finish(
+          inference::ModelInferResponse(), status, state);
+    }
+
+  } else if (state->step_ == Steps::COMPLETE) {
+#ifdef TRITON_ENABLE_TRACING
+    state->trace_timestamps_.emplace_back(
+        std::make_pair("GRPC_SEND_END", TraceManager::CaptureTimestamp()));
+#endif  // TRITON_ENABLE_TRACING
+
+    state->step_ = Steps::FINISH;
+  } else if (state->step_ == Steps::FINISH) {
+    finished = true;
+  }
+
+  return !finished;
+}
+
+TRITONSERVER_Error*
+ResponseAllocatorHelper(
+    TRITONSERVER_ResponseAllocator* allocator, const char* tensor_name,
+    size_t byte_size, TRITONSERVER_MemoryType preferred_memory_type,
+    int64_t preferred_memory_type_id, inference::ModelInferResponse* response,
+    const TensorShmMap& shm_map, void** buffer, void** buffer_userp,
+    TRITONSERVER_MemoryType* actual_memory_type, int64_t* actual_memory_type_id)
+{
+  *buffer = nullptr;
+  *buffer_userp = nullptr;
+  *actual_memory_type = preferred_memory_type;
+  *actual_memory_type_id = preferred_memory_type_id;
+
+  // We add an output contents even if the 'byte_size' == 0 because we
+  // expect to have a contents for every output.
+  inference::ModelInferResponse::InferOutputTensor* output_tensor =
+      response->add_outputs();
+  output_tensor->set_name(tensor_name);
+  std::string* raw_output = response->add_raw_output_contents();
+
+  if (byte_size > 0) {
+    const auto& pr = shm_map.find(tensor_name);
+    if (pr != shm_map.end()) {
+      // The output is in shared memory so check that shared memory
+      // size is at least large enough for the output.
+      if (byte_size > pr->second.byte_size_) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INTERNAL,
+            std::string(
+                "shared memory size specified with the request for output '" +
+                std::string(tensor_name) + "' (" +
+                std::to_string(pr->second.byte_size_) +
+                " bytes) should be at least " + std::to_string(byte_size) +
+                " bytes to hold the results")
+                .c_str());
+      }
+
+      *buffer = const_cast<void*>(pr->second.base_);
+      *actual_memory_type = pr->second.memory_type_;
+      *actual_memory_type_id = pr->second.memory_type_id_;
+
+      LOG_VERBOSE(1) << "GRPC: using shared-memory for '" << tensor_name
+                     << "', size: " << byte_size << ", addr: " << *buffer;
+      return nullptr;  // Success
+    }
+
+    // Not using shared memory so allocate a buffer. The buffer we
+    // create is directly in the response protobuf so we can't
+    // allocate any type other than CPU.
+    //
+    // FIXME we could use pinned CPU memory here.
+    if (*actual_memory_type != TRITONSERVER_MEMORY_CPU) {
+      LOG_VERBOSE(1) << "GRPC: unable to provide '" << tensor_name << "' in "
+                     << TRITONSERVER_MemoryTypeString(*actual_memory_type)
+                     << ", will use "
+                     << TRITONSERVER_MemoryTypeString(TRITONSERVER_MEMORY_CPU);
+      *actual_memory_type = TRITONSERVER_MEMORY_CPU;
+      *actual_memory_type_id = 0;
+    }
+
+    raw_output->resize(byte_size);
+    *buffer = static_cast<void*>(&((*raw_output)[0]));
+
+    LOG_VERBOSE(1) << "GRPC: using buffer for '" << tensor_name
+                   << "', size: " << byte_size << ", addr: " << *buffer;
+  }
+
+  return nullptr;  // Success
+}
+
+void
+ModelInferHandler::Execute(InferHandler::State* state)
+{
+  TRITONSERVER_Error* err = nullptr;
+  const inference::ModelInferRequest& request = state->request_;
+  auto response_queue = state->response_queue_;
+  int64_t requested_model_version;
+  if (err == nullptr) {
+    err = GetModelVersionFromString(
+        request.model_version(), &requested_model_version);
+  }
+
+  if (err == nullptr) {
+    uint32_t txn_flags;
+    err = TRITONSERVER_ServerModelTransactionProperties(
+        tritonserver_.get(), request.model_name().c_str(),
+        requested_model_version, &txn_flags, nullptr /* voidp */);
+    if ((err == nullptr) && (txn_flags & TRITONSERVER_TXN_DECOUPLED) != 0) {
+      err = TRITONSERVER_ErrorNew(
+          TRITONSERVER_ERROR_UNSUPPORTED,
+          "ModelInfer RPC doesn't support models with decoupled "
+          "transaction policy");
+    }
+  }
+
+  // Create the inference request which contains all the
+  // input information needed for an inference.
+  TRITONSERVER_InferenceRequest* irequest = nullptr;
+  if (err == nullptr) {
+    err = TRITONSERVER_InferenceRequestNew(
+        &irequest, tritonserver_.get(), request.model_name().c_str(),
+        requested_model_version);
+  }
+
+  if (err == nullptr) {
+    state->inference_request_ = {
+        irequest, [](TRITONSERVER_InferenceRequest* request) {
+          LOG_TRITONSERVER_ERROR(
+              TRITONSERVER_InferenceRequestDelete(request),
+              "deleting gRPC inference request");
+        }};
+    err = SetInferenceRequestMetadata(irequest, request, state->parameters_);
+  }
+
+  if (err == nullptr) {
+    err = ForwardHeadersAsParameters(irequest, state);
+  }
+
+  // Will be used to hold the serialized data in case explicit string
+  // tensors are present in the request.
+  std::list<std::string> serialized_data;
+
+  if (err == nullptr) {
+    err = InferGRPCToInput(
+        tritonserver_, shm_manager_, request, &serialized_data, irequest);
+  }
+  if (err == nullptr) {
+    err = InferAllocatorPayload<inference::ModelInferResponse>(
+        tritonserver_, shm_manager_, request, std::move(serialized_data),
+        response_queue, &state->alloc_payload_);
+  }
+
+  auto request_release_payload =
+      std::make_unique<RequestReleasePayload>(state->inference_request_);
+  if (err == nullptr) {
+    err = TRITONSERVER_InferenceRequestSetReleaseCallback(
+        irequest, InferRequestComplete,
+        request_release_payload.get() /* request_release_userp */);
+  }
+  if (err == nullptr) {
+    err = TRITONSERVER_InferenceRequestSetResponseCallback(
+        irequest, allocator_,
+        &state->alloc_payload_ /* response_allocator_userp */,
+        InferResponseComplete, reinterpret_cast<void*>(state));
+  }
+  // Get request ID for logging in case of error.
+  const char* request_id = "";
+  if (irequest != nullptr) {
+    LOG_TRITONSERVER_ERROR(
+        TRITONSERVER_InferenceRequestId(irequest, &request_id),
+        "unable to retrieve request ID string");
+  }
+
+  if (!strncmp(request_id, "", 1)) {
+    request_id = "<id_unknown>";
+  }
+  if (err == nullptr) {
+    TRITONSERVER_InferenceTrace* triton_trace = nullptr;
+#ifdef TRITON_ENABLE_TRACING
+    state->trace_ =
+        std::move(trace_manager_->SampleTrace(request.model_name()));
+    if (state->trace_ != nullptr) {
+      triton_trace = state->trace_->trace_;
+    }
+#endif  // TRITON_ENABLE_TRACING
+
+    state->step_ = ISSUED;
+    err = TRITONSERVER_ServerInferAsync(
+        tritonserver_.get(), irequest, triton_trace);
+  }
+
+  // If not error then state->step_ == ISSUED and inference request
+  // has initiated... completion callback will transition to
+  // COMPLETE or CANCELLED. Recording the state and the irequest
+  // to handle gRPC stream cancellation.
+  if (err == nullptr) {
+    state->context_->InsertInflightState(state);
+    // The payload will be cleaned in request release callback.
+    request_release_payload.release();
+  } else {
+    // If error go immediately to COMPLETE.
+    LOG_VERBOSE(1) << "[request id: " << request_id << "] "
+                   << "Infer failed: " << TRITONSERVER_ErrorMessage(err);
+
+    ::grpc::Status status;
+    GrpcStatusUtil::Create(&status, err);
+    TRITONSERVER_ErrorDelete(err);
+
+    inference::ModelInferResponse error_response;
+
+#ifdef TRITON_ENABLE_TRACING
+    state->trace_timestamps_.emplace_back(
+        std::make_pair("GRPC_SEND_START", TraceManager::CaptureTimestamp()));
+#endif  // TRITON_ENABLE_TRACING
+
+    state->step_ = COMPLETE;
+    state->context_->responder_->Finish(error_response, status, state);
+  }
+}
+
+void
+ModelInferHandler::InferResponseComplete(
+    TRITONSERVER_InferenceResponse* iresponse, const uint32_t flags,
+    void* userp)
+{
+  State* state = reinterpret_cast<State*>(userp);
+
+  // There are multiple handlers registered in the gRPC service
+  // Hence, we would need to properly synchronize this thread
+  // and the handler thread handling async cancellation
+  // notification.
+  std::lock_guard<std::recursive_mutex> lock(state->step_mtx_);
+
+  // Increment the callback index
+  state->cb_count_++;
+
+  LOG_VERBOSE(1) << "ModelInferHandler::InferResponseComplete, "
+                 << state->unique_id_ << " step " << state->step_;
+
+  // Defer to the callback with the final response
+  if ((flags & TRITONSERVER_RESPONSE_COMPLETE_FINAL) == 0) {
+    LOG_ERROR << "[INTERNAL] ModelInfer received a response without FINAL flag";
+    return;
+  }
+
+  state->context_->EraseInflightState(state);
+
+#ifdef TRITON_ENABLE_TRACING
+  state->trace_timestamps_.emplace_back(std::make_pair(
+      "INFER_RESPONSE_COMPLETE", TraceManager::CaptureTimestamp()));
+#endif  // TRITON_ENABLE_TRACING
+
+  // If gRPC Stream is cancelled then no need of forming and returning
+  // a response.
+  if (state->IsGrpcContextCancelled()) {
+    // Clean-up the received response object.
+    LOG_TRITONSERVER_ERROR(
+        TRITONSERVER_InferenceResponseDelete(iresponse),
+        "deleting GRPC inference response");
+
+    state->step_ = Steps::CANCELLED;
+
+    LOG_VERBOSE(1) << "ModelInferHandler::InferResponseComplete, "
+                   << state->unique_id_
+                   << ", skipping response generation as grpc transaction was "
+                      "cancelled... ";
+
+    // Send state back to the queue so that state can be released
+    // in the next cycle.
+    state->context_->PutTaskBackToQueue(state);
+
+    return;
+  }
+
+  TRITONSERVER_Error* err = nullptr;
+  // This callback is expected to be called exactly once for each request.
+  // Will use the single response object in the response list to hold the
+  // information.
+  inference::ModelInferResponse* response =
+      state->response_queue_->GetResponseAt(0);
+  bool response_created = false;
+  if (response == nullptr) {
+    LOG_ERROR << "expected allocator to have created a response object";
+    err = TRITONSERVER_ErrorNew(
+        TRITONSERVER_ERROR_INTERNAL,
+        "No response object found in the callback");
+    response_created = true;
+    response = new inference::ModelInferResponse();
+  }
+
+  if (state->cb_count_ != 1) {
+    err = TRITONSERVER_ErrorNew(
+        TRITONSERVER_ERROR_INTERNAL, std::string(
+                                         "expected a single response, got " +
+                                         std::to_string(state->cb_count_))
+                                         .c_str());
+  } else if (iresponse == nullptr) {
+    err = TRITONSERVER_ErrorNew(
+        TRITONSERVER_ERROR_INTERNAL, "received an unexpected null response");
+  } else {
+    err = InferResponseCompleteCommon<inference::ModelInferResponse>(
+        state->tritonserver_, iresponse, *response, state->alloc_payload_);
+  }
+
+  if (err != nullptr) {
+    response->Clear();
+  }
+
+  ::grpc::Status status;
+  GrpcStatusUtil::Create(&status, err);
+  TRITONSERVER_ErrorDelete(err);
+
+  LOG_TRITONSERVER_ERROR(
+      TRITONSERVER_InferenceResponseDelete(iresponse),
+      "deleting GRPC inference response");
+
+#ifdef TRITON_ENABLE_TRACING
+  state->trace_timestamps_.emplace_back(
+      std::make_pair("GRPC_SEND_START", TraceManager::CaptureTimestamp()));
+#endif  // TRITON_ENABLE_TRACING
+
+  state->step_ = COMPLETE;
+  state->context_->responder_->Finish(*response, status, state);
+  if (response_created) {
+    delete response;
+  }
+}
+
+}}}  // namespace triton::server::grpc
diff --git a/src/grpc/infer_handler.h b/src/grpc/infer_handler.h
new file mode 100644
index 0000000000..36783e5912
--- /dev/null
+++ b/src/grpc/infer_handler.h
@@ -0,0 +1,1436 @@
+// Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NVIDIA CORPORATION nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#pragma once
+
+#include <grpc++/alarm.h>
+#include <grpc++/grpc++.h>
+#include <re2/re2.h>
+
+#include <condition_variable>
+#include <queue>
+#include <regex>
+#include <thread>
+
+#include "../tracer.h"
+#include "grpc_handler.h"
+#include "grpc_service.grpc.pb.h"
+#include "grpc_utils.h"
+#include "triton/common/logging.h"
+#include "triton/core/tritonserver.h"
+
+// Unique IDs are only needed when debugging. They only appear in
+// verbose logging.
+#ifndef NDEBUG
+uint64_t NextUniqueId();
+#define NEXT_UNIQUE_ID NextUniqueId()
+#else
+#define NEXT_UNIQUE_ID (0)
+#endif  // NDEBUG
+
+namespace triton { namespace server { namespace grpc {
+
+// Options used in InferHandler/StreamInferHandler states that are set from
+// request parameters
+struct StateParameters {
+  // Whether to generate an empty response when a FINAL flag is received with
+  // no corresponding response. Only applicable to StreamInferHandlerState.
+  bool enable_empty_final_response_ = false;
+};
+
+//
+// C++11 doesn't have a barrier so we implement our own.
+//
+class Barrier {
+ public:
+  explicit Barrier(size_t cnt) : threshold_(cnt), count_(cnt), generation_(0) {}
+
+  void Wait()
+  {
+    std::unique_lock<std::mutex> lock(mu_);
+    auto lgen = generation_;
+    if (--count_ == 0) {
+      generation_++;
+      count_ = threshold_;
+      cv_.notify_all();
+    } else {
+      cv_.wait(lock, [this, lgen] { return lgen != generation_; });
+    }
+  }
+
+ private:
+  std::mutex mu_;
+  std::condition_variable cv_;
+  const size_t threshold_;
+  size_t count_;
+  size_t generation_;
+};
+
+// Simple structure that carries the userp payload needed for
+// request release callback.
+struct RequestReleasePayload final {
+  explicit RequestReleasePayload(
+      const std::shared_ptr<TRITONSERVER_InferenceRequest>& inference_request)
+      : inference_request_(inference_request){};
+
+ private:
+  std::shared_ptr<TRITONSERVER_InferenceRequest> inference_request_ = nullptr;
+};
+
+//
+// ResponseQueue
+//
+// A simple queue holding the responses to be written. Uses a
+// vector of persistent message objects to prevent allocating
+// memory for each response to be written.
+//
+template <typename ResponseType>
+class ResponseQueue {
+ public:
+  explicit ResponseQueue() { Reset(); }
+
+  ~ResponseQueue()
+  {
+    for (auto response : responses_) {
+      delete response;
+    }
+  }
+
+  // Resets the queue
+  void Reset()
+  {
+    alloc_count_ = 0;
+    ready_count_ = 0;
+    current_index_ = 0;
+    for (auto response : responses_) {
+      response->Clear();
+    }
+  }
+
+  // Gets the response for the non-decoupled models.
+  // Note that there will be a single response in
+  // non-decoupled cases.
+  ResponseType* GetNonDecoupledResponse()
+  {
+    std::lock_guard<std::mutex> lock(mtx_);
+    alloc_count_ = 1;
+    if (responses_.size() < 1) {
+      responses_.push_back(new ResponseType());
+    }
+    return responses_[0];
+  }
+
+  // Allocates a response on the head of the queue
+  void AllocateResponse()
+  {
+    std::lock_guard<std::mutex> lock(mtx_);
+    alloc_count_++;
+    if (responses_.size() < alloc_count_) {
+      responses_.push_back(new ResponseType());
+    }
+  }
+
+  // Gets the last allocated response
+  ResponseType* GetLastAllocatedResponse()
+  {
+    std::lock_guard<std::mutex> lock(mtx_);
+    if (responses_.size() < alloc_count_) {
+      LOG_ERROR
+          << "[INTERNAL] Attempting to access the response not yet allocated";
+      return nullptr;
+    }
+    return responses_[alloc_count_ - 1];
+  }
+
+  // Marks the next non-ready response complete
+  bool MarkNextResponseComplete()
+  {
+    std::lock_guard<std::mutex> lock(mtx_);
+    if (alloc_count_ <= ready_count_) {
+      LOG_ERROR
+          << "[INTERNAL] Attempting to mark an unallocated response complete";
+      return false;
+    }
+    ready_count_++;
+
+    return true;
+  }
+
+  // Gets the current response from the tail of
+  // the queue.
+  ResponseType* GetCurrentResponse()
+  {
+    std::lock_guard<std::mutex> lock(mtx_);
+    if (current_index_ >= ready_count_) {
+      LOG_ERROR << "[INTERNAL] Attempting to access current response when it "
+                   "is not ready";
+      return nullptr;
+    }
+    return responses_[current_index_];
+  }
+
+  // Gets the response at the specified index
+  ResponseType* GetResponseAt(const uint32_t index)
+  {
+    std::lock_guard<std::mutex> lock(mtx_);
+    if (index >= alloc_count_) {
+      LOG_ERROR << "[INTERNAL] Attempting to access response which is not yet "
+                   "allocated";
+      return nullptr;
+    }
+    return responses_[index];
+  }
+
+  // Pops the response from the tail of the queue
+  void PopResponse()
+  {
+    std::lock_guard<std::mutex> lock(mtx_);
+    current_index_++;
+  }
+
+  // Returns whether the queue is empty
+  bool IsEmpty()
+  {
+    std::lock_guard<std::mutex> lock(mtx_);
+    return ((alloc_count_ == ready_count_) && (alloc_count_ == current_index_));
+  }
+
+  // Returns whether the queue has responses
+  // ready to be written.
+  bool HasReadyResponse()
+  {
+    std::lock_guard<std::mutex> lock(mtx_);
+    return (ready_count_ > current_index_);
+  }
+
+ private:
+  std::vector<ResponseType*> responses_;
+  std::mutex mtx_;
+
+  // There are three indices to track the responses in the queue
+  // Tracks the allocated response
+  uint32_t alloc_count_;
+  // Tracks the response that is ready to be written
+  uint32_t ready_count_;
+  // Tracks the response next in the queue to be written
+  uint32_t current_index_;
+};
+
+
+//
+// ShmInfo
+//
+// Simple structure that carries the shared memory information
+//
+struct ShmInfo {
+  ShmInfo(
+      void* base, size_t byte_size, TRITONSERVER_MemoryType memory_type,
+      int64_t memory_type_id, char* cuda_ipc_handle)
+      : base_(base), byte_size_(byte_size), memory_type_(memory_type),
+        memory_type_id_(memory_type_id), cuda_ipc_handle_(cuda_ipc_handle)
+  {
+  }
+  void* base_;
+  size_t byte_size_;
+  TRITONSERVER_MemoryType memory_type_;
+  int64_t memory_type_id_;
+  char* cuda_ipc_handle_;
+};
+
+
+using TensorShmMap = std::unordered_map<std::string, ShmInfo>;
+
+//
+// AllocPayload
+//
+// Simple structure that carries the userp payload needed for
+// allocation.
+//
+template <typename ResponseType>
+struct AllocPayload {
+  using ClassificationMap = std::unordered_map<std::string, uint32_t>;
+
+  explicit AllocPayload() : response_queue_(nullptr) {}
+  ~AllocPayload()
+  {
+    // Don't delete 'response_'.. it is owned by the InferHandlerState
+  }
+
+  std::shared_ptr<ResponseQueue<ResponseType>> response_queue_;
+  uint32_t response_alloc_count_;
+  TensorShmMap shm_map_;
+  ClassificationMap classification_map_;
+
+  // Used to extend the lifetime of the serialized data in case
+  // non-raw contents were provided in the request. Serialized data's
+  // actual lifetime is that of the request whereas AllocPayload's
+  // lifetime is that of a response... but it is convenient to keep it
+  // here.
+  std::list<std::string> serialized_data_;
+};
+
+template <typename ResponseType>
+TRITONSERVER_Error*
+InferAllocatorPayload(
+    const std::shared_ptr<TRITONSERVER_Server>& tritonserver,
+    const std::shared_ptr<SharedMemoryManager>& shm_manager,
+    const inference::ModelInferRequest& request,
+    std::list<std::string>&& serialized_data,
+    std::shared_ptr<ResponseQueue<ResponseType>> response_queue,
+    AllocPayload<ResponseType>* alloc_payload)
+{
+  alloc_payload->response_queue_ = response_queue;
+  alloc_payload->shm_map_.clear();
+  alloc_payload->classification_map_.clear();
+  alloc_payload->serialized_data_ = std::move(serialized_data);
+
+  // If any of the outputs use shared memory, then we must calculate
+  // the memory address for that output and store it in the allocator
+  // payload so that it is available when the allocation callback is
+  // invoked.
+  for (const auto& io : request.outputs()) {
+    std::string region_name;
+    int64_t offset;
+    size_t byte_size;
+    bool has_shared_memory;
+    RETURN_IF_ERR(ParseSharedMemoryParams<
+                  inference::ModelInferRequest::InferRequestedOutputTensor>(
+        io, &has_shared_memory, &region_name, &offset, &byte_size));
+
+    bool has_classification;
+    uint32_t classification_count;
+    RETURN_IF_ERR(ParseClassificationParams(
+        io, &has_classification, &classification_count));
+
+    if (has_shared_memory && has_classification) {
+      return TRITONSERVER_ErrorNew(
+          TRITONSERVER_ERROR_INVALID_ARG,
+          "output can't set both 'shared_memory_region' and "
+          "'classification'");
+    }
+
+    if (has_shared_memory) {
+      void* base;
+      TRITONSERVER_MemoryType memory_type;
+      int64_t memory_type_id;
+      RETURN_IF_ERR(shm_manager->GetMemoryInfo(
+          region_name, offset, &base, &memory_type, &memory_type_id));
+
+      if (memory_type == TRITONSERVER_MEMORY_GPU) {
+#ifdef TRITON_ENABLE_GPU
+        char* cuda_handle;
+        RETURN_IF_ERR(shm_manager->GetCUDAHandle(
+            region_name, reinterpret_cast<cudaIpcMemHandle_t**>(&cuda_handle)));
+        alloc_payload->shm_map_.emplace(
+            io.name(),
+            ShmInfo(base, byte_size, memory_type, memory_type_id, cuda_handle));
+#endif
+      } else {
+        alloc_payload->shm_map_.emplace(
+            io.name(), ShmInfo(
+                           base, byte_size, memory_type, memory_type_id,
+                           nullptr /* cuda_ipc_handle */));
+      }
+    } else if (has_classification) {
+      alloc_payload->classification_map_.emplace(
+          io.name(), classification_count);
+    }
+  }
+
+  return nullptr;  // Success
+}
+
+TRITONSERVER_Error* InferGRPCToInputHelper(
+    const std::string& input_name, const std::string& model_name,
+    const TRITONSERVER_DataType tensor_dt, const TRITONSERVER_DataType input_dt,
+    const size_t binary_data_byte_size);
+
+TRITONSERVER_Error* InferGRPCToInput(
+    const std::shared_ptr<TRITONSERVER_Server>& tritonserver,
+    const std::shared_ptr<SharedMemoryManager>& shm_manager,
+    const inference::ModelInferRequest& request,
+    std::list<std::string>* serialized_data,
+    TRITONSERVER_InferenceRequest* inference_request);
+
+TRITONSERVER_Error* ResponseAllocatorHelper(
+    TRITONSERVER_ResponseAllocator* allocator, const char* tensor_name,
+    size_t byte_size, TRITONSERVER_MemoryType preferred_memory_type,
+    int64_t preferred_memory_type_id, inference::ModelInferResponse* response,
+    const TensorShmMap& shm_map, void** buffer, void** buffer_userp,
+    TRITONSERVER_MemoryType* actual_memory_type,
+    int64_t* actual_memory_type_id);
+
+TRITONSERVER_Error* OutputBufferAttributesHelper(
+    TRITONSERVER_ResponseAllocator* allocator, const char* tensor_name,
+    const TensorShmMap& shm_map,
+    TRITONSERVER_BufferAttributes* buffer_attributes);
+
+TRITONSERVER_Error* OutputBufferQueryHelper(
+    TRITONSERVER_ResponseAllocator* allocator, const char* tensor_name,
+    size_t* byte_size, const TensorShmMap& shm_map,
+    TRITONSERVER_MemoryType* memory_type, int64_t* memory_type_id);
+
+// Make sure to keep InferResponseAlloc and OutputBufferQuery logic in sync
+TRITONSERVER_Error* InferResponseAlloc(
+    TRITONSERVER_ResponseAllocator* allocator, const char* tensor_name,
+    size_t byte_size, TRITONSERVER_MemoryType preferred_memory_type,
+    int64_t preferred_memory_type_id, void* userp, void** buffer,
+    void** buffer_userp, TRITONSERVER_MemoryType* actual_memory_type,
+    int64_t* actual_memory_type_id);
+
+TRITONSERVER_Error* SetInferenceRequestMetadata(
+    TRITONSERVER_InferenceRequest* inference_request,
+    const inference::ModelInferRequest& request, StateParameters& state_params);
+
+// Helper to set options for StreamInferHandler state when parsing
+// request parameters.
+TRITONSERVER_Error* SetStateParameterFromTritonParameter(
+    StateParameters& state_params,
+    const std::pair<std::string, inference::InferParameter>& param);
+
+void InferRequestComplete(
+    TRITONSERVER_InferenceRequest* request, const uint32_t flags, void* userp);
+
+// Make sure to keep InferResponseAlloc and OutputBufferQuery logic in sync
+TRITONSERVER_Error* OutputBufferQuery(
+    TRITONSERVER_ResponseAllocator* allocator, void* userp,
+    const char* tensor_name, size_t* byte_size,
+    TRITONSERVER_MemoryType* memory_type, int64_t* memory_type_id);
+
+// Make sure to keep InferResponseAlloc, OutputBufferQuery, and
+// OutputBufferAttributes logic in sync
+TRITONSERVER_Error* OutputBufferAttributes(
+    TRITONSERVER_ResponseAllocator* allocator, const char* tensor_name,
+    TRITONSERVER_BufferAttributes* buffer_attributes, void* userp,
+    void* buffer_userp);
+
+TRITONSERVER_Error* InferResponseFree(
+    TRITONSERVER_ResponseAllocator* allocator, void* buffer, void* buffer_userp,
+    size_t byte_size, TRITONSERVER_MemoryType memory_type,
+    int64_t memory_type_id);
+
+TRITONSERVER_Error* InferResponseStart(
+    TRITONSERVER_ResponseAllocator* allocator, void* userp);
+
+template <typename ResponseType>
+TRITONSERVER_Error*
+InferResponseCompleteCommon(
+    TRITONSERVER_Server* server, TRITONSERVER_InferenceResponse* iresponse,
+    inference::ModelInferResponse& response,
+    const AllocPayload<ResponseType>& alloc_payload)
+{
+  RETURN_IF_ERR(TRITONSERVER_InferenceResponseError(iresponse));
+
+  const char *model_name, *id;
+  int64_t model_version;
+  RETURN_IF_ERR(TRITONSERVER_InferenceResponseModel(
+      iresponse, &model_name, &model_version));
+  RETURN_IF_ERR(TRITONSERVER_InferenceResponseId(iresponse, &id));
+
+  response.set_id(id);
+  response.set_model_name(model_name);
+  response.set_model_version(std::to_string(model_version));
+
+  // Propagate response parameters.
+  uint32_t parameter_count;
+  RETURN_IF_ERR(TRITONSERVER_InferenceResponseParameterCount(
+      iresponse, &parameter_count));
+  for (uint32_t pidx = 0; pidx < parameter_count; ++pidx) {
+    const char* name;
+    TRITONSERVER_ParameterType type;
+    const void* vvalue;
+    RETURN_IF_ERR(TRITONSERVER_InferenceResponseParameter(
+        iresponse, pidx, &name, &type, &vvalue));
+    inference::InferParameter& param = (*response.mutable_parameters())[name];
+    switch (type) {
+      case TRITONSERVER_PARAMETER_BOOL:
+        param.set_bool_param(*(reinterpret_cast<const bool*>(vvalue)));
+        break;
+      case TRITONSERVER_PARAMETER_INT:
+        param.set_int64_param(*(reinterpret_cast<const int64_t*>(vvalue)));
+        break;
+      case TRITONSERVER_PARAMETER_STRING:
+        param.set_string_param(reinterpret_cast<const char*>(vvalue));
+        break;
+      case TRITONSERVER_PARAMETER_BYTES:
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_UNSUPPORTED,
+            "Response parameter of type 'TRITONSERVER_PARAMETER_BYTES' is not "
+            "currently supported");
+        break;
+    }
+  }
+
+  // Go through each response output and transfer information to the
+  // corresponding GRPC response output.
+  uint32_t output_count;
+  RETURN_IF_ERR(
+      TRITONSERVER_InferenceResponseOutputCount(iresponse, &output_count));
+  if (output_count != (uint32_t)response.outputs_size()) {
+    return TRITONSERVER_ErrorNew(
+        TRITONSERVER_ERROR_INTERNAL, "response output count mismatch");
+  }
+
+  for (uint32_t output_idx = 0; output_idx < output_count; ++output_idx) {
+    const char* cname;
+    TRITONSERVER_DataType datatype;
+    const int64_t* shape;
+    uint64_t dim_count;
+    const void* base;
+    size_t byte_size;
+    TRITONSERVER_MemoryType memory_type;
+    int64_t memory_type_id;
+    void* userp;
+
+    RETURN_IF_ERR(TRITONSERVER_InferenceResponseOutput(
+        iresponse, output_idx, &cname, &datatype, &shape, &dim_count, &base,
+        &byte_size, &memory_type, &memory_type_id, &userp));
+
+    const std::string name(cname);
+
+    // There are usually very few outputs so fastest just to look for
+    // the one we want... could create a map for cases where there are
+    // a large number of outputs. Or rely on order to be same...
+    inference::ModelInferResponse::InferOutputTensor* output = nullptr;
+    for (auto& io : *(response.mutable_outputs())) {
+      if (io.name() == name) {
+        output = &io;
+        break;
+      }
+    }
+
+    if (output == nullptr) {
+      return TRITONSERVER_ErrorNew(
+          TRITONSERVER_ERROR_INTERNAL,
+          "unable to find expected response output");
+    }
+
+    // If this output was requested as classification then remove the
+    // raw output from the response and instead return classification
+    // results as a string tensor
+    const auto itr = alloc_payload.classification_map_.find(name);
+    if (itr == alloc_payload.classification_map_.end()) {
+      // Not classification...
+      output->set_datatype(TRITONSERVER_DataTypeString(datatype));
+      for (size_t idx = 0; idx < dim_count; idx++) {
+        output->add_shape(shape[idx]);
+      }
+    } else {
+      // Classification
+      const uint32_t classification_count = itr->second;
+
+      // For classification need to determine the batch size, if any,
+      // because need to use that to break up the response for each
+      // batch entry.
+      uint32_t batch_size = 0;
+
+      uint32_t batch_flags;
+      RETURN_IF_ERR(TRITONSERVER_ServerModelBatchProperties(
+          server, model_name, model_version, &batch_flags,
+          nullptr /* voidp */));
+      if ((dim_count > 0) &&
+          ((batch_flags & TRITONSERVER_BATCH_FIRST_DIM) != 0)) {
+        batch_size = shape[0];
+      }
+
+      // Determine the batch1 byte size of the tensor... needed when
+      // the response tensor batch-size > 1 so that we know how to
+      // stride though the tensor data.
+      size_t batch1_element_count = 1;
+      for (size_t idx = ((batch_size == 0) ? 0 : 1); idx < dim_count; idx++) {
+        batch1_element_count *= shape[idx];
+      }
+
+      const size_t batch1_byte_size =
+          batch1_element_count * TRITONSERVER_DataTypeByteSize(datatype);
+
+      // Create the classification contents
+      std::string serialized;
+
+      size_t class_offset = 0;
+      for (uint32_t bs = 0; bs < std::max((uint32_t)1, batch_size); ++bs) {
+        std::vector<std::string> class_strs;
+        RETURN_IF_ERR(TopkClassifications(
+            iresponse, output_idx,
+            reinterpret_cast<const char*>(base) + class_offset,
+            ((class_offset + batch1_byte_size) > byte_size) ? 0
+                                                            : batch1_byte_size,
+            datatype, classification_count, &class_strs));
+
+        // Serialize for binary representation...
+        for (const auto& str : class_strs) {
+          uint32_t len = str.size();
+          serialized.append(reinterpret_cast<const char*>(&len), sizeof(len));
+          if (len > 0) {
+            serialized.append(str);
+          }
+        }
+
+        class_offset += batch1_byte_size;
+      }
+
+      // Update the output with new datatype, shape and contents.
+      output->set_datatype(
+          TRITONSERVER_DataTypeString(TRITONSERVER_TYPE_BYTES));
+
+      if (batch_size > 0) {
+        output->add_shape(batch_size);
+      }
+      output->add_shape(
+          std::min(classification_count, (uint32_t)batch1_element_count));
+
+      (*response.mutable_raw_output_contents())[output_idx] =
+          std::move(serialized);
+    }
+  }
+
+  // Make sure response doesn't exceed GRPC limits.
+  if (response.ByteSizeLong() > MAX_GRPC_MESSAGE_SIZE) {
+    return TRITONSERVER_ErrorNew(
+        TRITONSERVER_ERROR_INVALID_ARG,
+        std::string(
+            "Response has byte size " +
+            std::to_string(response.ByteSizeLong()) +
+            " which exceeds gRPC's byte size limit " + std::to_string(INT_MAX) +
+            ".")
+            .c_str());
+  }
+
+  return nullptr;  // success
+}
+
+//
+// InferHandlerState
+//
+template <
+    typename ServerResponderType, typename RequestType, typename ResponseType>
+class InferHandlerState {
+ public:
+  using InferHandlerStateType =
+      InferHandlerState<ServerResponderType, RequestType, ResponseType>;
+
+  // State that is shared across all state objects that make up a GRPC
+  // transaction (e.g. a stream).
+  struct Context {
+    explicit Context(
+        ::grpc::ServerCompletionQueue* cq, const uint64_t unique_id = 0)
+        : cq_(cq), unique_id_(unique_id), ongoing_requests_(0),
+          step_(Steps::START), finish_ok_(true), ongoing_write_(false),
+          received_notification_(false)
+    {
+      ctx_.reset(new ::grpc::ServerContext());
+      responder_.reset(new ServerResponderType(ctx_.get()));
+    }
+
+    void SetCompressionLevel(grpc_compression_level compression_level)
+    {
+      ctx_->set_compression_level(compression_level);
+    }
+
+    void GrpcContextAsyncNotifyWhenDone(InferHandlerStateType* state)
+    {
+      notify_state_ = std::unique_ptr<InferHandlerStateType>(
+          new InferHandlerStateType(Steps::WAITING_NOTIFICATION, state));
+      ctx_->AsyncNotifyWhenDone(notify_state_.get());
+    }
+
+    void SetReceivedNotification(bool value) { received_notification_ = true; }
+
+    bool ReceivedNotification() { return received_notification_; }
+
+    bool IsCancelled()
+    {
+      return received_notification_ ? ctx_->IsCancelled() : false;
+    }
+
+    // Increments the ongoing request counter
+    void IncrementRequestCounter() { ongoing_requests_++; }
+
+    // Decrements the ongoing request counter
+    void DecrementRequestCounter() { ongoing_requests_--; }
+
+    // Adds the state object created on this context
+    void InsertState(InferHandlerStateType* state)
+    {
+      all_states_.insert(state);
+    }
+
+    // Erases the state object created on this context
+    void EraseState(InferHandlerStateType* state)
+    {
+      EraseInflightState(state);
+      all_states_.erase(state);
+    }
+
+    bool HandleCompletion()
+    {
+      if (step_ != Steps::FINISH) {
+        for (auto state : all_states_) {
+          std::lock_guard<std::recursive_mutex> lock(state->step_mtx_);
+          // There is no order guarantee on when the AsyncNotifyWhenDone
+          // event is placed on the completion queue vs when the actual
+          // state RPC is processed. Need to transition through two steps
+          // to preserve the lifetime of the state object.
+          if (state->step_ == Steps::PARTIAL_COMPLETION) {
+            state->step_ = Steps::COMPLETE;
+          } else {
+            state->step_ = Steps::FINISH;
+          }
+          PutTaskBackToQueue(state);
+        }
+        step_ = Steps::FINISH;
+        return true;
+      }
+      return false;
+    }
+
+    const std::string DebugString(InferHandlerStateType* state)
+    {
+      std::string debug_string("");
+      debug_string.append(
+          "Running state_id " + std::to_string(state->unique_id_) + "\n");
+      debug_string.append(
+          "\tContext step " + std::to_string(state->context_->step_) + " id " +
+          std::to_string(state->context_->unique_id_) + "\n");
+      for (auto new_state : all_states_) {
+        debug_string.append(
+            "\t\t State id " + std::to_string(new_state->unique_id_) +
+            ": State step " + std::to_string(new_state->step_) + "\n");
+      }
+
+      return debug_string;
+    }
+
+    // Inserts the state to a set tracking active requests
+    // within the server core. Should only be called when
+    // the request was successfully enqueued on Triton.
+    void InsertInflightState(InferHandlerStateType* state)
+    {
+      std::lock_guard<std::recursive_mutex> lock(mu_);
+      inflight_states_.insert(state);
+    }
+
+    // Erases the state to a set tracking active requests
+    // within the server core.
+    void EraseInflightState(InferHandlerStateType* state)
+    {
+      std::lock_guard<std::recursive_mutex> lock(mu_);
+      inflight_states_.erase(state);
+    }
+
+    // Issues the cancellation for all inflight requests
+    // being tracked by this context.
+    void IssueRequestCancellation()
+    {
+      {
+        std::lock_guard<std::recursive_mutex> lock(mu_);
+
+        // Issues the request cancellation to the core.
+        for (auto state : inflight_states_) {
+          std::lock_guard<std::recursive_mutex> lock(state->step_mtx_);
+          if (state->step_ != Steps::CANCELLED &&
+              state->step_ != Steps::COMPLETE) {
+            LOG_VERBOSE(1) << "Issuing cancellation for " << state->unique_id_;
+            if (state->inference_request_.get() == nullptr) {
+              // The context might be holding some states that have
+              // not been issued to Triton core. Need to skip calling
+              // issuing cancellation for such requests.
+              continue;
+            }
+            // Note that request may or may not be valid at this point.
+            // Assuming if RequestComplete callback is run asynchronously
+            // before this point.
+            TRITONSERVER_Error* err = nullptr;
+            err = TRITONSERVER_InferenceRequestCancel(
+                state->inference_request_.get());
+            // TODO: Add request id to the message
+            if (err != nullptr) {
+              LOG_INFO << "Failed to cancel the request: "
+                       << TRITONSERVER_ErrorMessage(err);
+            }
+            state->step_ = Steps::CANCELLATION_ISSUED;
+          } else if (state->step_ == Steps::COMPLETE) {
+            // The RPC is complete and no callback will be invoked to retrieve
+            // the object. Hence, need to explicitly place the state on the
+            // completion queue.
+            PutTaskBackToQueue(state);
+          }
+        }
+      }
+    }
+
+
+    // Handles the gRPC context cancellation. This function can be called
+    // multiple times and is supposed to be re-entrant.
+    // Returns whether or not to continue cycling through the gRPC
+    // completion queue or not.
+    bool HandleCancellation(
+        InferHandlerStateType* state, bool rpc_ok, const std::string& name)
+    {
+      if (!IsCancelled()) {
+        LOG_ERROR
+            << "[INTERNAL] HandleCancellation called even when the context was "
+               "not cancelled for "
+            << name << ", rpc_ok=" << rpc_ok << ", context "
+            << state->context_->unique_id_ << ", " << state->unique_id_
+            << " step " << state->step_;
+        return true;
+      }
+      if ((state->step_ != Steps::CANCELLATION_ISSUED) &&
+          (state->step_ != Steps::CANCELLED)) {
+        LOG_VERBOSE(1) << "Cancellation notification received for " << name
+                       << ", rpc_ok=" << rpc_ok << ", context "
+                       << state->context_->unique_id_ << ", "
+                       << state->unique_id_ << " step " << state->step_;
+
+        // If the context has not been cancelled then
+        // issue cancellation request to all the inflight
+        // states belonging to the context.
+        if (state->context_->step_ != Steps::CANCELLED) {
+          IssueRequestCancellation();
+          // Mark the context as cancelled
+          state->context_->step_ = Steps::CANCELLED;
+
+          // The state returns true because the CancelExecution
+          // call above would have raised alarm objects on all
+          // pending inflight states objects. This state will
+          // be taken up along with all the other states in the
+          // next iteration from the completion queue which
+          // would release the state.
+          return true;
+        }
+      }
+
+      if (state->step_ != Steps::CANCELLATION_ISSUED) {
+        // The cancellation has not been issued hence the state can
+        // be released.
+        LOG_VERBOSE(1) << "Completing cancellation for " << name
+                       << ", rpc_ok=" << rpc_ok << ", context "
+                       << state->context_->unique_id_ << ", "
+                       << state->unique_id_ << " step " << state->step_;
+
+        return false;
+      } else {
+        // Should wait for the ResponseComplete callbacks to be invoked.
+        LOG_VERBOSE(1)
+            << "Waiting for the callback to retrieve cancellation for " << name
+            << ", rpc_ok=" << rpc_ok << ", context "
+            << state->context_->unique_id_ << ", " << state->unique_id_
+            << " step " << state->step_;
+
+        return true;
+      }
+    }
+
+    // Enqueue 'state' so that its response is delivered in the
+    // correct order.
+    void EnqueueForResponse(InferHandlerStateType* state)
+    {
+      std::lock_guard<std::recursive_mutex> lock(mu_);
+      states_.push(state);
+    }
+
+    // Write the response to the stream directly.
+    void DecoupledWriteResponse(InferHandlerStateType* state)
+    {
+#ifdef TRITON_ENABLE_TRACING
+      state->trace_timestamps_.emplace_back(
+          std::make_pair("GRPC_SEND_START", TraceManager::CaptureTimestamp()));
+#endif  // TRITON_ENABLE_TRACING
+      state->step_ = Steps::WRITTEN;
+      ResponseType* response = state->response_queue_->GetCurrentResponse();
+      responder_->Write(*response, state);
+
+      // Clear the response after writing
+      response->mutable_infer_response()->Clear();
+
+      // Pop the response from queue
+      state->response_queue_->PopResponse();
+    }
+
+    // Adds the state object to the completion queue so
+    // that it can be processed later
+    void PutTaskBackToQueue(InferHandlerStateType* state)
+    {
+      std::lock_guard<std::recursive_mutex> lock(mu_);
+      // FIXME: Is there a better way to put task on the
+      // completion queue rather than using alarm object?
+      // The alarm object will add a new task to the back of the
+      // completion queue when it expires or when it’s cancelled.
+      state->alarm_.Set(
+          cq_, gpr_now(gpr_clock_type::GPR_CLOCK_REALTIME), state);
+    }
+
+    // Check the state at the front of the queue and write it if
+    // ready. The state at the front of the queue is ready if it is in
+    // the WRITEREADY state and it equals 'required_state' (or
+    // 'required_state' is nullptr). Return nullptr if front of queue
+    // was not ready (and so not written), or return the state if it
+    // was ready and written.
+    InferHandlerStateType* WriteResponseIfReady(
+        InferHandlerStateType* required_state)
+    {
+      std::lock_guard<std::recursive_mutex> lock(mu_);
+      if (states_.empty()) {
+        return nullptr;
+      }
+
+      InferHandlerStateType* state = states_.front();
+      if (state->step_ != Steps::WRITEREADY) {
+        return nullptr;
+      }
+
+      if ((required_state != nullptr) && (state != required_state)) {
+        return nullptr;
+      }
+
+#ifdef TRITON_ENABLE_TRACING
+      state->trace_timestamps_.emplace_back(
+          std::make_pair("GRPC_SEND_START", TraceManager::CaptureTimestamp()));
+#endif  // TRITON_ENABLE_TRACING
+
+      state->step_ = Steps::WRITTEN;
+      state->context_->ongoing_write_ = true;
+      // Non decoupled writes use only one response
+      responder_->Write(*state->response_queue_->GetResponseAt(0), state);
+
+      return state;
+    }
+
+    // If 'state' is at the front of the queue and written, pop it and
+    // return true. Other return false.
+    bool PopCompletedResponse(InferHandlerStateType* state)
+    {
+      std::lock_guard<std::recursive_mutex> lock(mu_);
+      if (states_.empty()) {
+        return false;
+      }
+
+      InferHandlerStateType* front = states_.front();
+      if ((front == state) && (state->step_ == Steps::WRITTEN)) {
+        states_.pop();
+        return true;
+      }
+
+      return false;
+    }
+
+    // Return true if this context has completed all reads and writes.
+    bool IsRequestsCompleted()
+    {
+      std::lock_guard<std::recursive_mutex> lock(mu_);
+      return (
+          (step_ == Steps::WRITEREADY) && states_.empty() &&
+          (ongoing_requests_ == 0));
+    }
+
+    // The grpc completion queue associated with the RPC.
+    ::grpc::ServerCompletionQueue* cq_;
+
+    // Unique ID for the context. Used only for debugging so will
+    // always be 0 in non-debug builds.
+    const uint64_t unique_id_;
+
+    // Context for the rpc, allowing to tweak aspects of it such as
+    // the use of compression, authentication, as well as to send
+    // metadata back to the client.
+    std::unique_ptr<::grpc::ServerContext> ctx_;
+    std::unique_ptr<ServerResponderType> responder_;
+
+    // The states associated with this context that are currently
+    // active. Used by stream handlers to maintain request / response
+    // orders. A state enters this queue when it has successfully read
+    // a request and exits the queue when it is written.
+    std::recursive_mutex mu_;
+    std::queue<InferHandlerStateType*> states_;
+    std::atomic<uint32_t> ongoing_requests_;
+
+    // Tracks the inflight requests sent to Triton core via this
+    // context. We will use this structure to issue cancellations
+    // on these requests.
+    std::set<InferHandlerStateType*> inflight_states_;
+
+    // Tracks all the states that have been created on this context.
+    std::set<InferHandlerStateType*> all_states_;
+
+    // The step of the entire context.
+    Steps step_;
+
+    // True if this context should finish with OK status, false if
+    // should finish with CANCELLED status.
+    bool finish_ok_;
+
+    // True if there is an ongoing write to the grpc stream
+    std::atomic<bool> ongoing_write_;
+
+    // The state object that is sent to grpc async notification
+    // for tracking the gRPC stream.
+    std::unique_ptr<InferHandlerState> notify_state_;
+
+    // Tracks whether the async notification has been delivered by
+    // completion queue.
+    bool received_notification_;
+  };
+
+  // This constructor is used to build a wrapper state object
+  // pointing to the actual state object. The wrapper state
+  // object is used to distinguish a tag from AsyncNotifyWhenDone()
+  // signal.
+  explicit InferHandlerState(Steps start_step, InferHandlerState* state)
+      : step_(start_step), state_ptr_(state), async_notify_state_(false)
+  {
+    state->MarkAsAsyncNotifyState();
+  }
+
+  explicit InferHandlerState(
+      TRITONSERVER_Server* tritonserver,
+      const std::shared_ptr<Context>& context, Steps start_step = Steps::START)
+      : tritonserver_(tritonserver), async_notify_state_(false)
+  {
+    // For debugging and testing,
+    const char* dstr = getenv("TRITONSERVER_DELAY_GRPC_RESPONSE");
+    delay_response_ms_ = 0;
+    if (dstr != nullptr) {
+      delay_response_ms_ = atoi(dstr);
+    }
+    response_queue_.reset(new ResponseQueue<ResponseType>());
+    Reset(context, start_step);
+  }
+
+  ~InferHandlerState() { ClearTraceTimestamps(); }
+
+  bool IsGrpcContextCancelled() { return context_->IsCancelled(); }
+
+  void Reset(
+      const std::shared_ptr<Context>& context, Steps start_step = Steps::START)
+  {
+    unique_id_ = NEXT_UNIQUE_ID;
+    context_ = context;
+    step_ = start_step;
+    cb_count_ = 0;
+    is_decoupled_ = false;
+    complete_ = false;
+    parameters_ = {};
+    request_.Clear();
+    response_queue_->Reset();
+    // Clear trace_timestamps_ here so they do not grow indefinitely since
+    // states are re-used for performance.
+    ClearTraceTimestamps();
+    // The pointer should be nullptr for all state objects instead of
+    // wrapper state object in WAITING_NOTIFICATION step.
+    state_ptr_ = nullptr;
+    async_notify_state_ = false;
+  }
+
+  void Release()
+  {
+    context_ = nullptr;
+    inference_request_.reset();
+    ClearTraceTimestamps();
+  }
+
+  void ClearTraceTimestamps()
+  {
+#ifdef TRITON_ENABLE_TRACING
+    if (trace_ != nullptr) {
+      for (const auto& timestamp : trace_timestamps_) {
+        trace_->CaptureTimestamp(timestamp.first, timestamp.second);
+      }
+      trace_.reset();
+    }
+    trace_timestamps_.clear();
+#endif  // TRITON_ENABLE_TRACING
+  }
+
+  // Returns whether all the responses from the state
+  // are delivered and successfully written on the
+  // stream.
+  bool IsComplete() { return (complete_ && response_queue_->IsEmpty()); }
+
+  void MarkAsAsyncNotifyState() { async_notify_state_ = true; }
+  bool IsAsyncNotifyState() { return async_notify_state_; }
+
+  // Needed in the response handle for classification outputs.
+  TRITONSERVER_Server* tritonserver_;
+
+  // Unique ID for the state. Used only for debugging so will
+  // always be 0 in non-debug builds.
+  uint64_t unique_id_;
+
+  std::shared_ptr<Context> context_;
+  Steps step_;
+  std::recursive_mutex step_mtx_;
+
+  // Shared pointer to the inference request object. The lifetime of
+  // inference request object is extended till all the responses from
+  // the request are processed and the request is released.
+  std::shared_ptr<TRITONSERVER_InferenceRequest> inference_request_;
+
+#ifdef TRITON_ENABLE_TRACING
+  std::shared_ptr<TraceManager::Trace> trace_;
+  // Additional timestamps that are captured before a trace stream is acquired
+  std::deque<std::pair<std::string, uint64_t>> trace_timestamps_;
+#endif  // TRITON_ENABLE_TRACING
+
+  bool is_decoupled_ = false;
+  StateParameters parameters_;
+
+  std::atomic<uint32_t> cb_count_;
+  bool complete_;
+
+  RequestType request_;
+  std::shared_ptr<ResponseQueue<ResponseType>> response_queue_;
+
+  ::grpc::Alarm alarm_;
+
+  // For testing and debugging
+  int delay_response_ms_;
+
+  // For inference requests the allocator payload, unused for other
+  // requests.
+  AllocPayload<ResponseType> alloc_payload_;
+
+  // The below pointer is only set when using this state object as a
+  // wrapper over actual state when being sent to completion queue
+  // using AsyncNotifyWhenDone function. Otherwise it is nullptr.
+  InferHandlerState* state_ptr_;
+
+  // Tracks whether this state object has been wrapped and send to
+  // AsyncNotifyWhenDone() function as a tag.
+  bool async_notify_state_;
+};
+
+
+//
+// InferHandler
+//
+template <
+    typename ServiceType, typename ServerResponderType, typename RequestType,
+    typename ResponseType>
+class InferHandler : public HandlerBase {
+ public:
+  InferHandler(
+      const std::string& name,
+      const std::shared_ptr<TRITONSERVER_Server>& tritonserver,
+      ServiceType* service, ::grpc::ServerCompletionQueue* cq,
+      size_t max_state_bucket_count,
+      std::pair<std::string, std::string> restricted_kv,
+      const std::string& header_forward_pattern);
+  virtual ~InferHandler();
+
+  // Descriptive name of of the handler.
+  const std::string& Name() const { return name_; }
+
+  // Start handling requests.
+  void Start() override;
+
+  // Stop handling requests.
+  void Stop() override;
+
+ protected:
+  using State =
+      InferHandlerState<ServerResponderType, RequestType, ResponseType>;
+  using StateContext = typename State::Context;
+
+  State* StateNew(
+      TRITONSERVER_Server* tritonserver,
+      const std::shared_ptr<StateContext>& context,
+      Steps start_step = Steps::START)
+  {
+    State* state = nullptr;
+
+    if (max_state_bucket_count_ > 0) {
+      std::lock_guard<std::mutex> lock(alloc_mu_);
+
+      if (!state_bucket_.empty()) {
+        state = state_bucket_.back();
+        state->Reset(context, start_step);
+        state_bucket_.pop_back();
+      }
+    }
+
+    if (state == nullptr) {
+      state = new State(tritonserver, context, start_step);
+    }
+
+    if (start_step == Steps::START) {
+      // Need to be called to receive an asynchronous notification
+      // when the transaction is cancelled.
+      context->GrpcContextAsyncNotifyWhenDone(state);
+    }
+    context->InsertState(state);
+
+    LOG_VERBOSE(2) << "StateNew, " << state->unique_id_ << " Step "
+                   << state->step_;
+
+    return state;
+  }
+
+  void StateRelease(State* state)
+  {
+    LOG_VERBOSE(2) << "StateRelease, " << state->unique_id_ << " Step "
+                   << state->step_;
+    if (max_state_bucket_count_ > 0) {
+      std::lock_guard<std::mutex> lock(alloc_mu_);
+
+      if (state_bucket_.size() < max_state_bucket_count_) {
+        state->Release();
+        state_bucket_.push_back(state);
+        return;
+      }
+    }
+
+    delete state;
+  }
+
+  virtual void StartNewRequest() = 0;
+  virtual bool Process(State* state, bool rpc_ok) = 0;
+  bool ExecutePrecondition(InferHandler::State* state);
+
+  TRITONSERVER_Error* ForwardHeadersAsParameters(
+      TRITONSERVER_InferenceRequest* irequest, InferHandler::State* state);
+
+  const std::string name_;
+  std::shared_ptr<TRITONSERVER_Server> tritonserver_;
+
+  ServiceType* service_;
+  ::grpc::ServerCompletionQueue* cq_;
+  std::unique_ptr<std::thread> thread_;
+
+  // Mutex to serialize State allocation
+  std::mutex alloc_mu_;
+
+  // Keep some number of state objects for reuse to avoid the overhead
+  // of creating a state for every new request.
+  const size_t max_state_bucket_count_;
+  std::vector<State*> state_bucket_;
+
+  std::pair<std::string, std::string> restricted_kv_;
+  std::string header_forward_pattern_;
+  re2::RE2 header_forward_regex_;
+};
+
+template <
+    typename ServiceType, typename ServerResponderType, typename RequestType,
+    typename ResponseType>
+InferHandler<ServiceType, ServerResponderType, RequestType, ResponseType>::
+    InferHandler(
+        const std::string& name,
+        const std::shared_ptr<TRITONSERVER_Server>& tritonserver,
+        ServiceType* service, ::grpc::ServerCompletionQueue* cq,
+        size_t max_state_bucket_count,
+        std::pair<std::string, std::string> restricted_kv,
+        const std::string& header_forward_pattern)
+    : name_(name), tritonserver_(tritonserver), service_(service), cq_(cq),
+      max_state_bucket_count_(max_state_bucket_count),
+      restricted_kv_(restricted_kv),
+      header_forward_pattern_(header_forward_pattern),
+      header_forward_regex_(header_forward_pattern_)
+{
+}
+
+template <
+    typename ServiceType, typename ServerResponderType, typename RequestType,
+    typename ResponseType>
+InferHandler<ServiceType, ServerResponderType, RequestType, ResponseType>::
+    ~InferHandler()
+{
+  for (State* state : state_bucket_) {
+    delete state;
+  }
+  state_bucket_.clear();
+
+  LOG_VERBOSE(1) << "Destructed " << Name();
+}
+
+template <
+    typename ServiceType, typename ServerResponderType, typename RequestType,
+    typename ResponseType>
+void
+InferHandler<
+    ServiceType, ServerResponderType, RequestType, ResponseType>::Start()
+{
+  // Use a barrier to make sure we don't return until thread has
+  // started.
+  auto barrier = std::make_shared<Barrier>(2);
+
+  thread_.reset(new std::thread([this, barrier] {
+    StartNewRequest();
+    barrier->Wait();
+
+    void* tag;
+    bool ok;
+
+    while (cq_->Next(&tag, &ok)) {
+      State* state = static_cast<State*>(tag);
+      if (state->step_ == Steps::WAITING_NOTIFICATION) {
+        State* state_wrapper = state;
+        state = state_wrapper->state_ptr_;
+        state->context_->SetReceivedNotification(true);
+        LOG_VERBOSE(1) << "Received notification for " << Name() << ", "
+                       << state->unique_id_;
+      }
+      LOG_VERBOSE(2) << "Grpc::CQ::Next() "
+                     << state->context_->DebugString(state);
+      if (!Process(state, ok)) {
+        LOG_VERBOSE(1) << "Done for " << Name() << ", " << state->unique_id_;
+        state->context_->EraseState(state);
+        StateRelease(state);
+      } else {
+        LOG_VERBOSE(2) << "Returning from " << Name() << ", "
+                       << state->unique_id_ << ", " << state->step_;
+      }
+    }
+  }));
+
+  barrier->Wait();
+  LOG_VERBOSE(1) << "Thread started for " << Name();
+}
+
+template <
+    typename ServiceType, typename ServerResponderType, typename RequestType,
+    typename ResponseType>
+void
+InferHandler<
+    ServiceType, ServerResponderType, RequestType, ResponseType>::Stop()
+{
+  if (thread_->joinable()) {
+    thread_->join();
+  }
+
+  LOG_VERBOSE(1) << "Thread exited for " << Name();
+}
+
+template <
+    typename ServiceType, typename ServerResponderType, typename RequestType,
+    typename ResponseType>
+bool
+InferHandler<ServiceType, ServerResponderType, RequestType, ResponseType>::
+    ExecutePrecondition(InferHandler::State* state)
+{
+  if (!restricted_kv_.first.empty()) {
+    const auto& metadata = state->context_->ctx_->client_metadata();
+    const auto it = metadata.find(restricted_kv_.first);
+    return (it != metadata.end()) && (it->second == restricted_kv_.second);
+  }
+  return true;
+}
+
+template <
+    typename ServiceType, typename ServerResponderType, typename RequestType,
+    typename ResponseType>
+TRITONSERVER_Error*
+InferHandler<ServiceType, ServerResponderType, RequestType, ResponseType>::
+    ForwardHeadersAsParameters(
+        TRITONSERVER_InferenceRequest* irequest, InferHandler::State* state)
+{
+  TRITONSERVER_Error* err = nullptr;
+  if (!header_forward_pattern_.empty()) {
+    const auto& metadata = state->context_->ctx_->client_metadata();
+    for (const auto& pair : metadata) {
+      auto& key = pair.first;
+      auto& value = pair.second;
+      std::string param_key = std::string(key.begin(), key.end());
+      if (RE2::PartialMatch(param_key, header_forward_regex_)) {
+        std::string param_value = std::string(value.begin(), value.end());
+        err = TRITONSERVER_InferenceRequestSetStringParameter(
+            irequest, param_key.c_str(), param_value.c_str());
+        if (err != nullptr) {
+          break;
+        }
+      }
+    }
+  }
+
+  return err;
+}
+
+//
+// ModelInferHandler
+//
+class ModelInferHandler
+    : public InferHandler<
+          inference::GRPCInferenceService::AsyncService,
+          ::grpc::ServerAsyncResponseWriter<inference::ModelInferResponse>,
+          inference::ModelInferRequest, inference::ModelInferResponse> {
+ public:
+  ModelInferHandler(
+      const std::string& name,
+      const std::shared_ptr<TRITONSERVER_Server>& tritonserver,
+      TraceManager* trace_manager,
+      const std::shared_ptr<SharedMemoryManager>& shm_manager,
+      inference::GRPCInferenceService::AsyncService* service,
+      ::grpc::ServerCompletionQueue* cq, size_t max_state_bucket_count,
+      grpc_compression_level compression_level,
+      std::pair<std::string, std::string> restricted_kv,
+      const std::string& forward_header_pattern)
+      : InferHandler(
+            name, tritonserver, service, cq, max_state_bucket_count,
+            restricted_kv, forward_header_pattern),
+        trace_manager_(trace_manager), shm_manager_(shm_manager),
+        compression_level_(compression_level)
+  {
+    // Create the allocator that will be used to allocate buffers for
+    // the result tensors.
+    FAIL_IF_ERR(
+        TRITONSERVER_ResponseAllocatorNew(
+            &allocator_, InferResponseAlloc, InferResponseFree,
+            InferResponseStart),
+        "creating inference response allocator");
+    FAIL_IF_ERR(
+        TRITONSERVER_ResponseAllocatorSetQueryFunction(
+            allocator_, OutputBufferQuery),
+        "setting allocator's query function");
+    FAIL_IF_ERR(
+        TRITONSERVER_ResponseAllocatorSetBufferAttributesFunction(
+            allocator_, OutputBufferAttributes),
+        "setting allocator's output buffer attributes function");
+  }
+
+  ~ModelInferHandler()
+  {
+    LOG_TRITONSERVER_ERROR(
+        TRITONSERVER_ResponseAllocatorDelete(allocator_),
+        "deleting response allocator");
+  }
+
+ protected:
+  void StartNewRequest() override;
+  bool Process(State* state, bool rpc_ok) override;
+
+ private:
+  void Execute(State* state);
+  static void InferResponseComplete(
+      TRITONSERVER_InferenceResponse* response, const uint32_t flags,
+      void* userp);
+
+  TraceManager* trace_manager_;
+  std::shared_ptr<SharedMemoryManager> shm_manager_;
+  TRITONSERVER_ResponseAllocator* allocator_;
+
+  grpc_compression_level compression_level_;
+};
+
+}}}  // namespace triton::server::grpc
diff --git a/src/grpc/stream_infer_handler.cc b/src/grpc/stream_infer_handler.cc
new file mode 100644
index 0000000000..9c162ad644
--- /dev/null
+++ b/src/grpc/stream_infer_handler.cc
@@ -0,0 +1,732 @@
+// Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NVIDIA CORPORATION nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#include "stream_infer_handler.h"
+
+#include <regex>
+
+namespace triton { namespace server { namespace grpc {
+
+// Make sure to keep InferResponseAlloc and OutputBufferQuery logic in sync
+TRITONSERVER_Error*
+StreamInferResponseAlloc(
+    TRITONSERVER_ResponseAllocator* allocator, const char* tensor_name,
+    size_t byte_size, TRITONSERVER_MemoryType preferred_memory_type,
+    int64_t preferred_memory_type_id, void* userp, void** buffer,
+    void** buffer_userp, TRITONSERVER_MemoryType* actual_memory_type,
+    int64_t* actual_memory_type_id)
+{
+  AllocPayload<inference::ModelStreamInferResponse>* payload =
+      reinterpret_cast<AllocPayload<inference::ModelStreamInferResponse>*>(
+          userp);
+
+  auto response = payload->response_queue_->GetLastAllocatedResponse();
+
+  if (response == nullptr) {
+    return TRITONSERVER_ErrorNew(
+        TRITONSERVER_ERROR_INTERNAL,
+        "Unable to access the last allocated response");
+  }
+
+  return ResponseAllocatorHelper(
+      allocator, tensor_name, byte_size, preferred_memory_type,
+      preferred_memory_type_id, response->mutable_infer_response(),
+      payload->shm_map_, buffer, buffer_userp, actual_memory_type,
+      actual_memory_type_id);
+}
+
+TRITONSERVER_Error*
+StreamInferResponseStart(TRITONSERVER_ResponseAllocator* allocator, void* userp)
+{
+  AllocPayload<inference::ModelStreamInferResponse>* payload =
+      reinterpret_cast<AllocPayload<inference::ModelStreamInferResponse>*>(
+          userp);
+
+  // Move to the next response object
+  payload->response_queue_->AllocateResponse();
+
+  return nullptr;  // success
+}
+
+// Make sure to keep InferResponseAlloc and OutputBufferQuery logic in sync
+TRITONSERVER_Error*
+StreamOutputBufferQuery(
+    TRITONSERVER_ResponseAllocator* allocator, void* userp,
+    const char* tensor_name, size_t* byte_size,
+    TRITONSERVER_MemoryType* memory_type, int64_t* memory_type_id)
+{
+  AllocPayload<inference::ModelStreamInferResponse>* payload =
+      reinterpret_cast<AllocPayload<inference::ModelStreamInferResponse>*>(
+          userp);
+  return OutputBufferQueryHelper(
+      allocator, tensor_name, byte_size, payload->shm_map_, memory_type,
+      memory_type_id);
+}
+
+// Make sure to keep InferResponseAlloc, OutputBufferQuery, and
+// OutputBufferAttributes logic in sync
+TRITONSERVER_Error*
+StreamOutputBufferAttributes(
+    TRITONSERVER_ResponseAllocator* allocator, const char* tensor_name,
+    TRITONSERVER_BufferAttributes* buffer_attributes, void* userp,
+    void* buffer_userp)
+{
+  AllocPayload<inference::ModelStreamInferResponse>* payload =
+      reinterpret_cast<AllocPayload<inference::ModelStreamInferResponse>*>(
+          userp);
+
+  return OutputBufferAttributesHelper(
+      allocator, tensor_name, payload->shm_map_, buffer_attributes);
+}
+
+//=============================================================================
+//  The following section contains the handling mechanism for ModelStreamInfer
+//  RPC. This implementation is tuned towards performance and reducing latency.
+//=============================================================================
+
+void
+ModelStreamInferHandler::StartNewRequest()
+{
+  auto context = std::make_shared<State::Context>(cq_, NEXT_UNIQUE_ID);
+  context->SetCompressionLevel(compression_level_);
+  State* state = StateNew(tritonserver_.get(), context);
+
+#ifdef TRITON_ENABLE_TRACING
+  // Can't create trace as we don't know the model to be requested,
+  // track timestamps in 'state'
+  state->trace_timestamps_.emplace_back(
+      std::make_pair("GRPC_WAITREAD_START", TraceManager::CaptureTimestamp()));
+#endif  // TRITON_ENABLE_TRACING
+
+  service_->RequestModelStreamInfer(
+      state->context_->ctx_.get(), state->context_->responder_.get(), cq_, cq_,
+      state);
+
+  LOG_VERBOSE(1) << "New request handler for " << Name() << ", "
+                 << state->unique_id_;
+}
+
+bool
+ModelStreamInferHandler::Process(InferHandler::State* state, bool rpc_ok)
+{
+  // Because gRPC doesn't allow concurrent writes on the
+  // the stream we only have a single handler thread that
+  // reads from the completion queue. Hence, cancellation
+  // notification will be received on the same handler
+  // thread.
+  // This means that we only need to take care of
+  // synchronizing this thread and the ResponseComplete
+  // threads.
+  if (state->context_->ReceivedNotification()) {
+    std::lock_guard<std::recursive_mutex> lock(state->step_mtx_);
+    if (state->IsGrpcContextCancelled()) {
+      bool resume = state->context_->HandleCancellation(state, rpc_ok, Name());
+      return resume;
+    } else {
+      if (state->context_->HandleCompletion()) {
+        return true;
+      }
+    }
+  }
+
+  LOG_VERBOSE(1) << "Process for " << Name() << ", rpc_ok=" << rpc_ok
+                 << ", context " << state->context_->unique_id_ << ", "
+                 << state->unique_id_ << " step " << state->step_;
+
+  // We need an explicit finish indicator. Can't use 'state->step_'
+  // because we launch an async thread that could update 'state's
+  // step_ to be FINISH before this thread exits this function.
+  bool finished = false;
+
+  if (state->step_ == Steps::START) {
+    // A new stream connection... If RPC failed on a new request then
+    // the server is shutting down and so we should do nothing.
+    if (!rpc_ok) {
+      state->step_ = Steps::FINISH;
+      return false;
+    }
+
+    // Start a new request to replace this one...
+    StartNewRequest();
+
+    if (ExecutePrecondition(state)) {
+      // Since this is the start of a connection, 'state' hasn't been
+      // used yet so use it to read a request off the connection.
+      state->context_->step_ = Steps::READ;
+      state->step_ = Steps::READ;
+      state->context_->responder_->Read(&state->request_, state);
+    } else {
+      // Precondition is not satisfied, cancel the stream
+      state->context_->step_ = Steps::COMPLETE;
+      state->step_ = Steps::PARTIAL_COMPLETION;
+      ::grpc::Status status = ::grpc::Status(
+          ::grpc::StatusCode::UNAVAILABLE,
+          std::string("This protocol is restricted, expecting header '") +
+              restricted_kv_.first + "'");
+      state->context_->responder_->Finish(status, state);
+      return !finished;
+    }
+
+  } else if (state->step_ == Steps::READ) {
+    TRITONSERVER_Error* err = nullptr;
+    const inference::ModelInferRequest& request = state->request_;
+#ifdef TRITON_ENABLE_TRACING
+    state->trace_timestamps_.emplace_back(
+        std::make_pair("GRPC_WAITREAD_END", TraceManager::CaptureTimestamp()));
+#endif  // TRITON_ENABLE_TRACING
+
+    // If done reading and no in-flight requests then can finish the
+    // entire stream. Otherwise just finish this state.
+    if (!rpc_ok) {
+      state->context_->step_ = Steps::WRITEREADY;
+      if (state->context_->IsRequestsCompleted()) {
+        state->context_->step_ = Steps::COMPLETE;
+        state->step_ = Steps::PARTIAL_COMPLETION;
+        LOG_VERBOSE(2) << "Finishing responder from state "
+                       << state->unique_id_;
+        state->context_->responder_->Finish(
+            state->context_->finish_ok_ ? ::grpc::Status::OK
+                                        : ::grpc::Status::CANCELLED,
+            state);
+      } else {
+        state->step_ = Steps::FINISH;
+        finished = true;
+      }
+
+      return !finished;
+    }
+
+    int64_t requested_model_version;
+    err = GetModelVersionFromString(
+        request.model_version(), &requested_model_version);
+
+    // Record the transaction policy of the model into the current state
+    // object.
+    if (err == nullptr) {
+      uint32_t txn_flags;
+      err = TRITONSERVER_ServerModelTransactionProperties(
+          tritonserver_.get(), request.model_name().c_str(),
+          requested_model_version, &txn_flags, nullptr /* voidp */);
+      if (err == nullptr) {
+        state->is_decoupled_ = ((txn_flags & TRITONSERVER_TXN_DECOUPLED) != 0);
+      }
+    }
+
+    // Request has been successfully read, increment the context request
+    // counter.
+    state->context_->IncrementRequestCounter();
+
+    // If the request is not for a model with decoupled transaction policy
+    // then put it in the context queue so that its response is sent in
+    // the same order as the request was received.
+    if (!state->is_decoupled_) {
+      state->context_->EnqueueForResponse(state);
+    }
+
+    // Need to get context here as it is needed below. 'state' can
+    // complete inference, write response, and finish (which releases
+    // context) before we make any forward progress.... so need to
+    // hold onto context here while we know it is good.
+    std::shared_ptr<StateContext> context = state->context_;
+
+    // Issue the inference request into server...
+    auto response_queue_ = state->response_queue_;
+
+    // Create the inference request which contains all the
+    // input information needed for an inference.
+    TRITONSERVER_InferenceRequest* irequest = nullptr;
+    if (err == nullptr) {
+      err = TRITONSERVER_InferenceRequestNew(
+          &irequest, tritonserver_.get(), request.model_name().c_str(),
+          requested_model_version);
+    }
+
+    if (err == nullptr) {
+      state->inference_request_ = {
+          irequest, [](TRITONSERVER_InferenceRequest* request) {
+            LOG_TRITONSERVER_ERROR(
+                TRITONSERVER_InferenceRequestDelete(request),
+                "deleting gRPC inference request");
+          }};
+      err = SetInferenceRequestMetadata(irequest, request, state->parameters_);
+    }
+
+    if (err == nullptr) {
+      err = ForwardHeadersAsParameters(irequest, state);
+    }
+
+    // Will be used to hold the serialized data in case explicit string
+    // tensors are present in the request.
+    std::list<std::string> serialized_data;
+
+    if (err == nullptr) {
+      err = InferGRPCToInput(
+          tritonserver_, shm_manager_, request, &serialized_data, irequest);
+    }
+    if (err == nullptr) {
+      err = InferAllocatorPayload<inference::ModelStreamInferResponse>(
+          tritonserver_, shm_manager_, request, std::move(serialized_data),
+          response_queue_, &state->alloc_payload_);
+    }
+
+    auto request_release_payload =
+        std::make_unique<RequestReleasePayload>(state->inference_request_);
+    if (err == nullptr) {
+      err = TRITONSERVER_InferenceRequestSetReleaseCallback(
+          irequest, InferRequestComplete,
+          request_release_payload.get() /* request_release_userp */);
+    }
+    if (err == nullptr) {
+      err = TRITONSERVER_InferenceRequestSetResponseCallback(
+          irequest, allocator_,
+          &state->alloc_payload_ /* response_allocator_userp */,
+          StreamInferResponseComplete, reinterpret_cast<void*>(state));
+    }
+
+    if (err == nullptr) {
+      TRITONSERVER_InferenceTrace* triton_trace = nullptr;
+#ifdef TRITON_ENABLE_TRACING
+      state->trace_ =
+          std::move(trace_manager_->SampleTrace(request.model_name()));
+      if (state->trace_ != nullptr) {
+        triton_trace = state->trace_->trace_;
+      }
+#endif  // TRITON_ENABLE_TRACING
+
+      state->step_ = ISSUED;
+      err = TRITONSERVER_ServerInferAsync(
+          tritonserver_.get(), irequest, triton_trace);
+    }
+
+    // If there was not an error in issuing the 'state' request then
+    // state->step_ == ISSUED and inference request has
+    // initiated... the completion callback will transition to
+    // WRITEREADY or WRITTEN or CANCELLED. Recording the state and the
+    // irequest to handle gRPC stream cancellation.
+    if (err == nullptr) {
+      state->context_->InsertInflightState(state);
+      // The payload will be cleaned in request release callback.
+      request_release_payload.release();
+    } else {
+      // If there was an error then enqueue the error response and show
+      // it to be ready for writing.
+      inference::ModelStreamInferResponse* response;
+      if (state->is_decoupled_) {
+        state->response_queue_->AllocateResponse();
+        response = state->response_queue_->GetLastAllocatedResponse();
+      } else {
+        response = state->response_queue_->GetNonDecoupledResponse();
+      }
+
+      // Get request ID for logging in case of error.
+      std::string log_request_id = request.id();
+      if (log_request_id.empty()) {
+        log_request_id = "<id_unknown>";
+      }
+      LOG_VERBOSE(1) << "[request id: " << log_request_id << "] "
+                     << "Infer failed: " << TRITONSERVER_ErrorMessage(err);
+
+      ::grpc::Status status;
+      GrpcStatusUtil::Create(&status, err);
+      TRITONSERVER_ErrorDelete(err);
+      response->set_error_message(status.error_message());
+
+      response->mutable_infer_response()->Clear();
+      // repopulate the id so that client knows which request failed.
+      response->mutable_infer_response()->set_id(request.id());
+      state->step_ = Steps::WRITEREADY;
+      if (!state->is_decoupled_) {
+        state->context_->WriteResponseIfReady(state);
+      } else {
+        state->response_queue_->MarkNextResponseComplete();
+        state->complete_ = true;
+        state->context_->PutTaskBackToQueue(state);
+      }
+    }
+
+    // Now that the inference request is in flight, create a copy of
+    // 'state' and use it to attempt another read from the connection
+    // (i.e the next request in the stream).
+    State* next_read_state =
+        StateNew(tritonserver_.get(), context, Steps::READ);
+
+#ifdef TRITON_ENABLE_TRACING
+    // Capture a timestamp for the time when we start waiting for this
+    // next request to read.
+    // Can't create trace as we don't know the model to be requested,
+    // track timestamps in 'state'
+    next_read_state->trace_timestamps_.emplace_back(std::make_pair(
+        "GRPC_WAITREAD_START", TraceManager::CaptureTimestamp()));
+#endif  // TRITON_ENABLE_TRACING
+
+    next_read_state->context_->responder_->Read(
+        &next_read_state->request_, next_read_state);
+  } else if (state->step_ == Steps::PARTIAL_COMPLETION) {
+    state->step_ = Steps::COMPLETE;
+  } else if (state->step_ == Steps::COMPLETE) {
+    state->step_ = Steps::FINISH;
+  } else if (state->step_ == Steps::FINISH) {
+    // The RPC execution is finished hence the state
+    // can be released.
+    finished = true;
+  } else if (!state->is_decoupled_) {
+    // We handle the WRITTEN and WRITEREADY states little
+    // differently depending whether the inference request
+    // is for a decoupled model or not. This is because the
+    // grpc contract requires us to call Write() only once
+    // on a task. Hence, for decoupled writes, we call only
+    // one write and then wait for another notification from
+    // the completion queue to execute pending Write()'s, if
+    // any.
+
+    //
+    // Non-Decoupled state transitions
+    //
+    if (state->step_ == Steps::WRITTEN) {
+      state->context_->ongoing_write_ = false;
+#ifdef TRITON_ENABLE_TRACING
+      state->trace_timestamps_.emplace_back(
+          std::make_pair("GRPC_SEND_END", TraceManager::CaptureTimestamp()));
+#endif  // TRITON_ENABLE_TRACING
+
+      // If the write failed (for example, client closed the stream)
+      // mark that the stream did not complete successfully but don't
+      // cancel right away... need to wait for any pending reads,
+      // inferences and writes to complete.
+      if (!rpc_ok) {
+        LOG_VERBOSE(1) << "Write for " << Name() << ", rpc_ok=" << rpc_ok
+                       << ", context " << state->context_->unique_id_ << ", "
+                       << state->unique_id_ << " step " << state->step_
+                       << ", failed";
+        state->context_->finish_ok_ = false;
+      }
+
+      // Log an error if 'state' is not the expected next response. Mark
+      // that the stream did not complete successfully but don't cancel
+      // right away... need to wait for any pending reads, inferences
+      // and writes to complete.
+      if (!state->context_->PopCompletedResponse(state)) {
+        LOG_ERROR << "Unexpected response for " << Name()
+                  << ", rpc_ok=" << rpc_ok << ", context "
+                  << state->context_->unique_id_ << ", " << state->unique_id_
+                  << " step " << state->step_;
+        state->context_->finish_ok_ = false;
+      }
+
+      // Write the next response if it is ready...
+      state->context_->WriteResponseIfReady(nullptr);
+
+      // The response for the request has been written completely.
+      // The counter can be safely decremented.
+      state->context_->DecrementRequestCounter();
+      finished = Finish(state);
+    }
+  } else {
+    //
+    //  Decoupled state transitions
+    //
+    if (state->step_ == Steps::WRITTEN) {
+      state->context_->ongoing_write_ = false;
+#ifdef TRITON_ENABLE_TRACING
+      state->trace_timestamps_.emplace_back(
+          std::make_pair("GRPC_SEND_END", TraceManager::CaptureTimestamp()));
+#endif  // TRITON_ENABLE_TRACING
+
+      // If the write failed (for example, client closed the stream)
+      // mark that the stream did not complete successfully but don't
+      // cancel right away... need to wait for any pending reads,
+      // inferences and writes to complete.
+      if (!rpc_ok) {
+        LOG_VERBOSE(1) << "Write for " << Name() << ", rpc_ok=" << rpc_ok
+                       << ", context " << state->context_->unique_id_ << ", "
+                       << state->unique_id_ << " step " << state->step_
+                       << ", failed";
+        state->context_->finish_ok_ = false;
+      }
+
+      // Finish the state if all the transactions associated with
+      // the state have completed.
+      if (state->IsComplete()) {
+        state->context_->DecrementRequestCounter();
+        finished = Finish(state);
+      } else {
+        std::lock_guard<std::recursive_mutex> lock(state->step_mtx_);
+
+        // If there is an available response to be written
+        // to the stream, then transition directly to WRITEREADY
+        // state and enqueue itself to the completion queue to be
+        // taken up later. Otherwise, go to ISSUED state and wait
+        // for the callback to make a response available.
+        if (state->response_queue_->HasReadyResponse()) {
+          state->step_ = Steps::WRITEREADY;
+          state->context_->PutTaskBackToQueue(state);
+        } else {
+          state->step_ = Steps::ISSUED;
+        }
+      }
+    } else if (state->step_ == Steps::WRITEREADY) {
+      if (state->delay_response_ms_ != 0) {
+        // Will delay the write of the response by the specified time.
+        // This can be used to test the flow where there are other
+        // responses available to be written.
+        LOG_INFO << "Delaying the write of the response by "
+                 << state->delay_response_ms_ << " ms...";
+        std::this_thread::sleep_for(
+            std::chrono::milliseconds(state->delay_response_ms_));
+      }
+
+      // Finish the state if all the transactions associated with
+      // the state have completed.
+      if (state->IsComplete()) {
+        state->context_->DecrementRequestCounter();
+        finished = Finish(state);
+      } else {
+        // GRPC doesn't allow to issue another write till
+        // the notification from previous write has been
+        // delivered. If there is an ongoing write then
+        // defer writing and place the task at the back
+        // of the completion queue to be taken up later.
+        if (!state->context_->ongoing_write_) {
+          state->context_->ongoing_write_ = true;
+          state->context_->DecoupledWriteResponse(state);
+        } else {
+          state->context_->PutTaskBackToQueue(state);
+        }
+      }
+    }
+  }
+
+  return !finished;
+}
+
+bool
+ModelStreamInferHandler::Finish(InferHandler::State* state)
+{
+  // If done reading and no in-flight requests then can finish the
+  // entire stream.
+  if (state->context_->IsRequestsCompleted()) {
+    state->context_->step_ = Steps::COMPLETE;
+    state->step_ = Steps::PARTIAL_COMPLETION;
+    LOG_VERBOSE(2) << "Finishing responder from state " << state->unique_id_;
+    state->context_->responder_->Finish(
+        state->context_->finish_ok_ ? ::grpc::Status::OK
+                                    : ::grpc::Status::CANCELLED,
+        state);
+  } else if (state->IsAsyncNotifyState()) {
+    // Should only mark the state complete as the state has been sent
+    // to AsyncNotifyWhenDone() tag and the completion event should take
+    // care of finally releasing the state object.
+    state->step_ = Steps::COMPLETE;
+  } else {
+    // Can finish this state.
+    state->step_ = Steps::FINISH;
+    return true;
+  }
+
+  return false;
+}
+
+void
+ModelStreamInferHandler::StreamInferResponseComplete(
+    TRITONSERVER_InferenceResponse* iresponse, const uint32_t flags,
+    void* userp)
+{
+  State* state = reinterpret_cast<State*>(userp);
+
+  // Increment the callback index
+  uint32_t response_index = state->cb_count_++;
+
+  LOG_VERBOSE(1) << "ModelStreamInferHandler::StreamInferComplete, context "
+                 << state->context_->unique_id_ << ", " << state->unique_id_
+                 << " step " << state->step_ << ", callback index "
+                 << state->cb_count_ << ", flags " << flags;
+
+#ifdef TRITON_ENABLE_TRACING
+  if (state->cb_count_ == 1) {
+    state->trace_timestamps_.emplace_back(std::make_pair(
+        "INFER_RESPONSE_COMPLETE", TraceManager::CaptureTimestamp()));
+  }
+#endif  // TRITON_ENABLE_TRACING
+
+  // Log appropriate errors
+  state->complete_ = ((flags & TRITONSERVER_RESPONSE_COMPLETE_FINAL) != 0);
+  if (!state->is_decoupled_) {
+    if (!state->complete_) {
+      LOG_ERROR << "[INTERNAL] ModelStreamInfer received a response without "
+                   "FINAL flag for a model with one-to-one transaction";
+    }
+    if (iresponse == nullptr) {
+      LOG_ERROR << "[INTERNAL] ModelStreamInfer received a null response for a "
+                   "model with one-to-one transaction";
+    }
+  }
+
+  // If receiving the final callback then erase the state from the inflight
+  // state data structure to prevent cancellation being called on the request.
+  // Also make sure that if this state was sent to gRPC async notification
+  // mechanism then the state is not removed as it would be needed for handling
+  // the cancellation if detected.
+  if (state->complete_ && (!state->IsAsyncNotifyState())) {
+    state->context_->EraseInflightState(state);
+  }
+
+  if (state->IsGrpcContextCancelled()) {
+    std::lock_guard<std::recursive_mutex> lock(state->step_mtx_);
+    // Clean-up the received response object.
+    LOG_TRITONSERVER_ERROR(
+        TRITONSERVER_InferenceResponseDelete(iresponse),
+        "deleting GRPC inference response");
+
+    LOG_VERBOSE(1) << "ModelStreamInferHandler::StreamInferResponseComplete, "
+                   << state->unique_id_
+                   << ", skipping response generation as grpc transaction was "
+                      "cancelled... ";
+
+    // If this was the final callback for the state
+    // then cycle through the completion queue so
+    // that state object can be released.
+    if (state->complete_) {
+      state->step_ = Steps::CANCELLED;
+      state->context_->PutTaskBackToQueue(state);
+    }
+
+    return;
+  }
+
+  auto& response_queue = state->response_queue_;
+  std::string log_request_id = state->request_.id();
+  if (log_request_id.empty()) {
+    log_request_id = "<id_unknown>";
+  }
+
+  inference::ModelStreamInferResponse* response = nullptr;
+  bool failed = false;
+  if (iresponse) {
+    // Backend returned a non-null response
+    TRITONSERVER_Error* err = nullptr;
+    response = response_queue->GetResponseAt(response_index);
+    if (response) {
+      inference::ModelInferResponse& infer_response =
+          *(response->mutable_infer_response());
+      // Validate Triton iresponse and set grpc/protobuf response fields from it
+      err = InferResponseCompleteCommon<inference::ModelStreamInferResponse>(
+          state->tritonserver_, iresponse, infer_response,
+          state->alloc_payload_);
+    } else {
+      LOG_ERROR << "expected the response allocator to have added the response";
+    }
+
+    if (err != nullptr) {
+      failed = true;
+      ::grpc::Status status;
+      GrpcStatusUtil::Create(&status, err);
+      response->mutable_infer_response()->Clear();
+      response->set_error_message(status.error_message());
+      LOG_VERBOSE(1) << "Failed for ID: " << log_request_id << std::endl;
+    }
+
+    TRITONSERVER_ErrorDelete(err);
+    LOG_TRITONSERVER_ERROR(
+        TRITONSERVER_InferenceResponseDelete(iresponse),
+        "deleting GRPC inference response");
+  }
+
+  // Decoupled backends can return a null response via
+  // TRITONBACKEND_ResponseFactorySendFlags. By default, these null
+  // "empty" responses are not sent back to the client. Clients can
+  // opt-in to receiving these empty responses via request parameters.
+  // NOTE: The complete flag is the only flag used for this case at this time.
+  const bool empty_final =
+      (!iresponse && state->is_decoupled_ && state->complete_);
+  const bool enable_empty_final =
+      state->parameters_.enable_empty_final_response_;
+
+  const bool create_empty_response = (empty_final && enable_empty_final);
+  if (create_empty_response) {
+    // Assume decoupled here based on prior checks.
+    state->response_queue_->AllocateResponse();
+    response = state->response_queue_->GetLastAllocatedResponse();
+    if (response) {
+      LOG_VERBOSE(1) << "[request id: " << log_request_id << "] "
+                     << "Creating empty final response";
+      response->mutable_infer_response()->Clear();
+    } else {
+      LOG_ERROR << "expected the response allocator to have added the response";
+    }
+  }
+
+  if (response) {
+    auto& infer_response = *(response->mutable_infer_response());
+    // Set response metadata to associate it with request. These will be set
+    // by InferResponseCompleteCommon for successful inference.
+    if (create_empty_response || failed) {
+      infer_response.set_id(state->request_.id());
+      infer_response.set_model_name(state->request_.model_name());
+      infer_response.set_model_version(state->request_.model_version());
+    }
+    auto& params = *(infer_response.mutable_parameters());
+    params["triton_final_response"].set_bool_param(state->complete_);
+  }
+
+  // Update states to signal that response/error is ready to write to stream
+  {
+    // Need to hold lock because the handler thread processing context
+    // cancellation might have cancelled or marked the state for cancellation.
+    std::lock_guard<std::recursive_mutex> lock(state->step_mtx_);
+
+    if (state->IsGrpcContextCancelled()) {
+      LOG_VERBOSE(1)
+          << "ModelStreamInferHandler::StreamInferResponseComplete, "
+          << state->unique_id_
+          << ", skipping writing response because of transaction was cancelled";
+
+      // If this was the final callback for the state
+      // then cycle through the completion queue so
+      // that state object can be released.
+      if (state->complete_) {
+        state->step_ = Steps::CANCELLED;
+        state->context_->PutTaskBackToQueue(state);
+      }
+
+      return;
+    }
+
+    if (state->is_decoupled_) {
+      if (response) {
+        state->response_queue_->MarkNextResponseComplete();
+      }
+      if (state->step_ == Steps::ISSUED) {
+        state->step_ = Steps::WRITEREADY;
+        state->context_->PutTaskBackToQueue(state);
+      }
+    } else {
+      state->step_ = Steps::WRITEREADY;
+      state->context_->WriteResponseIfReady(state);
+    }
+  }
+}
+
+}}}  // namespace triton::server::grpc
diff --git a/src/grpc/stream_infer_handler.h b/src/grpc/stream_infer_handler.h
new file mode 100644
index 0000000000..60c4530227
--- /dev/null
+++ b/src/grpc/stream_infer_handler.h
@@ -0,0 +1,124 @@
+// Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NVIDIA CORPORATION nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#pragma once
+
+#include "infer_handler.h"
+
+namespace triton { namespace server { namespace grpc {
+
+// Make sure to keep InferResponseAlloc and OutputBufferQuery logic in sync
+TRITONSERVER_Error* StreamInferResponseAlloc(
+    TRITONSERVER_ResponseAllocator* allocator, const char* tensor_name,
+    size_t byte_size, TRITONSERVER_MemoryType preferred_memory_type,
+    int64_t preferred_memory_type_id, void* userp, void** buffer,
+    void** buffer_userp, TRITONSERVER_MemoryType* actual_memory_type,
+    int64_t* actual_memory_type_id);
+
+//
+// Additional Stream Infer utilities
+//
+TRITONSERVER_Error* StreamInferResponseStart(
+    TRITONSERVER_ResponseAllocator* allocator, void* userp);
+
+// Make sure to keep InferResponseAlloc and OutputBufferQuery logic in sync
+TRITONSERVER_Error* StreamOutputBufferQuery(
+    TRITONSERVER_ResponseAllocator* allocator, void* userp,
+    const char* tensor_name, size_t* byte_size,
+    TRITONSERVER_MemoryType* memory_type, int64_t* memory_type_id);
+
+// Make sure to keep InferResponseAlloc, OutputBufferQuery, and
+// OutputBufferAttributes logic in sync
+TRITONSERVER_Error* StreamOutputBufferAttributes(
+    TRITONSERVER_ResponseAllocator* allocator, const char* tensor_name,
+    TRITONSERVER_BufferAttributes* buffer_attributes, void* userp,
+    void* buffer_userp);
+
+class ModelStreamInferHandler
+    : public InferHandler<
+          inference::GRPCInferenceService::AsyncService,
+          ::grpc::ServerAsyncReaderWriter<
+              inference::ModelStreamInferResponse,
+              inference::ModelInferRequest>,
+          inference::ModelInferRequest, inference::ModelStreamInferResponse> {
+ public:
+  ModelStreamInferHandler(
+      const std::string& name,
+      const std::shared_ptr<TRITONSERVER_Server>& tritonserver,
+      TraceManager* trace_manager,
+      const std::shared_ptr<SharedMemoryManager>& shm_manager,
+      inference::GRPCInferenceService::AsyncService* service,
+      ::grpc::ServerCompletionQueue* cq, size_t max_state_bucket_count,
+      grpc_compression_level compression_level,
+      std::pair<std::string, std::string> restricted_kv,
+      const std::string& header_forward_pattern)
+      : InferHandler(
+            name, tritonserver, service, cq, max_state_bucket_count,
+            restricted_kv, header_forward_pattern),
+        trace_manager_(trace_manager), shm_manager_(shm_manager),
+        compression_level_(compression_level)
+  {
+    // Create the allocator that will be used to allocate buffers for
+    // the result tensors.
+    FAIL_IF_ERR(
+        TRITONSERVER_ResponseAllocatorNew(
+            &allocator_, StreamInferResponseAlloc, InferResponseFree,
+            StreamInferResponseStart),
+        "creating response allocator");
+    FAIL_IF_ERR(
+        TRITONSERVER_ResponseAllocatorSetQueryFunction(
+            allocator_, StreamOutputBufferQuery),
+        "setting allocator's query function");
+    FAIL_IF_ERR(
+        TRITONSERVER_ResponseAllocatorSetBufferAttributesFunction(
+            allocator_, StreamOutputBufferAttributes),
+        "setting allocator's output buffer attribute query function");
+  }
+
+  ~ModelStreamInferHandler()
+  {
+    LOG_TRITONSERVER_ERROR(
+        TRITONSERVER_ResponseAllocatorDelete(allocator_),
+        "deleting response allocator");
+  }
+
+ protected:
+  void StartNewRequest() override;
+  bool Process(State* state, bool rpc_ok) override;
+
+ private:
+  static void StreamInferResponseComplete(
+      TRITONSERVER_InferenceResponse* response, const uint32_t flags,
+      void* userp);
+  bool Finish(State* state);
+
+  TraceManager* trace_manager_;
+  std::shared_ptr<SharedMemoryManager> shm_manager_;
+  TRITONSERVER_ResponseAllocator* allocator_;
+
+  grpc_compression_level compression_level_;
+};
+
+}}}  // namespace triton::server::grpc
diff --git a/src/grpc_server.cc b/src/grpc_server.cc
deleted file mode 100644
index 0971750bb1..0000000000
--- a/src/grpc_server.cc
+++ /dev/null
@@ -1,4621 +0,0 @@
-// Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-//
-// Redistribution and use in source and binary forms, with or without
-// modification, are permitted provided that the following conditions
-// are met:
-//  * Redistributions of source code must retain the above copyright
-//    notice, this list of conditions and the following disclaimer.
-//  * Redistributions in binary form must reproduce the above copyright
-//    notice, this list of conditions and the following disclaimer in the
-//    documentation and/or other materials provided with the distribution.
-//  * Neither the name of NVIDIA CORPORATION nor the names of its
-//    contributors may be used to endorse or promote products derived
-//    from this software without specific prior written permission.
-//
-// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
-// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
-// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
-// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
-// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
-#include "grpc_server.h"
-
-#include <google/protobuf/arena.h>
-#include <grpc++/alarm.h>
-#include <chrono>
-#include <condition_variable>
-#include <cstdint>
-#include <fstream>
-#include <list>
-#include <map>
-#include <mutex>
-#include <queue>
-#include <sstream>
-#include <thread>
-#include "classification.h"
-#include "common.h"
-#include "grpc++/grpc++.h"
-#include "grpc++/security/server_credentials.h"
-#include "grpc++/server.h"
-#include "grpc++/server_builder.h"
-#include "grpc++/server_context.h"
-#include "grpc++/support/status.h"
-#include "triton/common/logging.h"
-#include "triton/core/tritonserver.h"
-
-#define TRITONJSON_STATUSTYPE TRITONSERVER_Error*
-#define TRITONJSON_STATUSRETURN(M) \
-  return TRITONSERVER_ErrorNew(TRITONSERVER_ERROR_INTERNAL, (M).c_str())
-#define TRITONJSON_STATUSSUCCESS nullptr
-#include "triton/common/triton_json.h"
-
-#ifdef TRITON_ENABLE_TRACING
-#include "tracer.h"
-#endif  // TRITON_ENABLE_TRACING
-
-#define REGISTER_GRPC_INFER_THREAD_COUNT 2
-
-namespace triton { namespace server {
-namespace {
-
-// Unique IDs are only needed when debugging. They only appear in
-// verbose logging.
-#ifndef NDEBUG
-uint64_t
-NextUniqueId()
-{
-  static std::atomic<uint64_t> id(0);
-  return ++id;
-}
-#define NEXT_UNIQUE_ID NextUniqueId()
-#else
-#define NEXT_UNIQUE_ID (0)
-#endif  // NDEBUG
-
-//
-// C++11 doesn't have a barrier so we implement our own.
-//
-class Barrier {
- public:
-  explicit Barrier(size_t cnt) : threshold_(cnt), count_(cnt), generation_(0) {}
-
-  void Wait()
-  {
-    std::unique_lock<std::mutex> lock(mu_);
-    auto lgen = generation_;
-    if (--count_ == 0) {
-      generation_++;
-      count_ = threshold_;
-      cv_.notify_all();
-    } else {
-      cv_.wait(lock, [this, lgen] { return lgen != generation_; });
-    }
-  }
-
- private:
-  std::mutex mu_;
-  std::condition_variable cv_;
-  const size_t threshold_;
-  size_t count_;
-  size_t generation_;
-};
-
-//
-// GrpcStatusUtil
-//
-class GrpcStatusUtil {
- public:
-  static void Create(grpc::Status* status, TRITONSERVER_Error* err);
-  static grpc::StatusCode CodeToStatus(TRITONSERVER_Error_Code code);
-};
-
-void
-GrpcStatusUtil::Create(grpc::Status* status, TRITONSERVER_Error* err)
-{
-  if (err == nullptr) {
-    *status = grpc::Status::OK;
-  } else {
-    *status = grpc::Status(
-        GrpcStatusUtil::CodeToStatus(TRITONSERVER_ErrorCode(err)),
-        TRITONSERVER_ErrorMessage(err));
-  }
-}
-
-grpc::StatusCode
-GrpcStatusUtil::CodeToStatus(TRITONSERVER_Error_Code code)
-{
-  // GRPC status codes:
-  // https://github.com/grpc/grpc/blob/master/include/grpc/impl/codegen/status.h
-  switch (code) {
-    case TRITONSERVER_ERROR_UNKNOWN:
-      return grpc::StatusCode::UNKNOWN;
-    case TRITONSERVER_ERROR_INTERNAL:
-      return grpc::StatusCode::INTERNAL;
-    case TRITONSERVER_ERROR_NOT_FOUND:
-      return grpc::StatusCode::NOT_FOUND;
-    case TRITONSERVER_ERROR_INVALID_ARG:
-      return grpc::StatusCode::INVALID_ARGUMENT;
-    case TRITONSERVER_ERROR_UNAVAILABLE:
-      return grpc::StatusCode::UNAVAILABLE;
-    case TRITONSERVER_ERROR_UNSUPPORTED:
-      return grpc::StatusCode::UNIMPLEMENTED;
-    case TRITONSERVER_ERROR_ALREADY_EXISTS:
-      return grpc::StatusCode::ALREADY_EXISTS;
-  }
-
-  return grpc::StatusCode::UNKNOWN;
-}
-
-// The step of processing that the state is in. Every state must
-// recognize START, COMPLETE and FINISH and the others are optional.
-typedef enum {
-  START,
-  COMPLETE,
-  FINISH,
-  ISSUED,
-  READ,
-  WRITEREADY,
-  WRITTEN
-} Steps;
-
-std::ostream&
-operator<<(std::ostream& out, const Steps& step)
-{
-  switch (step) {
-    case START:
-      out << "START";
-      break;
-    case COMPLETE:
-      out << "COMPLETE";
-      break;
-    case FINISH:
-      out << "FINISH";
-      break;
-    case ISSUED:
-      out << "ISSUED";
-      break;
-    case READ:
-      out << "READ";
-      break;
-    case WRITEREADY:
-      out << "WRITEREADY";
-      break;
-    case WRITTEN:
-      out << "WRITTEN";
-      break;
-  }
-
-  return out;
-}
-
-//
-// The server has separate handling mechanisms for inference RPCs
-// and non-inference RPCs.
-//
-
-//=========================================================================
-//  The following section contains the handling mechanism for non-inference
-//  RPCs. A single thread is created to handle all these requests as they
-//  are deemed to be not performance critical.
-//=========================================================================
-
-template <typename ResponderType, typename RequestType, typename ResponseType>
-class CommonCallData : public GRPCServer::ICallData {
- public:
-  using StandardRegisterFunc = std::function<void(
-      grpc::ServerContext*, RequestType*, ResponderType*, void*)>;
-  using StandardCallbackFunc =
-      std::function<void(RequestType&, ResponseType*, grpc::Status*)>;
-
-  CommonCallData(
-      const std::string& name, const uint64_t id,
-      const StandardRegisterFunc OnRegister,
-      const StandardCallbackFunc OnExecute, const bool async,
-      grpc::ServerCompletionQueue* cq)
-      : name_(name), id_(id), OnRegister_(OnRegister), OnExecute_(OnExecute),
-        async_(async), cq_(cq), responder_(&ctx_), step_(Steps::START)
-  {
-    OnRegister_(&ctx_, &request_, &responder_, this);
-    LOG_VERBOSE(1) << "Ready for RPC '" << name_ << "', " << id_;
-  }
-
-  ~CommonCallData()
-  {
-    if (async_thread_.joinable()) {
-      async_thread_.join();
-    }
-  }
-
-  bool Process(bool ok) override;
-
-  std::string Name() override { return name_; }
-
-  uint64_t Id() override { return id_; }
-
- private:
-  void Execute();
-  void AddToCompletionQueue();
-  void WriteResponse();
-
-  const std::string name_;
-  const uint64_t id_;
-  const StandardRegisterFunc OnRegister_;
-  const StandardCallbackFunc OnExecute_;
-  const bool async_;
-  grpc::ServerCompletionQueue* cq_;
-
-  grpc::ServerContext ctx_;
-  grpc::Alarm alarm_;
-
-  ResponderType responder_;
-  RequestType request_;
-  ResponseType response_;
-  grpc::Status status_;
-
-  std::thread async_thread_;
-
-  Steps step_;
-};
-
-template <typename ResponderType, typename RequestType, typename ResponseType>
-bool
-CommonCallData<ResponderType, RequestType, ResponseType>::Process(bool rpc_ok)
-{
-  LOG_VERBOSE(1) << "Process for " << name_ << ", rpc_ok=" << rpc_ok << ", "
-                 << id_ << " step " << step_;
-
-  // If RPC failed on a new request then the server is shutting down
-  // and so we should do nothing (including not registering for a new
-  // request). If RPC failed on a non-START step then there is nothing
-  // we can do since we one execute one step.
-  const bool shutdown = (!rpc_ok && (step_ == Steps::START));
-  if (shutdown) {
-    if (async_thread_.joinable()) {
-      async_thread_.join();
-    }
-    step_ = Steps::FINISH;
-  }
-
-  if (step_ == Steps::START) {
-    // Start a new request to replace this one...
-    if (!shutdown) {
-      new CommonCallData<ResponderType, RequestType, ResponseType>(
-          name_, id_ + 1, OnRegister_, OnExecute_, async_, cq_);
-    }
-
-    if (!async_) {
-      // For synchronous calls, execute and write response
-      // here.
-      Execute();
-      WriteResponse();
-    } else {
-      // For asynchronous calls, delegate the execution to another
-      // thread.
-      step_ = Steps::ISSUED;
-      async_thread_ = std::thread(&CommonCallData::Execute, this);
-    }
-  } else if (step_ == Steps::WRITEREADY) {
-    // Will only come here for asynchronous mode.
-    WriteResponse();
-  } else if (step_ == Steps::COMPLETE) {
-    step_ = Steps::FINISH;
-  }
-
-  return step_ != Steps::FINISH;
-}
-
-template <typename ResponderType, typename RequestType, typename ResponseType>
-void
-CommonCallData<ResponderType, RequestType, ResponseType>::Execute()
-{
-  OnExecute_(request_, &response_, &status_);
-  step_ = Steps::WRITEREADY;
-
-  if (async_) {
-    // For asynchronous operation, need to add itself onto the completion
-    // queue so that the response can be written once the object is
-    // taken up next for execution.
-    AddToCompletionQueue();
-  }
-}
-
-template <typename ResponderType, typename RequestType, typename ResponseType>
-void
-CommonCallData<ResponderType, RequestType, ResponseType>::AddToCompletionQueue()
-{
-  alarm_.Set(cq_, gpr_now(gpr_clock_type::GPR_CLOCK_REALTIME), this);
-}
-
-template <typename ResponderType, typename RequestType, typename ResponseType>
-void
-CommonCallData<ResponderType, RequestType, ResponseType>::WriteResponse()
-{
-  step_ = Steps::COMPLETE;
-  responder_.Finish(response_, status_, this);
-}
-
-//
-// CommonHandler
-//
-// A common handler for all non-inference requests.
-//
-class CommonHandler : public GRPCServer::HandlerBase {
- public:
-  CommonHandler(
-      const std::string& name,
-      const std::shared_ptr<TRITONSERVER_Server>& tritonserver,
-      const std::shared_ptr<SharedMemoryManager>& shm_manager,
-      TraceManager* trace_manager,
-      inference::GRPCInferenceService::AsyncService* service,
-      grpc::ServerCompletionQueue* cq);
-
-  // Descriptive name of of the handler.
-  const std::string& Name() const { return name_; }
-
-  // Start handling requests.
-  void Start();
-
-  // Stop handling requests.
-  void Stop();
-
- private:
-  void SetUpAllRequests();
-
-  const std::string name_;
-  std::shared_ptr<TRITONSERVER_Server> tritonserver_;
-
-  std::shared_ptr<SharedMemoryManager> shm_manager_;
-  TraceManager* trace_manager_;
-
-  inference::GRPCInferenceService::AsyncService* service_;
-  grpc::ServerCompletionQueue* cq_;
-  std::unique_ptr<std::thread> thread_;
-};
-
-CommonHandler::CommonHandler(
-    const std::string& name,
-    const std::shared_ptr<TRITONSERVER_Server>& tritonserver,
-    const std::shared_ptr<SharedMemoryManager>& shm_manager,
-    TraceManager* trace_manager,
-    inference::GRPCInferenceService::AsyncService* service,
-    grpc::ServerCompletionQueue* cq)
-    : name_(name), tritonserver_(tritonserver), shm_manager_(shm_manager),
-      trace_manager_(trace_manager), service_(service), cq_(cq)
-{
-}
-
-void
-CommonHandler::Start()
-{
-  // Use a barrier to make sure we don't return until thread has
-  // started.
-  auto barrier = std::make_shared<Barrier>(2);
-
-  thread_.reset(new std::thread([this, barrier] {
-    SetUpAllRequests();
-    barrier->Wait();
-
-    void* tag;
-    bool ok;
-
-    while (cq_->Next(&tag, &ok)) {
-      GRPCServer::ICallData* call_data =
-          static_cast<GRPCServer::ICallData*>(tag);
-      if (!call_data->Process(ok)) {
-        LOG_VERBOSE(1) << "Done for " << call_data->Name() << ", "
-                       << call_data->Id();
-        delete call_data;
-      }
-    }
-  }));
-
-  barrier->Wait();
-  LOG_VERBOSE(1) << "Thread started for " << Name();
-}
-
-void
-CommonHandler::Stop()
-{
-  if (thread_->joinable()) {
-    thread_->join();
-  }
-
-  LOG_VERBOSE(1) << "Thread exited for " << Name();
-}
-
-void
-CommonHandler::SetUpAllRequests()
-{
-  // Define all the RPCs to be handled by this handler below
-  //
-  // The format of each RPC specification is :
-  // 1. A OnRegister function: This will be called when the
-  //    server is ready to receive the requests for this RPC.
-  // 2. A OnExecute function: This will be called when the
-  //    to process the request.
-  // 3. Create a CommonCallData object with the above callback
-  //    functions
-
-  //
-  //  ServerLive
-  //
-  auto OnRegisterServerLive =
-      [this](
-          grpc::ServerContext* ctx, inference::ServerLiveRequest* request,
-          grpc::ServerAsyncResponseWriter<inference::ServerLiveResponse>*
-              responder,
-          void* tag) {
-        this->service_->RequestServerLive(
-            ctx, request, responder, this->cq_, this->cq_, tag);
-      };
-
-  auto OnExecuteServerLive = [this](
-                                 inference::ServerLiveRequest& request,
-                                 inference::ServerLiveResponse* response,
-                                 grpc::Status* status) {
-    bool live = false;
-    TRITONSERVER_Error* err =
-        TRITONSERVER_ServerIsLive(tritonserver_.get(), &live);
-
-    response->set_live((err == nullptr) && live);
-
-    GrpcStatusUtil::Create(status, err);
-    TRITONSERVER_ErrorDelete(err);
-  };
-
-  new CommonCallData<
-      grpc::ServerAsyncResponseWriter<inference::ServerLiveResponse>,
-      inference::ServerLiveRequest, inference::ServerLiveResponse>(
-      "ServerLive", 0, OnRegisterServerLive, OnExecuteServerLive,
-      false /* async */, cq_);
-
-  //
-  //  ServerReady
-  //
-  auto OnRegisterServerReady =
-      [this](
-          grpc::ServerContext* ctx, inference::ServerReadyRequest* request,
-          grpc::ServerAsyncResponseWriter<inference::ServerReadyResponse>*
-              responder,
-          void* tag) {
-        this->service_->RequestServerReady(
-            ctx, request, responder, this->cq_, this->cq_, tag);
-      };
-
-  auto OnExecuteServerReady = [this](
-                                  inference::ServerReadyRequest& request,
-                                  inference::ServerReadyResponse* response,
-                                  grpc::Status* status) {
-    bool ready = false;
-    TRITONSERVER_Error* err =
-        TRITONSERVER_ServerIsReady(tritonserver_.get(), &ready);
-
-    response->set_ready((err == nullptr) && ready);
-
-    GrpcStatusUtil::Create(status, err);
-    TRITONSERVER_ErrorDelete(err);
-  };
-
-  new CommonCallData<
-      grpc::ServerAsyncResponseWriter<inference::ServerReadyResponse>,
-      inference::ServerReadyRequest, inference::ServerReadyResponse>(
-      "ServerReady", 0, OnRegisterServerReady, OnExecuteServerReady,
-      false /* async */, cq_);
-
-  //
-  //  ModelReady
-  //
-  auto OnRegisterModelReady =
-      [this](
-          grpc::ServerContext* ctx, inference::ModelReadyRequest* request,
-          grpc::ServerAsyncResponseWriter<inference::ModelReadyResponse>*
-              responder,
-          void* tag) {
-        this->service_->RequestModelReady(
-            ctx, request, responder, this->cq_, this->cq_, tag);
-      };
-
-  auto OnExecuteModelReady = [this](
-                                 inference::ModelReadyRequest& request,
-                                 inference::ModelReadyResponse* response,
-                                 grpc::Status* status) {
-    bool is_ready = false;
-    int64_t requested_model_version;
-    auto err =
-        GetModelVersionFromString(request.version(), &requested_model_version);
-    if (err == nullptr) {
-      err = TRITONSERVER_ServerModelIsReady(
-          tritonserver_.get(), request.name().c_str(), requested_model_version,
-          &is_ready);
-    }
-
-    response->set_ready(is_ready);
-
-    GrpcStatusUtil::Create(status, err);
-    TRITONSERVER_ErrorDelete(err);
-  };
-
-  new CommonCallData<
-      grpc::ServerAsyncResponseWriter<inference::ModelReadyResponse>,
-      inference::ModelReadyRequest, inference::ModelReadyResponse>(
-      "ModelReady", 0, OnRegisterModelReady, OnExecuteModelReady,
-      false /* async */, cq_);
-
-  //
-  //  ServerMetadata
-  //
-  auto OnRegisterServerMetadata =
-      [this](
-          grpc::ServerContext* ctx, inference::ServerMetadataRequest* request,
-          grpc::ServerAsyncResponseWriter<inference::ServerMetadataResponse>*
-              responder,
-          void* tag) {
-        this->service_->RequestServerMetadata(
-            ctx, request, responder, this->cq_, this->cq_, tag);
-      };
-
-  auto OnExecuteServerMetadata =
-      [this](
-          inference::ServerMetadataRequest& request,
-          inference::ServerMetadataResponse* response, grpc::Status* status) {
-        TRITONSERVER_Message* server_metadata_message = nullptr;
-        TRITONSERVER_Error* err = TRITONSERVER_ServerMetadata(
-            tritonserver_.get(), &server_metadata_message);
-        GOTO_IF_ERR(err, earlyexit);
-
-        const char* buffer;
-        size_t byte_size;
-        err = TRITONSERVER_MessageSerializeToJson(
-            server_metadata_message, &buffer, &byte_size);
-        GOTO_IF_ERR(err, earlyexit);
-
-        {
-          triton::common::TritonJson::Value server_metadata_json;
-          err = server_metadata_json.Parse(buffer, byte_size);
-          GOTO_IF_ERR(err, earlyexit);
-
-          const char* name;
-          size_t namelen;
-          err = server_metadata_json.MemberAsString("name", &name, &namelen);
-          GOTO_IF_ERR(err, earlyexit);
-
-          const char* version;
-          size_t versionlen;
-          err = server_metadata_json.MemberAsString(
-              "version", &version, &versionlen);
-          GOTO_IF_ERR(err, earlyexit);
-
-          response->set_name(std::string(name, namelen));
-          response->set_version(std::string(version, versionlen));
-
-          if (server_metadata_json.Find("extensions")) {
-            triton::common::TritonJson::Value extensions_json;
-            err = server_metadata_json.MemberAsArray(
-                "extensions", &extensions_json);
-            GOTO_IF_ERR(err, earlyexit);
-
-            for (size_t idx = 0; idx < extensions_json.ArraySize(); ++idx) {
-              const char* ext;
-              size_t extlen;
-              err = extensions_json.IndexAsString(idx, &ext, &extlen);
-              GOTO_IF_ERR(err, earlyexit);
-              response->add_extensions(std::string(ext, extlen));
-            }
-          }
-          TRITONSERVER_MessageDelete(server_metadata_message);
-        }
-
-      earlyexit:
-        GrpcStatusUtil::Create(status, err);
-        TRITONSERVER_ErrorDelete(err);
-      };
-
-  new CommonCallData<
-      grpc::ServerAsyncResponseWriter<inference::ServerMetadataResponse>,
-      inference::ServerMetadataRequest, inference::ServerMetadataResponse>(
-      "ServerMetadata", 0, OnRegisterServerMetadata, OnExecuteServerMetadata,
-      false /* async */, cq_);
-
-  //
-  //  ModelMetadata
-  //
-  auto OnRegisterModelMetadata =
-      [this](
-          grpc::ServerContext* ctx, inference::ModelMetadataRequest* request,
-          grpc::ServerAsyncResponseWriter<inference::ModelMetadataResponse>*
-              responder,
-          void* tag) {
-        this->service_->RequestModelMetadata(
-            ctx, request, responder, this->cq_, this->cq_, tag);
-      };
-
-  auto OnExecuteModelMetadata = [this](
-                                    inference::ModelMetadataRequest& request,
-                                    inference::ModelMetadataResponse* response,
-                                    grpc::Status* status) {
-    int64_t requested_model_version;
-    auto err =
-        GetModelVersionFromString(request.version(), &requested_model_version);
-    GOTO_IF_ERR(err, earlyexit);
-
-    {
-      TRITONSERVER_Message* model_metadata_message = nullptr;
-      err = TRITONSERVER_ServerModelMetadata(
-          tritonserver_.get(), request.name().c_str(), requested_model_version,
-          &model_metadata_message);
-      GOTO_IF_ERR(err, earlyexit);
-
-      const char* buffer;
-      size_t byte_size;
-      err = TRITONSERVER_MessageSerializeToJson(
-          model_metadata_message, &buffer, &byte_size);
-      GOTO_IF_ERR(err, earlyexit);
-
-      triton::common::TritonJson::Value model_metadata_json;
-      err = model_metadata_json.Parse(buffer, byte_size);
-      GOTO_IF_ERR(err, earlyexit);
-
-      const char* name;
-      size_t namelen;
-      err = model_metadata_json.MemberAsString("name", &name, &namelen);
-      GOTO_IF_ERR(err, earlyexit);
-
-      response->set_name(std::string(name, namelen));
-
-      if (model_metadata_json.Find("versions")) {
-        triton::common::TritonJson::Value versions_json;
-        err = model_metadata_json.MemberAsArray("versions", &versions_json);
-        GOTO_IF_ERR(err, earlyexit);
-
-        for (size_t idx = 0; idx < versions_json.ArraySize(); ++idx) {
-          const char* version;
-          size_t versionlen;
-          err = versions_json.IndexAsString(idx, &version, &versionlen);
-          GOTO_IF_ERR(err, earlyexit);
-          response->add_versions(std::string(version, versionlen));
-        }
-      }
-
-      const char* platform;
-      size_t platformlen;
-      err = model_metadata_json.MemberAsString(
-          "platform", &platform, &platformlen);
-      GOTO_IF_ERR(err, earlyexit);
-      response->set_platform(std::string(platform, platformlen));
-
-      if (model_metadata_json.Find("inputs")) {
-        triton::common::TritonJson::Value inputs_json;
-        err = model_metadata_json.MemberAsArray("inputs", &inputs_json);
-        GOTO_IF_ERR(err, earlyexit);
-
-        for (size_t idx = 0; idx < inputs_json.ArraySize(); ++idx) {
-          triton::common::TritonJson::Value io_json;
-          err = inputs_json.IndexAsObject(idx, &io_json);
-          GOTO_IF_ERR(err, earlyexit);
-
-          inference::ModelMetadataResponse::TensorMetadata* io =
-              response->add_inputs();
-
-          const char* name;
-          size_t namelen;
-          err = io_json.MemberAsString("name", &name, &namelen);
-          GOTO_IF_ERR(err, earlyexit);
-
-          const char* datatype;
-          size_t datatypelen;
-          err = io_json.MemberAsString("datatype", &datatype, &datatypelen);
-          GOTO_IF_ERR(err, earlyexit);
-
-          io->set_name(std::string(name, namelen));
-          io->set_datatype(std::string(datatype, datatypelen));
-
-          if (io_json.Find("shape")) {
-            triton::common::TritonJson::Value shape_json;
-            err = io_json.MemberAsArray("shape", &shape_json);
-            GOTO_IF_ERR(err, earlyexit);
-
-            for (size_t sidx = 0; sidx < shape_json.ArraySize(); ++sidx) {
-              int64_t d;
-              err = shape_json.IndexAsInt(sidx, &d);
-              GOTO_IF_ERR(err, earlyexit);
-
-              io->add_shape(d);
-            }
-          }
-        }
-      }
-
-      if (model_metadata_json.Find("outputs")) {
-        triton::common::TritonJson::Value outputs_json;
-        err = model_metadata_json.MemberAsArray("outputs", &outputs_json);
-        GOTO_IF_ERR(err, earlyexit);
-
-        for (size_t idx = 0; idx < outputs_json.ArraySize(); ++idx) {
-          triton::common::TritonJson::Value io_json;
-          err = outputs_json.IndexAsObject(idx, &io_json);
-          GOTO_IF_ERR(err, earlyexit);
-
-          inference::ModelMetadataResponse::TensorMetadata* io =
-              response->add_outputs();
-
-          const char* name;
-          size_t namelen;
-          err = io_json.MemberAsString("name", &name, &namelen);
-          GOTO_IF_ERR(err, earlyexit);
-
-          const char* datatype;
-          size_t datatypelen;
-          err = io_json.MemberAsString("datatype", &datatype, &datatypelen);
-          GOTO_IF_ERR(err, earlyexit);
-
-          io->set_name(std::string(name, namelen));
-          io->set_datatype(std::string(datatype, datatypelen));
-
-          if (io_json.Find("shape")) {
-            triton::common::TritonJson::Value shape_json;
-            err = io_json.MemberAsArray("shape", &shape_json);
-            GOTO_IF_ERR(err, earlyexit);
-
-            for (size_t sidx = 0; sidx < shape_json.ArraySize(); ++sidx) {
-              int64_t d;
-              err = shape_json.IndexAsInt(sidx, &d);
-              GOTO_IF_ERR(err, earlyexit);
-
-              io->add_shape(d);
-            }
-          }
-        }
-      }
-
-      TRITONSERVER_MessageDelete(model_metadata_message);
-    }
-
-  earlyexit:
-    GrpcStatusUtil::Create(status, err);
-    TRITONSERVER_ErrorDelete(err);
-  };
-
-  new CommonCallData<
-      grpc::ServerAsyncResponseWriter<inference::ModelMetadataResponse>,
-      inference::ModelMetadataRequest, inference::ModelMetadataResponse>(
-      "ModelMetadata", 0, OnRegisterModelMetadata, OnExecuteModelMetadata,
-      false /* async */, cq_);
-
-  //
-  //  ModelConfig
-  //
-  auto OnRegisterModelConfig =
-      [this](
-          grpc::ServerContext* ctx, inference::ModelConfigRequest* request,
-          grpc::ServerAsyncResponseWriter<inference::ModelConfigResponse>*
-              responder,
-          void* tag) {
-        this->service_->RequestModelConfig(
-            ctx, request, responder, this->cq_, this->cq_, tag);
-      };
-
-  auto OnExecuteModelConfig = [this](
-                                  inference::ModelConfigRequest& request,
-                                  inference::ModelConfigResponse* response,
-                                  grpc::Status* status) {
-    int64_t requested_model_version;
-    auto err =
-        GetModelVersionFromString(request.version(), &requested_model_version);
-    if (err == nullptr) {
-      TRITONSERVER_Message* model_config_message = nullptr;
-      err = TRITONSERVER_ServerModelConfig(
-          tritonserver_.get(), request.name().c_str(), requested_model_version,
-          1 /* config_version */, &model_config_message);
-      if (err == nullptr) {
-        const char* buffer;
-        size_t byte_size;
-        err = TRITONSERVER_MessageSerializeToJson(
-            model_config_message, &buffer, &byte_size);
-        if (err == nullptr) {
-          ::google::protobuf::util::JsonStringToMessage(
-              {buffer, (int)byte_size}, response->mutable_config());
-        }
-        TRITONSERVER_MessageDelete(model_config_message);
-      }
-    }
-
-    GrpcStatusUtil::Create(status, err);
-    TRITONSERVER_ErrorDelete(err);
-  };
-
-  new CommonCallData<
-      grpc::ServerAsyncResponseWriter<inference::ModelConfigResponse>,
-      inference::ModelConfigRequest, inference::ModelConfigResponse>(
-      "ModelConfig", 0, OnRegisterModelConfig, OnExecuteModelConfig,
-      false /* async */, cq_);
-
-  //
-  //  ModelStatistics
-  //
-  auto OnRegisterModelStatistics =
-      [this](
-          grpc::ServerContext* ctx, inference::ModelStatisticsRequest* request,
-          grpc::ServerAsyncResponseWriter<inference::ModelStatisticsResponse>*
-              responder,
-          void* tag) {
-        this->service_->RequestModelStatistics(
-            ctx, request, responder, this->cq_, this->cq_, tag);
-      };
-
-  auto OnExecuteModelStatistics = [this](
-                                      inference::ModelStatisticsRequest&
-                                          request,
-                                      inference::ModelStatisticsResponse*
-                                          response,
-                                      grpc::Status* status) {
-#ifdef TRITON_ENABLE_STATS
-    triton::common::TritonJson::Value model_stats_json;
-
-    int64_t requested_model_version;
-    auto err =
-        GetModelVersionFromString(request.version(), &requested_model_version);
-    GOTO_IF_ERR(err, earlyexit);
-
-    {
-      TRITONSERVER_Message* model_stats_message = nullptr;
-      err = TRITONSERVER_ServerModelStatistics(
-          tritonserver_.get(), request.name().c_str(), requested_model_version,
-          &model_stats_message);
-      GOTO_IF_ERR(err, earlyexit);
-
-      const char* buffer;
-      size_t byte_size;
-      err = TRITONSERVER_MessageSerializeToJson(
-          model_stats_message, &buffer, &byte_size);
-      GOTO_IF_ERR(err, earlyexit);
-
-      err = model_stats_json.Parse(buffer, byte_size);
-      GOTO_IF_ERR(err, earlyexit);
-
-      TRITONSERVER_MessageDelete(model_stats_message);
-    }
-
-    if (model_stats_json.Find("model_stats")) {
-      triton::common::TritonJson::Value stats_json;
-      err = model_stats_json.MemberAsArray("model_stats", &stats_json);
-      GOTO_IF_ERR(err, earlyexit);
-
-      for (size_t idx = 0; idx < stats_json.ArraySize(); ++idx) {
-        triton::common::TritonJson::Value model_stat;
-        err = stats_json.IndexAsObject(idx, &model_stat);
-        GOTO_IF_ERR(err, earlyexit);
-
-        auto statistics = response->add_model_stats();
-
-        const char* name;
-        size_t namelen;
-        err = model_stat.MemberAsString("name", &name, &namelen);
-        GOTO_IF_ERR(err, earlyexit);
-
-        const char* version;
-        size_t versionlen;
-        err = model_stat.MemberAsString("version", &version, &versionlen);
-        GOTO_IF_ERR(err, earlyexit);
-
-        statistics->set_name(std::string(name, namelen));
-        statistics->set_version(std::string(version, versionlen));
-
-        uint64_t ucnt;
-        err = model_stat.MemberAsUInt("last_inference", &ucnt);
-        GOTO_IF_ERR(err, earlyexit);
-        statistics->set_last_inference(ucnt);
-
-        err = model_stat.MemberAsUInt("inference_count", &ucnt);
-        GOTO_IF_ERR(err, earlyexit);
-        statistics->set_inference_count(ucnt);
-
-        err = model_stat.MemberAsUInt("execution_count", &ucnt);
-        GOTO_IF_ERR(err, earlyexit);
-        statistics->set_execution_count(ucnt);
-
-        triton::common::TritonJson::Value infer_stats_json;
-        err = model_stat.MemberAsObject("inference_stats", &infer_stats_json);
-        GOTO_IF_ERR(err, earlyexit);
-
-        {
-          triton::common::TritonJson::Value success_json;
-          err = infer_stats_json.MemberAsObject("success", &success_json);
-          GOTO_IF_ERR(err, earlyexit);
-
-          err = success_json.MemberAsUInt("count", &ucnt);
-          GOTO_IF_ERR(err, earlyexit);
-          statistics->mutable_inference_stats()->mutable_success()->set_count(
-              ucnt);
-          err = success_json.MemberAsUInt("ns", &ucnt);
-          GOTO_IF_ERR(err, earlyexit);
-          statistics->mutable_inference_stats()->mutable_success()->set_ns(
-              ucnt);
-        }
-
-        {
-          triton::common::TritonJson::Value fail_json;
-          err = infer_stats_json.MemberAsObject("fail", &fail_json);
-          GOTO_IF_ERR(err, earlyexit);
-
-          err = fail_json.MemberAsUInt("count", &ucnt);
-          GOTO_IF_ERR(err, earlyexit);
-          statistics->mutable_inference_stats()->mutable_fail()->set_count(
-              ucnt);
-          err = fail_json.MemberAsUInt("ns", &ucnt);
-          GOTO_IF_ERR(err, earlyexit);
-          statistics->mutable_inference_stats()->mutable_fail()->set_ns(ucnt);
-        }
-
-        {
-          triton::common::TritonJson::Value queue_json;
-          err = infer_stats_json.MemberAsObject("queue", &queue_json);
-          GOTO_IF_ERR(err, earlyexit);
-
-          err = queue_json.MemberAsUInt("count", &ucnt);
-          GOTO_IF_ERR(err, earlyexit);
-          statistics->mutable_inference_stats()->mutable_queue()->set_count(
-              ucnt);
-          err = queue_json.MemberAsUInt("ns", &ucnt);
-          GOTO_IF_ERR(err, earlyexit);
-          statistics->mutable_inference_stats()->mutable_queue()->set_ns(ucnt);
-        }
-
-        {
-          triton::common::TritonJson::Value compute_input_json;
-          err = infer_stats_json.MemberAsObject(
-              "compute_input", &compute_input_json);
-          GOTO_IF_ERR(err, earlyexit);
-
-          err = compute_input_json.MemberAsUInt("count", &ucnt);
-          GOTO_IF_ERR(err, earlyexit);
-          statistics->mutable_inference_stats()
-              ->mutable_compute_input()
-              ->set_count(ucnt);
-          err = compute_input_json.MemberAsUInt("ns", &ucnt);
-          GOTO_IF_ERR(err, earlyexit);
-          statistics->mutable_inference_stats()
-              ->mutable_compute_input()
-              ->set_ns(ucnt);
-        }
-
-        {
-          triton::common::TritonJson::Value compute_infer_json;
-          err = infer_stats_json.MemberAsObject(
-              "compute_infer", &compute_infer_json);
-          GOTO_IF_ERR(err, earlyexit);
-
-          err = compute_infer_json.MemberAsUInt("count", &ucnt);
-          GOTO_IF_ERR(err, earlyexit);
-          statistics->mutable_inference_stats()
-              ->mutable_compute_infer()
-              ->set_count(ucnt);
-          err = compute_infer_json.MemberAsUInt("ns", &ucnt);
-          GOTO_IF_ERR(err, earlyexit);
-          statistics->mutable_inference_stats()
-              ->mutable_compute_infer()
-              ->set_ns(ucnt);
-        }
-
-        {
-          triton::common::TritonJson::Value compute_output_json;
-          err = infer_stats_json.MemberAsObject(
-              "compute_output", &compute_output_json);
-          GOTO_IF_ERR(err, earlyexit);
-
-          err = compute_output_json.MemberAsUInt("count", &ucnt);
-          GOTO_IF_ERR(err, earlyexit);
-          statistics->mutable_inference_stats()
-              ->mutable_compute_output()
-              ->set_count(ucnt);
-          err = compute_output_json.MemberAsUInt("ns", &ucnt);
-          GOTO_IF_ERR(err, earlyexit);
-          statistics->mutable_inference_stats()
-              ->mutable_compute_output()
-              ->set_ns(ucnt);
-        }
-
-        {
-          triton::common::TritonJson::Value cache_hit_json;
-          err = infer_stats_json.MemberAsObject("cache_hit", &cache_hit_json);
-          GOTO_IF_ERR(err, earlyexit);
-
-          err = cache_hit_json.MemberAsUInt("count", &ucnt);
-          GOTO_IF_ERR(err, earlyexit);
-          statistics->mutable_inference_stats()->mutable_cache_hit()->set_count(
-              ucnt);
-          err = cache_hit_json.MemberAsUInt("ns", &ucnt);
-          GOTO_IF_ERR(err, earlyexit);
-          statistics->mutable_inference_stats()->mutable_cache_hit()->set_ns(
-              ucnt);
-        }
-
-        {
-          triton::common::TritonJson::Value cache_miss_json;
-          err = infer_stats_json.MemberAsObject("cache_miss", &cache_miss_json);
-          GOTO_IF_ERR(err, earlyexit);
-
-          err = cache_miss_json.MemberAsUInt("count", &ucnt);
-          GOTO_IF_ERR(err, earlyexit);
-          statistics->mutable_inference_stats()
-              ->mutable_cache_miss()
-              ->set_count(ucnt);
-          err = cache_miss_json.MemberAsUInt("ns", &ucnt);
-          GOTO_IF_ERR(err, earlyexit);
-          statistics->mutable_inference_stats()->mutable_cache_miss()->set_ns(
-              ucnt);
-        }
-
-
-        triton::common::TritonJson::Value batches_json;
-        err = model_stat.MemberAsArray("batch_stats", &batches_json);
-        GOTO_IF_ERR(err, earlyexit);
-
-        for (size_t idx = 0; idx < batches_json.ArraySize(); ++idx) {
-          triton::common::TritonJson::Value batch_stat;
-          err = batches_json.IndexAsObject(idx, &batch_stat);
-          GOTO_IF_ERR(err, earlyexit);
-
-          auto batch_statistics = statistics->add_batch_stats();
-
-          uint64_t ucnt;
-          err = batch_stat.MemberAsUInt("batch_size", &ucnt);
-          GOTO_IF_ERR(err, earlyexit);
-          batch_statistics->set_batch_size(ucnt);
-
-          {
-            triton::common::TritonJson::Value compute_input_json;
-            err =
-                batch_stat.MemberAsObject("compute_input", &compute_input_json);
-            GOTO_IF_ERR(err, earlyexit);
-
-            err = compute_input_json.MemberAsUInt("count", &ucnt);
-            GOTO_IF_ERR(err, earlyexit);
-            batch_statistics->mutable_compute_input()->set_count(ucnt);
-            err = compute_input_json.MemberAsUInt("ns", &ucnt);
-            GOTO_IF_ERR(err, earlyexit);
-            batch_statistics->mutable_compute_input()->set_ns(ucnt);
-          }
-
-          {
-            triton::common::TritonJson::Value compute_infer_json;
-            err =
-                batch_stat.MemberAsObject("compute_infer", &compute_infer_json);
-            GOTO_IF_ERR(err, earlyexit);
-
-            err = compute_infer_json.MemberAsUInt("count", &ucnt);
-            GOTO_IF_ERR(err, earlyexit);
-            batch_statistics->mutable_compute_infer()->set_count(ucnt);
-            err = compute_infer_json.MemberAsUInt("ns", &ucnt);
-            GOTO_IF_ERR(err, earlyexit);
-            batch_statistics->mutable_compute_infer()->set_ns(ucnt);
-          }
-
-          {
-            triton::common::TritonJson::Value compute_output_json;
-            err = batch_stat.MemberAsObject(
-                "compute_output", &compute_output_json);
-            GOTO_IF_ERR(err, earlyexit);
-
-            err = compute_output_json.MemberAsUInt("count", &ucnt);
-            GOTO_IF_ERR(err, earlyexit);
-            batch_statistics->mutable_compute_output()->set_count(ucnt);
-            err = compute_output_json.MemberAsUInt("ns", &ucnt);
-            GOTO_IF_ERR(err, earlyexit);
-            batch_statistics->mutable_compute_output()->set_ns(ucnt);
-          }
-        }
-      }
-    }
-
-  earlyexit:
-    GrpcStatusUtil::Create(status, err);
-    TRITONSERVER_ErrorDelete(err);
-#else
-    auto err = TRITONSERVER_ErrorNew(
-        TRITONSERVER_ERROR_UNAVAILABLE,
-        "the server does not suppport model statistics");
-    GrpcStatusUtil::Create(status, err);
-    TRITONSERVER_ErrorDelete(err);
-#endif
-  };
-
-  new CommonCallData<
-      grpc::ServerAsyncResponseWriter<inference::ModelStatisticsResponse>,
-      inference::ModelStatisticsRequest, inference::ModelStatisticsResponse>(
-      "ModelStatistics", 0, OnRegisterModelStatistics, OnExecuteModelStatistics,
-      false /* async */, cq_);
-
-  //
-  //  Trace
-  //
-  auto OnRegisterTrace =
-      [this](
-          grpc::ServerContext* ctx, inference::TraceSettingRequest* request,
-          grpc::ServerAsyncResponseWriter<inference::TraceSettingResponse>*
-              responder,
-          void* tag) {
-        this->service_->RequestTraceSetting(
-            ctx, request, responder, this->cq_, this->cq_, tag);
-      };
-
-  auto OnExecuteTrace = [this](
-                            inference::TraceSettingRequest& request,
-                            inference::TraceSettingResponse* response,
-                            grpc::Status* status) {
-#ifdef TRITON_ENABLE_TRACING
-    TRITONSERVER_Error* err = nullptr;
-    TRITONSERVER_InferenceTraceLevel level = TRITONSERVER_TRACE_LEVEL_DISABLED;
-    uint32_t rate;
-    int32_t count;
-    uint32_t log_frequency;
-    std::string filepath;
-    // Update trace setting
-    if (!request.settings().empty()) {
-      TraceManager::NewSetting new_setting;
-      {
-        static std::string setting_name = "trace_file";
-        auto it = request.settings().find(setting_name);
-        if (it != request.settings().end()) {
-          if (it->second.value().size() == 0) {
-            new_setting.clear_filepath_ = true;
-          } else if (it->second.value().size() == 1) {
-            filepath = it->second.value()[0];
-            new_setting.filepath_ = &filepath;
-          } else {
-            err = TRITONSERVER_ErrorNew(
-                TRITONSERVER_ERROR_INVALID_ARG,
-                (std::string("expect only 1 value for '") + setting_name + "'")
-                    .c_str());
-            GOTO_IF_ERR(err, earlyexit);
-          }
-        }
-      }
-      {
-        static std::string setting_name = "trace_level";
-        auto it = request.settings().find(setting_name);
-        if (it != request.settings().end()) {
-          if (it->second.value().size() == 0) {
-            new_setting.clear_level_ = true;
-          } else {
-            for (const auto& level_str : it->second.value()) {
-              if (level_str == "OFF") {
-                if (it->second.value().size() == 1) {
-                  level = TRITONSERVER_TRACE_LEVEL_DISABLED;
-                  new_setting.level_ = &level;
-                } else {
-                  err = TRITONSERVER_ErrorNew(
-                      TRITONSERVER_ERROR_INVALID_ARG,
-                      "Expect only one trace level 'OFF' is specified");
-                  GOTO_IF_ERR(err, earlyexit);
-                }
-              } else if (level_str == "TIMESTAMPS") {
-                level = static_cast<TRITONSERVER_InferenceTraceLevel>(
-                    level | TRITONSERVER_TRACE_LEVEL_TIMESTAMPS);
-                new_setting.level_ = &level;
-              } else if (level_str == "TENSORS") {
-                level = static_cast<TRITONSERVER_InferenceTraceLevel>(
-                    level | TRITONSERVER_TRACE_LEVEL_TENSORS);
-                new_setting.level_ = &level;
-              }
-            }
-          }
-        }
-      }
-      {
-        static std::string setting_name = "trace_rate";
-        auto it = request.settings().find(setting_name);
-        if (it != request.settings().end()) {
-          if (it->second.value().size() == 0) {
-            new_setting.clear_rate_ = true;
-          } else if (it->second.value().size() == 1) {
-            try {
-              rate = std::stoi(it->second.value()[0]);
-              new_setting.rate_ = &rate;
-            }
-            catch (const std::invalid_argument& ia) {
-              err = TRITONSERVER_ErrorNew(
-                  TRITONSERVER_ERROR_INVALID_ARG,
-                  (std::string("Unable to parse '") + setting_name +
-                   "', got: " + it->second.value()[0])
-                      .c_str());
-              GOTO_IF_ERR(err, earlyexit);
-            }
-          } else {
-            err = TRITONSERVER_ErrorNew(
-                TRITONSERVER_ERROR_INVALID_ARG,
-                (std::string("expect only 1 value for '") + setting_name + "'")
-                    .c_str());
-            GOTO_IF_ERR(err, earlyexit);
-          }
-        }
-      }
-      {
-        static std::string setting_name = "trace_count";
-        auto it = request.settings().find(setting_name);
-        if (it != request.settings().end()) {
-          if (it->second.value().size() == 0) {
-            new_setting.clear_count_ = true;
-          } else if (it->second.value().size() == 1) {
-            try {
-              count = std::stoi(it->second.value()[0]);
-              new_setting.count_ = &count;
-            }
-            catch (const std::invalid_argument& ia) {
-              err = TRITONSERVER_ErrorNew(
-                  TRITONSERVER_ERROR_INVALID_ARG,
-                  (std::string("Unable to parse '") + setting_name +
-                   "', got: " + it->second.value()[0])
-                      .c_str());
-              GOTO_IF_ERR(err, earlyexit);
-            }
-          } else {
-            err = TRITONSERVER_ErrorNew(
-                TRITONSERVER_ERROR_INVALID_ARG,
-                (std::string("expect only 1 value for '") + setting_name + "'")
-                    .c_str());
-            GOTO_IF_ERR(err, earlyexit);
-          }
-        }
-      }
-      {
-        static std::string setting_name = "log_frequency";
-        auto it = request.settings().find(setting_name);
-        if (it != request.settings().end()) {
-          if (it->second.value().size() == 0) {
-            new_setting.clear_log_frequency_ = true;
-          } else if (it->second.value().size() == 1) {
-            try {
-              log_frequency = std::stoi(it->second.value()[0]);
-              new_setting.log_frequency_ = &log_frequency;
-            }
-            catch (const std::invalid_argument& ia) {
-              err = TRITONSERVER_ErrorNew(
-                  TRITONSERVER_ERROR_INVALID_ARG,
-                  (std::string("Unable to parse '") + setting_name +
-                   "', got: " + it->second.value()[0])
-                      .c_str());
-              GOTO_IF_ERR(err, earlyexit);
-            }
-          } else {
-            err = TRITONSERVER_ErrorNew(
-                TRITONSERVER_ERROR_INVALID_ARG,
-                (std::string("expect only 1 value for '") + setting_name + "'")
-                    .c_str());
-            GOTO_IF_ERR(err, earlyexit);
-          }
-        }
-      }
-
-      err =
-          trace_manager_->UpdateTraceSetting(request.model_name(), new_setting);
-      GOTO_IF_ERR(err, earlyexit);
-    }
-
-    // Get current trace setting, this is needed even if the setting
-    // has been updated above as some values may not be provided in the request.
-    trace_manager_->GetTraceSetting(
-        request.model_name(), &level, &rate, &count, &log_frequency, &filepath);
-    // level
-    {
-      inference::TraceSettingResponse::SettingValue level_setting;
-      if (level == TRITONSERVER_TRACE_LEVEL_DISABLED) {
-        level_setting.add_value("OFF");
-      } else {
-        if (level & TRITONSERVER_TRACE_LEVEL_TIMESTAMPS) {
-          level_setting.add_value("TIMESTAMPS");
-        }
-        if (level & TRITONSERVER_TRACE_LEVEL_TENSORS) {
-          level_setting.add_value("TENSORS");
-        }
-      }
-      (*response->mutable_settings())["trace_level"] = level_setting;
-    }
-    (*response->mutable_settings())["trace_rate"].add_value(
-        std::to_string(rate));
-    (*response->mutable_settings())["trace_count"].add_value(
-        std::to_string(count));
-    (*response->mutable_settings())["log_frequency"].add_value(
-        std::to_string(log_frequency));
-    (*response->mutable_settings())["trace_file"].add_value(filepath);
-
-  earlyexit:
-    GrpcStatusUtil::Create(status, err);
-    TRITONSERVER_ErrorDelete(err);
-#else
-    auto err = TRITONSERVER_ErrorNew(
-        TRITONSERVER_ERROR_UNAVAILABLE, "the server does not suppport trace");
-    GrpcStatusUtil::Create(status, err);
-    TRITONSERVER_ErrorDelete(err);
-#endif
-  };
-
-  new CommonCallData<
-      grpc::ServerAsyncResponseWriter<inference::TraceSettingResponse>,
-      inference::TraceSettingRequest, inference::TraceSettingResponse>(
-      "Trace", 0, OnRegisterTrace, OnExecuteTrace, false /* async */, cq_);
-
-
-  //
-  // SystemSharedMemoryStatus
-  //
-  auto OnRegisterSystemSharedMemoryStatus =
-      [this](
-          grpc::ServerContext* ctx,
-          inference::SystemSharedMemoryStatusRequest* request,
-          grpc::ServerAsyncResponseWriter<
-              inference::SystemSharedMemoryStatusResponse>* responder,
-          void* tag) {
-        this->service_->RequestSystemSharedMemoryStatus(
-            ctx, request, responder, this->cq_, this->cq_, tag);
-      };
-
-  auto OnExecuteSystemSharedMemoryStatus =
-      [this](
-          inference::SystemSharedMemoryStatusRequest& request,
-          inference::SystemSharedMemoryStatusResponse* response,
-          grpc::Status* status) {
-        triton::common::TritonJson::Value shm_status_json(
-            triton::common::TritonJson::ValueType::ARRAY);
-        TRITONSERVER_Error* err = shm_manager_->GetStatus(
-            request.name(), TRITONSERVER_MEMORY_CPU, &shm_status_json);
-        GOTO_IF_ERR(err, earlyexit);
-
-        for (size_t idx = 0; idx < shm_status_json.ArraySize(); ++idx) {
-          triton::common::TritonJson::Value shm_region_json;
-          err = shm_status_json.IndexAsObject(idx, &shm_region_json);
-          GOTO_IF_ERR(err, earlyexit);
-
-          const char* name;
-          size_t namelen;
-          err = shm_region_json.MemberAsString("name", &name, &namelen);
-          GOTO_IF_ERR(err, earlyexit);
-
-          const char* key;
-          size_t keylen;
-          err = shm_region_json.MemberAsString("key", &key, &keylen);
-          GOTO_IF_ERR(err, earlyexit);
-
-          uint64_t offset;
-          err = shm_region_json.MemberAsUInt("offset", &offset);
-          GOTO_IF_ERR(err, earlyexit);
-
-          uint64_t byte_size;
-          err = shm_region_json.MemberAsUInt("byte_size", &byte_size);
-          GOTO_IF_ERR(err, earlyexit);
-
-          inference::SystemSharedMemoryStatusResponse::RegionStatus
-              region_status;
-          region_status.set_name(std::string(name, namelen));
-          region_status.set_key(std::string(key, keylen));
-          region_status.set_offset(offset);
-          region_status.set_byte_size(byte_size);
-
-          (*response->mutable_regions())[name] = region_status;
-        }
-
-      earlyexit:
-        GrpcStatusUtil::Create(status, err);
-        TRITONSERVER_ErrorDelete(err);
-      };
-
-  new CommonCallData<
-      grpc::ServerAsyncResponseWriter<
-          inference::SystemSharedMemoryStatusResponse>,
-      inference::SystemSharedMemoryStatusRequest,
-      inference::SystemSharedMemoryStatusResponse>(
-      "SystemSharedMemoryStatus", 0, OnRegisterSystemSharedMemoryStatus,
-      OnExecuteSystemSharedMemoryStatus, false /* async */, cq_);
-
-
-  //
-  // SystemSharedMemoryRegister
-  //
-  auto OnRegisterSystemSharedMemoryRegister =
-      [this](
-          grpc::ServerContext* ctx,
-          inference::SystemSharedMemoryRegisterRequest* request,
-          grpc::ServerAsyncResponseWriter<
-              inference::SystemSharedMemoryRegisterResponse>* responder,
-          void* tag) {
-        this->service_->RequestSystemSharedMemoryRegister(
-            ctx, request, responder, this->cq_, this->cq_, tag);
-      };
-
-  auto OnExecuteSystemSharedMemoryRegister =
-      [this](
-          inference::SystemSharedMemoryRegisterRequest& request,
-          inference::SystemSharedMemoryRegisterResponse* response,
-          grpc::Status* status) {
-        TRITONSERVER_Error* err = shm_manager_->RegisterSystemSharedMemory(
-            request.name(), request.key(), request.offset(),
-            request.byte_size());
-
-        GrpcStatusUtil::Create(status, err);
-        TRITONSERVER_ErrorDelete(err);
-      };
-
-  new CommonCallData<
-      grpc::ServerAsyncResponseWriter<
-          inference::SystemSharedMemoryRegisterResponse>,
-      inference::SystemSharedMemoryRegisterRequest,
-      inference::SystemSharedMemoryRegisterResponse>(
-      "SystemSharedMemoryRegister", 0, OnRegisterSystemSharedMemoryRegister,
-      OnExecuteSystemSharedMemoryRegister, false /* async */, cq_);
-
-
-  //
-  // SystemSharedMemoryUnregister
-  //
-  auto OnRegisterSystemSharedMemoryUnregister =
-      [this](
-          grpc::ServerContext* ctx,
-          inference::SystemSharedMemoryUnregisterRequest* request,
-          grpc::ServerAsyncResponseWriter<
-              inference::SystemSharedMemoryUnregisterResponse>* responder,
-          void* tag) {
-        this->service_->RequestSystemSharedMemoryUnregister(
-            ctx, request, responder, this->cq_, this->cq_, tag);
-      };
-
-  auto OnExecuteSystemSharedMemoryUnregister =
-      [this](
-          inference::SystemSharedMemoryUnregisterRequest& request,
-          inference::SystemSharedMemoryUnregisterResponse* response,
-          grpc::Status* status) {
-        TRITONSERVER_Error* err = nullptr;
-        if (request.name().empty()) {
-          err = shm_manager_->UnregisterAll(TRITONSERVER_MEMORY_CPU);
-        } else {
-          err =
-              shm_manager_->Unregister(request.name(), TRITONSERVER_MEMORY_CPU);
-        }
-
-        GrpcStatusUtil::Create(status, err);
-        TRITONSERVER_ErrorDelete(err);
-      };
-
-  new CommonCallData<
-      grpc::ServerAsyncResponseWriter<
-          inference::SystemSharedMemoryUnregisterResponse>,
-      inference::SystemSharedMemoryUnregisterRequest,
-      inference::SystemSharedMemoryUnregisterResponse>(
-      "SystemSharedMemoryUnregister", 0, OnRegisterSystemSharedMemoryUnregister,
-      OnExecuteSystemSharedMemoryUnregister, false /* async */, cq_);
-
-
-  //
-  // CudaSharedMemoryStatus
-  //
-  auto OnRegisterCudaSharedMemoryStatus =
-      [this](
-          grpc::ServerContext* ctx,
-          inference::CudaSharedMemoryStatusRequest* request,
-          grpc::ServerAsyncResponseWriter<
-              inference::CudaSharedMemoryStatusResponse>* responder,
-          void* tag) {
-        this->service_->RequestCudaSharedMemoryStatus(
-            ctx, request, responder, this->cq_, this->cq_, tag);
-      };
-  auto OnExecuteCudaSharedMemoryStatus =
-      [this](
-          inference::CudaSharedMemoryStatusRequest& request,
-          inference::CudaSharedMemoryStatusResponse* response,
-          grpc::Status* status) {
-        triton::common::TritonJson::Value shm_status_json(
-            triton::common::TritonJson::ValueType::ARRAY);
-        TRITONSERVER_Error* err = shm_manager_->GetStatus(
-            request.name(), TRITONSERVER_MEMORY_GPU, &shm_status_json);
-        GOTO_IF_ERR(err, earlyexit);
-
-        for (size_t idx = 0; idx < shm_status_json.ArraySize(); ++idx) {
-          triton::common::TritonJson::Value shm_region_json;
-          err = shm_status_json.IndexAsObject(idx, &shm_region_json);
-          GOTO_IF_ERR(err, earlyexit);
-
-          const char* name;
-          size_t namelen;
-          err = shm_region_json.MemberAsString("name", &name, &namelen);
-          GOTO_IF_ERR(err, earlyexit);
-
-          uint64_t device_id;
-          err = shm_region_json.MemberAsUInt("device_id", &device_id);
-          GOTO_IF_ERR(err, earlyexit);
-
-          uint64_t byte_size;
-          err = shm_region_json.MemberAsUInt("byte_size", &byte_size);
-          GOTO_IF_ERR(err, earlyexit);
-
-
-          inference::CudaSharedMemoryStatusResponse::RegionStatus region_status;
-          region_status.set_name(std::string(name, namelen));
-          region_status.set_device_id(device_id);
-          region_status.set_byte_size(byte_size);
-
-          (*response->mutable_regions())[name] = region_status;
-        }
-      earlyexit:
-        GrpcStatusUtil::Create(status, err);
-        TRITONSERVER_ErrorDelete(err);
-      };
-  new CommonCallData<
-      grpc::ServerAsyncResponseWriter<
-          inference::CudaSharedMemoryStatusResponse>,
-      inference::CudaSharedMemoryStatusRequest,
-      inference::CudaSharedMemoryStatusResponse>(
-      "CudaSharedMemoryStatus", 0, OnRegisterCudaSharedMemoryStatus,
-      OnExecuteCudaSharedMemoryStatus, false /* async */, cq_);
-
-
-  //
-  // CudaSharedMemoryRegister
-  //
-  auto OnRegisterCudaSharedMemoryRegister =
-      [this](
-          grpc::ServerContext* ctx,
-          inference::CudaSharedMemoryRegisterRequest* request,
-          grpc::ServerAsyncResponseWriter<
-              inference::CudaSharedMemoryRegisterResponse>* responder,
-          void* tag) {
-        this->service_->RequestCudaSharedMemoryRegister(
-            ctx, request, responder, this->cq_, this->cq_, tag);
-      };
-
-  auto OnExecuteCudaSharedMemoryRegister =
-      [this](
-          inference::CudaSharedMemoryRegisterRequest& request,
-          inference::CudaSharedMemoryRegisterResponse* response,
-          grpc::Status* status) {
-        TRITONSERVER_Error* err = nullptr;
-#ifdef TRITON_ENABLE_GPU
-        err = shm_manager_->RegisterCUDASharedMemory(
-            request.name(),
-            reinterpret_cast<const cudaIpcMemHandle_t*>(
-                request.raw_handle().c_str()),
-            request.byte_size(), request.device_id());
-#else
-        err = TRITONSERVER_ErrorNew(
-            TRITONSERVER_ERROR_INVALID_ARG,
-            std::string(
-                "failed to register CUDA shared memory region: '" +
-                request.name() + "', GPUs not supported")
-                .c_str());
-#endif  // TRITON_ENABLE_GPU
-
-        GrpcStatusUtil::Create(status, err);
-        TRITONSERVER_ErrorDelete(err);
-      };
-
-  new CommonCallData<
-      grpc::ServerAsyncResponseWriter<
-          inference::CudaSharedMemoryRegisterResponse>,
-      inference::CudaSharedMemoryRegisterRequest,
-      inference::CudaSharedMemoryRegisterResponse>(
-      "CudaSharedMemoryRegister", 0, OnRegisterCudaSharedMemoryRegister,
-      OnExecuteCudaSharedMemoryRegister, false /* async */, cq_);
-
-  //
-  // CudaSharedMemoryUnregister
-  //
-  auto OnRegisterCudaSharedMemoryUnregister =
-      [this](
-          grpc::ServerContext* ctx,
-          inference::CudaSharedMemoryUnregisterRequest* request,
-          grpc::ServerAsyncResponseWriter<
-              inference::CudaSharedMemoryUnregisterResponse>* responder,
-          void* tag) {
-        this->service_->RequestCudaSharedMemoryUnregister(
-            ctx, request, responder, this->cq_, this->cq_, tag);
-      };
-
-  auto OnExecuteCudaSharedMemoryUnregister =
-      [this](
-          inference::CudaSharedMemoryUnregisterRequest& request,
-          inference::CudaSharedMemoryUnregisterResponse* response,
-          grpc::Status* status) {
-        TRITONSERVER_Error* err = nullptr;
-        if (request.name().empty()) {
-          err = shm_manager_->UnregisterAll(TRITONSERVER_MEMORY_GPU);
-        } else {
-          err =
-              shm_manager_->Unregister(request.name(), TRITONSERVER_MEMORY_GPU);
-        }
-
-        GrpcStatusUtil::Create(status, err);
-        TRITONSERVER_ErrorDelete(err);
-      };
-
-  new CommonCallData<
-      grpc::ServerAsyncResponseWriter<
-          inference::CudaSharedMemoryUnregisterResponse>,
-      inference::CudaSharedMemoryUnregisterRequest,
-      inference::CudaSharedMemoryUnregisterResponse>(
-      "CudaSharedMemoryUnregister", 0, OnRegisterCudaSharedMemoryUnregister,
-      OnExecuteCudaSharedMemoryUnregister, false /* async */, cq_);
-
-  //
-  // RepositoryIndex
-  //
-  auto OnRegisterRepositoryIndex =
-      [this](
-          grpc::ServerContext* ctx, inference::RepositoryIndexRequest* request,
-          grpc::ServerAsyncResponseWriter<inference::RepositoryIndexResponse>*
-              responder,
-          void* tag) {
-        this->service_->RequestRepositoryIndex(
-            ctx, request, responder, this->cq_, this->cq_, tag);
-      };
-
-  auto OnExecuteRepositoryIndex =
-      [this](
-          inference::RepositoryIndexRequest& request,
-          inference::RepositoryIndexResponse* response, grpc::Status* status) {
-        TRITONSERVER_Error* err = nullptr;
-        if (request.repository_name().empty()) {
-          uint32_t flags = 0;
-          if (request.ready()) {
-            flags |= TRITONSERVER_INDEX_FLAG_READY;
-          }
-
-          TRITONSERVER_Message* model_index_message = nullptr;
-          err = TRITONSERVER_ServerModelIndex(
-              tritonserver_.get(), flags, &model_index_message);
-          GOTO_IF_ERR(err, earlyexit);
-
-          const char* buffer;
-          size_t byte_size;
-          err = TRITONSERVER_MessageSerializeToJson(
-              model_index_message, &buffer, &byte_size);
-          GOTO_IF_ERR(err, earlyexit);
-
-          triton::common::TritonJson::Value model_index_json;
-          err = model_index_json.Parse(buffer, byte_size);
-          GOTO_IF_ERR(err, earlyexit);
-
-          err = model_index_json.AssertType(
-              triton::common::TritonJson::ValueType::ARRAY);
-          GOTO_IF_ERR(err, earlyexit);
-
-          for (size_t idx = 0; idx < model_index_json.ArraySize(); ++idx) {
-            triton::common::TritonJson::Value index_json;
-            err = model_index_json.IndexAsObject(idx, &index_json);
-            GOTO_IF_ERR(err, earlyexit);
-
-            auto model_index = response->add_models();
-
-            const char* name;
-            size_t namelen;
-            err = index_json.MemberAsString("name", &name, &namelen);
-            GOTO_IF_ERR(err, earlyexit);
-            model_index->set_name(std::string(name, namelen));
-
-            if (index_json.Find("version")) {
-              const char* version;
-              size_t versionlen;
-              err = index_json.MemberAsString("version", &version, &versionlen);
-              GOTO_IF_ERR(err, earlyexit);
-              model_index->set_version(std::string(version, versionlen));
-            }
-            if (index_json.Find("state")) {
-              const char* state;
-              size_t statelen;
-              err = index_json.MemberAsString("state", &state, &statelen);
-              GOTO_IF_ERR(err, earlyexit);
-              model_index->set_state(std::string(state, statelen));
-            }
-            if (index_json.Find("reason")) {
-              const char* reason;
-              size_t reasonlen;
-              err = index_json.MemberAsString("reason", &reason, &reasonlen);
-              GOTO_IF_ERR(err, earlyexit);
-              model_index->set_reason(std::string(reason, reasonlen));
-            }
-          }
-
-          TRITONSERVER_MessageDelete(model_index_message);
-        } else {
-          err = TRITONSERVER_ErrorNew(
-              TRITONSERVER_ERROR_UNSUPPORTED,
-              "'repository_name' specification is not supported");
-        }
-
-      earlyexit:
-        GrpcStatusUtil::Create(status, err);
-        TRITONSERVER_ErrorDelete(err);
-      };
-
-  new CommonCallData<
-      grpc::ServerAsyncResponseWriter<inference::RepositoryIndexResponse>,
-      inference::RepositoryIndexRequest, inference::RepositoryIndexResponse>(
-      "RepositoryIndex", 0, OnRegisterRepositoryIndex, OnExecuteRepositoryIndex,
-      false /* async */, cq_);
-
-  //
-  // RepositoryModelLoad
-  //
-  auto OnRegisterRepositoryModelLoad =
-      [this](
-          grpc::ServerContext* ctx,
-          inference::RepositoryModelLoadRequest* request,
-          grpc::ServerAsyncResponseWriter<
-              inference::RepositoryModelLoadResponse>* responder,
-          void* tag) {
-        this->service_->RequestRepositoryModelLoad(
-            ctx, request, responder, this->cq_, this->cq_, tag);
-      };
-
-  auto OnExecuteRepositoryModelLoad =
-      [this](
-          inference::RepositoryModelLoadRequest& request,
-          inference::RepositoryModelLoadResponse* response,
-          grpc::Status* status) {
-        TRITONSERVER_Error* err = nullptr;
-        if (request.repository_name().empty()) {
-          std::vector<TRITONSERVER_Parameter*> params;
-          // WAR for the const-ness check
-          std::vector<const TRITONSERVER_Parameter*> const_params;
-          for (const auto& param_proto : request.parameters()) {
-            if (param_proto.first == "config") {
-              if (param_proto.second.parameter_choice_case() !=
-                  inference::ModelRepositoryParameter::ParameterChoiceCase::
-                      kStringParam) {
-                err = TRITONSERVER_ErrorNew(
-                    TRITONSERVER_ERROR_INVALID_ARG,
-                    (std::string("invalid value type for load parameter '") +
-                     param_proto.first + "', expected string_param.")
-                        .c_str());
-                break;
-              } else {
-                auto param = TRITONSERVER_ParameterNew(
-                    param_proto.first.c_str(), TRITONSERVER_PARAMETER_STRING,
-                    param_proto.second.string_param().c_str());
-                if (param != nullptr) {
-                  params.emplace_back(param);
-                  const_params.emplace_back(param);
-                } else {
-                  err = TRITONSERVER_ErrorNew(
-                      TRITONSERVER_ERROR_INTERNAL,
-                      "unexpected error on creating Triton parameter");
-                  break;
-                }
-              }
-            } else if (param_proto.first.rfind("file:", 0) == 0) {
-              if (param_proto.second.parameter_choice_case() !=
-                  inference::ModelRepositoryParameter::ParameterChoiceCase::
-                      kBytesParam) {
-                err = TRITONSERVER_ErrorNew(
-                    TRITONSERVER_ERROR_INVALID_ARG,
-                    (std::string("invalid value type for load parameter '") +
-                     param_proto.first + "', expected bytes_param.")
-                        .c_str());
-                break;
-              } else {
-                auto param = TRITONSERVER_ParameterBytesNew(
-                    param_proto.first.c_str(),
-                    param_proto.second.bytes_param().data(),
-                    param_proto.second.bytes_param().length());
-                if (param != nullptr) {
-                  params.emplace_back(param);
-                  const_params.emplace_back(param);
-                } else {
-                  err = TRITONSERVER_ErrorNew(
-                      TRITONSERVER_ERROR_INTERNAL,
-                      "unexpected error on creating Triton parameter");
-                  break;
-                }
-              }
-            } else {
-              err = TRITONSERVER_ErrorNew(
-                  TRITONSERVER_ERROR_INVALID_ARG,
-                  (std::string("unrecognized load parameter '") +
-                   param_proto.first + "'.")
-                      .c_str());
-              break;
-            }
-          }
-          if (err == nullptr) {
-            err = TRITONSERVER_ServerLoadModelWithParameters(
-                tritonserver_.get(), request.model_name().c_str(),
-                const_params.data(), const_params.size());
-          }
-          // Assumes no further 'params' access after load API returns
-          for (auto& param : params) {
-            TRITONSERVER_ParameterDelete(param);
-          }
-        } else {
-          err = TRITONSERVER_ErrorNew(
-              TRITONSERVER_ERROR_UNSUPPORTED,
-              "'repository_name' specification is not supported");
-        }
-
-        GrpcStatusUtil::Create(status, err);
-        TRITONSERVER_ErrorDelete(err);
-      };
-
-  new CommonCallData<
-      grpc::ServerAsyncResponseWriter<inference::RepositoryModelLoadResponse>,
-      inference::RepositoryModelLoadRequest,
-      inference::RepositoryModelLoadResponse>(
-      "RepositoryModelLoad", 0, OnRegisterRepositoryModelLoad,
-      OnExecuteRepositoryModelLoad, true /* async */, cq_);
-
-  //
-  // RepositoryModelUnload
-  //
-  auto OnRegisterRepositoryModelUnload =
-      [this](
-          grpc::ServerContext* ctx,
-          inference::RepositoryModelUnloadRequest* request,
-          grpc::ServerAsyncResponseWriter<
-              inference::RepositoryModelUnloadResponse>* responder,
-          void* tag) {
-        this->service_->RequestRepositoryModelUnload(
-            ctx, request, responder, this->cq_, this->cq_, tag);
-      };
-
-  auto OnExecuteRepositoryModelUnload =
-      [this](
-          inference::RepositoryModelUnloadRequest& request,
-          inference::RepositoryModelUnloadResponse* response,
-          grpc::Status* status) {
-        TRITONSERVER_Error* err = nullptr;
-        if (request.repository_name().empty()) {
-          // Check if the dependent models should be removed
-          bool unload_dependents = false;
-          for (auto param : request.parameters()) {
-            if (param.first.compare("unload_dependents") == 0) {
-              const auto& unload_param = param.second;
-              if (unload_param.parameter_choice_case() !=
-                  inference::ModelRepositoryParameter::ParameterChoiceCase::
-                      kBoolParam) {
-                err = TRITONSERVER_ErrorNew(
-                    TRITONSERVER_ERROR_INVALID_ARG,
-                    "invalid value type for 'unload_dependents' parameter, "
-                    "expected "
-                    "bool_param.");
-              }
-              unload_dependents = unload_param.bool_param();
-              break;
-            }
-          }
-          if (err == nullptr) {
-            if (unload_dependents) {
-              err = TRITONSERVER_ServerUnloadModelAndDependents(
-                  tritonserver_.get(), request.model_name().c_str());
-            } else {
-              err = TRITONSERVER_ServerUnloadModel(
-                  tritonserver_.get(), request.model_name().c_str());
-            }
-          }
-        } else {
-          err = TRITONSERVER_ErrorNew(
-              TRITONSERVER_ERROR_UNSUPPORTED,
-              "'repository_name' specification is not supported");
-        }
-
-        GrpcStatusUtil::Create(status, err);
-        TRITONSERVER_ErrorDelete(err);
-      };
-
-  new CommonCallData<
-      grpc::ServerAsyncResponseWriter<inference::RepositoryModelUnloadResponse>,
-      inference::RepositoryModelUnloadRequest,
-      inference::RepositoryModelUnloadResponse>(
-      "RepositoryModelUnload", 0, OnRegisterRepositoryModelUnload,
-      OnExecuteRepositoryModelUnload, true /* async */, cq_);
-}
-
-//=========================================================================
-//  The following section contains the handling mechanism for inference
-//  RPCs such as ModelInfer and ModelStreamInfer. This implementation
-//  is tuned more towards performance and reducing the latency.
-//=========================================================================
-
-//
-// ResponseQueue
-//
-// A simple queue holding the responses to be written. Uses a
-// vector of persistent message objects to prevent allocating
-// memory for each response to be written.
-//
-template <typename ResponseType>
-class ResponseQueue {
- public:
-  explicit ResponseQueue() { Reset(); }
-
-  ~ResponseQueue()
-  {
-    for (auto response : responses_) {
-      delete response;
-    }
-  }
-
-  // Resets the queue
-  void Reset()
-  {
-    alloc_count_ = 0;
-    ready_count_ = 0;
-    current_index_ = 0;
-    for (auto response : responses_) {
-      response->Clear();
-    }
-  }
-
-  // Gets the response for the non-decoupled models.
-  // Note that there will be a single response in
-  // non-decoupled cases.
-  ResponseType* GetNonDecoupledResponse()
-  {
-    std::lock_guard<std::mutex> lock(mtx_);
-    alloc_count_ = 1;
-    if (responses_.size() < 1) {
-      responses_.push_back(new ResponseType());
-    }
-    return responses_[0];
-  }
-
-  // Allocates a response on the head of the queue
-  void AllocateResponse()
-  {
-    std::lock_guard<std::mutex> lock(mtx_);
-    alloc_count_++;
-    if (responses_.size() < alloc_count_) {
-      responses_.push_back(new ResponseType());
-    }
-  }
-
-  // Gets the last allocated response
-  ResponseType* GetLastAllocatedResponse()
-  {
-    std::lock_guard<std::mutex> lock(mtx_);
-    if (responses_.size() < alloc_count_) {
-      LOG_ERROR
-          << "[INTERNAL] Attempting to access the response not yet allocated";
-      return nullptr;
-    }
-    return responses_[alloc_count_ - 1];
-  }
-
-  // Marks the next non-ready response complete
-  bool MarkNextResponseComplete()
-  {
-    std::lock_guard<std::mutex> lock(mtx_);
-    if (alloc_count_ <= ready_count_) {
-      LOG_ERROR
-          << "[INTERNAL] Attempting to mark an unallocated response complete";
-      return false;
-    }
-    ready_count_++;
-
-    return true;
-  }
-
-  // Gets the current response from the tail of
-  // the queue.
-  ResponseType* GetCurrentResponse()
-  {
-    std::lock_guard<std::mutex> lock(mtx_);
-    if (current_index_ >= ready_count_) {
-      LOG_ERROR << "[INTERNAL] Attempting to access current response when it "
-                   "is not ready";
-      return nullptr;
-    }
-    return responses_[current_index_];
-  }
-
-  // Gets the response at the specified index
-  ResponseType* GetResponseAt(const uint32_t index)
-  {
-    std::lock_guard<std::mutex> lock(mtx_);
-    if (index >= alloc_count_) {
-      LOG_ERROR << "[INTERNAL] Attempting to access response which is not yet "
-                   "allocated";
-      return nullptr;
-    }
-    return responses_[index];
-  }
-
-  // Pops the response from the tail of the queue
-  void PopResponse()
-  {
-    std::lock_guard<std::mutex> lock(mtx_);
-    current_index_++;
-  }
-
-  // Returns whether the queue is empty
-  bool IsEmpty()
-  {
-    std::lock_guard<std::mutex> lock(mtx_);
-    return ((alloc_count_ == ready_count_) && (alloc_count_ == current_index_));
-  }
-
-  // Returns whether the queue has responses
-  // ready to be written.
-  bool HasReadyResponse()
-  {
-    std::lock_guard<std::mutex> lock(mtx_);
-    return (ready_count_ > current_index_);
-  }
-
- private:
-  std::vector<ResponseType*> responses_;
-  std::mutex mtx_;
-
-  // There are three indices to track the responses in the queue
-  // Tracks the allocated response
-  uint32_t alloc_count_;
-  // Tracks the response that is ready to be written
-  uint32_t ready_count_;
-  // Tracks the response next in the queue to be written
-  uint32_t current_index_;
-};
-
-//
-// ShmInfo
-//
-// Simple structure that carries the shared memory information
-//
-struct ShmInfo {
-  ShmInfo(
-      void* base, size_t byte_size, TRITONSERVER_MemoryType memory_type,
-      int64_t memory_type_id, char* cuda_ipc_handle)
-      : base_(base), byte_size_(byte_size), memory_type_(memory_type),
-        memory_type_id_(memory_type_id), cuda_ipc_handle_(cuda_ipc_handle)
-  {
-  }
-  void* base_;
-  size_t byte_size_;
-  TRITONSERVER_MemoryType memory_type_;
-  int64_t memory_type_id_;
-  char* cuda_ipc_handle_;
-};
-using TensorShmMap = std::unordered_map<std::string, ShmInfo>;
-
-//
-// AllocPayload
-//
-// Simple structure that carries the userp payload needed for
-// allocation.
-//
-template <typename ResponseType>
-struct AllocPayload {
-  using ClassificationMap = std::unordered_map<std::string, uint32_t>;
-
-  explicit AllocPayload() : response_queue_(nullptr) {}
-  ~AllocPayload()
-  {
-    // Don't delete 'response_'.. it is owned by the InferHandlerState
-  }
-
-  std::shared_ptr<ResponseQueue<ResponseType>> response_queue_;
-  uint32_t response_alloc_count_;
-  TensorShmMap shm_map_;
-  ClassificationMap classification_map_;
-
-  // Used to extend the lifetime of the serialized data in case
-  // non-raw contents were provided in the request. Serialized data's
-  // actual lifetime is that of the request whereas AllocPayload's
-  // lifetime is that of a response... but it is convenient to keep it
-  // here.
-  std::list<std::string> serialized_data_;
-};
-
-//
-// InferHandlerState
-//
-template <
-    typename ServerResponderType, typename RequestType, typename ResponseType>
-class InferHandlerState {
- public:
-  using InferHandlerStateType =
-      InferHandlerState<ServerResponderType, RequestType, ResponseType>;
-
-  // State that is shared across all state objects that make up a GRPC
-  // transaction (e.g. a stream).
-  struct Context {
-    explicit Context(
-        grpc::ServerCompletionQueue* cq, const uint64_t unique_id = 0)
-        : cq_(cq), unique_id_(unique_id), ongoing_requests_(0),
-          step_(Steps::START), finish_ok_(true), ongoing_write_(false)
-    {
-      ctx_.reset(new grpc::ServerContext());
-      responder_.reset(new ServerResponderType(ctx_.get()));
-    }
-
-    void SetCompressionLevel(grpc_compression_level compression_level)
-    {
-      ctx_->set_compression_level(compression_level);
-    }
-
-    // Increments the ongoing request counter
-    void IncrementRequestCounter() { ongoing_requests_++; }
-
-    // Decrements the ongoing request counter
-    void DecrementRequestCounter() { ongoing_requests_--; }
-
-    // Enqueue 'state' so that its response is delivered in the
-    // correct order.
-    void EnqueueForResponse(InferHandlerStateType* state)
-    {
-      std::lock_guard<std::mutex> lock(mu_);
-      states_.push(state);
-    }
-
-    // Write the response to the stream directly.
-    void DecoupledWriteResponse(InferHandlerStateType* state)
-    {
-#ifdef TRITON_ENABLE_TRACING
-      state->trace_timestamps_.emplace_back(
-          std::make_pair("GRPC_SEND_START", TraceManager::CaptureTimestamp()));
-#endif  // TRITON_ENABLE_TRACING
-      state->step_ = Steps::WRITTEN;
-      ResponseType* response = state->response_queue_->GetCurrentResponse();
-      responder_->Write(*response, state);
-
-      // Clear the response after writing
-      response->mutable_infer_response()->Clear();
-
-      // Pop the response from queue
-      state->response_queue_->PopResponse();
-    }
-
-    // Adds the state object to the completion queue so
-    // that it can be processed later
-    void PutTaskBackToQueue(InferHandlerStateType* state)
-    {
-      std::lock_guard<std::mutex> lock(mu_);
-      // FIXME: Is there a better way to put task on the
-      // completion queue rather than using alarm object?
-      // The alarm object will add a new task to the back of the
-      // completion queue when it expires or when it’s cancelled.
-      state->alarm_.Set(
-          cq_, gpr_now(gpr_clock_type::GPR_CLOCK_REALTIME), state);
-    }
-
-    // Check the state at the front of the queue and write it if
-    // ready. The state at the front of the queue is ready if it is in
-    // the WRITEREADY state and it equals 'required_state' (or
-    // 'required_state' is nullptr). Return nullptr if front of queue
-    // was not ready (and so not written), or return the state if it
-    // was ready and written.
-    InferHandlerStateType* WriteResponseIfReady(
-        InferHandlerStateType* required_state)
-    {
-      std::lock_guard<std::mutex> lock(mu_);
-      if (states_.empty()) {
-        return nullptr;
-      }
-
-      InferHandlerStateType* state = states_.front();
-      if (state->step_ != Steps::WRITEREADY) {
-        return nullptr;
-      }
-
-      if ((required_state != nullptr) && (state != required_state)) {
-        return nullptr;
-      }
-
-#ifdef TRITON_ENABLE_TRACING
-      state->trace_timestamps_.emplace_back(
-          std::make_pair("GRPC_SEND_START", TraceManager::CaptureTimestamp()));
-#endif  // TRITON_ENABLE_TRACING
-
-      state->step_ = Steps::WRITTEN;
-      state->context_->ongoing_write_ = true;
-      // Non decoupled writes use only one response
-      responder_->Write(*state->response_queue_->GetResponseAt(0), state);
-
-      return state;
-    }
-
-    // If 'state' is at the front of the queue and written, pop it and
-    // return true. Other return false.
-    bool PopCompletedResponse(InferHandlerStateType* state)
-    {
-      std::lock_guard<std::mutex> lock(mu_);
-      if (states_.empty()) {
-        return false;
-      }
-
-      InferHandlerStateType* front = states_.front();
-      if ((front == state) && (state->step_ == Steps::WRITTEN)) {
-        states_.pop();
-        return true;
-      }
-
-      return false;
-    }
-
-    // Return true if this context has completed all reads and writes.
-    bool IsRequestsCompleted()
-    {
-      std::lock_guard<std::mutex> lock(mu_);
-      return (
-          (step_ == Steps::WRITEREADY) && states_.empty() &&
-          (ongoing_requests_ == 0));
-    }
-
-    // The grpc completion queue associated with the RPC.
-    grpc::ServerCompletionQueue* cq_;
-
-    // Unique ID for the context. Used only for debugging so will
-    // always be 0 in non-debug builds.
-    const uint64_t unique_id_;
-
-    // Context for the rpc, allowing to tweak aspects of it such as
-    // the use of compression, authentication, as well as to send
-    // metadata back to the client.
-    std::unique_ptr<grpc::ServerContext> ctx_;
-    std::unique_ptr<ServerResponderType> responder_;
-
-    // The states associated with this context that are currently
-    // active. Used by stream handlers to maintain request / response
-    // orders. A state enters this queue when it has successfully read
-    // a request and exits the queue when it is written.
-    std::mutex mu_;
-    std::queue<InferHandlerStateType*> states_;
-    std::atomic<uint32_t> ongoing_requests_;
-
-    // The step of the entire context.
-    Steps step_;
-
-    // True if this context should finish with OK status, false if
-    // should finish with CANCELLED status.
-    bool finish_ok_;
-
-    // True if there is an ongoing write to the grpc stream
-    std::atomic<bool> ongoing_write_;
-  };
-
-  explicit InferHandlerState(
-      TRITONSERVER_Server* tritonserver,
-      const std::shared_ptr<Context>& context, Steps start_step = Steps::START)
-      : tritonserver_(tritonserver)
-  {
-    // For debugging and testing,
-    const char* dstr = getenv("TRITONSERVER_DELAY_GRPC_RESPONSE");
-    delay_response_ms_ = 0;
-    if (dstr != nullptr) {
-      delay_response_ms_ = atoi(dstr);
-    }
-    response_queue_.reset(new ResponseQueue<ResponseType>());
-    Reset(context, start_step);
-  }
-
-  ~InferHandlerState() { ClearTraceTimestamps(); }
-
-  void Reset(
-      const std::shared_ptr<Context>& context, Steps start_step = Steps::START)
-  {
-    unique_id_ = NEXT_UNIQUE_ID;
-    context_ = context;
-    step_ = start_step;
-    cb_count_ = 0;
-    is_decoupled_ = false;
-    complete_ = false;
-    request_.Clear();
-    response_queue_->Reset();
-    // Clear trace_timestamps_ here so they do not grow indefinitely since
-    // states are re-used for performance.
-    ClearTraceTimestamps();
-  }
-
-  void Release()
-  {
-    context_ = nullptr;
-    ClearTraceTimestamps();
-  }
-
-  void ClearTraceTimestamps()
-  {
-#ifdef TRITON_ENABLE_TRACING
-    if (trace_ != nullptr) {
-      for (const auto& timestamp : trace_timestamps_) {
-        trace_->CaptureTimestamp(timestamp.first, timestamp.second);
-      }
-      trace_.reset();
-    }
-    trace_timestamps_.clear();
-#endif  // TRITON_ENABLE_TRACING
-  }
-
-  // Returns whether all the responses from the state
-  // are delivered and successfully written on the
-  // stream.
-  bool IsComplete() { return (complete_ && response_queue_->IsEmpty()); }
-
-  // Needed in the response handle for classification outputs.
-  TRITONSERVER_Server* tritonserver_;
-
-  // Unique ID for the state. Used only for debugging so will
-  // always be 0 in non-debug builds.
-  uint64_t unique_id_;
-
-  std::shared_ptr<Context> context_;
-  Steps step_;
-  std::mutex step_mtx_;
-
-#ifdef TRITON_ENABLE_TRACING
-  std::shared_ptr<TraceManager::Trace> trace_;
-  // Additional timestamps that are captured before a trace stream is acquired
-  std::deque<std::pair<std::string, uint64_t>> trace_timestamps_;
-#endif  // TRITON_ENABLE_TRACING
-
-  bool is_decoupled_;
-  std::atomic<uint32_t> cb_count_;
-  bool complete_;
-
-  RequestType request_;
-  std::shared_ptr<ResponseQueue<ResponseType>> response_queue_;
-
-  grpc::Alarm alarm_;
-
-  // For testing and debugging
-  int delay_response_ms_;
-
-  // For inference requests the allocator payload, unused for other
-  // requests.
-  AllocPayload<ResponseType> alloc_payload_;
-};
-
-//
-// InferHandler
-//
-template <
-    typename ServiceType, typename ServerResponderType, typename RequestType,
-    typename ResponseType>
-class InferHandler : public GRPCServer::HandlerBase {
- public:
-  InferHandler(
-      const std::string& name,
-      const std::shared_ptr<TRITONSERVER_Server>& tritonserver,
-      ServiceType* service, grpc::ServerCompletionQueue* cq,
-      size_t max_state_bucket_count);
-  virtual ~InferHandler();
-
-  // Descriptive name of of the handler.
-  const std::string& Name() const { return name_; }
-
-  // Start handling requests.
-  void Start();
-
-  // Stop handling requests.
-  void Stop();
-
- protected:
-  using State =
-      InferHandlerState<ServerResponderType, RequestType, ResponseType>;
-  using StateContext = typename State::Context;
-
-  State* StateNew(
-      TRITONSERVER_Server* tritonserver,
-      const std::shared_ptr<StateContext>& context,
-      Steps start_step = Steps::START)
-  {
-    State* state = nullptr;
-
-    if (max_state_bucket_count_ > 0) {
-      std::lock_guard<std::mutex> lock(alloc_mu_);
-
-      if (!state_bucket_.empty()) {
-        state = state_bucket_.back();
-        state->Reset(context, start_step);
-        state_bucket_.pop_back();
-      }
-    }
-
-    if (state == nullptr) {
-      state = new State(tritonserver, context, start_step);
-    }
-
-    return state;
-  }
-
-  void StateRelease(State* state)
-  {
-    if (max_state_bucket_count_ > 0) {
-      std::lock_guard<std::mutex> lock(alloc_mu_);
-
-      if (state_bucket_.size() < max_state_bucket_count_) {
-        state->Release();
-        state_bucket_.push_back(state);
-        return;
-      }
-    }
-
-    delete state;
-  }
-
-  virtual void StartNewRequest() = 0;
-  virtual bool Process(State* state, bool rpc_ok) = 0;
-
-  const std::string name_;
-  std::shared_ptr<TRITONSERVER_Server> tritonserver_;
-
-  ServiceType* service_;
-  grpc::ServerCompletionQueue* cq_;
-  std::unique_ptr<std::thread> thread_;
-
-  // Mutex to serialize State allocation
-  std::mutex alloc_mu_;
-
-  // Keep some number of state objects for reuse to avoid the overhead
-  // of creating a state for every new request.
-  const size_t max_state_bucket_count_;
-  std::vector<State*> state_bucket_;
-};
-
-template <
-    typename ServiceType, typename ServerResponderType, typename RequestType,
-    typename ResponseType>
-InferHandler<ServiceType, ServerResponderType, RequestType, ResponseType>::
-    InferHandler(
-        const std::string& name,
-        const std::shared_ptr<TRITONSERVER_Server>& tritonserver,
-        ServiceType* service, grpc::ServerCompletionQueue* cq,
-        size_t max_state_bucket_count)
-    : name_(name), tritonserver_(tritonserver), service_(service), cq_(cq),
-      max_state_bucket_count_(max_state_bucket_count)
-{
-}
-
-template <
-    typename ServiceType, typename ServerResponderType, typename RequestType,
-    typename ResponseType>
-InferHandler<ServiceType, ServerResponderType, RequestType, ResponseType>::
-    ~InferHandler()
-{
-  for (State* state : state_bucket_) {
-    delete state;
-  }
-  state_bucket_.clear();
-
-  LOG_VERBOSE(1) << "Destructed " << Name();
-}
-
-template <
-    typename ServiceType, typename ServerResponderType, typename RequestType,
-    typename ResponseType>
-void
-InferHandler<
-    ServiceType, ServerResponderType, RequestType, ResponseType>::Start()
-{
-  // Use a barrier to make sure we don't return until thread has
-  // started.
-  auto barrier = std::make_shared<Barrier>(2);
-
-  thread_.reset(new std::thread([this, barrier] {
-    StartNewRequest();
-    barrier->Wait();
-
-    void* tag;
-    bool ok;
-
-    while (cq_->Next(&tag, &ok)) {
-      State* state = static_cast<State*>(tag);
-      if (!Process(state, ok)) {
-        LOG_VERBOSE(1) << "Done for " << Name() << ", " << state->unique_id_;
-        StateRelease(state);
-      }
-    }
-  }));
-
-  barrier->Wait();
-  LOG_VERBOSE(1) << "Thread started for " << Name();
-}
-
-template <
-    typename ServiceType, typename ServerResponderType, typename RequestType,
-    typename ResponseType>
-void
-InferHandler<
-    ServiceType, ServerResponderType, RequestType, ResponseType>::Stop()
-{
-  if (thread_->joinable()) {
-    thread_->join();
-  }
-
-  LOG_VERBOSE(1) << "Thread exited for " << Name();
-}
-
-//
-// Infer utilities
-//
-TRITONSERVER_Error*
-ResponseAllocatorHelper(
-    TRITONSERVER_ResponseAllocator* allocator, const char* tensor_name,
-    size_t byte_size, TRITONSERVER_MemoryType preferred_memory_type,
-    int64_t preferred_memory_type_id, inference::ModelInferResponse* response,
-    const TensorShmMap& shm_map, void** buffer, void** buffer_userp,
-    TRITONSERVER_MemoryType* actual_memory_type, int64_t* actual_memory_type_id)
-{
-  *buffer = nullptr;
-  *buffer_userp = nullptr;
-  *actual_memory_type = preferred_memory_type;
-  *actual_memory_type_id = preferred_memory_type_id;
-
-  // We add an output contents even if the 'byte_size' == 0 because we
-  // expect to have a contents for every output.
-  inference::ModelInferResponse::InferOutputTensor* output_tensor =
-      response->add_outputs();
-  output_tensor->set_name(tensor_name);
-  std::string* raw_output = response->add_raw_output_contents();
-
-  if (byte_size > 0) {
-    const auto& pr = shm_map.find(tensor_name);
-    if (pr != shm_map.end()) {
-      // The output is in shared memory so check that shared memory
-      // size is at least large enough for the output.
-      if (byte_size > pr->second.byte_size_) {
-        return TRITONSERVER_ErrorNew(
-            TRITONSERVER_ERROR_INTERNAL,
-            std::string(
-                "shared memory size specified with the request for output '" +
-                std::string(tensor_name) + "' (" +
-                std::to_string(pr->second.byte_size_) +
-                " bytes) should be at least " + std::to_string(byte_size) +
-                " bytes to hold the results")
-                .c_str());
-      }
-
-      *buffer = const_cast<void*>(pr->second.base_);
-      *actual_memory_type = pr->second.memory_type_;
-      *actual_memory_type_id = pr->second.memory_type_id_;
-
-      LOG_VERBOSE(1) << "GRPC: using shared-memory for '" << tensor_name
-                     << "', size: " << byte_size << ", addr: " << *buffer;
-      return nullptr;  // Success
-    }
-
-    // Not using shared memory so allocate a buffer. The buffer we
-    // create is directly in the response protobuf so we can't
-    // allocate any type other than CPU.
-    //
-    // FIXME we could use pinned CPU memory here.
-    if (*actual_memory_type != TRITONSERVER_MEMORY_CPU) {
-      LOG_VERBOSE(1) << "GRPC: unable to provide '" << tensor_name << "' in "
-                     << TRITONSERVER_MemoryTypeString(*actual_memory_type)
-                     << ", will use "
-                     << TRITONSERVER_MemoryTypeString(TRITONSERVER_MEMORY_CPU);
-      *actual_memory_type = TRITONSERVER_MEMORY_CPU;
-      *actual_memory_type_id = 0;
-    }
-
-    raw_output->resize(byte_size);
-    *buffer = static_cast<void*>(&((*raw_output)[0]));
-
-    LOG_VERBOSE(1) << "GRPC: using buffer for '" << tensor_name
-                   << "', size: " << byte_size << ", addr: " << *buffer;
-  }
-
-  return nullptr;  // Success
-}
-
-TRITONSERVER_Error*
-OutputBufferAttributesHelper(
-    TRITONSERVER_ResponseAllocator* allocator, const char* tensor_name,
-    const TensorShmMap& shm_map,
-    TRITONSERVER_BufferAttributes* buffer_attributes)
-{
-  // We only need to set the cuda ipc handle here. The rest of the buffer
-  // attributes have been properly populated by triton core.
-  if (tensor_name != nullptr) {
-    const auto& pr = shm_map.find(tensor_name);
-
-    if (pr != shm_map.end()) {
-      if (pr->second.memory_type_ == TRITONSERVER_MEMORY_GPU) {
-        RETURN_IF_ERR(TRITONSERVER_BufferAttributesSetCudaIpcHandle(
-            buffer_attributes, pr->second.cuda_ipc_handle_));
-      }
-    }
-  }
-
-  return nullptr;  // Success
-}
-
-TRITONSERVER_Error*
-OutputBufferQueryHelper(
-    TRITONSERVER_ResponseAllocator* allocator, const char* tensor_name,
-    size_t* byte_size, const TensorShmMap& shm_map,
-    TRITONSERVER_MemoryType* memory_type, int64_t* memory_type_id)
-{
-  // Check if shared memory is used if named tensor is provided
-  if (tensor_name != nullptr) {
-    const auto& pr = shm_map.find(tensor_name);
-    if (pr != shm_map.end()) {
-      // The output is in shared memory so check that shared memory
-      // size is at least large enough for the output, if byte size is provided
-      if ((byte_size != nullptr) && (*byte_size > pr->second.byte_size_)) {
-        // Don't return error yet and just set to the default properties for
-        // GRPC buffer, error will be raised when allocation happens
-        *memory_type = TRITONSERVER_MEMORY_CPU;
-        *memory_type_id = 0;
-      } else {
-        *memory_type = pr->second.memory_type_;
-        *memory_type_id = pr->second.memory_type_id_;
-      }
-      return nullptr;  // Success
-    }
-  }
-
-  // Not using shared memory so a buffer created directly in
-  // the response protobuf will be used, and the type will be CPU.
-  *memory_type = TRITONSERVER_MEMORY_CPU;
-  *memory_type_id = 0;
-  return nullptr;  // Success
-}
-
-// Make sure to keep InferResponseAlloc and OutputBufferQuery logic in sync
-TRITONSERVER_Error*
-InferResponseAlloc(
-    TRITONSERVER_ResponseAllocator* allocator, const char* tensor_name,
-    size_t byte_size, TRITONSERVER_MemoryType preferred_memory_type,
-    int64_t preferred_memory_type_id, void* userp, void** buffer,
-    void** buffer_userp, TRITONSERVER_MemoryType* actual_memory_type,
-    int64_t* actual_memory_type_id)
-{
-  AllocPayload<inference::ModelInferResponse>* payload =
-      reinterpret_cast<AllocPayload<inference::ModelInferResponse>*>(userp);
-
-  // ModelInfer RPC expects exactly one response per request. Hence,
-  // will be creating and using just one response object.
-  inference::ModelInferResponse* response =
-      payload->response_queue_->GetNonDecoupledResponse();
-  return ResponseAllocatorHelper(
-      allocator, tensor_name, byte_size, preferred_memory_type,
-      preferred_memory_type_id, response, payload->shm_map_, buffer,
-      buffer_userp, actual_memory_type, actual_memory_type_id);
-}
-
-// Make sure to keep InferResponseAlloc and OutputBufferQuery logic in sync
-TRITONSERVER_Error*
-OutputBufferQuery(
-    TRITONSERVER_ResponseAllocator* allocator, void* userp,
-    const char* tensor_name, size_t* byte_size,
-    TRITONSERVER_MemoryType* memory_type, int64_t* memory_type_id)
-{
-  AllocPayload<inference::ModelInferResponse>* payload =
-      reinterpret_cast<AllocPayload<inference::ModelInferResponse>*>(userp);
-
-  return OutputBufferQueryHelper(
-      allocator, tensor_name, byte_size, payload->shm_map_, memory_type,
-      memory_type_id);
-}
-
-// Make sure to keep InferResponseAlloc, OutputBufferQuery, and
-// OutputBufferAttributes logic in sync
-TRITONSERVER_Error*
-OutputBufferAttributes(
-    TRITONSERVER_ResponseAllocator* allocator, const char* tensor_name,
-    TRITONSERVER_BufferAttributes* buffer_attributes, void* userp,
-    void* buffer_userp)
-{
-  AllocPayload<inference::ModelInferResponse>* payload =
-      reinterpret_cast<AllocPayload<inference::ModelInferResponse>*>(userp);
-
-  return OutputBufferAttributesHelper(
-      allocator, tensor_name, payload->shm_map_, buffer_attributes);
-  return nullptr;  // Success
-}
-
-TRITONSERVER_Error*
-InferResponseFree(
-    TRITONSERVER_ResponseAllocator* allocator, void* buffer, void* buffer_userp,
-    size_t byte_size, TRITONSERVER_MemoryType memory_type,
-    int64_t memory_type_id)
-{
-  LOG_VERBOSE(1) << "GRPC free: "
-                 << "size " << byte_size << ", addr " << buffer;
-
-  // Don't do anything when releasing a buffer since InferResponseAlloc
-  // wrote directly into the response protobuf.
-  return nullptr;  // Success
-}
-
-TRITONSERVER_Error*
-InferResponseStart(TRITONSERVER_ResponseAllocator* allocator, void* userp)
-{
-  AllocPayload<inference::ModelInferResponse>* payload =
-      reinterpret_cast<AllocPayload<inference::ModelInferResponse>*>(userp);
-
-  // ModelInfer RPC expects exactly one response per request. Hence, always call
-  // GetNonDecoupledResponse() to create one response object on response start.
-  payload->response_queue_->GetNonDecoupledResponse();
-
-  return nullptr;  // success
-}
-
-template <typename TensorType>
-TRITONSERVER_Error*
-ParseSharedMemoryParams(
-    const TensorType& tensor, bool* has_shared_memory, std::string* region_name,
-    int64_t* offset, size_t* byte_size)
-{
-  *has_shared_memory = false;
-  *offset = 0 /* default value */;
-  const auto& region_it = tensor.parameters().find("shared_memory_region");
-  if (region_it != tensor.parameters().end()) {
-    *has_shared_memory = true;
-    const auto& infer_param = region_it->second;
-    if (infer_param.parameter_choice_case() !=
-        inference::InferParameter::ParameterChoiceCase::kStringParam) {
-      return TRITONSERVER_ErrorNew(
-          TRITONSERVER_ERROR_INVALID_ARG,
-          std::string(
-              "invalid value type for 'shared_memory_region' parameter for "
-              "tensor '" +
-              tensor.name() + "', expected string_param.")
-              .c_str());
-    }
-    *region_name = infer_param.string_param();
-  }
-
-  const auto& offset_it = tensor.parameters().find("shared_memory_offset");
-  if (offset_it != tensor.parameters().end()) {
-    if (!*has_shared_memory) {
-      return TRITONSERVER_ErrorNew(
-          TRITONSERVER_ERROR_INVALID_ARG,
-          std::string(
-              "'shared_memory_offset' can not be specified without "
-              "'shared_memory_region' parameter for tensor '" +
-              tensor.name() + "'")
-              .c_str());
-    }
-    const auto& infer_param = offset_it->second;
-    if (infer_param.parameter_choice_case() !=
-        inference::InferParameter::ParameterChoiceCase::kInt64Param) {
-      return TRITONSERVER_ErrorNew(
-          TRITONSERVER_ERROR_INVALID_ARG,
-          std::string(
-              "invalid value type for 'shared_memory_offset' parameter for "
-              "tensor '" +
-              tensor.name() + "', expected int64_param.")
-              .c_str());
-    }
-    *offset = infer_param.int64_param();
-  }
-
-  const auto& bs_it = tensor.parameters().find("shared_memory_byte_size");
-  if (bs_it != tensor.parameters().end()) {
-    if (!*has_shared_memory) {
-      return TRITONSERVER_ErrorNew(
-          TRITONSERVER_ERROR_INVALID_ARG,
-          std::string(
-              "'shared_memory_byte_size' can not be specified without "
-              "'shared_memory_region' parameter for tensor '" +
-              tensor.name() + "'")
-              .c_str());
-    }
-    const auto& infer_param = bs_it->second;
-    if (infer_param.parameter_choice_case() !=
-        inference::InferParameter::ParameterChoiceCase::kInt64Param) {
-      return TRITONSERVER_ErrorNew(
-          TRITONSERVER_ERROR_INVALID_ARG,
-          std::string(
-              "invalid value type for 'shared_memory_byte_size' parameter "
-              "for "
-              "tensor '" +
-              tensor.name() + "', expected int64_param.")
-              .c_str());
-    }
-    *byte_size = infer_param.int64_param();
-  } else {
-    if (*has_shared_memory) {
-      return TRITONSERVER_ErrorNew(
-          TRITONSERVER_ERROR_INVALID_ARG,
-          std::string(
-              "'shared_memory_byte_size' must be specified along with "
-              "'shared_memory_region' parameter for tensor '" +
-              tensor.name() + "'")
-              .c_str());
-    }
-  }
-
-  return nullptr;
-}
-
-TRITONSERVER_Error*
-ParseClassificationParams(
-    const inference::ModelInferRequest::InferRequestedOutputTensor& output,
-    bool* has_classification, uint32_t* classification_count)
-{
-  *has_classification = false;
-
-  const auto& class_it = output.parameters().find("classification");
-  if (class_it != output.parameters().end()) {
-    *has_classification = true;
-
-    const auto& param = class_it->second;
-    if (param.parameter_choice_case() !=
-        inference::InferParameter::ParameterChoiceCase::kInt64Param) {
-      return TRITONSERVER_ErrorNew(
-          TRITONSERVER_ERROR_INVALID_ARG,
-          "invalid value type for 'classification' parameter, expected "
-          "int64_param");
-    }
-
-    const int64_t cnt = param.int64_param();
-    if (cnt <= 0) {
-      return TRITONSERVER_ErrorNew(
-          TRITONSERVER_ERROR_INVALID_ARG,
-          "invalid value for 'classification' parameter, expected >= 0");
-    }
-
-    *classification_count = cnt;
-  }
-
-  return nullptr;  // success
-}
-
-template <typename ResponseType>
-TRITONSERVER_Error*
-InferAllocatorPayload(
-    const std::shared_ptr<TRITONSERVER_Server>& tritonserver,
-    const std::shared_ptr<SharedMemoryManager>& shm_manager,
-    const inference::ModelInferRequest& request,
-    std::list<std::string>&& serialized_data,
-    std::shared_ptr<ResponseQueue<ResponseType>> response_queue,
-    AllocPayload<ResponseType>* alloc_payload)
-{
-  alloc_payload->response_queue_ = response_queue;
-  alloc_payload->shm_map_.clear();
-  alloc_payload->classification_map_.clear();
-  alloc_payload->serialized_data_ = std::move(serialized_data);
-
-  // If any of the outputs use shared memory, then we must calculate
-  // the memory address for that output and store it in the allocator
-  // payload so that it is available when the allocation callback is
-  // invoked.
-  for (const auto& io : request.outputs()) {
-    std::string region_name;
-    int64_t offset;
-    size_t byte_size;
-    bool has_shared_memory;
-    RETURN_IF_ERR(ParseSharedMemoryParams<
-                  inference::ModelInferRequest::InferRequestedOutputTensor>(
-        io, &has_shared_memory, &region_name, &offset, &byte_size));
-
-    bool has_classification;
-    uint32_t classification_count;
-    RETURN_IF_ERR(ParseClassificationParams(
-        io, &has_classification, &classification_count));
-
-    if (has_shared_memory && has_classification) {
-      return TRITONSERVER_ErrorNew(
-          TRITONSERVER_ERROR_INVALID_ARG,
-          "output can't set both 'shared_memory_region' and "
-          "'classification'");
-    }
-
-    if (has_shared_memory) {
-      void* base;
-      TRITONSERVER_MemoryType memory_type;
-      int64_t memory_type_id;
-      RETURN_IF_ERR(shm_manager->GetMemoryInfo(
-          region_name, offset, &base, &memory_type, &memory_type_id));
-
-      if (memory_type == TRITONSERVER_MEMORY_GPU) {
-#ifdef TRITON_ENABLE_GPU
-        char* cuda_handle;
-        RETURN_IF_ERR(shm_manager->GetCUDAHandle(
-            region_name, reinterpret_cast<cudaIpcMemHandle_t**>(&cuda_handle)));
-        alloc_payload->shm_map_.emplace(
-            io.name(),
-            ShmInfo(base, byte_size, memory_type, memory_type_id, cuda_handle));
-#endif
-      } else {
-        alloc_payload->shm_map_.emplace(
-            io.name(), ShmInfo(
-                           base, byte_size, memory_type, memory_type_id,
-                           nullptr /* cuda_ipc_handle */));
-      }
-    } else if (has_classification) {
-      alloc_payload->classification_map_.emplace(
-          io.name(), classification_count);
-    }
-  }
-
-  return nullptr;  // Success
-}
-
-TRITONSERVER_Error*
-InferGRPCToInputHelper(
-    const std::string& input_name, const std::string& model_name,
-    const TRITONSERVER_DataType tensor_dt, const TRITONSERVER_DataType input_dt,
-    const size_t binary_data_byte_size)
-{
-  if (binary_data_byte_size != 0) {
-    return TRITONSERVER_ErrorNew(
-        TRITONSERVER_ERROR_INVALID_ARG,
-        std::string(
-            "unexpected explicit tensor data for input tensor '" + input_name +
-            "' for model '" + model_name +
-            "', binary data was already supplied.")
-            .c_str());
-  }
-
-  if (tensor_dt != input_dt) {
-    return TRITONSERVER_ErrorNew(
-        TRITONSERVER_ERROR_INVALID_ARG,
-        std::string(
-            "unexpected explicit tensor data for input tensor '" + input_name +
-            "' for model '" + model_name + "' of type '" +
-            TRITONSERVER_DataTypeString(tensor_dt) + "', expected datatype '" +
-            TRITONSERVER_DataTypeString(input_dt) + "'")
-            .c_str());
-  }
-
-  return nullptr;  // success
-}
-
-TRITONSERVER_Error*
-InferGRPCToInput(
-    const std::shared_ptr<TRITONSERVER_Server>& tritonserver,
-    const std::shared_ptr<SharedMemoryManager>& shm_manager,
-    const inference::ModelInferRequest& request,
-    std::list<std::string>* serialized_data,
-    TRITONSERVER_InferenceRequest* inference_request)
-{
-  // Verify that the batch-byte-size of each input matches the size of
-  // the provided tensor data (provided raw or from shared memory)
-  int index = 0;
-  for (const auto& io : request.inputs()) {
-    const void* base;
-    size_t byte_size = 0;
-    TRITONSERVER_MemoryType memory_type = TRITONSERVER_MEMORY_CPU;
-    int64_t memory_type_id = 0;
-
-    std::string region_name;
-    int64_t offset;
-    bool has_shared_memory;
-    RETURN_IF_ERR(
-        ParseSharedMemoryParams<inference::ModelInferRequest::InferInputTensor>(
-            io, &has_shared_memory, &region_name, &offset, &byte_size));
-
-    TRITONSERVER_BufferAttributes* buffer_attributes;
-    RETURN_IF_ERR(TRITONSERVER_BufferAttributesNew(&buffer_attributes));
-    auto buffer_attributes_del =
-        [](TRITONSERVER_BufferAttributes* buffer_attributes) {
-          TRITONSERVER_BufferAttributesDelete(buffer_attributes);
-        };
-    std::unique_ptr<
-        TRITONSERVER_BufferAttributes, decltype(buffer_attributes_del)>
-        buffer_attrsl(buffer_attributes, buffer_attributes_del);
-    char* cuda_ipc_handle = nullptr;
-
-    if (has_shared_memory) {
-      if (io.has_contents()) {
-        return TRITONSERVER_ErrorNew(
-            TRITONSERVER_ERROR_INVALID_ARG,
-            std::string(
-                "unexpected 'content' provided when using shared memory "
-                "for "
-                "input tensor '" +
-                io.name() + "' for model '" + request.model_name() + "'")
-                .c_str());
-      }
-      void* tmp;
-      RETURN_IF_ERR(shm_manager->GetMemoryInfo(
-          region_name, offset, &tmp, &memory_type, &memory_type_id));
-      base = tmp;
-      if (memory_type == TRITONSERVER_MEMORY_GPU) {
-#ifdef TRITON_ENABLE_GPU
-        RETURN_IF_ERR(shm_manager->GetCUDAHandle(
-            region_name,
-            reinterpret_cast<cudaIpcMemHandle_t**>(&cuda_ipc_handle)));
-#endif
-      }
-    } else {
-      if (io.has_contents() && (!request.raw_input_contents().empty())) {
-        return TRITONSERVER_ErrorNew(
-            TRITONSERVER_ERROR_INVALID_ARG,
-            std::string(
-                "contents field must not be specified when using "
-                "raw_input_contents for '" +
-                io.name() + "' for model '" + request.model_name() + "'")
-                .c_str());
-      } else if (io.has_contents()) {
-        // Check the presence of explicit tensors
-        TRITONSERVER_DataType dtype =
-            TRITONSERVER_StringToDataType(io.datatype().c_str());
-        const size_t elem_byte_size = TRITONSERVER_DataTypeByteSize(dtype);
-        if (io.contents().bool_contents_size() != 0) {
-          RETURN_IF_ERR(InferGRPCToInputHelper(
-              io.name(), request.model_name(), TRITONSERVER_TYPE_BOOL, dtype,
-              byte_size));
-          base = (const void*)io.contents().bool_contents().data();
-          byte_size = io.contents().bool_contents_size() * elem_byte_size;
-        }
-
-        if (io.contents().int_contents_size() != 0) {
-          if (dtype == TRITONSERVER_TYPE_INT8) {
-            RETURN_IF_ERR(InferGRPCToInputHelper(
-                io.name(), request.model_name(), TRITONSERVER_TYPE_INT8, dtype,
-                byte_size));
-            serialized_data->emplace_back();
-            auto& serialized = serialized_data->back();
-            serialized.reserve(
-                io.contents().int_contents_size() * elem_byte_size);
-            for (const auto& element : io.contents().int_contents()) {
-              // Assuming the system is little-endian, picking the
-              // least significant byte of 32-bit integer as a
-              // int8 element
-              serialized.append(
-                  reinterpret_cast<const char*>(&element), elem_byte_size);
-            }
-            base = serialized.c_str();
-            byte_size = serialized.size();
-          } else if (dtype == TRITONSERVER_TYPE_INT16) {
-            RETURN_IF_ERR(InferGRPCToInputHelper(
-                io.name(), request.model_name(), TRITONSERVER_TYPE_INT16, dtype,
-                byte_size));
-            serialized_data->emplace_back();
-            auto& serialized = serialized_data->back();
-            serialized.reserve(
-                io.contents().int_contents_size() * elem_byte_size);
-            for (const auto& element : io.contents().int_contents()) {
-              // Assuming the system is little-endian, picking the
-              // least 2 significant bytes of 32-bit integer as a
-              // int16 element
-              serialized.append(
-                  reinterpret_cast<const char*>(&element), elem_byte_size);
-            }
-            base = serialized.c_str();
-            byte_size = serialized.size();
-          } else {
-            RETURN_IF_ERR(InferGRPCToInputHelper(
-                io.name(), request.model_name(), TRITONSERVER_TYPE_INT32, dtype,
-                byte_size));
-            base = (const void*)io.contents().int_contents().data();
-            byte_size = io.contents().int_contents_size() * elem_byte_size;
-          }
-        }
-
-        if (io.contents().int64_contents_size() != 0) {
-          RETURN_IF_ERR(InferGRPCToInputHelper(
-              io.name(), request.model_name(), TRITONSERVER_TYPE_INT64, dtype,
-              byte_size));
-          base = (const void*)io.contents().int64_contents().data();
-          byte_size = io.contents().int64_contents_size() * elem_byte_size;
-        }
-
-        if (io.contents().uint_contents_size() != 0) {
-          if (dtype == TRITONSERVER_TYPE_UINT8) {
-            RETURN_IF_ERR(InferGRPCToInputHelper(
-                io.name(), request.model_name(), TRITONSERVER_TYPE_UINT8, dtype,
-                byte_size));
-            serialized_data->emplace_back();
-            auto& serialized = serialized_data->back();
-            serialized.reserve(
-                io.contents().uint_contents_size() * elem_byte_size);
-            for (const auto& element : io.contents().uint_contents()) {
-              // Assuming the system is little-endian, picking the
-              // least significant byte of 32-bit unsigned integer as a
-              // uint8 element
-              serialized.append(
-                  reinterpret_cast<const char*>(&element), elem_byte_size);
-            }
-            base = serialized.c_str();
-            byte_size = serialized.size();
-          } else if (dtype == TRITONSERVER_TYPE_UINT16) {
-            RETURN_IF_ERR(InferGRPCToInputHelper(
-                io.name(), request.model_name(), TRITONSERVER_TYPE_UINT16,
-                dtype, byte_size));
-            serialized_data->emplace_back();
-            auto& serialized = serialized_data->back();
-            serialized.reserve(
-                io.contents().uint_contents_size() * elem_byte_size);
-            for (const auto& element : io.contents().uint_contents()) {
-              // Assuming the system is little-endian, picking the
-              // least 2 significant bytes of 32-bit integer as a
-              // uint16 element
-              serialized.append(
-                  reinterpret_cast<const char*>(&element), elem_byte_size);
-            }
-            base = serialized.c_str();
-            byte_size = serialized.size();
-          } else {
-            RETURN_IF_ERR(InferGRPCToInputHelper(
-                io.name(), request.model_name(), TRITONSERVER_TYPE_UINT32,
-                dtype, byte_size));
-            base = (const void*)io.contents().int_contents().data();
-            byte_size = io.contents().int_contents_size() * elem_byte_size;
-          }
-        }
-
-        if (io.contents().uint64_contents_size() != 0) {
-          RETURN_IF_ERR(InferGRPCToInputHelper(
-              io.name(), request.model_name(), TRITONSERVER_TYPE_UINT64, dtype,
-              byte_size));
-          base = (const void*)io.contents().uint64_contents().data();
-          byte_size = io.contents().uint64_contents_size() * elem_byte_size;
-        }
-
-        if (io.contents().fp32_contents_size() != 0) {
-          RETURN_IF_ERR(InferGRPCToInputHelper(
-              io.name(), request.model_name(), TRITONSERVER_TYPE_FP32, dtype,
-              byte_size));
-          base = (const void*)io.contents().fp32_contents().data();
-          byte_size = io.contents().fp32_contents_size() * elem_byte_size;
-        }
-
-        if (io.contents().fp64_contents_size() != 0) {
-          RETURN_IF_ERR(InferGRPCToInputHelper(
-              io.name(), request.model_name(), TRITONSERVER_TYPE_FP64, dtype,
-              byte_size));
-          base = (const void*)io.contents().fp64_contents().data();
-          byte_size = io.contents().fp64_contents_size() * elem_byte_size;
-        }
-
-        if (io.contents().bytes_contents_size() != 0) {
-          RETURN_IF_ERR(InferGRPCToInputHelper(
-              io.name(), request.model_name(), TRITONSERVER_TYPE_BYTES, dtype,
-              byte_size));
-
-          serialized_data->emplace_back();
-          auto& serialized = serialized_data->back();
-
-          // Serialize the output tensor strings. Each string is
-          // serialized as a 4-byte length followed by the string itself
-          // with no null-terminator.
-          for (const auto& element : io.contents().bytes_contents()) {
-            uint32_t len{(uint32_t)element.size()};
-            serialized.append(
-                reinterpret_cast<const char*>(&len), sizeof(uint32_t));
-            if (element.size() > 0) {
-              serialized.append(element.c_str(), len);
-            }
-          }
-          base = serialized.c_str();
-          byte_size = serialized.size();
-        }
-      } else if (request.raw_input_contents().size() > index) {
-        // Try to read the raw contents if available
-        const std::string& raw = request.raw_input_contents()[index++];
-        base = raw.c_str();
-        byte_size = raw.size();
-      } else {
-        return TRITONSERVER_ErrorNew(
-            TRITONSERVER_ERROR_INVALID_ARG,
-            std::string(
-                "unable to find data for input tensor '" + io.name() +
-                "' for model '" + request.model_name() + "' in request.")
-                .c_str());
-      }
-    }
-
-    if (cuda_ipc_handle != nullptr) {
-      RETURN_IF_ERR(TRITONSERVER_BufferAttributesSetCudaIpcHandle(
-          buffer_attributes, reinterpret_cast<void*>(cuda_ipc_handle)));
-    }
-
-    RETURN_IF_ERR(TRITONSERVER_BufferAttributesSetMemoryType(
-        buffer_attributes, memory_type));
-    RETURN_IF_ERR(TRITONSERVER_BufferAttributesSetMemoryTypeId(
-        buffer_attributes, memory_type_id));
-    RETURN_IF_ERR(
-        TRITONSERVER_BufferAttributesSetByteSize(buffer_attributes, byte_size));
-    RETURN_IF_ERR(
-        TRITONSERVER_InferenceRequestAppendInputDataWithBufferAttributes(
-            inference_request, io.name().c_str(), base, buffer_attributes));
-  }
-
-  return nullptr;  // success
-}
-
-TRITONSERVER_Error*
-SetInferenceRequestMetadata(
-    TRITONSERVER_InferenceRequest* inference_request,
-    const inference::ModelInferRequest& request)
-{
-  RETURN_IF_ERR(TRITONSERVER_InferenceRequestSetId(
-      inference_request, request.id().c_str()));
-
-  uint32_t flags = 0;
-  for (auto param : request.parameters()) {
-    if (param.first.compare("sequence_id") == 0) {
-      const auto& infer_param = param.second;
-      if (infer_param.parameter_choice_case() ==
-          inference::InferParameter::ParameterChoiceCase::kInt64Param) {
-        RETURN_IF_ERR(TRITONSERVER_InferenceRequestSetCorrelationId(
-            inference_request, infer_param.int64_param()));
-      } else if (
-          infer_param.parameter_choice_case() ==
-          inference::InferParameter::ParameterChoiceCase::kStringParam) {
-        RETURN_IF_ERR(TRITONSERVER_InferenceRequestSetCorrelationIdString(
-            inference_request, infer_param.string_param().c_str()));
-      } else {
-        return TRITONSERVER_ErrorNew(
-            TRITONSERVER_ERROR_INVALID_ARG,
-            "invalid value type for 'sequence_id' parameter, expected "
-            "int64_param or string_param.");
-      }
-    } else if (param.first.compare("sequence_start") == 0) {
-      const auto& infer_param = param.second;
-      if (infer_param.parameter_choice_case() !=
-          inference::InferParameter::ParameterChoiceCase::kBoolParam) {
-        return TRITONSERVER_ErrorNew(
-            TRITONSERVER_ERROR_INVALID_ARG,
-            "invalid value type for 'sequence_start' parameter, expected "
-            "bool_param.");
-      }
-      if (infer_param.bool_param()) {
-        flags |= TRITONSERVER_REQUEST_FLAG_SEQUENCE_START;
-      }
-    } else if (param.first.compare("sequence_end") == 0) {
-      const auto& infer_param = param.second;
-      if (infer_param.parameter_choice_case() !=
-          inference::InferParameter::ParameterChoiceCase::kBoolParam) {
-        return TRITONSERVER_ErrorNew(
-            TRITONSERVER_ERROR_INVALID_ARG,
-            "invalid value type for 'sequence_end' parameter, expected "
-            "bool_param.");
-      }
-      if (infer_param.bool_param()) {
-        flags |= TRITONSERVER_REQUEST_FLAG_SEQUENCE_END;
-      }
-    } else if (param.first.compare("priority") == 0) {
-      const auto& infer_param = param.second;
-      if (infer_param.parameter_choice_case() !=
-          inference::InferParameter::ParameterChoiceCase::kInt64Param) {
-        return TRITONSERVER_ErrorNew(
-            TRITONSERVER_ERROR_INVALID_ARG,
-            "invalid value type for 'priority' parameter, expected "
-            "int64_param.");
-      }
-      RETURN_IF_ERR(TRITONSERVER_InferenceRequestSetPriority(
-          inference_request, infer_param.int64_param()));
-
-    } else if (param.first.compare("timeout") == 0) {
-      const auto& infer_param = param.second;
-      if (infer_param.parameter_choice_case() !=
-          inference::InferParameter::ParameterChoiceCase::kInt64Param) {
-        return TRITONSERVER_ErrorNew(
-            TRITONSERVER_ERROR_INVALID_ARG,
-            "invalid value type for 'timeout' parameter, expected "
-            "int64_param.");
-      }
-      RETURN_IF_ERR(TRITONSERVER_InferenceRequestSetTimeoutMicroseconds(
-          inference_request, infer_param.int64_param()));
-    }
-  }
-
-  RETURN_IF_ERR(
-      TRITONSERVER_InferenceRequestSetFlags(inference_request, flags));
-
-  for (const auto& input : request.inputs()) {
-    RETURN_IF_ERR(TRITONSERVER_InferenceRequestAddInput(
-        inference_request, input.name().c_str(),
-        TRITONSERVER_StringToDataType(input.datatype().c_str()),
-        input.shape().data(), input.shape_size()));
-  }
-
-  for (const auto& output : request.outputs()) {
-    RETURN_IF_ERR(TRITONSERVER_InferenceRequestAddRequestedOutput(
-        inference_request, output.name().c_str()));
-  }
-
-  return nullptr;  // Success
-}
-
-void
-InferRequestComplete(
-    TRITONSERVER_InferenceRequest* request, const uint32_t flags, void* userp)
-{
-  LOG_VERBOSE(1) << "ModelInferHandler::InferRequestComplete";
-
-  if ((flags & TRITONSERVER_REQUEST_RELEASE_ALL) != 0) {
-    LOG_TRITONSERVER_ERROR(
-        TRITONSERVER_InferenceRequestDelete(request),
-        "deleting GRPC inference request");
-  }
-}
-
-template <typename ResponseType>
-TRITONSERVER_Error*
-InferResponseCompleteCommon(
-    TRITONSERVER_Server* server, TRITONSERVER_InferenceResponse* iresponse,
-    inference::ModelInferResponse& response,
-    const AllocPayload<ResponseType>& alloc_payload)
-{
-  RETURN_IF_ERR(TRITONSERVER_InferenceResponseError(iresponse));
-
-  const char *model_name, *id;
-  int64_t model_version;
-  RETURN_IF_ERR(TRITONSERVER_InferenceResponseModel(
-      iresponse, &model_name, &model_version));
-  RETURN_IF_ERR(TRITONSERVER_InferenceResponseId(iresponse, &id));
-
-  response.set_id(id);
-  response.set_model_name(model_name);
-  response.set_model_version(std::to_string(model_version));
-
-  // Propagate response parameters.
-  uint32_t parameter_count;
-  RETURN_IF_ERR(TRITONSERVER_InferenceResponseParameterCount(
-      iresponse, &parameter_count));
-  for (uint32_t pidx = 0; pidx < parameter_count; ++pidx) {
-    const char* name;
-    TRITONSERVER_ParameterType type;
-    const void* vvalue;
-    RETURN_IF_ERR(TRITONSERVER_InferenceResponseParameter(
-        iresponse, pidx, &name, &type, &vvalue));
-    inference::InferParameter& param = (*response.mutable_parameters())[name];
-    switch (type) {
-      case TRITONSERVER_PARAMETER_BOOL:
-        param.set_bool_param(*(reinterpret_cast<const bool*>(vvalue)));
-        break;
-      case TRITONSERVER_PARAMETER_INT:
-        param.set_int64_param(*(reinterpret_cast<const int64_t*>(vvalue)));
-        break;
-      case TRITONSERVER_PARAMETER_STRING:
-        param.set_string_param(reinterpret_cast<const char*>(vvalue));
-        break;
-      case TRITONSERVER_PARAMETER_BYTES:
-        return TRITONSERVER_ErrorNew(
-            TRITONSERVER_ERROR_UNSUPPORTED,
-            "Response parameter of type 'TRITONSERVER_PARAMETER_BYTES' is not "
-            "currently supported");
-        break;
-    }
-  }
-
-  // Go through each response output and transfer information to the
-  // corresponding GRPC response output.
-  uint32_t output_count;
-  RETURN_IF_ERR(
-      TRITONSERVER_InferenceResponseOutputCount(iresponse, &output_count));
-  if (output_count != (uint32_t)response.outputs_size()) {
-    return TRITONSERVER_ErrorNew(
-        TRITONSERVER_ERROR_INTERNAL, "response output count mismatch");
-  }
-
-  for (uint32_t output_idx = 0; output_idx < output_count; ++output_idx) {
-    const char* cname;
-    TRITONSERVER_DataType datatype;
-    const int64_t* shape;
-    uint64_t dim_count;
-    const void* base;
-    size_t byte_size;
-    TRITONSERVER_MemoryType memory_type;
-    int64_t memory_type_id;
-    void* userp;
-
-    RETURN_IF_ERR(TRITONSERVER_InferenceResponseOutput(
-        iresponse, output_idx, &cname, &datatype, &shape, &dim_count, &base,
-        &byte_size, &memory_type, &memory_type_id, &userp));
-
-    const std::string name(cname);
-
-    // There are usually very few outputs so fastest just to look for
-    // the one we want... could create a map for cases where there are
-    // a large number of outputs. Or rely on order to be same...
-    inference::ModelInferResponse::InferOutputTensor* output = nullptr;
-    for (auto& io : *(response.mutable_outputs())) {
-      if (io.name() == name) {
-        output = &io;
-        break;
-      }
-    }
-
-    if (output == nullptr) {
-      return TRITONSERVER_ErrorNew(
-          TRITONSERVER_ERROR_INTERNAL,
-          "unable to find expected response output");
-    }
-
-    // If this output was requested as classification then remove the
-    // raw output from the response and instead return classification
-    // results as a string tensor
-    const auto itr = alloc_payload.classification_map_.find(name);
-    if (itr == alloc_payload.classification_map_.end()) {
-      // Not classification...
-      output->set_datatype(TRITONSERVER_DataTypeString(datatype));
-      for (size_t idx = 0; idx < dim_count; idx++) {
-        output->add_shape(shape[idx]);
-      }
-    } else {
-      // Classification
-      const uint32_t classification_count = itr->second;
-
-      // For classification need to determine the batch size, if any,
-      // because need to use that to break up the response for each
-      // batch entry.
-      uint32_t batch_size = 0;
-
-      uint32_t batch_flags;
-      RETURN_IF_ERR(TRITONSERVER_ServerModelBatchProperties(
-          server, model_name, model_version, &batch_flags,
-          nullptr /* voidp */));
-      if ((dim_count > 0) &&
-          ((batch_flags & TRITONSERVER_BATCH_FIRST_DIM) != 0)) {
-        batch_size = shape[0];
-      }
-
-      // Determine the batch1 byte size of the tensor... needed when
-      // the response tensor batch-size > 1 so that we know how to
-      // stride though the tensor data.
-      size_t batch1_element_count = 1;
-      for (size_t idx = ((batch_size == 0) ? 0 : 1); idx < dim_count; idx++) {
-        batch1_element_count *= shape[idx];
-      }
-
-      const size_t batch1_byte_size =
-          batch1_element_count * TRITONSERVER_DataTypeByteSize(datatype);
-
-      // Create the classification contents
-      std::string serialized;
-
-      size_t class_offset = 0;
-      for (uint32_t bs = 0; bs < std::max((uint32_t)1, batch_size); ++bs) {
-        std::vector<std::string> class_strs;
-        RETURN_IF_ERR(TopkClassifications(
-            iresponse, output_idx,
-            reinterpret_cast<const char*>(base) + class_offset,
-            ((class_offset + batch1_byte_size) > byte_size) ? 0
-                                                            : batch1_byte_size,
-            datatype, classification_count, &class_strs));
-
-        // Serialize for binary representation...
-        for (const auto& str : class_strs) {
-          uint32_t len = str.size();
-          serialized.append(reinterpret_cast<const char*>(&len), sizeof(len));
-          if (len > 0) {
-            serialized.append(str);
-          }
-        }
-
-        class_offset += batch1_byte_size;
-      }
-
-      // Update the output with new datatype, shape and contents.
-      output->set_datatype(
-          TRITONSERVER_DataTypeString(TRITONSERVER_TYPE_BYTES));
-
-      if (batch_size > 0) {
-        output->add_shape(batch_size);
-      }
-      output->add_shape(
-          std::min(classification_count, (uint32_t)batch1_element_count));
-
-      (*response.mutable_raw_output_contents())[output_idx] =
-          std::move(serialized);
-    }
-  }
-
-  // Make sure response doesn't exceed GRPC limits.
-  if (response.ByteSizeLong() > MAX_GRPC_MESSAGE_SIZE) {
-    return TRITONSERVER_ErrorNew(
-        TRITONSERVER_ERROR_INVALID_ARG,
-        std::string(
-            "Response has byte size " +
-            std::to_string(response.ByteSizeLong()) +
-            " which exceeds gRPC's byte size limit " + std::to_string(INT_MAX) +
-            ".")
-            .c_str());
-  }
-
-  return nullptr;  // success
-}
-
-//
-// ModelInferHandler
-//
-class ModelInferHandler
-    : public InferHandler<
-          inference::GRPCInferenceService::AsyncService,
-          grpc::ServerAsyncResponseWriter<inference::ModelInferResponse>,
-          inference::ModelInferRequest, inference::ModelInferResponse> {
- public:
-  ModelInferHandler(
-      const std::string& name,
-      const std::shared_ptr<TRITONSERVER_Server>& tritonserver,
-      TraceManager* trace_manager,
-      const std::shared_ptr<SharedMemoryManager>& shm_manager,
-      inference::GRPCInferenceService::AsyncService* service,
-      grpc::ServerCompletionQueue* cq, size_t max_state_bucket_count,
-      grpc_compression_level compression_level)
-      : InferHandler(name, tritonserver, service, cq, max_state_bucket_count),
-        trace_manager_(trace_manager), shm_manager_(shm_manager),
-        compression_level_(compression_level)
-  {
-    // Create the allocator that will be used to allocate buffers for
-    // the result tensors.
-    FAIL_IF_ERR(
-        TRITONSERVER_ResponseAllocatorNew(
-            &allocator_, InferResponseAlloc, InferResponseFree,
-            InferResponseStart),
-        "creating inference response allocator");
-    FAIL_IF_ERR(
-        TRITONSERVER_ResponseAllocatorSetQueryFunction(
-            allocator_, OutputBufferQuery),
-        "setting allocator's query function");
-    FAIL_IF_ERR(
-        TRITONSERVER_ResponseAllocatorSetBufferAttributesFunction(
-            allocator_, OutputBufferAttributes),
-        "setting allocator's output buffer attributes function");
-  }
-
-  ~ModelInferHandler()
-  {
-    LOG_TRITONSERVER_ERROR(
-        TRITONSERVER_ResponseAllocatorDelete(allocator_),
-        "deleting response allocator");
-  }
-
- protected:
-  void StartNewRequest() override;
-  bool Process(State* state, bool rpc_ok) override;
-
- private:
-  static void InferResponseComplete(
-      TRITONSERVER_InferenceResponse* response, const uint32_t flags,
-      void* userp);
-
-  TraceManager* trace_manager_;
-  std::shared_ptr<SharedMemoryManager> shm_manager_;
-  TRITONSERVER_ResponseAllocator* allocator_;
-
-  grpc_compression_level compression_level_;
-};
-
-void
-ModelInferHandler::StartNewRequest()
-{
-  auto context = std::make_shared<State::Context>(cq_);
-  context->SetCompressionLevel(compression_level_);
-  State* state = StateNew(tritonserver_.get(), context);
-
-#ifdef TRITON_ENABLE_TRACING
-  // Can't create trace as we don't know the model to be requested,
-  // track timestamps in 'state'
-  state->trace_timestamps_.emplace_back(
-      std::make_pair("GRPC_WAITREAD_START", TraceManager::CaptureTimestamp()));
-#endif  // TRITON_ENABLE_TRACING
-
-  service_->RequestModelInfer(
-      state->context_->ctx_.get(), &state->request_,
-      state->context_->responder_.get(), cq_, cq_, state);
-
-  LOG_VERBOSE(1) << "New request handler for " << Name() << ", "
-                 << state->unique_id_;
-}
-
-bool
-ModelInferHandler::Process(InferHandler::State* state, bool rpc_ok)
-{
-  LOG_VERBOSE(1) << "Process for " << Name() << ", rpc_ok=" << rpc_ok << ", "
-                 << state->unique_id_ << " step " << state->step_;
-
-  // We need an explicit finish indicator. Can't use 'state->step_'
-  // because we launch an async thread that could update 'state's
-  // step_ to be FINISH before this thread exits this function.
-  bool finished = false;
-
-  // If RPC failed on a new request then the server is shutting down
-  // and so we should do nothing (including not registering for a new
-  // request). If RPC failed on a non-START step then there is nothing
-  // we can do since we one execute one step.
-  const bool shutdown = (!rpc_ok && (state->step_ == Steps::START));
-  if (shutdown) {
-    state->step_ = Steps::FINISH;
-    finished = true;
-  }
-
-  const inference::ModelInferRequest& request = state->request_;
-  auto response_queue = state->response_queue_;
-
-  if (state->step_ == Steps::START) {
-    TRITONSERVER_Error* err = nullptr;
-#ifdef TRITON_ENABLE_TRACING
-    // Can't create trace as we don't know the model to be requested,
-    // track timestamps in 'state'
-    state->trace_timestamps_.emplace_back(
-        std::make_pair("GRPC_WAITREAD_END", TraceManager::CaptureTimestamp()));
-#endif  // TRITON_ENABLE_TRACING
-
-    // Start a new request to replace this one...
-    if (!shutdown) {
-      StartNewRequest();
-    }
-
-    int64_t requested_model_version;
-    if (err == nullptr) {
-      err = GetModelVersionFromString(
-          request.model_version(), &requested_model_version);
-    }
-
-    if (err == nullptr) {
-      uint32_t txn_flags;
-      err = TRITONSERVER_ServerModelTransactionProperties(
-          tritonserver_.get(), request.model_name().c_str(),
-          requested_model_version, &txn_flags, nullptr /* voidp */);
-      if ((err == nullptr) && (txn_flags & TRITONSERVER_TXN_DECOUPLED) != 0) {
-        err = TRITONSERVER_ErrorNew(
-            TRITONSERVER_ERROR_UNSUPPORTED,
-            "ModelInfer RPC doesn't support models with decoupled "
-            "transaction policy");
-      }
-    }
-
-    // Create the inference request which contains all the
-    // input information needed for an inference.
-    TRITONSERVER_InferenceRequest* irequest = nullptr;
-    if (err == nullptr) {
-      err = TRITONSERVER_InferenceRequestNew(
-          &irequest, tritonserver_.get(), request.model_name().c_str(),
-          requested_model_version);
-    }
-
-    if (err == nullptr) {
-      err = SetInferenceRequestMetadata(irequest, request);
-    }
-
-    // Will be used to hold the serialized data in case explicit string
-    // tensors are present in the request.
-    std::list<std::string> serialized_data;
-
-    if (err == nullptr) {
-      err = InferGRPCToInput(
-          tritonserver_, shm_manager_, request, &serialized_data, irequest);
-    }
-    if (err == nullptr) {
-      err = InferAllocatorPayload<inference::ModelInferResponse>(
-          tritonserver_, shm_manager_, request, std::move(serialized_data),
-          response_queue, &state->alloc_payload_);
-    }
-    if (err == nullptr) {
-      err = TRITONSERVER_InferenceRequestSetReleaseCallback(
-          irequest, InferRequestComplete, nullptr /* request_release_userp */);
-    }
-    if (err == nullptr) {
-      err = TRITONSERVER_InferenceRequestSetResponseCallback(
-          irequest, allocator_,
-          &state->alloc_payload_ /* response_allocator_userp */,
-          InferResponseComplete, reinterpret_cast<void*>(state));
-    }
-    if (err == nullptr) {
-      TRITONSERVER_InferenceTrace* triton_trace = nullptr;
-#ifdef TRITON_ENABLE_TRACING
-      state->trace_ =
-          std::move(trace_manager_->SampleTrace(request.model_name()));
-      if (state->trace_ != nullptr) {
-        triton_trace = state->trace_->trace_;
-      }
-#endif  // TRITON_ENABLE_TRACING
-
-      state->step_ = ISSUED;
-      err = TRITONSERVER_ServerInferAsync(
-          tritonserver_.get(), irequest, triton_trace);
-    }
-
-    // If not error then state->step_ == ISSUED and inference request
-    // has initiated... completion callback will transition to
-    // COMPLETE. If error go immediately to COMPLETE.
-    if (err != nullptr) {
-      LOG_VERBOSE(1) << "Infer failed: " << TRITONSERVER_ErrorMessage(err);
-
-      LOG_TRITONSERVER_ERROR(
-          TRITONSERVER_InferenceRequestDelete(irequest),
-          "deleting GRPC inference request");
-
-      grpc::Status status;
-      GrpcStatusUtil::Create(&status, err);
-      TRITONSERVER_ErrorDelete(err);
-
-      inference::ModelInferResponse error_response;
-
-#ifdef TRITON_ENABLE_TRACING
-      state->trace_timestamps_.emplace_back(
-          std::make_pair("GRPC_SEND_START", TraceManager::CaptureTimestamp()));
-#endif  // TRITON_ENABLE_TRACING
-
-      state->step_ = COMPLETE;
-      state->context_->responder_->Finish(error_response, status, state);
-    }
-  } else if (state->step_ == Steps::COMPLETE) {
-#ifdef TRITON_ENABLE_TRACING
-    state->trace_timestamps_.emplace_back(
-        std::make_pair("GRPC_SEND_END", TraceManager::CaptureTimestamp()));
-#endif  // TRITON_ENABLE_TRACING
-
-    state->step_ = Steps::FINISH;
-    finished = true;
-  }
-
-  return !finished;
-}
-
-void
-ModelInferHandler::InferResponseComplete(
-    TRITONSERVER_InferenceResponse* iresponse, const uint32_t flags,
-    void* userp)
-{
-  State* state = reinterpret_cast<State*>(userp);
-
-  // Increment the callback index
-  state->cb_count_++;
-
-  LOG_VERBOSE(1) << "ModelInferHandler::InferResponseComplete, "
-                 << state->unique_id_ << " step " << state->step_;
-
-  // Defer to the callback with the final response
-  if ((flags & TRITONSERVER_RESPONSE_COMPLETE_FINAL) == 0) {
-    LOG_ERROR << "[INTERNAL] ModelInfer received a response without FINAL flag";
-    return;
-  }
-
-#ifdef TRITON_ENABLE_TRACING
-  state->trace_timestamps_.emplace_back(std::make_pair(
-      "INFER_RESPONSE_COMPLETE", TraceManager::CaptureTimestamp()));
-#endif  // TRITON_ENABLE_TRACING
-
-  TRITONSERVER_Error* err = nullptr;
-  // This callback is expected to be called exactly once for each request.
-  // Will use the single response object in the response list to hold the
-  // information.
-  inference::ModelInferResponse* response =
-      state->response_queue_->GetResponseAt(0);
-  bool response_created = false;
-  if (response == nullptr) {
-    LOG_ERROR << "expected allocator to have created a response object";
-    err = TRITONSERVER_ErrorNew(
-        TRITONSERVER_ERROR_INTERNAL,
-        "No response object found in the callback");
-    response_created = true;
-    response = new inference::ModelInferResponse();
-  }
-
-  if (state->cb_count_ != 1) {
-    err = TRITONSERVER_ErrorNew(
-        TRITONSERVER_ERROR_INTERNAL, std::string(
-                                         "expected a single response, got " +
-                                         std::to_string(state->cb_count_))
-                                         .c_str());
-  } else if (iresponse == nullptr) {
-    err = TRITONSERVER_ErrorNew(
-        TRITONSERVER_ERROR_INTERNAL, "received an unexpected null response");
-  } else {
-    err = InferResponseCompleteCommon<inference::ModelInferResponse>(
-        state->tritonserver_, iresponse, *response, state->alloc_payload_);
-  }
-
-  if (err != nullptr) {
-    response->Clear();
-  }
-
-  grpc::Status status;
-  GrpcStatusUtil::Create(&status, err);
-  TRITONSERVER_ErrorDelete(err);
-
-  LOG_TRITONSERVER_ERROR(
-      TRITONSERVER_InferenceResponseDelete(iresponse),
-      "deleting GRPC inference response");
-
-#ifdef TRITON_ENABLE_TRACING
-  state->trace_timestamps_.emplace_back(
-      std::make_pair("GRPC_SEND_START", TraceManager::CaptureTimestamp()));
-#endif  // TRITON_ENABLE_TRACING
-
-  state->step_ = COMPLETE;
-  state->context_->responder_->Finish(*response, status, state);
-  if (response_created) {
-    delete response;
-  }
-}
-
-//
-// Additional Stream Infer utilities
-//
-TRITONSERVER_Error*
-StreamInferResponseStart(TRITONSERVER_ResponseAllocator* allocator, void* userp)
-{
-  AllocPayload<inference::ModelStreamInferResponse>* payload =
-      reinterpret_cast<AllocPayload<inference::ModelStreamInferResponse>*>(
-          userp);
-
-  // Move to the next response object
-  payload->response_queue_->AllocateResponse();
-
-  return nullptr;  // success
-}
-
-// Make sure to keep InferResponseAlloc and OutputBufferQuery logic in sync
-TRITONSERVER_Error*
-StreamInferResponseAlloc(
-    TRITONSERVER_ResponseAllocator* allocator, const char* tensor_name,
-    size_t byte_size, TRITONSERVER_MemoryType preferred_memory_type,
-    int64_t preferred_memory_type_id, void* userp, void** buffer,
-    void** buffer_userp, TRITONSERVER_MemoryType* actual_memory_type,
-    int64_t* actual_memory_type_id)
-{
-  AllocPayload<inference::ModelStreamInferResponse>* payload =
-      reinterpret_cast<AllocPayload<inference::ModelStreamInferResponse>*>(
-          userp);
-
-  auto response = payload->response_queue_->GetLastAllocatedResponse();
-
-  if (response == nullptr) {
-    return TRITONSERVER_ErrorNew(
-        TRITONSERVER_ERROR_INTERNAL,
-        "Unable to access the last allocated response");
-  }
-
-  return ResponseAllocatorHelper(
-      allocator, tensor_name, byte_size, preferred_memory_type,
-      preferred_memory_type_id, response->mutable_infer_response(),
-      payload->shm_map_, buffer, buffer_userp, actual_memory_type,
-      actual_memory_type_id);
-}
-
-// Make sure to keep InferResponseAlloc and OutputBufferQuery logic in sync
-TRITONSERVER_Error*
-StreamOutputBufferQuery(
-    TRITONSERVER_ResponseAllocator* allocator, void* userp,
-    const char* tensor_name, size_t* byte_size,
-    TRITONSERVER_MemoryType* memory_type, int64_t* memory_type_id)
-{
-  AllocPayload<inference::ModelStreamInferResponse>* payload =
-      reinterpret_cast<AllocPayload<inference::ModelStreamInferResponse>*>(
-          userp);
-  return OutputBufferQueryHelper(
-      allocator, tensor_name, byte_size, payload->shm_map_, memory_type,
-      memory_type_id);
-}
-
-// Make sure to keep InferResponseAlloc, OutputBufferQuery, and
-// OutputBufferAttributes logic in sync
-TRITONSERVER_Error*
-StreamOutputBufferAttributes(
-    TRITONSERVER_ResponseAllocator* allocator, const char* tensor_name,
-    TRITONSERVER_BufferAttributes* buffer_attributes, void* userp,
-    void* buffer_userp)
-{
-  AllocPayload<inference::ModelStreamInferResponse>* payload =
-      reinterpret_cast<AllocPayload<inference::ModelStreamInferResponse>*>(
-          userp);
-
-  return OutputBufferAttributesHelper(
-      allocator, tensor_name, payload->shm_map_, buffer_attributes);
-}
-
-//
-// ModelStreamInferHandler
-//
-class ModelStreamInferHandler
-    : public InferHandler<
-          inference::GRPCInferenceService::AsyncService,
-          grpc::ServerAsyncReaderWriter<
-              inference::ModelStreamInferResponse,
-              inference::ModelInferRequest>,
-          inference::ModelInferRequest, inference::ModelStreamInferResponse> {
- public:
-  ModelStreamInferHandler(
-      const std::string& name,
-      const std::shared_ptr<TRITONSERVER_Server>& tritonserver,
-      TraceManager* trace_manager,
-      const std::shared_ptr<SharedMemoryManager>& shm_manager,
-      inference::GRPCInferenceService::AsyncService* service,
-      grpc::ServerCompletionQueue* cq, size_t max_state_bucket_count,
-      grpc_compression_level compression_level)
-      : InferHandler(name, tritonserver, service, cq, max_state_bucket_count),
-        trace_manager_(trace_manager), shm_manager_(shm_manager),
-        compression_level_(compression_level)
-  {
-    // Create the allocator that will be used to allocate buffers for
-    // the result tensors.
-    FAIL_IF_ERR(
-        TRITONSERVER_ResponseAllocatorNew(
-            &allocator_, StreamInferResponseAlloc, InferResponseFree,
-            StreamInferResponseStart),
-        "creating response allocator");
-    FAIL_IF_ERR(
-        TRITONSERVER_ResponseAllocatorSetQueryFunction(
-            allocator_, StreamOutputBufferQuery),
-        "setting allocator's query function");
-    FAIL_IF_ERR(
-        TRITONSERVER_ResponseAllocatorSetBufferAttributesFunction(
-            allocator_, StreamOutputBufferAttributes),
-        "setting allocator's output buffer attribute query function");
-  }
-
-  ~ModelStreamInferHandler()
-  {
-    LOG_TRITONSERVER_ERROR(
-        TRITONSERVER_ResponseAllocatorDelete(allocator_),
-        "deleting response allocator");
-  }
-
- protected:
-  void StartNewRequest() override;
-  bool Process(State* state, bool rpc_ok) override;
-
- private:
-  static void StreamInferResponseComplete(
-      TRITONSERVER_InferenceResponse* response, const uint32_t flags,
-      void* userp);
-  bool Finish(State* state);
-
-  TraceManager* trace_manager_;
-  std::shared_ptr<SharedMemoryManager> shm_manager_;
-  TRITONSERVER_ResponseAllocator* allocator_;
-
-  grpc_compression_level compression_level_;
-};
-
-void
-ModelStreamInferHandler::StartNewRequest()
-{
-  auto context = std::make_shared<State::Context>(cq_, NEXT_UNIQUE_ID);
-  context->SetCompressionLevel(compression_level_);
-  State* state = StateNew(tritonserver_.get(), context);
-
-#ifdef TRITON_ENABLE_TRACING
-  // Can't create trace as we don't know the model to be requested,
-  // track timestamps in 'state'
-  state->trace_timestamps_.emplace_back(
-      std::make_pair("GRPC_WAITREAD_START", TraceManager::CaptureTimestamp()));
-#endif  // TRITON_ENABLE_TRACING
-
-  service_->RequestModelStreamInfer(
-      state->context_->ctx_.get(), state->context_->responder_.get(), cq_, cq_,
-      state);
-
-  LOG_VERBOSE(1) << "New request handler for " << Name() << ", "
-                 << state->unique_id_;
-}
-
-bool
-ModelStreamInferHandler::Process(InferHandler::State* state, bool rpc_ok)
-{
-  LOG_VERBOSE(1) << "Process for " << Name() << ", rpc_ok=" << rpc_ok
-                 << ", context " << state->context_->unique_id_ << ", "
-                 << state->unique_id_ << " step " << state->step_;
-
-  // We need an explicit finish indicator. Can't use 'state->step_'
-  // because we launch an async thread that could update 'state's
-  // step_ to be FINISH before this thread exits this function.
-  bool finished = false;
-
-  if (state->step_ == Steps::START) {
-    // A new stream connection... If RPC failed on a new request then
-    // the server is shutting down and so we should do nothing.
-    if (!rpc_ok) {
-      state->step_ = Steps::FINISH;
-      return false;
-    }
-
-    // Start a new request to replace this one...
-    StartNewRequest();
-
-    // Since this is the start of a connection, 'state' hasn't been
-    // used yet so use it to read a request off the connection.
-    state->context_->step_ = Steps::READ;
-    state->step_ = Steps::READ;
-    state->context_->responder_->Read(&state->request_, state);
-
-  } else if (state->step_ == Steps::READ) {
-    TRITONSERVER_Error* err = nullptr;
-    const inference::ModelInferRequest& request = state->request_;
-#ifdef TRITON_ENABLE_TRACING
-    state->trace_timestamps_.emplace_back(
-        std::make_pair("GRPC_WAITREAD_END", TraceManager::CaptureTimestamp()));
-#endif  // TRITON_ENABLE_TRACING
-
-    // If done reading and no in-flight requests then can finish the
-    // entire stream. Otherwise just finish this state.
-    if (!rpc_ok) {
-      state->context_->step_ = Steps::WRITEREADY;
-      if (state->context_->IsRequestsCompleted()) {
-        state->context_->step_ = Steps::COMPLETE;
-        state->step_ = Steps::COMPLETE;
-        state->context_->responder_->Finish(
-            state->context_->finish_ok_ ? grpc::Status::OK
-                                        : grpc::Status::CANCELLED,
-            state);
-      } else {
-        state->step_ = Steps::FINISH;
-        finished = true;
-      }
-
-      return !finished;
-    }
-
-    int64_t requested_model_version;
-    err = GetModelVersionFromString(
-        request.model_version(), &requested_model_version);
-
-    // Record the transaction policy of the model into the current state
-    // object.
-    if (err == nullptr) {
-      uint32_t txn_flags;
-      err = TRITONSERVER_ServerModelTransactionProperties(
-          tritonserver_.get(), request.model_name().c_str(),
-          requested_model_version, &txn_flags, nullptr /* voidp */);
-      if (err == nullptr) {
-        state->is_decoupled_ = ((txn_flags & TRITONSERVER_TXN_DECOUPLED) != 0);
-      }
-    }
-
-    // Request has been successfully read, increment the context request
-    // counter.
-    state->context_->IncrementRequestCounter();
-
-    // If the request is not for a model with decoupled transaction policy
-    // then put it in the context queue so thats it's response is sent in
-    // the same order as the request was received.
-    if (!state->is_decoupled_) {
-      state->context_->EnqueueForResponse(state);
-    }
-
-    // Need to get context here as it is needed below. 'state' can
-    // complete inference, write response, and finish (which releases
-    // context) before we make any forward progress.... so need to
-    // hold onto context here while we know it is good.
-    std::shared_ptr<StateContext> context = state->context_;
-
-    // Issue the inference request into server...
-    auto response_queue_ = state->response_queue_;
-
-    // Create the inference request which contains all the
-    // input information needed for an inference.
-    TRITONSERVER_InferenceRequest* irequest = nullptr;
-    if (err == nullptr) {
-      err = TRITONSERVER_InferenceRequestNew(
-          &irequest, tritonserver_.get(), request.model_name().c_str(),
-          requested_model_version);
-    }
-
-    if (err == nullptr) {
-      err = SetInferenceRequestMetadata(irequest, request);
-    }
-
-    // Will be used to hold the serialized data in case explicit string
-    // tensors are present in the request.
-    std::list<std::string> serialized_data;
-
-    if (err == nullptr) {
-      err = InferGRPCToInput(
-          tritonserver_, shm_manager_, request, &serialized_data, irequest);
-    }
-    if (err == nullptr) {
-      err = InferAllocatorPayload<inference::ModelStreamInferResponse>(
-          tritonserver_, shm_manager_, request, std::move(serialized_data),
-          response_queue_, &state->alloc_payload_);
-    }
-    if (err == nullptr) {
-      err = TRITONSERVER_InferenceRequestSetReleaseCallback(
-          irequest, InferRequestComplete, nullptr /* request_release_userp */);
-    }
-    if (err == nullptr) {
-      err = TRITONSERVER_InferenceRequestSetResponseCallback(
-          irequest, allocator_,
-          &state->alloc_payload_ /* response_allocator_userp */,
-          StreamInferResponseComplete, reinterpret_cast<void*>(state));
-    }
-    if (err == nullptr) {
-      TRITONSERVER_InferenceTrace* triton_trace = nullptr;
-#ifdef TRITON_ENABLE_TRACING
-      state->trace_ =
-          std::move(trace_manager_->SampleTrace(request.model_name()));
-      if (state->trace_ != nullptr) {
-        triton_trace = state->trace_->trace_;
-      }
-#endif  // TRITON_ENABLE_TRACING
-
-      state->step_ = ISSUED;
-      err = TRITONSERVER_ServerInferAsync(
-          tritonserver_.get(), irequest, triton_trace);
-    }
-
-    // If there was not an error in issuing the 'state' request then
-    // state->step_ == ISSUED and inference request has
-    // initiated... the completion callback will transition to
-    // WRITEREADY or WRITTEN. If there was an error then enqueue the
-    // error response and show it to be ready for writing.
-    if (err != nullptr) {
-      inference::ModelStreamInferResponse* response;
-      if (state->is_decoupled_) {
-        state->response_queue_->AllocateResponse();
-        response = state->response_queue_->GetLastAllocatedResponse();
-      } else {
-        response = state->response_queue_->GetNonDecoupledResponse();
-      }
-      LOG_VERBOSE(1) << "Infer failed: " << TRITONSERVER_ErrorMessage(err);
-
-      LOG_TRITONSERVER_ERROR(
-          TRITONSERVER_InferenceRequestDelete(irequest),
-          "deleting GRPC inference request");
-
-      grpc::Status status;
-      GrpcStatusUtil::Create(&status, err);
-      TRITONSERVER_ErrorDelete(err);
-      response->set_error_message(status.error_message());
-
-      response->mutable_infer_response()->Clear();
-      // repopulate the id so that client knows which request failed.
-      response->mutable_infer_response()->set_id(request.id());
-      state->step_ = Steps::WRITEREADY;
-      if (!state->is_decoupled_) {
-        state->context_->WriteResponseIfReady(state);
-      } else {
-        state->response_queue_->MarkNextResponseComplete();
-        state->complete_ = true;
-        state->context_->PutTaskBackToQueue(state);
-      }
-    }
-
-    // Now that the inference request is in flight, create a copy of
-    // 'state' and use it to attempt another read from the connection
-    // (i.e the next request in the stream).
-    State* next_read_state =
-        StateNew(tritonserver_.get(), context, Steps::READ);
-
-#ifdef TRITON_ENABLE_TRACING
-    // Capture a timestamp for the time when we start waiting for this
-    // next request to read.
-    // Can't create trace as we don't know the model to be requested,
-    // track timestamps in 'state'
-    next_read_state->trace_timestamps_.emplace_back(std::make_pair(
-        "GRPC_WAITREAD_START", TraceManager::CaptureTimestamp()));
-#endif  // TRITON_ENABLE_TRACING
-
-    next_read_state->context_->responder_->Read(
-        &next_read_state->request_, next_read_state);
-
-  } else if (state->step_ == Steps::COMPLETE) {
-    state->step_ = Steps::FINISH;
-    finished = true;
-  } else if (!state->is_decoupled_) {
-    // We handle the WRITTEN and WRITEREADY states little
-    // differently depending whether the inference request
-    // is for a decoupled model or not. This is because the
-    // grpc contract requires us to call Write() only once
-    // on a task. Hence, for decoupled writes, we call only
-    // one write and then wait for another notification from
-    // the completion queue to execute pending Write()'s, if
-    // any.
-
-    //
-    // Non-Decoupled state transitions
-    //
-    if (state->step_ == Steps::WRITTEN) {
-      state->context_->ongoing_write_ = false;
-#ifdef TRITON_ENABLE_TRACING
-      state->trace_timestamps_.emplace_back(
-          std::make_pair("GRPC_SEND_END", TraceManager::CaptureTimestamp()));
-#endif  // TRITON_ENABLE_TRACING
-
-      // If the write failed (for example, client closed the stream)
-      // mark that the stream did not complete successfully but don't
-      // cancel right away... need to wait for any pending reads,
-      // inferences and writes to complete.
-      if (!rpc_ok) {
-        LOG_VERBOSE(1) << "Write for " << Name() << ", rpc_ok=" << rpc_ok
-                       << ", context " << state->context_->unique_id_ << ", "
-                       << state->unique_id_ << " step " << state->step_
-                       << ", failed";
-        state->context_->finish_ok_ = false;
-      }
-
-
-      // Log an error if 'state' is not the expected next response. Mark
-      // that the stream did not complete successfully but don't cancel
-      // right away... need to wait for any pending reads, inferences
-      // and writes to complete.
-      if (!state->context_->PopCompletedResponse(state)) {
-        LOG_ERROR << "Unexpected response for " << Name()
-                  << ", rpc_ok=" << rpc_ok << ", context "
-                  << state->context_->unique_id_ << ", " << state->unique_id_
-                  << " step " << state->step_;
-        state->context_->finish_ok_ = false;
-      }
-
-      // Write the next response if it is ready...
-      state->context_->WriteResponseIfReady(nullptr);
-
-      // The response for the request has been written completely.
-      // The counter can be safely decremented.
-      state->context_->DecrementRequestCounter();
-      finished = Finish(state);
-
-    } else if (state->step_ == Steps::COMPLETE) {
-      state->step_ = Steps::FINISH;
-      finished = true;
-    }
-  } else {
-    //
-    //  Decoupled state transitions
-    //
-    if (state->step_ == Steps::WRITTEN) {
-      state->context_->ongoing_write_ = false;
-#ifdef TRITON_ENABLE_TRACING
-      state->trace_timestamps_.emplace_back(
-          std::make_pair("GRPC_SEND_END", TraceManager::CaptureTimestamp()));
-#endif  // TRITON_ENABLE_TRACING
-
-      // If the write failed (for example, client closed the stream)
-      // mark that the stream did not complete successfully but don't
-      // cancel right away... need to wait for any pending reads,
-      // inferences and writes to complete.
-      if (!rpc_ok) {
-        LOG_VERBOSE(1) << "Write for " << Name() << ", rpc_ok=" << rpc_ok
-                       << ", context " << state->context_->unique_id_ << ", "
-                       << state->unique_id_ << " step " << state->step_
-                       << ", failed";
-        state->context_->finish_ok_ = false;
-      }
-
-      // Finish the state if all the transactions associated with
-      // the state have completed.
-      if (state->IsComplete()) {
-        state->context_->DecrementRequestCounter();
-        finished = Finish(state);
-      } else {
-        std::lock_guard<std::mutex> lock(state->step_mtx_);
-
-        // If there is an available response to be written
-        // to the stream, then transition directly to WRITEREADY
-        // state and enqueue itself to the completion queue to be
-        // taken up later. Otherwise, go to ISSUED state and wait
-        // for the callback to make a response available.
-        if (state->response_queue_->HasReadyResponse()) {
-          state->step_ = Steps::WRITEREADY;
-          state->context_->PutTaskBackToQueue(state);
-        } else {
-          state->step_ = Steps::ISSUED;
-        }
-      }
-    } else if (state->step_ == Steps::WRITEREADY) {
-      if (state->delay_response_ms_ != 0) {
-        // Will delay the write of the response by the specified time.
-        // This can be used to test the flow where there are other
-        // responses available to be written.
-        LOG_INFO << "Delaying the write of the response by "
-                 << state->delay_response_ms_ << " ms...";
-        std::this_thread::sleep_for(
-            std::chrono::milliseconds(state->delay_response_ms_));
-      }
-
-      // Finish the state if all the transactions associated with
-      // the state have completed.
-      if (state->IsComplete()) {
-        state->context_->DecrementRequestCounter();
-        finished = Finish(state);
-      } else {
-        // GRPC doesn't allow to issue another write till
-        // the notification from previous write has been
-        // delivered. If there is an ongoing write then
-        // defer writing and place the task at the back
-        // of the completion queue to be taken up later.
-        if (!state->context_->ongoing_write_) {
-          state->context_->ongoing_write_ = true;
-          state->context_->DecoupledWriteResponse(state);
-        } else {
-          state->context_->PutTaskBackToQueue(state);
-        }
-      }
-    }
-  }
-
-  return !finished;
-}
-
-bool
-ModelStreamInferHandler::Finish(InferHandler::State* state)
-{
-  // If done reading and no in-flight requests then can finish the
-  // entire stream. Otherwise just finish this state.
-  if (state->context_->IsRequestsCompleted()) {
-    state->context_->step_ = Steps::COMPLETE;
-    state->step_ = Steps::COMPLETE;
-    state->context_->responder_->Finish(
-        state->context_->finish_ok_ ? grpc::Status::OK
-                                    : grpc::Status::CANCELLED,
-        state);
-  } else {
-    state->step_ = Steps::FINISH;
-    return true;
-  }
-
-  return false;
-}
-
-void
-ModelStreamInferHandler::StreamInferResponseComplete(
-    TRITONSERVER_InferenceResponse* iresponse, const uint32_t flags,
-    void* userp)
-{
-  State* state = reinterpret_cast<State*>(userp);
-
-  // Increment the callback index
-  uint32_t response_index = state->cb_count_++;
-
-  LOG_VERBOSE(1) << "ModelStreamInferHandler::StreamInferComplete, context "
-                 << state->context_->unique_id_ << ", " << state->unique_id_
-                 << " step " << state->step_ << ", callback index "
-                 << state->cb_count_ << ", flags " << flags;
-
-#ifdef TRITON_ENABLE_TRACING
-  if (state->cb_count_ == 1) {
-    state->trace_timestamps_.emplace_back(std::make_pair(
-        "INFER_RESPONSE_COMPLETE", TraceManager::CaptureTimestamp()));
-  }
-#endif  // TRITON_ENABLE_TRACING
-
-  // Log appropriate errors
-  if (!state->is_decoupled_) {
-    if ((flags & TRITONSERVER_RESPONSE_COMPLETE_FINAL) == 0) {
-      LOG_ERROR << "[INTERNAL] ModelStreamInfer received a response without "
-                   "FINAL flag for a model with one-to-one transaction";
-    }
-    if (iresponse == nullptr) {
-      LOG_ERROR << "[INTERNAL] ModelStreamInfer received a null response for a "
-                   "model with one-to-one transaction";
-    }
-  }
-
-  auto& response_queue = state->response_queue_;
-
-  if (iresponse != nullptr) {
-    auto response = response_queue->GetResponseAt(response_index);
-    if (response == nullptr) {
-      LOG_ERROR << "expected the response allocator to have added the response";
-    }
-
-    TRITONSERVER_Error* err = nullptr;
-    if (iresponse != nullptr) {
-      inference::ModelInferResponse& infer_response =
-          *(response->mutable_infer_response());
-      err = InferResponseCompleteCommon<inference::ModelStreamInferResponse>(
-          state->tritonserver_, iresponse, infer_response,
-          state->alloc_payload_);
-    }
-
-    if (err != nullptr) {
-      grpc::Status status;
-      GrpcStatusUtil::Create(&status, err);
-      response->mutable_infer_response()->Clear();
-      response->set_error_message(status.error_message());
-
-      // repopulate the id so that client knows which request failed.
-      const char* id;
-      LOG_TRITONSERVER_ERROR(
-          TRITONSERVER_InferenceResponseId(iresponse, &id),
-          "couldn't retrieve id for failed request");
-      LOG_VERBOSE(1) << "Failed for ID: " << id << std::endl;
-      response->mutable_infer_response()->set_id(id);
-    }
-
-    TRITONSERVER_ErrorDelete(err);
-
-    LOG_TRITONSERVER_ERROR(
-        TRITONSERVER_InferenceResponseDelete(iresponse),
-        "deleting GRPC inference response");
-  }
-
-  state->complete_ = ((flags & TRITONSERVER_RESPONSE_COMPLETE_FINAL) != 0);
-  if (!state->is_decoupled_) {
-    state->step_ = Steps::WRITEREADY;
-    state->context_->WriteResponseIfReady(state);
-  } else {
-    std::lock_guard<std::mutex> lock(state->step_mtx_);
-    if (iresponse != nullptr) {
-      state->response_queue_->MarkNextResponseComplete();
-    }
-    if (state->step_ == Steps::ISSUED) {
-      state->step_ = Steps::WRITEREADY;
-      state->context_->PutTaskBackToQueue(state);
-    }
-  }
-}
-
-void
-ReadFile(const std::string& filename, std::string& data)
-{
-  data.clear();
-  if (!filename.empty()) {
-    std::ifstream file(filename.c_str(), std::ios::in);
-    if (file.is_open()) {
-      std::stringstream ss;
-      ss << file.rdbuf();
-      file.close();
-      data = ss.str();
-    }
-  }
-}
-
-}  // namespace
-
-//
-// GRPCServer
-//
-GRPCServer::GRPCServer(
-    const std::shared_ptr<TRITONSERVER_Server>& server,
-    triton::server::TraceManager* trace_manager,
-    const std::shared_ptr<SharedMemoryManager>& shm_manager,
-    const std::string& server_addr, bool use_ssl, const SslOptions& ssl_options,
-    const int infer_allocation_pool_size,
-    grpc_compression_level compression_level,
-    const KeepAliveOptions& keepalive_options)
-    : server_(server), trace_manager_(trace_manager), shm_manager_(shm_manager),
-      server_addr_(server_addr), use_ssl_(use_ssl), ssl_options_(ssl_options),
-      infer_allocation_pool_size_(infer_allocation_pool_size),
-      compression_level_(compression_level),
-      keepalive_options_(keepalive_options), running_(false)
-{
-}
-
-GRPCServer::~GRPCServer()
-{
-  IGNORE_ERR(Stop());
-}
-
-TRITONSERVER_Error*
-GRPCServer::Create(
-    const std::shared_ptr<TRITONSERVER_Server>& server,
-    triton::server::TraceManager* trace_manager,
-    const std::shared_ptr<SharedMemoryManager>& shm_manager, int32_t port,
-    std::string address, bool use_ssl, const SslOptions& ssl_options,
-    int infer_allocation_pool_size, grpc_compression_level compression_level,
-    const KeepAliveOptions& keepalive_options,
-    std::unique_ptr<GRPCServer>* grpc_server)
-{
-  const std::string addr = address + ":" + std::to_string(port);
-  grpc_server->reset(new GRPCServer(
-      server, trace_manager, shm_manager, addr, use_ssl, ssl_options,
-      infer_allocation_pool_size, compression_level, keepalive_options));
-
-  return nullptr;  // success
-}
-
-TRITONSERVER_Error*
-GRPCServer::Start()
-{
-  if (running_) {
-    return TRITONSERVER_ErrorNew(
-        TRITONSERVER_ERROR_ALREADY_EXISTS, "GRPC server is already running.");
-  }
-
-  std::shared_ptr<grpc::ServerCredentials> credentials;
-  if (use_ssl_) {
-    std::string key;
-    std::string cert;
-    std::string root;
-    ReadFile(ssl_options_.server_cert, cert);
-    ReadFile(ssl_options_.server_key, key);
-    ReadFile(ssl_options_.root_cert, root);
-    grpc::SslServerCredentialsOptions::PemKeyCertPair keycert = {key, cert};
-    grpc::SslServerCredentialsOptions sslOpts;
-    sslOpts.pem_root_certs = root;
-    sslOpts.pem_key_cert_pairs.push_back(keycert);
-    if (ssl_options_.use_mutual_auth) {
-      sslOpts.client_certificate_request =
-          GRPC_SSL_REQUEST_AND_REQUIRE_CLIENT_CERTIFICATE_AND_VERIFY;
-    }
-    credentials = grpc::SslServerCredentials(sslOpts);
-  } else {
-    credentials = grpc::InsecureServerCredentials();
-  }
-
-  int bound_port = 0;
-  grpc_builder_.AddListeningPort(server_addr_, credentials, &bound_port);
-  grpc_builder_.SetMaxMessageSize(MAX_GRPC_MESSAGE_SIZE);
-  grpc_builder_.RegisterService(&service_);
-  // GRPC KeepAlive Docs: https://grpc.github.io/grpc/cpp/md_doc_keepalive.html
-  // NOTE: In order to work properly, the client-side settings should
-  // be in agreement with server-side settings.
-  grpc_builder_.AddChannelArgument(GRPC_ARG_ALLOW_REUSEPORT, 0);
-  grpc_builder_.AddChannelArgument(
-      GRPC_ARG_KEEPALIVE_TIME_MS, keepalive_options_.keepalive_time_ms);
-  grpc_builder_.AddChannelArgument(
-      GRPC_ARG_KEEPALIVE_TIMEOUT_MS, keepalive_options_.keepalive_timeout_ms);
-  grpc_builder_.AddChannelArgument(
-      GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS,
-      keepalive_options_.keepalive_permit_without_calls);
-  grpc_builder_.AddChannelArgument(
-      GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA,
-      keepalive_options_.http2_max_pings_without_data);
-  grpc_builder_.AddChannelArgument(
-      GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS,
-      keepalive_options_.http2_min_recv_ping_interval_without_data_ms);
-  grpc_builder_.AddChannelArgument(
-      GRPC_ARG_HTTP2_MAX_PING_STRIKES,
-      keepalive_options_.http2_max_ping_strikes);
-
-  LOG_VERBOSE(1) << "=== GRPC KeepAlive Options ===";
-  LOG_VERBOSE(1) << "keepalive_time_ms: "
-                 << keepalive_options_.keepalive_time_ms;
-  LOG_VERBOSE(1) << "keepalive_timeout_ms: "
-                 << keepalive_options_.keepalive_timeout_ms;
-  LOG_VERBOSE(1) << "keepalive_permit_without_calls: "
-                 << keepalive_options_.keepalive_permit_without_calls;
-  LOG_VERBOSE(1) << "http2_max_pings_without_data: "
-                 << keepalive_options_.http2_max_pings_without_data;
-  LOG_VERBOSE(1)
-      << "http2_min_recv_ping_interval_without_data_ms: "
-      << keepalive_options_.http2_min_recv_ping_interval_without_data_ms;
-  LOG_VERBOSE(1) << "http2_max_ping_strikes: "
-                 << keepalive_options_.http2_max_ping_strikes;
-  LOG_VERBOSE(1) << "==============================";
-
-  common_cq_ = grpc_builder_.AddCompletionQueue();
-  model_infer_cq_ = grpc_builder_.AddCompletionQueue();
-  model_stream_infer_cq_ = grpc_builder_.AddCompletionQueue();
-  grpc_server_ = grpc_builder_.BuildAndStart();
-  // Check if binding port failed
-  if (bound_port == 0) {
-    return TRITONSERVER_ErrorNew(
-        TRITONSERVER_ERROR_UNAVAILABLE,
-        (std::string("Socket '") + server_addr_ + "' already in use ").c_str());
-  }
-
-  // A common Handler for other non-inference requests
-  CommonHandler* hcommon = new CommonHandler(
-      "CommonHandler", server_, shm_manager_, trace_manager_, &service_,
-      common_cq_.get());
-  hcommon->Start();
-  common_handler_.reset(hcommon);
-
-  // Handler for model inference requests.
-  for (int i = 0; i < REGISTER_GRPC_INFER_THREAD_COUNT; ++i) {
-    ModelInferHandler* hmodelinfer = new ModelInferHandler(
-        "ModelInferHandler", server_, trace_manager_, shm_manager_, &service_,
-        model_infer_cq_.get(),
-        infer_allocation_pool_size_ /* max_state_bucket_count */,
-        compression_level_);
-    hmodelinfer->Start();
-    model_infer_handlers_.emplace_back(hmodelinfer);
-  }
-
-  // Handler for streaming inference requests. Keeps one handler for streaming
-  // to avoid possible concurrent writes which is not allowed
-  ModelStreamInferHandler* hmodelstreaminfer = new ModelStreamInferHandler(
-      "ModelStreamInferHandler", server_, trace_manager_, shm_manager_,
-      &service_, model_stream_infer_cq_.get(),
-      infer_allocation_pool_size_ /* max_state_bucket_count */,
-      compression_level_);
-  hmodelstreaminfer->Start();
-  model_stream_infer_handlers_.emplace_back(hmodelstreaminfer);
-
-  running_ = true;
-  LOG_INFO << "Started GRPCInferenceService at " << server_addr_;
-  return nullptr;  // success
-}
-
-TRITONSERVER_Error*
-GRPCServer::Stop()
-{
-  if (!running_) {
-    return TRITONSERVER_ErrorNew(
-        TRITONSERVER_ERROR_UNAVAILABLE, "GRPC server is not running.");
-  }
-
-  // Always shutdown the completion queue after the server.
-  grpc_server_->Shutdown();
-
-  common_cq_->Shutdown();
-  model_infer_cq_->Shutdown();
-  model_stream_infer_cq_->Shutdown();
-
-  // Must stop all handlers explicitly to wait for all the handler
-  // threads to join since they are referencing completion queue, etc.
-  dynamic_cast<CommonHandler*>(common_handler_.get())->Stop();
-  for (const auto& model_infer_handler : model_infer_handlers_) {
-    dynamic_cast<ModelInferHandler*>(model_infer_handler.get())->Stop();
-  }
-  for (const auto& model_stream_infer_handler : model_stream_infer_handlers_) {
-    dynamic_cast<ModelStreamInferHandler*>(model_stream_infer_handler.get())
-        ->Stop();
-  }
-
-  running_ = false;
-  return nullptr;  // success
-}
-
-}}  // namespace triton::server
diff --git a/src/grpc_server.h b/src/grpc_server.h
deleted file mode 100644
index f623bbb777..0000000000
--- a/src/grpc_server.h
+++ /dev/null
@@ -1,132 +0,0 @@
-// Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-//
-// Redistribution and use in source and binary forms, with or without
-// modification, are permitted provided that the following conditions
-// are met:
-//  * Redistributions of source code must retain the above copyright
-//    notice, this list of conditions and the following disclaimer.
-//  * Redistributions in binary form must reproduce the above copyright
-//    notice, this list of conditions and the following disclaimer in the
-//    documentation and/or other materials provided with the distribution.
-//  * Neither the name of NVIDIA CORPORATION nor the names of its
-//    contributors may be used to endorse or promote products derived
-//    from this software without specific prior written permission.
-//
-// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
-// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
-// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
-// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
-// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-#pragma once
-
-#include <grpc++/grpc++.h>
-#include "grpc_service.grpc.pb.h"
-#include "shared_memory_manager.h"
-#include "tracer.h"
-#include "triton/core/tritonserver.h"
-
-namespace triton { namespace server {
-
-struct SslOptions {
-  explicit SslOptions() {}
-  // File holding PEM-encoded server certificate
-  std::string server_cert;
-  // File holding PEM-encoded server key
-  std::string server_key;
-  // File holding PEM-encoded root certificate
-  std::string root_cert;
-  // Whether to use Mutual Authentication
-  bool use_mutual_auth;
-};
-
-// GRPC KeepAlive: https://grpc.github.io/grpc/cpp/md_doc_keepalive.html
-struct KeepAliveOptions {
-  explicit KeepAliveOptions()
-      : keepalive_time_ms(7200000), keepalive_timeout_ms(20000),
-        keepalive_permit_without_calls(false), http2_max_pings_without_data(2),
-        http2_min_recv_ping_interval_without_data_ms(300000),
-        http2_max_ping_strikes(2)
-  {
-  }
-  int keepalive_time_ms;
-  int keepalive_timeout_ms;
-  bool keepalive_permit_without_calls;
-  int http2_max_pings_without_data;
-  int http2_min_recv_ping_interval_without_data_ms;
-  int http2_max_ping_strikes;
-};
-
-class GRPCServer {
- public:
-  static TRITONSERVER_Error* Create(
-      const std::shared_ptr<TRITONSERVER_Server>& server,
-      triton::server::TraceManager* trace_manager,
-      const std::shared_ptr<SharedMemoryManager>& shm_manager, int32_t port,
-      std::string address, bool use_ssl, const SslOptions& ssl_options,
-      int infer_allocation_pool_size, grpc_compression_level compression_level,
-      const KeepAliveOptions& keepalive_options,
-      std::unique_ptr<GRPCServer>* grpc_server);
-
-  ~GRPCServer();
-
-  TRITONSERVER_Error* Start();
-  TRITONSERVER_Error* Stop();
-
- public:
-  class HandlerBase {
-   public:
-    virtual ~HandlerBase() = default;
-  };
-
-  class ICallData {
-   public:
-    virtual ~ICallData() = default;
-    virtual bool Process(bool ok) = 0;
-    virtual std::string Name() = 0;
-    virtual uint64_t Id() = 0;
-  };
-
- private:
-  GRPCServer(
-      const std::shared_ptr<TRITONSERVER_Server>& server,
-      triton::server::TraceManager* trace_manager,
-      const std::shared_ptr<SharedMemoryManager>& shm_manager,
-      const std::string& server_addr, bool use_ssl,
-      const SslOptions& ssl_options, const int infer_allocation_pool_size,
-      grpc_compression_level compression_level,
-      const KeepAliveOptions& keepalive_options);
-
-  std::shared_ptr<TRITONSERVER_Server> server_;
-  TraceManager* trace_manager_;
-  std::shared_ptr<SharedMemoryManager> shm_manager_;
-  const std::string server_addr_;
-  const bool use_ssl_;
-  const SslOptions ssl_options_;
-
-  const int infer_allocation_pool_size_;
-  grpc_compression_level compression_level_;
-
-  const KeepAliveOptions keepalive_options_;
-
-  std::unique_ptr<grpc::ServerCompletionQueue> common_cq_;
-  std::unique_ptr<grpc::ServerCompletionQueue> model_infer_cq_;
-  std::unique_ptr<grpc::ServerCompletionQueue> model_stream_infer_cq_;
-
-  grpc::ServerBuilder grpc_builder_;
-  std::unique_ptr<grpc::Server> grpc_server_;
-
-  std::unique_ptr<HandlerBase> common_handler_;
-  std::vector<std::unique_ptr<HandlerBase>> model_infer_handlers_;
-  std::vector<std::unique_ptr<HandlerBase>> model_stream_infer_handlers_;
-
-  inference::GRPCInferenceService::AsyncService service_;
-  bool running_;
-};
-
-}}  // namespace triton::server
diff --git a/src/http_server.cc b/src/http_server.cc
index de018aaa19..32cf6956cd 100644
--- a/src/http_server.cc
+++ b/src/http_server.cc
@@ -1,4 +1,4 @@
-// Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 //
 // Redistribution and use in source and binary forms, with or without
 // modification, are permitted provided that the following conditions
@@ -32,9 +32,12 @@
 
 #include <event2/buffer.h>
 #include <re2/re2.h>
+
 #include <algorithm>
 #include <list>
+#include <regex>
 #include <thread>
+
 #include "classification.h"
 
 #define TRITONJSON_STATUSTYPE TRITONSERVER_Error*
@@ -47,13 +50,19 @@ extern "C" {
 #include <b64/cdecode.h>
 }
 
-#ifdef TRITON_ENABLE_TRACING
-#include "tracer.h"
-#endif  // TRITON_ENABLE_TRACING
-
 namespace triton { namespace server {
 
-#define HTTP_RESPOND_IF_ERR(REQ, X)                   \
+#define RETURN_AND_CALLBACK_IF_ERR(X, CALLBACK) \
+  do {                                          \
+    TRITONSERVER_Error* err__ = (X);            \
+    if (err__ != nullptr) {                     \
+      CALLBACK(err__);                          \
+      TRITONSERVER_ErrorDelete(err__);          \
+      return;                                   \
+    }                                           \
+  } while (false)
+
+#define RETURN_AND_RESPOND_IF_ERR(REQ, X)             \
   do {                                                \
     TRITONSERVER_Error* err__ = (X);                  \
     if (err__ != nullptr) {                           \
@@ -64,6 +73,102 @@ namespace triton { namespace server {
     }                                                 \
   } while (false)
 
+#define RETURN_AND_RESPOND_WITH_ERR(REQ, CODE, MSG) \
+  do {                                              \
+    EVBufferAddErrorJson((REQ)->buffer_out, MSG);   \
+    evhtp_send_reply((REQ), CODE);                  \
+    return;                                         \
+  } while (false)
+
+#define RETURN_AND_RESPOND_IF_RESTRICTED(                               \
+    REQ, RESTRICTED_CATEGORY, RESTRICTED_APIS)                          \
+  do {                                                                  \
+    auto const& is_restricted_api =                                     \
+        RESTRICTED_APIS.IsRestricted(RESTRICTED_CATEGORY);              \
+    auto const& restriction = RESTRICTED_APIS.Get(RESTRICTED_CATEGORY); \
+    if (is_restricted_api && RespondIfRestricted(REQ, restriction)) {   \
+      return;                                                           \
+    }                                                                   \
+  } while (false)
+
+
+namespace {
+void
+EVBufferAddErrorJson(evbuffer* buffer, const char* message)
+{
+  triton::common::TritonJson::Value response(
+      triton::common::TritonJson::ValueType::OBJECT);
+  response.AddStringRef("error", message, strlen(message));
+
+  triton::common::TritonJson::WriteBuffer buffer_json;
+  response.Write(&buffer_json);
+
+  evbuffer_add(buffer, buffer_json.Base(), buffer_json.Size());
+}
+
+void
+EVBufferAddErrorJson(evbuffer* buffer, TRITONSERVER_Error* err)
+{
+  const char* message = TRITONSERVER_ErrorMessage(err);
+  EVBufferAddErrorJson(buffer, message);
+}
+
+void
+AddContentTypeHeader(evhtp_request_t* req, const char* type)
+{
+  // Remove existing header if found
+  auto content_header =
+      evhtp_headers_find_header(req->headers_out, kContentTypeHeader);
+  if (content_header) {
+    evhtp_header_rm_and_free(req->headers_out, content_header);
+  }
+
+  evhtp_headers_add_header(
+      req->headers_out, evhtp_header_new(kContentTypeHeader, type, 1, 1));
+}
+
+TRITONSERVER_Error*
+SetTritonParameterFromJsonParameter(
+    const std::string& parameter,
+    triton::common::TritonJson::Value& params_json,
+    TRITONSERVER_InferenceRequest* irequest)
+{
+  triton::common::TritonJson::Value value;
+  if (!params_json.Find(parameter.c_str(), &value)) {
+    return TRITONSERVER_ErrorNew(
+        TRITONSERVER_ERROR_INTERNAL,
+        ("parameter key '" + parameter + "' was not found in the JSON")
+            .c_str());
+  }
+
+  if (value.IsString()) {
+    std::string string_value;
+    RETURN_IF_ERR(value.AsString(&string_value));
+    RETURN_IF_ERR(TRITONSERVER_InferenceRequestSetStringParameter(
+        irequest, parameter.c_str(), string_value.c_str()));
+  } else if (value.IsInt()) {
+    int64_t int_value;
+    RETURN_IF_ERR(value.AsInt(&int_value));
+    RETURN_IF_ERR(TRITONSERVER_InferenceRequestSetIntParameter(
+        irequest, parameter.c_str(), int_value));
+  } else if (value.IsBool()) {
+    bool bool_value;
+    RETURN_IF_ERR(value.AsBool(&bool_value));
+    RETURN_IF_ERR(TRITONSERVER_InferenceRequestSetBoolParameter(
+        irequest, parameter.c_str(), bool_value));
+  } else {
+    return TRITONSERVER_ErrorNew(
+        TRITONSERVER_ERROR_INVALID_ARG,
+        ("parameter '" + parameter +
+         "' has invalid type. It should be either "
+         "'int', 'bool', or 'string'.")
+            .c_str());
+  }
+  return nullptr;  // success
+}
+
+}  // namespace
+
 TRITONSERVER_Error*
 HTTPServer::Start()
 {
@@ -71,6 +176,9 @@ HTTPServer::Start()
     evbase_ = event_base_new();
     htp_ = evhtp_new(evbase_, NULL);
     evhtp_enable_flag(htp_, EVHTP_FLAG_ENABLE_NODELAY);
+    if (reuse_port_) {
+      evhtp_enable_flag(htp_, EVHTP_FLAG_ENABLE_REUSEPORT);
+    }
     evhtp_set_gencb(htp_, HTTPServer::Dispatch, this);
     evhtp_use_threads_wexit(htp_, NULL, NULL, thread_cnt_, NULL);
     if (evhtp_bind_socket(htp_, address_.c_str(), port_, 1024) != 0) {
@@ -136,11 +244,14 @@ HTTPMetricsServer::Handle(evhtp_request_t* req)
                  << req->uri->path->full;
 
   if (req->method != htp_method_GET) {
-    evhtp_send_reply(req, EVHTP_RES_METHNALLOWED);
-    return;
+    RETURN_AND_RESPOND_WITH_ERR(
+        req, EVHTP_RES_METHNALLOWED, "Method Not Allowed");
   }
 
   evhtp_res res = EVHTP_RES_BADREQ;
+  evhtp_headers_add_header(
+      req->headers_out,
+      evhtp_header_new(kContentTypeHeader, "text/plain; charset=utf-8", 1, 1));
 
   // Call to metric endpoint should not have any trailing string
   if (RE2::FullMatch(std::string(req->uri->path->full), api_regex_)) {
@@ -182,7 +293,6 @@ HTTPMetricsServer::Create(
 
 #endif  // TRITON_ENABLE_METRICS
 
-
 namespace {
 
 // Allocate an evbuffer of size 'byte_size'. Return the 'evb' and
@@ -239,31 +349,29 @@ AllocEVBuffer(const size_t byte_size, evbuffer** evb, void** base)
   return nullptr;  // success
 }
 
+// Recursively adds to byte_size from multi dimensional data input
 TRITONSERVER_Error*
 JsonBytesArrayByteSize(
     triton::common::TritonJson::Value& tensor_data, size_t* byte_size)
 {
   *byte_size = 0;
-
-  for (size_t i = 0; i < tensor_data.ArraySize(); i++) {
-    triton::common::TritonJson::Value el;
-    RETURN_IF_ERR(tensor_data.At(i, &el));
-
-    // Recurse if not last dimension...
-    TRITONSERVER_Error* assert_err =
-        el.AssertType(triton::common::TritonJson::ValueType::ARRAY);
-    if (assert_err == nullptr) {
-      RETURN_IF_ERR(JsonBytesArrayByteSize(el, byte_size));
-    } else {
-      // Serialized data size is the length of the string itself plus
-      // 4 bytes to record the string length.
-      const char* str;
-      size_t len = 0;
-      RETURN_MSG_IF_ERR(
-          el.AsString(&str, &len), "Unable to parse JSON bytes array");
-      *byte_size += len + sizeof(uint32_t);
+  // Recurse if not last dimension...
+  if (tensor_data.IsArray()) {
+    for (size_t i = 0; i < tensor_data.ArraySize(); i++) {
+      triton::common::TritonJson::Value el;
+      RETURN_IF_ERR(tensor_data.At(i, &el));
+      size_t byte_size_;
+      RETURN_IF_ERR(JsonBytesArrayByteSize(el, &byte_size_));
+      *byte_size += byte_size_;
     }
-    TRITONSERVER_ErrorDelete(assert_err);
+  } else {
+    // Serialized data size is the length of the string itself plus
+    // 4 bytes to record the string length.
+    const char* str;
+    size_t len = 0;
+    RETURN_MSG_IF_ERR(
+        tensor_data.AsString(&str, &len), "Unable to parse JSON bytes array");
+    *byte_size += len + sizeof(uint32_t);
   }
 
   return nullptr;  // success
@@ -275,141 +383,140 @@ ReadDataFromJsonHelper(
     triton::common::TritonJson::Value& tensor_data, int* counter,
     int64_t expected_cnt)
 {
-  // FIXME should invert loop and switch so don't have to do a switch
-  // each iteration.
-  for (size_t i = 0; i < tensor_data.ArraySize(); i++) {
-    triton::common::TritonJson::Value el;
-    RETURN_IF_ERR(tensor_data.At(i, &el));
-
-    // Recurse if not last dimension...
-    TRITONSERVER_Error* assert_err =
-        el.AssertType(triton::common::TritonJson::ValueType::ARRAY);
-    if (assert_err == nullptr) {
+  // FIXME should move 'switch' statement outside the recursive function and
+  // pass in a read data callback once data type is confirmed.
+  // Currently 'switch' is performed on each element even through all elements
+  // have the same data type.
+
+  // Recurse on array element if not last dimension...
+  if (tensor_data.IsArray()) {
+    for (size_t i = 0; i < tensor_data.ArraySize(); i++) {
+      triton::common::TritonJson::Value el;
+      RETURN_IF_ERR(tensor_data.At(i, &el));
       RETURN_IF_ERR(
           ReadDataFromJsonHelper(base, dtype, el, counter, expected_cnt));
-    } else {
-      // Check if writing to 'serialized' is overrunning the expected byte_size
-      if (*counter >= expected_cnt) {
-        return TRITONSERVER_ErrorNew(
-            TRITONSERVER_ERROR_INTERNAL,
-            "Shape does not match true shape of 'data' field");
-      }
-      switch (dtype) {
-        case TRITONSERVER_TYPE_BOOL: {
-          bool b = false;
-          RETURN_IF_ERR(el.AsBool(&b));
-          uint8_t* data_vec = reinterpret_cast<uint8_t*>(base);
-          // FIXME for unsigned should bounds check and raise error
-          // since otherwise the actually used value will be
-          // unexpected.
-          data_vec[*counter] = (uint8_t)(b ? 1 : 0);
-          *counter += 1;
-          break;
-        }
-        case TRITONSERVER_TYPE_UINT8: {
-          uint64_t ui = 0;
-          RETURN_IF_ERR(el.AsUInt(&ui));
-          uint8_t* data_vec = reinterpret_cast<uint8_t*>(base);
-          data_vec[*counter] = (uint8_t)ui;
-          *counter += 1;
-          break;
-        }
-        case TRITONSERVER_TYPE_UINT16: {
-          uint64_t ui = 0;
-          RETURN_IF_ERR(el.AsUInt(&ui));
-          uint16_t* data_vec = reinterpret_cast<uint16_t*>(base);
-          data_vec[*counter] = (uint16_t)ui;
-          *counter += 1;
-          break;
-        }
-        case TRITONSERVER_TYPE_UINT32: {
-          uint64_t ui = 0;
-          RETURN_IF_ERR(el.AsUInt(&ui));
-          uint32_t* data_vec = reinterpret_cast<uint32_t*>(base);
-          data_vec[*counter] = (uint32_t)ui;
-          *counter += 1;
-          break;
-        }
-        case TRITONSERVER_TYPE_UINT64: {
-          uint64_t ui = 0;
-          RETURN_IF_ERR(el.AsUInt(&ui));
-          uint64_t* data_vec = reinterpret_cast<uint64_t*>(base);
-          data_vec[*counter] = ui;
-          *counter += 1;
-          break;
-        }
-        case TRITONSERVER_TYPE_INT8: {
-          // FIXME for signed type just assigning to smaller type is
-          // "implementation defined" and so really need to bounds
-          // check.
-          int64_t si = 0;
-          RETURN_IF_ERR(el.AsInt(&si));
-          int8_t* data_vec = reinterpret_cast<int8_t*>(base);
-          data_vec[*counter] = (int8_t)si;
-          *counter += 1;
-          break;
-        }
-        case TRITONSERVER_TYPE_INT16: {
-          int64_t si = 0;
-          RETURN_IF_ERR(el.AsInt(&si));
-          int16_t* data_vec = reinterpret_cast<int16_t*>(base);
-          data_vec[*counter] = (int16_t)si;
-          *counter += 1;
-          break;
-        }
-        case TRITONSERVER_TYPE_INT32: {
-          int64_t si = 0;
-          RETURN_IF_ERR(el.AsInt(&si));
-          int32_t* data_vec = reinterpret_cast<int32_t*>(base);
-          data_vec[*counter] = (int32_t)si;
-          *counter += 1;
-          break;
-        }
-        case TRITONSERVER_TYPE_INT64: {
-          int64_t si = 0;
-          RETURN_IF_ERR(el.AsInt(&si));
-          int64_t* data_vec = reinterpret_cast<int64_t*>(base);
-          data_vec[*counter] = si;
-          *counter += 1;
-          break;
-        }
-        case TRITONSERVER_TYPE_FP32: {
-          double fp64 = 0;
-          RETURN_IF_ERR(el.AsDouble(&fp64));
-          float* data_vec = reinterpret_cast<float*>(base);
-          data_vec[*counter] = fp64;
-          *counter += 1;
-          break;
-        }
-        case TRITONSERVER_TYPE_FP64: {
-          double fp64 = 0;
-          RETURN_IF_ERR(el.AsDouble(&fp64));
-          double* data_vec = reinterpret_cast<double*>(base);
-          data_vec[*counter] = fp64;
-          *counter += 1;
-          break;
-        }
-        case TRITONSERVER_TYPE_BYTES: {
-          const char* cstr;
-          size_t len = 0;
-          RETURN_IF_ERR(el.AsString(&cstr, &len));
-          if (static_cast<int64_t>(*counter + len + sizeof(uint32_t)) >
-              expected_cnt) {
-            return TRITONSERVER_ErrorNew(
-                TRITONSERVER_ERROR_INTERNAL,
-                "Shape does not match true shape of 'data' field");
-          }
-          memcpy(
-              base + *counter, reinterpret_cast<char*>(&len), sizeof(uint32_t));
-          std::copy(cstr, cstr + len, base + *counter + sizeof(uint32_t));
-          *counter += len + sizeof(uint32_t);
-          break;
+    }
+  } else {
+    // Check if writing to 'serialized' is overrunning the expected byte_size
+    if (*counter >= expected_cnt) {
+      return TRITONSERVER_ErrorNew(
+          TRITONSERVER_ERROR_INTERNAL,
+          "Shape does not match true shape of 'data' field");
+    }
+    switch (dtype) {
+      case TRITONSERVER_TYPE_BOOL: {
+        bool b = false;
+        RETURN_IF_ERR(tensor_data.AsBool(&b));
+        uint8_t* data_vec = reinterpret_cast<uint8_t*>(base);
+        // FIXME for unsigned should bounds check and raise error
+        // since otherwise the actually used value will be
+        // unexpected.
+        data_vec[*counter] = (uint8_t)(b ? 1 : 0);
+        *counter += 1;
+        break;
+      }
+      case TRITONSERVER_TYPE_UINT8: {
+        uint64_t ui = 0;
+        RETURN_IF_ERR(tensor_data.AsUInt(&ui));
+        uint8_t* data_vec = reinterpret_cast<uint8_t*>(base);
+        data_vec[*counter] = (uint8_t)ui;
+        *counter += 1;
+        break;
+      }
+      case TRITONSERVER_TYPE_UINT16: {
+        uint64_t ui = 0;
+        RETURN_IF_ERR(tensor_data.AsUInt(&ui));
+        uint16_t* data_vec = reinterpret_cast<uint16_t*>(base);
+        data_vec[*counter] = (uint16_t)ui;
+        *counter += 1;
+        break;
+      }
+      case TRITONSERVER_TYPE_UINT32: {
+        uint64_t ui = 0;
+        RETURN_IF_ERR(tensor_data.AsUInt(&ui));
+        uint32_t* data_vec = reinterpret_cast<uint32_t*>(base);
+        data_vec[*counter] = (uint32_t)ui;
+        *counter += 1;
+        break;
+      }
+      case TRITONSERVER_TYPE_UINT64: {
+        uint64_t ui = 0;
+        RETURN_IF_ERR(tensor_data.AsUInt(&ui));
+        uint64_t* data_vec = reinterpret_cast<uint64_t*>(base);
+        data_vec[*counter] = ui;
+        *counter += 1;
+        break;
+      }
+      case TRITONSERVER_TYPE_INT8: {
+        // FIXME for signed type just assigning to smaller type is
+        // "implementation defined" and so really need to bounds
+        // check.
+        int64_t si = 0;
+        RETURN_IF_ERR(tensor_data.AsInt(&si));
+        int8_t* data_vec = reinterpret_cast<int8_t*>(base);
+        data_vec[*counter] = (int8_t)si;
+        *counter += 1;
+        break;
+      }
+      case TRITONSERVER_TYPE_INT16: {
+        int64_t si = 0;
+        RETURN_IF_ERR(tensor_data.AsInt(&si));
+        int16_t* data_vec = reinterpret_cast<int16_t*>(base);
+        data_vec[*counter] = (int16_t)si;
+        *counter += 1;
+        break;
+      }
+      case TRITONSERVER_TYPE_INT32: {
+        int64_t si = 0;
+        RETURN_IF_ERR(tensor_data.AsInt(&si));
+        int32_t* data_vec = reinterpret_cast<int32_t*>(base);
+        data_vec[*counter] = (int32_t)si;
+        *counter += 1;
+        break;
+      }
+      case TRITONSERVER_TYPE_INT64: {
+        int64_t si = 0;
+        RETURN_IF_ERR(tensor_data.AsInt(&si));
+        int64_t* data_vec = reinterpret_cast<int64_t*>(base);
+        data_vec[*counter] = si;
+        *counter += 1;
+        break;
+      }
+      case TRITONSERVER_TYPE_FP32: {
+        double fp64 = 0;
+        RETURN_IF_ERR(tensor_data.AsDouble(&fp64));
+        float* data_vec = reinterpret_cast<float*>(base);
+        data_vec[*counter] = fp64;
+        *counter += 1;
+        break;
+      }
+      case TRITONSERVER_TYPE_FP64: {
+        double fp64 = 0;
+        RETURN_IF_ERR(tensor_data.AsDouble(&fp64));
+        double* data_vec = reinterpret_cast<double*>(base);
+        data_vec[*counter] = fp64;
+        *counter += 1;
+        break;
+      }
+      case TRITONSERVER_TYPE_BYTES: {
+        const char* cstr;
+        size_t len = 0;
+        RETURN_IF_ERR(tensor_data.AsString(&cstr, &len));
+        if (static_cast<int64_t>(*counter + len + sizeof(uint32_t)) >
+            expected_cnt) {
+          return TRITONSERVER_ErrorNew(
+              TRITONSERVER_ERROR_INTERNAL,
+              "Shape does not match true shape of 'data' field");
         }
-        default:
-          break;
+        memcpy(
+            base + *counter, reinterpret_cast<char*>(&len), sizeof(uint32_t));
+        std::copy(cstr, cstr + len, base + *counter + sizeof(uint32_t));
+        *counter += len + sizeof(uint32_t);
+        break;
       }
+      default:
+        break;
     }
-    TRITONSERVER_ErrorDelete(assert_err);
   }
 
   return nullptr;  // success
@@ -502,7 +609,8 @@ WriteDataToJson(
                 .c_str());
       }
       for (size_t e = 0; e < element_count; ++e) {
-        data_json->AppendBool((bool_base[e] == 0) ? false : true);
+        RETURN_IF_ERR(
+            data_json->AppendBool((bool_base[e] == 0) ? false : true));
       }
       break;
     }
@@ -512,7 +620,7 @@ WriteDataToJson(
           output_name, byte_size, sizeof(uint8_t) * element_count));
       const uint8_t* cbase = reinterpret_cast<const uint8_t*>(base);
       for (size_t e = 0; e < element_count; ++e) {
-        data_json->AppendUInt(cbase[e]);
+        RETURN_IF_ERR(data_json->AppendUInt(cbase[e]));
       }
       break;
     }
@@ -522,7 +630,7 @@ WriteDataToJson(
           output_name, byte_size, sizeof(uint16_t) * element_count));
       const uint16_t* cbase = reinterpret_cast<const uint16_t*>(base);
       for (size_t e = 0; e < element_count; ++e) {
-        data_json->AppendUInt(cbase[e]);
+        RETURN_IF_ERR(data_json->AppendUInt(cbase[e]));
       }
       break;
     }
@@ -532,7 +640,7 @@ WriteDataToJson(
           output_name, byte_size, sizeof(uint32_t) * element_count));
       const uint32_t* cbase = reinterpret_cast<const uint32_t*>(base);
       for (size_t e = 0; e < element_count; ++e) {
-        data_json->AppendUInt(cbase[e]);
+        RETURN_IF_ERR(data_json->AppendUInt(cbase[e]));
       }
       break;
     }
@@ -542,7 +650,7 @@ WriteDataToJson(
           output_name, byte_size, sizeof(uint64_t) * element_count));
       const uint64_t* cbase = reinterpret_cast<const uint64_t*>(base);
       for (size_t e = 0; e < element_count; ++e) {
-        data_json->AppendUInt(cbase[e]);
+        RETURN_IF_ERR(data_json->AppendUInt(cbase[e]));
       }
       break;
     }
@@ -552,7 +660,7 @@ WriteDataToJson(
           output_name, byte_size, sizeof(int8_t) * element_count));
       const int8_t* cbase = reinterpret_cast<const int8_t*>(base);
       for (size_t e = 0; e < element_count; ++e) {
-        data_json->AppendInt(cbase[e]);
+        RETURN_IF_ERR(data_json->AppendInt(cbase[e]));
       }
       break;
     }
@@ -562,7 +670,7 @@ WriteDataToJson(
           output_name, byte_size, sizeof(int16_t) * element_count));
       const int16_t* cbase = reinterpret_cast<const int16_t*>(base);
       for (size_t e = 0; e < element_count; ++e) {
-        data_json->AppendInt(cbase[e]);
+        RETURN_IF_ERR(data_json->AppendInt(cbase[e]));
       }
       break;
     }
@@ -572,7 +680,7 @@ WriteDataToJson(
           output_name, byte_size, sizeof(int32_t) * element_count));
       const int32_t* cbase = reinterpret_cast<const int32_t*>(base);
       for (size_t e = 0; e < element_count; ++e) {
-        data_json->AppendInt(cbase[e]);
+        RETURN_IF_ERR(data_json->AppendInt(cbase[e]));
       }
       break;
     }
@@ -582,7 +690,7 @@ WriteDataToJson(
           output_name, byte_size, sizeof(int64_t) * element_count));
       const int64_t* cbase = reinterpret_cast<const int64_t*>(base);
       for (size_t e = 0; e < element_count; ++e) {
-        data_json->AppendInt(cbase[e]);
+        RETURN_IF_ERR(data_json->AppendInt(cbase[e]));
       }
       break;
     }
@@ -606,7 +714,7 @@ WriteDataToJson(
           output_name, byte_size, sizeof(float) * element_count));
       const float* cbase = reinterpret_cast<const float*>(base);
       for (size_t e = 0; e < element_count; ++e) {
-        data_json->AppendDouble(cbase[e]);
+        RETURN_IF_ERR(data_json->AppendDouble(cbase[e]));
       }
       break;
     }
@@ -616,7 +724,7 @@ WriteDataToJson(
           output_name, byte_size, sizeof(double) * element_count));
       const double* cbase = reinterpret_cast<const double*>(base);
       for (size_t e = 0; e < element_count; ++e) {
-        data_json->AppendDouble(cbase[e]);
+        RETURN_IF_ERR(data_json->AppendDouble(cbase[e]));
       }
       break;
     }
@@ -649,7 +757,7 @@ WriteDataToJson(
         // Can use stringref because 'base' buffer is not deleted
         // until response is deleted and that happens after this json
         // is serialized.
-        data_json->AppendStringRef(cbase + offset, len);
+        RETURN_IF_ERR(data_json->AppendStringRef(cbase + offset, len));
         offset += len;
       }
       break;
@@ -664,21 +772,6 @@ WriteDataToJson(
   return nullptr;  // success
 }
 
-void
-EVBufferAddErrorJson(evbuffer* buffer, TRITONSERVER_Error* err)
-{
-  const char* message = TRITONSERVER_ErrorMessage(err);
-
-  triton::common::TritonJson::Value response(
-      triton::common::TritonJson::ValueType::OBJECT);
-  response.AddStringRef("error", message, strlen(message));
-
-  triton::common::TritonJson::WriteBuffer buffer_json;
-  response.Write(&buffer_json);
-
-  evbuffer_add(buffer, buffer_json.Base(), buffer_json.Size());
-}
-
 TRITONSERVER_Error*
 CheckBinaryInputData(
     triton::common::TritonJson::Value& request_input, bool* is_binary,
@@ -691,7 +784,7 @@ CheckBinaryInputData(
     triton::common::TritonJson::Value binary_data_size_json;
     if (params_json.Find("binary_data_size", &binary_data_size_json)) {
       RETURN_MSG_IF_ERR(
-          binary_data_size_json.AsUInt(byte_size),
+          binary_data_size_json.AsUInt(reinterpret_cast<uint64_t*>(byte_size)),
           "Unable to parse 'binary_data_size'");
       *is_binary = true;
     }
@@ -963,19 +1056,21 @@ HTTPAPIServer::HTTPAPIServer(
     const std::shared_ptr<TRITONSERVER_Server>& server,
     triton::server::TraceManager* trace_manager,
     const std::shared_ptr<SharedMemoryManager>& shm_manager, const int32_t port,
-    const std::string address, const int thread_cnt)
-    : HTTPServer(port, address, thread_cnt), server_(server),
-      trace_manager_(trace_manager), shm_manager_(shm_manager),
+    const bool reuse_port, const std::string& address,
+    const std::string& header_forward_pattern, const int thread_cnt,
+    const RestrictedFeatures& restricted_apis)
+    : HTTPServer(port, reuse_port, address, header_forward_pattern, thread_cnt),
+      server_(server), trace_manager_(trace_manager), shm_manager_(shm_manager),
       allocator_(nullptr), server_regex_(R"(/v2(?:/health/(live|ready))?)"),
       model_regex_(
-          R"(/v2/models/([^/]+)(?:/versions/([0-9]+))?(?:/(infer|ready|config|stats|trace/setting))?)"),
+          R"(/v2/models/([^/]+)(?:/versions/([0-9]+))?(?:/(infer|generate|generate_stream|ready|config|stats|trace/setting))?)"),
       modelcontrol_regex_(
           R"(/v2/repository(?:/([^/]+))?/(index|models/([^/]+)/(load|unload)))"),
       systemsharedmemory_regex_(
           R"(/v2/systemsharedmemory(?:/region/([^/]+))?/(status|register|unregister))"),
       cudasharedmemory_regex_(
           R"(/v2/cudasharedmemory(?:/region/([^/]+))?/(status|register|unregister))"),
-      trace_regex_(R"(/v2/trace/setting)")
+      trace_regex_(R"(/v2/trace/setting)"), restricted_apis_(restricted_apis)
 {
   // FIXME, don't cache server metadata. The http endpoint should
   // not be deciding that server metadata will not change during
@@ -1007,6 +1102,8 @@ HTTPAPIServer::HTTPAPIServer(
       TRITONSERVER_ResponseAllocatorSetBufferAttributesFunction(
           allocator_, OutputBufferAttributes),
       "setting allocator's buffer attributes function");
+
+  ConfigureGenerateMappingSchema();
 }
 
 HTTPAPIServer::~HTTPAPIServer()
@@ -1059,13 +1156,14 @@ HTTPAPIServer::InferResponseAlloc(
     // ...then make sure shared memory size is at least as big as
     // the size of the output.
     if (byte_size > info->byte_size_) {
+      const auto info_byte_size = info->byte_size_;
       delete info;
       return TRITONSERVER_ErrorNew(
           TRITONSERVER_ERROR_INTERNAL,
           std::string(
               "shared memory size specified with the request for output '" +
               std::string(tensor_name) + "' (" +
-              std::to_string(info->byte_size_) + " bytes) should be at least " +
+              std::to_string(info_byte_size) + " bytes) should be at least " +
               std::to_string(byte_size) + " bytes to hold the results")
               .c_str());
     }
@@ -1193,9 +1291,12 @@ HTTPAPIServer::InferResponseFree(
 void
 HTTPAPIServer::HandleServerHealth(evhtp_request_t* req, const std::string& kind)
 {
+  RETURN_AND_RESPOND_IF_RESTRICTED(
+      req, RestrictedCategory::HEALTH, restricted_apis_);
+
   if (req->method != htp_method_GET) {
-    evhtp_send_reply(req, EVHTP_RES_METHNALLOWED);
-    return;
+    RETURN_AND_RESPOND_WITH_ERR(
+        req, EVHTP_RES_METHNALLOWED, "Method Not Allowed");
   }
 
   TRITONSERVER_Error* err = nullptr;
@@ -1217,9 +1318,13 @@ void
 HTTPAPIServer::HandleRepositoryIndex(
     evhtp_request_t* req, const std::string& repository_name)
 {
+  RETURN_AND_RESPOND_IF_RESTRICTED(
+      req, RestrictedCategory::MODEL_REPOSITORY, restricted_apis_);
+
+  AddContentTypeHeader(req, "application/json");
   if (req->method != htp_method_POST) {
-    evhtp_send_reply(req, EVHTP_RES_METHNALLOWED);
-    return;
+    RETURN_AND_RESPOND_WITH_ERR(
+        req, EVHTP_RES_METHNALLOWED, "Method Not Allowed");
   }
 
   TRITONSERVER_Error* err = nullptr;
@@ -1254,10 +1359,6 @@ HTTPAPIServer::HandleRepositoryIndex(
     }
   }
 
-  evhtp_headers_add_header(
-      req->headers_out,
-      evhtp_header_new(kContentTypeHeader, "application/json", 1, 1));
-
   if (err == nullptr) {
     uint32_t flags = 0;
     if (ready) {
@@ -1265,7 +1366,7 @@ HTTPAPIServer::HandleRepositoryIndex(
     }
 
     TRITONSERVER_Message* message = nullptr;
-    auto err = TRITONSERVER_ServerModelIndex(server_.get(), flags, &message);
+    err = TRITONSERVER_ServerModelIndex(server_.get(), flags, &message);
     if (err == nullptr) {
       const char* buffer;
       size_t byte_size;
@@ -1279,11 +1380,7 @@ HTTPAPIServer::HandleRepositoryIndex(
     }
   }
 
-  if (err != nullptr) {
-    EVBufferAddErrorJson(req->buffer_out, err);
-    evhtp_send_reply(req, EVHTP_RES_BADREQ);
-    TRITONSERVER_ErrorDelete(err);
-  }
+  RETURN_AND_RESPOND_IF_ERR(req, err);
 }
 
 void
@@ -1291,15 +1388,15 @@ HTTPAPIServer::HandleRepositoryControl(
     evhtp_request_t* req, const std::string& repository_name,
     const std::string& model_name, const std::string& action)
 {
+  RETURN_AND_RESPOND_IF_RESTRICTED(
+      req, RestrictedCategory::MODEL_REPOSITORY, restricted_apis_);
+
+  AddContentTypeHeader(req, "application/json");
   if (req->method != htp_method_POST) {
-    evhtp_send_reply(req, EVHTP_RES_METHNALLOWED);
-    return;
+    RETURN_AND_RESPOND_WITH_ERR(
+        req, EVHTP_RES_METHNALLOWED, "Method Not Allowed");
   }
 
-  evhtp_headers_add_header(
-      req->headers_out,
-      evhtp_header_new(kContentTypeHeader, "application/json", 1, 1));
-
   TRITONSERVER_Error* err = nullptr;
   if (!repository_name.empty()) {
     err = TRITONSERVER_ErrorNew(
@@ -1314,7 +1411,7 @@ HTTPAPIServer::HandleRepositoryControl(
         v = static_cast<struct evbuffer_iovec*>(
             alloca(sizeof(struct evbuffer_iovec) * n));
         if (evbuffer_peek(req->buffer_in, -1, NULL, v, n) != n) {
-          HTTP_RESPOND_IF_ERR(
+          RETURN_AND_RESPOND_IF_ERR(
               req, TRITONSERVER_ErrorNew(
                        TRITONSERVER_ERROR_INTERNAL,
                        "unexpected error getting load model request buffers"));
@@ -1331,7 +1428,7 @@ HTTPAPIServer::HandleRepositoryControl(
           };
       std::unique_ptr<
           std::vector<TRITONSERVER_Parameter*>, decltype(param_deleter)>
-      params(new std::vector<TRITONSERVER_Parameter*>(), param_deleter);
+          params(new std::vector<TRITONSERVER_Parameter*>(), param_deleter);
       // local variables to store the decoded file content, the data must
       // be valid until TRITONSERVER_ServerLoadModelWithParameters returns.
       std::list<std::vector<char>> binary_files;
@@ -1340,7 +1437,7 @@ HTTPAPIServer::HandleRepositoryControl(
       size_t buffer_len = evbuffer_get_length(req->buffer_in);
       if (buffer_len > 0) {
         triton::common::TritonJson::Value request;
-        HTTP_RESPOND_IF_ERR(
+        RETURN_AND_RESPOND_IF_ERR(
             req, EVBufferToJson(&request, v, &v_idx, buffer_len, n));
 
         // Parse request body for parameters
@@ -1348,11 +1445,11 @@ HTTPAPIServer::HandleRepositoryControl(
         if (request.Find("parameters", &param_json)) {
           // Iterate over each member in 'param_json'
           std::vector<std::string> members;
-          HTTP_RESPOND_IF_ERR(req, param_json.Members(&members));
+          RETURN_AND_RESPOND_IF_ERR(req, param_json.Members(&members));
           for (const auto& m : members) {
             const char* param_str = nullptr;
             size_t param_len = 0;
-            HTTP_RESPOND_IF_ERR(
+            RETURN_AND_RESPOND_IF_ERR(
                 req,
                 param_json.MemberAsString(m.c_str(), &param_str, &param_len));
 
@@ -1377,7 +1474,7 @@ HTTPAPIServer::HandleRepositoryControl(
               params->emplace_back(param);
               const_params.emplace_back(param);
             } else {
-              HTTP_RESPOND_IF_ERR(
+              RETURN_AND_RESPOND_IF_ERR(
                   req, TRITONSERVER_ErrorNew(
                            TRITONSERVER_ERROR_INTERNAL,
                            "unexpected error on creating Triton parameter"));
@@ -1385,7 +1482,7 @@ HTTPAPIServer::HandleRepositoryControl(
           }
         }
       }
-      HTTP_RESPOND_IF_ERR(
+      RETURN_AND_RESPOND_IF_ERR(
           req, TRITONSERVER_ServerLoadModelWithParameters(
                    server_.get(), model_name.c_str(), const_params.data(),
                    const_params.size()));
@@ -1438,13 +1535,8 @@ HTTPAPIServer::HandleRepositoryControl(
     }
   }
 
-  if (err == nullptr) {
-    evhtp_send_reply(req, EVHTP_RES_OK);
-  } else {
-    EVBufferAddErrorJson(req->buffer_out, err);
-    evhtp_send_reply(req, EVHTP_RES_BADREQ);
-    TRITONSERVER_ErrorDelete(err);
-  }
+  RETURN_AND_RESPOND_IF_ERR(req, err);
+  evhtp_send_reply(req, EVHTP_RES_OK);
 }
 
 void
@@ -1452,14 +1544,17 @@ HTTPAPIServer::HandleModelReady(
     evhtp_request_t* req, const std::string& model_name,
     const std::string& model_version_str)
 {
+  RETURN_AND_RESPOND_IF_RESTRICTED(
+      req, RestrictedCategory::HEALTH, restricted_apis_);
+
   if (req->method != htp_method_GET) {
-    evhtp_send_reply(req, EVHTP_RES_METHNALLOWED);
-    return;
+    RETURN_AND_RESPOND_WITH_ERR(
+        req, EVHTP_RES_METHNALLOWED, "Method Not Allowed");
   }
 
   if (model_name.empty()) {
-    evhtp_send_reply(req, EVHTP_RES_BADREQ);
-    return;
+    RETURN_AND_RESPOND_WITH_ERR(
+        req, EVHTP_RES_BADREQ, "Missing model name in ModelReady request");
   }
 
   bool ready = false;
@@ -1472,10 +1567,13 @@ HTTPAPIServer::HandleModelReady(
         server_.get(), model_name.c_str(), requested_model_version, &ready);
   }
 
-  evhtp_send_reply(
-      req, (ready && (err == nullptr)) ? EVHTP_RES_OK : EVHTP_RES_BADREQ);
+  if (!ready && !err) {
+    RETURN_AND_RESPOND_WITH_ERR(
+        req, EVHTP_RES_BADREQ, "Model version not ready");
+  }
 
-  TRITONSERVER_ErrorDelete(err);
+  RETURN_AND_RESPOND_IF_ERR(req, err);
+  evhtp_send_reply(req, EVHTP_RES_OK);
 }
 
 void
@@ -1483,23 +1581,21 @@ HTTPAPIServer::HandleModelMetadata(
     evhtp_request_t* req, const std::string& model_name,
     const std::string& model_version_str)
 {
+  RETURN_AND_RESPOND_IF_RESTRICTED(
+      req, RestrictedCategory::METADATA, restricted_apis_);
+
+  AddContentTypeHeader(req, "application/json");
+
   if (req->method != htp_method_GET) {
-    evhtp_send_reply(req, EVHTP_RES_METHNALLOWED);
-    return;
+    RETURN_AND_RESPOND_WITH_ERR(
+        req, EVHTP_RES_METHNALLOWED, "Method Not Allowed");
   }
 
   if (model_name.empty()) {
-    std::string message_json =
-        "{ \"error\" : \"missing model name in ModelMetadata request\" }";
-    evbuffer_add(req->buffer_out, message_json.c_str(), message_json.size());
-    evhtp_send_reply(req, EVHTP_RES_BADREQ);
-    return;
+    RETURN_AND_RESPOND_WITH_ERR(
+        req, EVHTP_RES_BADREQ, "Missing model name in ModelMetadata request");
   }
 
-  evhtp_headers_add_header(
-      req->headers_out,
-      evhtp_header_new(kContentTypeHeader, "application/json", 1, 1));
-
   TRITONSERVER_Message* message = nullptr;
 
   int64_t requested_model_version;
@@ -1520,11 +1616,37 @@ HTTPAPIServer::HandleModelMetadata(
     }
   }
 
-  if (err != nullptr) {
-    EVBufferAddErrorJson(req->buffer_out, err);
-    evhtp_send_reply(req, EVHTP_RES_BADREQ);
-    TRITONSERVER_ErrorDelete(err);
+  RETURN_AND_RESPOND_IF_ERR(req, err);
+}
+
+TRITONSERVER_Error*
+HTTPAPIServer::GetModelConfig(
+    const std::string& model_name, int64_t requested_model_version,
+    std::string* config_json)
+{
+  if (model_name.empty()) {
+    return TRITONSERVER_ErrorNew(
+        TRITONSERVER_ERROR_INVALID_ARG,
+        "Missing model name in ModelConfig request");
+  }
+
+  TRITONSERVER_Message* message = nullptr;
+  RETURN_IF_ERR(TRITONSERVER_ServerModelConfig(
+      server_.get(), model_name.c_str(), requested_model_version,
+      1 /* config_version */, &message));
+  const char* buffer;
+  size_t byte_size;
+  TRITONSERVER_Error* err = nullptr;
+  err = TRITONSERVER_MessageSerializeToJson(message, &buffer, &byte_size);
+  if (err == nullptr) {
+    // Copy config into string for simplicity
+    *config_json = std::string(buffer, byte_size);
+  }
+  if (message) {
+    TRITONSERVER_MessageDelete(message);
   }
+
+  return err;
 }
 
 void
@@ -1532,49 +1654,27 @@ HTTPAPIServer::HandleModelConfig(
     evhtp_request_t* req, const std::string& model_name,
     const std::string& model_version_str)
 {
-  if (req->method != htp_method_GET) {
-    evhtp_send_reply(req, EVHTP_RES_METHNALLOWED);
-    return;
-  }
+  RETURN_AND_RESPOND_IF_RESTRICTED(
+      req, RestrictedCategory::MODEL_CONFIG, restricted_apis_);
 
-  if (model_name.empty()) {
-    std::string message_json =
-        "{ \"error\" : \"missing model name in ModelConfig request\" }";
-    evbuffer_add(req->buffer_out, message_json.c_str(), message_json.size());
-    evhtp_send_reply(req, EVHTP_RES_BADREQ);
-    return;
+  AddContentTypeHeader(req, "application/json");
+  if (req->method != htp_method_GET) {
+    RETURN_AND_RESPOND_WITH_ERR(
+        req, EVHTP_RES_METHNALLOWED, "Method Not Allowed");
   }
 
-  evhtp_headers_add_header(
-      req->headers_out,
-      evhtp_header_new(kContentTypeHeader, "application/json", 1, 1));
-
-  TRITONSERVER_Message* message = nullptr;
-
   int64_t requested_model_version;
-  auto err =
-      GetModelVersionFromString(model_version_str, &requested_model_version);
-  if (err == nullptr) {
-    err = TRITONSERVER_ServerModelConfig(
-        server_.get(), model_name.c_str(), requested_model_version,
-        1 /* config_version */, &message);
-    if (err == nullptr) {
-      const char* buffer;
-      size_t byte_size;
-      err = TRITONSERVER_MessageSerializeToJson(message, &buffer, &byte_size);
-      if (err == nullptr) {
-        evbuffer_add(req->buffer_out, buffer, byte_size);
-        evhtp_send_reply(req, EVHTP_RES_OK);
-      }
-      TRITONSERVER_MessageDelete(message);
-    }
-  }
+  RETURN_AND_RESPOND_IF_ERR(
+      req,
+      GetModelVersionFromString(model_version_str, &requested_model_version));
 
-  if (err != nullptr) {
-    EVBufferAddErrorJson(req->buffer_out, err);
-    evhtp_send_reply(req, EVHTP_RES_BADREQ);
-    TRITONSERVER_ErrorDelete(err);
-  }
+  std::string config_json_str = "";
+  RETURN_AND_RESPOND_IF_ERR(
+      req,
+      GetModelConfig(model_name, requested_model_version, &config_json_str));
+  evbuffer_add(
+      req->buffer_out, config_json_str.c_str(), config_json_str.size());
+  evhtp_send_reply(req, EVHTP_RES_OK);
 }
 
 void
@@ -1582,15 +1682,15 @@ HTTPAPIServer::HandleModelStats(
     evhtp_request_t* req, const std::string& model_name,
     const std::string& model_version_str)
 {
+  RETURN_AND_RESPOND_IF_RESTRICTED(
+      req, RestrictedCategory::STATISTICS, restricted_apis_);
+
+  AddContentTypeHeader(req, "application/json");
   if (req->method != htp_method_GET) {
-    evhtp_send_reply(req, EVHTP_RES_METHNALLOWED);
-    return;
+    RETURN_AND_RESPOND_WITH_ERR(
+        req, EVHTP_RES_METHNALLOWED, "Method Not Allowed");
   }
 
-  evhtp_headers_add_header(
-      req->headers_out,
-      evhtp_header_new(kContentTypeHeader, "application/json", 1, 1));
-
 #ifdef TRITON_ENABLE_STATS
   TRITONSERVER_Message* model_stats_message = nullptr;
 
@@ -1618,34 +1718,44 @@ HTTPAPIServer::HandleModelStats(
 #else
   auto err = TRITONSERVER_ErrorNew(
       TRITONSERVER_ERROR_UNAVAILABLE,
-      "the server does not suppport model statistics");
+      "the server does not support model statistics");
 #endif
 
-  if (err != nullptr) {
-    EVBufferAddErrorJson(req->buffer_out, err);
-    evhtp_send_reply(req, EVHTP_RES_BADREQ);
-    TRITONSERVER_ErrorDelete(err);
-  }
+  RETURN_AND_RESPOND_IF_ERR(req, err);
 }
 
 void
 HTTPAPIServer::HandleTrace(evhtp_request_t* req, const std::string& model_name)
 {
+  RETURN_AND_RESPOND_IF_RESTRICTED(
+      req, RestrictedCategory::TRACE, restricted_apis_);
+
+  AddContentTypeHeader(req, "application/json");
   if ((req->method != htp_method_GET) && (req->method != htp_method_POST)) {
-    evhtp_send_reply(req, EVHTP_RES_METHNALLOWED);
+    RETURN_AND_RESPOND_WITH_ERR(
+        req, EVHTP_RES_METHNALLOWED, "Method Not Allowed");
     return;
   }
 
-  evhtp_headers_add_header(
-      req->headers_out,
-      evhtp_header_new(kContentTypeHeader, "application/json", 1, 1));
-
 #ifdef TRITON_ENABLE_TRACING
   TRITONSERVER_InferenceTraceLevel level = TRITONSERVER_TRACE_LEVEL_DISABLED;
   uint32_t rate;
   int32_t count;
   uint32_t log_frequency;
   std::string filepath;
+  if (!model_name.empty()) {
+    bool ready = false;
+    RETURN_AND_RESPOND_IF_ERR(
+        req,
+        TRITONSERVER_ServerModelIsReady(
+            server_.get(), model_name.c_str(), -1 /* model version */, &ready));
+    if (!ready) {
+      RETURN_AND_RESPOND_IF_ERR(
+          req, TRITONSERVER_ErrorNew(
+                   TRITONSERVER_ERROR_INVALID_ARG,
+                   ("Request for unknown model : " + model_name).c_str()));
+    }
+  }
 
   // Perform trace setting update if requested
   if (req->method == htp_method_POST) {
@@ -1656,7 +1766,7 @@ HTTPAPIServer::HandleTrace(evhtp_request_t* req, const std::string& model_name)
       v = static_cast<struct evbuffer_iovec*>(
           alloca(sizeof(struct evbuffer_iovec) * n));
       if (evbuffer_peek(req->buffer_in, -1, NULL, v, n) != n) {
-        HTTP_RESPOND_IF_ERR(
+        RETURN_AND_RESPOND_IF_ERR(
             req, TRITONSERVER_ErrorNew(
                      TRITONSERVER_ERROR_INTERNAL,
                      "unexpected error getting trace request buffers"));
@@ -1665,7 +1775,7 @@ HTTPAPIServer::HandleTrace(evhtp_request_t* req, const std::string& model_name)
 
     triton::common::TritonJson::Value request;
     size_t buffer_len = evbuffer_get_length(req->buffer_in);
-    HTTP_RESPOND_IF_ERR(
+    RETURN_AND_RESPOND_IF_ERR(
         req, EVBufferToJson(&request, v, &v_idx, buffer_len, n));
 
     TraceManager::NewSetting new_setting;
@@ -1675,7 +1785,7 @@ HTTPAPIServer::HandleTrace(evhtp_request_t* req, const std::string& model_name)
       if (setting_json.IsNull()) {
         new_setting.clear_filepath_ = true;
       } else {
-        HTTP_RESPOND_IF_ERR(req, setting_json.AsString(&filepath));
+        RETURN_AND_RESPOND_IF_ERR(req, setting_json.AsString(&filepath));
         new_setting.filepath_ = &filepath;
       }
     }
@@ -1684,17 +1794,18 @@ HTTPAPIServer::HandleTrace(evhtp_request_t* req, const std::string& model_name)
         new_setting.clear_level_ = true;
       } else {
         triton::common::TritonJson::Value level_array;
-        HTTP_RESPOND_IF_ERR(
+        RETURN_AND_RESPOND_IF_ERR(
             req, request.MemberAsArray("trace_level", &level_array));
         for (size_t i = 0; i < level_array.ArraySize(); ++i) {
           std::string level_str;
-          HTTP_RESPOND_IF_ERR(req, level_array.IndexAsString(i, &level_str));
+          RETURN_AND_RESPOND_IF_ERR(
+              req, level_array.IndexAsString(i, &level_str));
           if (level_str == "OFF") {
             if (level_array.ArraySize() == 1) {
               level = TRITONSERVER_TRACE_LEVEL_DISABLED;
               new_setting.level_ = &level;
             } else {
-              HTTP_RESPOND_IF_ERR(
+              RETURN_AND_RESPOND_IF_ERR(
                   req, TRITONSERVER_ErrorNew(
                            TRITONSERVER_ERROR_INVALID_ARG,
                            "Expect only one trace level 'OFF' is specified"));
@@ -1716,19 +1827,32 @@ HTTPAPIServer::HandleTrace(evhtp_request_t* req, const std::string& model_name)
         new_setting.clear_rate_ = true;
       } else {
         std::string rate_str;
-        HTTP_RESPOND_IF_ERR(req, setting_json.AsString(&rate_str));
+        RETURN_AND_RESPOND_IF_ERR(req, setting_json.AsString(&rate_str));
         try {
           rate = std::stoi(rate_str);
           new_setting.rate_ = &rate;
         }
         catch (const std::invalid_argument& ia) {
-          HTTP_RESPOND_IF_ERR(
+          RETURN_AND_RESPOND_IF_ERR(
               req, TRITONSERVER_ErrorNew(
                        TRITONSERVER_ERROR_INVALID_ARG,
                        (std::string("Unable to parse 'trace_rate', got: ") +
                         rate_str)
                            .c_str()));
         }
+        catch (const std::out_of_range& oor) {
+          RETURN_AND_RESPOND_IF_ERR(
+              req,
+              TRITONSERVER_ErrorNew(
+                  TRITONSERVER_ERROR_INVALID_ARG,
+                  (std::string("Unable to parse 'trace_rate', value is out of "
+                               "range [ ") +
+                   std::to_string(std::numeric_limits<std::uint32_t>::min()) +
+                   ", " +
+                   std::to_string(std::numeric_limits<std::uint32_t>::max()) +
+                   " ], got: " + rate_str)
+                      .c_str()));
+        }
       }
     }
     if (request.Find("trace_count", &setting_json)) {
@@ -1736,19 +1860,41 @@ HTTPAPIServer::HandleTrace(evhtp_request_t* req, const std::string& model_name)
         new_setting.clear_count_ = true;
       } else {
         std::string count_str;
-        HTTP_RESPOND_IF_ERR(req, setting_json.AsString(&count_str));
+        RETURN_AND_RESPOND_IF_ERR(req, setting_json.AsString(&count_str));
         try {
           count = std::stoi(count_str);
+          if (count < TraceManager::MIN_TRACE_COUNT_VALUE) {
+            RETURN_AND_RESPOND_IF_ERR(
+                req, TRITONSERVER_ErrorNew(
+                         TRITONSERVER_ERROR_INVALID_ARG,
+                         (std::string("Unable to parse 'trace_count'.") +
+                          " Expecting value >= " +
+                          std::to_string(TraceManager::MIN_TRACE_COUNT_VALUE) +
+                          ", got:" + count_str)
+                             .c_str()));
+          }
           new_setting.count_ = &count;
         }
         catch (const std::invalid_argument& ia) {
-          HTTP_RESPOND_IF_ERR(
+          RETURN_AND_RESPOND_IF_ERR(
               req, TRITONSERVER_ErrorNew(
                        TRITONSERVER_ERROR_INVALID_ARG,
                        (std::string("Unable to parse 'trace_count', got: ") +
                         count_str)
                            .c_str()));
         }
+        catch (const std::out_of_range& oor) {
+          RETURN_AND_RESPOND_IF_ERR(
+              req,
+              TRITONSERVER_ErrorNew(
+                  TRITONSERVER_ERROR_INVALID_ARG,
+                  (std::string("Unable to parse 'trace_count', value is out of "
+                               "range [ ") +
+                   std::to_string(TraceManager::MIN_TRACE_COUNT_VALUE) + ", " +
+                   std::to_string(std::numeric_limits<std::int32_t>::max()) +
+                   " ], got: " + count_str)
+                      .c_str()));
+        }
       }
     }
     if (request.Find("log_frequency", &setting_json)) {
@@ -1756,22 +1902,36 @@ HTTPAPIServer::HandleTrace(evhtp_request_t* req, const std::string& model_name)
         new_setting.clear_log_frequency_ = true;
       } else {
         std::string frequency_str;
-        HTTP_RESPOND_IF_ERR(req, setting_json.AsString(&frequency_str));
+        RETURN_AND_RESPOND_IF_ERR(req, setting_json.AsString(&frequency_str));
         try {
           log_frequency = std::stoi(frequency_str);
           new_setting.log_frequency_ = &log_frequency;
         }
         catch (const std::invalid_argument& ia) {
-          HTTP_RESPOND_IF_ERR(
+          RETURN_AND_RESPOND_IF_ERR(
               req, TRITONSERVER_ErrorNew(
                        TRITONSERVER_ERROR_INVALID_ARG,
                        (std::string("Unable to parse 'log_frequency', got: ") +
                         frequency_str)
                            .c_str()));
         }
+        catch (const std::out_of_range& oor) {
+          RETURN_AND_RESPOND_IF_ERR(
+              req,
+              TRITONSERVER_ErrorNew(
+                  TRITONSERVER_ERROR_INVALID_ARG,
+                  (std::string(
+                       "Unable to parse 'log_frequency', value is out of "
+                       "range [ ") +
+                   std::to_string(std::numeric_limits<std::uint32_t>::min()) +
+                   ", " +
+                   std::to_string(std::numeric_limits<std::uint32_t>::max()) +
+                   " ], got: " + frequency_str)
+                      .c_str()));
+        }
       }
     }
-    HTTP_RESPOND_IF_ERR(
+    RETURN_AND_RESPOND_IF_ERR(
         req, trace_manager_->UpdateTraceSetting(model_name, new_setting));
   }
 
@@ -1786,51 +1946,204 @@ HTTPAPIServer::HandleTrace(evhtp_request_t* req, const std::string& model_name)
     triton::common::TritonJson::Value level_array(
         triton::common::TritonJson::ValueType::ARRAY);
     if (level == TRITONSERVER_TRACE_LEVEL_DISABLED) {
-      HTTP_RESPOND_IF_ERR(req, level_array.AppendString("OFF"));
+      RETURN_AND_RESPOND_IF_ERR(req, level_array.AppendString("OFF"));
     } else {
       if (level & TRITONSERVER_TRACE_LEVEL_TIMESTAMPS) {
-        HTTP_RESPOND_IF_ERR(req, level_array.AppendString("TIMESTAMPS"));
+        RETURN_AND_RESPOND_IF_ERR(req, level_array.AppendString("TIMESTAMPS"));
       }
       if (level & TRITONSERVER_TRACE_LEVEL_TENSORS) {
-        HTTP_RESPOND_IF_ERR(req, level_array.AppendString("TENSORS"));
+        RETURN_AND_RESPOND_IF_ERR(req, level_array.AppendString("TENSORS"));
       }
     }
-    HTTP_RESPOND_IF_ERR(
+    RETURN_AND_RESPOND_IF_ERR(
         req, trace_response.Add("trace_level", std::move(level_array)));
   }
-  HTTP_RESPOND_IF_ERR(
+  RETURN_AND_RESPOND_IF_ERR(
       req, trace_response.AddString("trace_rate", std::to_string(rate)));
-  HTTP_RESPOND_IF_ERR(
+  RETURN_AND_RESPOND_IF_ERR(
       req, trace_response.AddString("trace_count", std::to_string(count)));
-  HTTP_RESPOND_IF_ERR(
+  RETURN_AND_RESPOND_IF_ERR(
       req,
       trace_response.AddString("log_frequency", std::to_string(log_frequency)));
-  HTTP_RESPOND_IF_ERR(req, trace_response.AddString("trace_file", filepath));
+  RETURN_AND_RESPOND_IF_ERR(
+      req, trace_response.AddString("trace_file", filepath));
 
   triton::common::TritonJson::WriteBuffer buffer;
-  HTTP_RESPOND_IF_ERR(req, trace_response.Write(&buffer));
+  RETURN_AND_RESPOND_IF_ERR(req, trace_response.Write(&buffer));
   evbuffer_add(req->buffer_out, buffer.Base(), buffer.Size());
   evhtp_send_reply(req, EVHTP_RES_OK);
 #else
-  HTTP_RESPOND_IF_ERR(
+  RETURN_AND_RESPOND_IF_ERR(
       req, TRITONSERVER_ErrorNew(
                TRITONSERVER_ERROR_UNAVAILABLE,
-               "the server does not suppport tracing"));
+               "the server does not support tracing"));
 #endif
 }
 
+void
+HTTPAPIServer::HandleLogging(evhtp_request_t* req)
+{
+  RETURN_AND_RESPOND_IF_RESTRICTED(
+      req, RestrictedCategory::LOGGING, restricted_apis_);
+
+  AddContentTypeHeader(req, "application/json");
+  if ((req->method != htp_method_GET) && (req->method != htp_method_POST)) {
+    RETURN_AND_RESPOND_WITH_ERR(
+        req, EVHTP_RES_METHNALLOWED, "Method Not Allowed");
+  }
+
+#ifdef TRITON_ENABLE_LOGGING
+  // Perform log setting update if requested
+  if (req->method == htp_method_POST) {
+    struct evbuffer_iovec* v = nullptr;
+    int v_idx = 0;
+    int n = evbuffer_peek(req->buffer_in, -1, NULL, NULL, 0);
+    if (n > 0) {
+      v = static_cast<struct evbuffer_iovec*>(
+          alloca(sizeof(struct evbuffer_iovec) * n));
+      if (evbuffer_peek(req->buffer_in, -1, NULL, v, n) != n) {
+        RETURN_AND_RESPOND_IF_ERR(
+            req,
+            TRITONSERVER_ErrorNew(
+                TRITONSERVER_ERROR_INTERNAL,
+                "unexpected error getting dynamic logging request buffers"));
+      }
+    }
+    TRITONSERVER_Error* err = nullptr;
+    triton::common::TritonJson::Value request;
+    size_t buffer_len = evbuffer_get_length(req->buffer_in);
+    RETURN_AND_RESPOND_IF_ERR(
+        req, EVBufferToJson(&request, v, &v_idx, buffer_len, n));
+    // Server and Core repos do not have the same Logger object
+    // Each update must be applied to both server and core repo versions
+    triton::common::TritonJson::Value setting_json;
+    if (request.Find("log_file", &setting_json)) {
+      if (!setting_json.IsNull()) {
+        // Set new settings in server then in core
+        std::string log_file_path;
+        RETURN_AND_RESPOND_IF_ERR(req, setting_json.AsString(&log_file_path));
+        const std::string& error = LOG_SET_OUT_FILE(log_file_path);
+        if (!error.empty()) {
+          RETURN_AND_RESPOND_IF_ERR(
+              req, TRITONSERVER_ErrorNew(
+                       TRITONSERVER_ERROR_UNAVAILABLE, (error).c_str()));
+        }
+        // Okay to pass nullptr because we know the update will be applied
+        // to the global object.
+        err = TRITONSERVER_ServerOptionsSetLogFile(
+            nullptr, log_file_path.c_str());
+        if (err != nullptr) {
+          RETURN_AND_RESPOND_IF_ERR(
+              req, TRITONSERVER_ErrorNew(
+                       TRITONSERVER_ERROR_UNAVAILABLE,
+                       (TRITONSERVER_ErrorMessage(err))));
+        }
+      }
+    }
+    if (request.Find("log_info", &setting_json)) {
+      if (!setting_json.IsNull()) {
+        bool log_info_status;
+        RETURN_AND_RESPOND_IF_ERR(req, setting_json.AsBool(&log_info_status));
+        LOG_ENABLE_INFO(log_info_status);
+        TRITONSERVER_ServerOptionsSetLogInfo(nullptr, log_info_status);
+      }
+    }
+    if (request.Find("log_warning", &setting_json)) {
+      if (!setting_json.IsNull()) {
+        bool log_warn_status;
+        RETURN_AND_RESPOND_IF_ERR(req, setting_json.AsBool(&log_warn_status));
+        LOG_ENABLE_WARNING(log_warn_status);
+        TRITONSERVER_ServerOptionsSetLogWarn(nullptr, log_warn_status);
+      }
+    }
+    if (request.Find("log_error", &setting_json)) {
+      if (!setting_json.IsNull()) {
+        bool log_error_status;
+        RETURN_AND_RESPOND_IF_ERR(req, setting_json.AsBool(&log_error_status));
+        LOG_ENABLE_ERROR(log_error_status);
+        TRITONSERVER_ServerOptionsSetLogError(nullptr, log_error_status);
+      }
+    }
+    if (request.Find("log_verbose_level", &setting_json)) {
+      if (!setting_json.IsNull()) {
+        uint64_t verbose_level;
+        RETURN_AND_RESPOND_IF_ERR(req, setting_json.AsUInt(&verbose_level));
+        LOG_SET_VERBOSE(static_cast<int32_t>(verbose_level));
+        TRITONSERVER_ServerOptionsSetLogVerbose(
+            nullptr, static_cast<int32_t>(verbose_level));
+      }
+    }
+    if (request.Find("log_format", &setting_json)) {
+      if (!setting_json.IsNull()) {
+        std::string log_format_parse;
+        RETURN_AND_RESPOND_IF_ERR(
+            req, setting_json.AsString(&log_format_parse));
+        triton::common::Logger::Format log_format_final =
+            triton::common::Logger::Format::kDEFAULT;
+        if (log_format_parse == "ISO8601") {
+          log_format_final = triton::common::Logger::Format::kISO8601;
+        } else if (log_format_parse != "default") {
+          // Returns from function
+          RETURN_AND_RESPOND_IF_ERR(
+              req, TRITONSERVER_ErrorNew(
+                       TRITONSERVER_ERROR_UNAVAILABLE,
+                       ("invalid argument for --log_format, got: " +
+                        log_format_parse)
+                           .c_str()));
+        }
+        LOG_SET_FORMAT(log_format_final);
+        switch (log_format_final) {
+          case triton::common::Logger::Format::kDEFAULT:
+            TRITONSERVER_ServerOptionsSetLogFormat(
+                nullptr, TRITONSERVER_LOG_DEFAULT);
+            break;
+          case triton::common::Logger::Format::kISO8601:
+            TRITONSERVER_ServerOptionsSetLogFormat(
+                nullptr, TRITONSERVER_LOG_ISO8601);
+            break;
+        }
+      }
+    }
+  }
+  triton::common::TritonJson::Value log_setting_response(
+      triton::common::TritonJson::ValueType::OBJECT);
+  RETURN_AND_RESPOND_IF_ERR(
+      req, log_setting_response.AddString("log_file", LOG_FILE));
+  RETURN_AND_RESPOND_IF_ERR(
+      req, log_setting_response.AddBool("log_info", LOG_INFO_IS_ON));
+  RETURN_AND_RESPOND_IF_ERR(
+      req, log_setting_response.AddBool("log_warning", LOG_WARNING_IS_ON));
+  RETURN_AND_RESPOND_IF_ERR(
+      req, log_setting_response.AddBool("log_error", LOG_ERROR_IS_ON));
+  RETURN_AND_RESPOND_IF_ERR(
+      req, log_setting_response.AddInt(
+               "log_verbose_level", static_cast<uint64_t>(LOG_VERBOSE_LEVEL)));
+  RETURN_AND_RESPOND_IF_ERR(
+      req, log_setting_response.AddString("log_format", LOG_FORMAT_STRING));
+  triton::common::TritonJson::WriteBuffer buffer;
+  RETURN_AND_RESPOND_IF_ERR(req, log_setting_response.Write(&buffer));
+  evbuffer_add(req->buffer_out, buffer.Base(), buffer.Size());
+  evhtp_send_reply(req, EVHTP_RES_OK);
+#else
+  RETURN_AND_RESPOND_IF_ERR(
+      req, TRITONSERVER_ErrorNew(
+               TRITONSERVER_ERROR_UNAVAILABLE,
+               "the server does not support dynamic logging"));
+#endif  // TRITON_ENABLE_LOGGING
+}
+
 void
 HTTPAPIServer::HandleServerMetadata(evhtp_request_t* req)
 {
+  RETURN_AND_RESPOND_IF_RESTRICTED(
+      req, RestrictedCategory::METADATA, restricted_apis_);
+
+  AddContentTypeHeader(req, "application/json");
   if (req->method != htp_method_GET) {
-    evhtp_send_reply(req, EVHTP_RES_METHNALLOWED);
-    return;
+    RETURN_AND_RESPOND_WITH_ERR(
+        req, EVHTP_RES_METHNALLOWED, "Method Not Allowed");
   }
 
-  evhtp_headers_add_header(
-      req->headers_out,
-      evhtp_header_new(kContentTypeHeader, "application/json", 1, 1));
-
   if (server_metadata_err_ == nullptr) {
     evbuffer_add(
         req->buffer_out, server_metadata_.c_str(), server_metadata_.size());
@@ -1846,18 +2159,18 @@ HTTPAPIServer::HandleSystemSharedMemory(
     evhtp_request_t* req, const std::string& region_name,
     const std::string& action)
 {
+  RETURN_AND_RESPOND_IF_RESTRICTED(
+      req, RestrictedCategory::SHARED_MEMORY, restricted_apis_);
+
+  AddContentTypeHeader(req, "application/json");
   if ((action == "status") && (req->method != htp_method_GET)) {
-    evhtp_send_reply(req, EVHTP_RES_METHNALLOWED);
-    return;
+    RETURN_AND_RESPOND_WITH_ERR(
+        req, EVHTP_RES_METHNALLOWED, "Method Not Allowed");
   } else if ((action != "status") && (req->method != htp_method_POST)) {
-    evhtp_send_reply(req, EVHTP_RES_METHNALLOWED);
-    return;
+    RETURN_AND_RESPOND_WITH_ERR(
+        req, EVHTP_RES_METHNALLOWED, "Method Not Allowed");
   }
 
-  evhtp_headers_add_header(
-      req->headers_out,
-      evhtp_header_new(kContentTypeHeader, "application/json", 1, 1));
-
   TRITONSERVER_Error* err = nullptr;
   if (action == "status") {
     triton::common::TritonJson::Value shm_status(
@@ -1943,13 +2256,8 @@ HTTPAPIServer::HandleSystemSharedMemory(
     }
   }
 
-  if (err == nullptr) {
-    evhtp_send_reply(req, EVHTP_RES_OK);
-  } else {
-    EVBufferAddErrorJson(req->buffer_out, err);
-    evhtp_send_reply(req, EVHTP_RES_BADREQ);
-    TRITONSERVER_ErrorDelete(err);
-  }
+  RETURN_AND_RESPOND_IF_ERR(req, err);
+  evhtp_send_reply(req, EVHTP_RES_OK);
 }
 
 void
@@ -1957,18 +2265,18 @@ HTTPAPIServer::HandleCudaSharedMemory(
     evhtp_request_t* req, const std::string& region_name,
     const std::string& action)
 {
+  RETURN_AND_RESPOND_IF_RESTRICTED(
+      req, RestrictedCategory::SHARED_MEMORY, restricted_apis_);
+
+  AddContentTypeHeader(req, "application/json");
   if ((action == "status") && (req->method != htp_method_GET)) {
-    evhtp_send_reply(req, EVHTP_RES_METHNALLOWED);
-    return;
+    RETURN_AND_RESPOND_WITH_ERR(
+        req, EVHTP_RES_METHNALLOWED, "Method Not Allowed");
   } else if ((action != "status") && (req->method != htp_method_POST)) {
-    evhtp_send_reply(req, EVHTP_RES_METHNALLOWED);
-    return;
+    RETURN_AND_RESPOND_WITH_ERR(
+        req, EVHTP_RES_METHNALLOWED, "Method Not Allowed");
   }
 
-  evhtp_headers_add_header(
-      req->headers_out,
-      evhtp_header_new(kContentTypeHeader, "application/json", 1, 1));
-
   TRITONSERVER_Error* err = nullptr;
   if (action == "status") {
     triton::common::TritonJson::Value shm_status(
@@ -2069,22 +2377,53 @@ HTTPAPIServer::HandleCudaSharedMemory(
     }
   }
 
-  if (err == nullptr) {
-    evhtp_send_reply(req, EVHTP_RES_OK);
+  RETURN_AND_RESPOND_IF_ERR(req, err);
+  evhtp_send_reply(req, EVHTP_RES_OK);
+}
+
+TRITONSERVER_Error*
+HTTPAPIServer::GetContentLength(
+    evhtp_request_t* req, evbuffer* decompressed_buffer,
+    int32_t* content_length)
+{
+  TRITONSERVER_Error* err = nullptr;
+
+  // Set to body size in case there is no Content-Length to compare with
+  int32_t lcontent_length = evbuffer_get_length(req->buffer_in);
+  if (decompressed_buffer == nullptr) {
+    const char* content_length_c_str =
+        evhtp_kv_find(req->headers_in, kContentLengthHeader);
+    if (content_length_c_str != nullptr) {
+      try {
+        lcontent_length = std::atoi(content_length_c_str);
+      }
+      catch (const std::invalid_argument& ia) {
+        err = TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INVALID_ARG,
+            (std::string("Unable to parse ") + kContentLengthHeader +
+             ", got: " + content_length_c_str)
+                .c_str());
+      }
+    }
   } else {
-    EVBufferAddErrorJson(req->buffer_out, err);
-    evhtp_send_reply(req, EVHTP_RES_BADREQ);
-    TRITONSERVER_ErrorDelete(err);
+    // The Content-Length doesn't reflect the actual request body size
+    // if compression is used, set 'content_length' to the decompressed size
+    lcontent_length = evbuffer_get_length(decompressed_buffer);
   }
+
+  *content_length = lcontent_length;
+  return err;
 }
 
+
 TRITONSERVER_Error*
 HTTPAPIServer::GetInferenceHeaderLength(
     evhtp_request_t* req, int32_t content_length, size_t* header_length)
 {
-  // Find Inference-Header-Content-Length in header.
   // Set to content length in case that the header is not specified
   *header_length = content_length;
+
+  // Find Inference-Header-Content-Length in header.
   const char* header_length_c_str =
       evhtp_kv_find(req->headers_in, kInferHeaderContentLengthHTTPHeader);
   if (header_length_c_str != NULL) {
@@ -2148,129 +2487,14 @@ HTTPAPIServer::GetResponseCompressionType(evhtp_request_t* req)
   return DataCompressor::Type::IDENTITY;
 }
 
+// Helpers for parsing JSON requests for Triton-specific fields
 TRITONSERVER_Error*
-HTTPAPIServer::EVBufferToInput(
-    const std::string& model_name, TRITONSERVER_InferenceRequest* irequest,
-    evbuffer* input_buffer, InferRequestClass* infer_req, size_t header_length)
+HTTPAPIServer::ParseJsonTritonIO(
+    triton::common::TritonJson::Value& request_json,
+    TRITONSERVER_InferenceRequest* irequest, InferRequestClass* infer_req,
+    const std::string& model_name, evbuffer_iovec* v, int* v_idx_ptr,
+    size_t header_length, int n)
 {
-  // Extract individual input data from HTTP body and register in
-  // 'irequest'. The HTTP body is not necessarily stored in contiguous
-  // memory.
-  //
-  // Get the addr and size of each chunk of memory holding the HTTP
-  // body.
-  struct evbuffer_iovec* v = nullptr;
-  int v_idx = 0;
-
-  int n = evbuffer_peek(input_buffer, -1, NULL, NULL, 0);
-  if (n > 0) {
-    v = static_cast<struct evbuffer_iovec*>(
-        alloca(sizeof(struct evbuffer_iovec) * n));
-    if (evbuffer_peek(input_buffer, -1, NULL, v, n) != n) {
-      return TRITONSERVER_ErrorNew(
-          TRITONSERVER_ERROR_INTERNAL,
-          "unexpected error getting input buffers");
-    }
-  }
-
-  // Extract just the json header from the HTTP body. 'header_length == 0' means
-  // that the entire HTTP body should be input data for a raw binary request.
-  triton::common::TritonJson::Value request_json;
-
-  RETURN_IF_ERR(EVBufferToJson(&request_json, v, &v_idx, header_length, n));
-
-  // Set InferenceRequest request_id
-  triton::common::TritonJson::Value id_json;
-  if (request_json.Find("id", &id_json)) {
-    const char* id;
-    size_t id_len;
-    RETURN_MSG_IF_ERR(id_json.AsString(&id, &id_len), "Unable to parse 'id'");
-    RETURN_IF_ERR(TRITONSERVER_InferenceRequestSetId(irequest, id));
-  }
-
-  // The default setting for returned outputs (JSON or BINARY). This
-  // is needed for the case when outputs are not explicitly specified.
-  AllocPayload::OutputInfo::Kind default_output_kind =
-      AllocPayload::OutputInfo::JSON;
-
-  // Set sequence correlation ID and flags if any
-  triton::common::TritonJson::Value params_json;
-  if (request_json.Find("parameters", &params_json)) {
-    triton::common::TritonJson::Value seq_json;
-    if (params_json.Find("sequence_id", &seq_json)) {
-      // Try to parse sequence_id as uint64_t
-      uint64_t seq_id;
-      if (seq_json.AsUInt(&seq_id) != nullptr) {
-        // On failure try to parse as a string
-        std::string seq_id;
-        RETURN_MSG_IF_ERR(
-            seq_json.AsString(&seq_id), "Unable to parse 'sequence_id'");
-        RETURN_IF_ERR(TRITONSERVER_InferenceRequestSetCorrelationIdString(
-            irequest, seq_id.c_str()));
-      } else {
-        RETURN_IF_ERR(
-            TRITONSERVER_InferenceRequestSetCorrelationId(irequest, seq_id));
-      }
-    }
-
-    uint32_t flags = 0;
-
-    {
-      triton::common::TritonJson::Value start_json;
-      if (params_json.Find("sequence_start", &start_json)) {
-        bool start;
-        RETURN_MSG_IF_ERR(
-            start_json.AsBool(&start), "Unable to parse 'sequence_start'");
-        if (start) {
-          flags |= TRITONSERVER_REQUEST_FLAG_SEQUENCE_START;
-        }
-      }
-
-      triton::common::TritonJson::Value end_json;
-      if (params_json.Find("sequence_end", &end_json)) {
-        bool end;
-        RETURN_MSG_IF_ERR(
-            end_json.AsBool(&end), "Unable to parse 'sequence_end'");
-        if (end) {
-          flags |= TRITONSERVER_REQUEST_FLAG_SEQUENCE_END;
-        }
-      }
-    }
-
-    RETURN_IF_ERR(TRITONSERVER_InferenceRequestSetFlags(irequest, flags));
-
-    {
-      triton::common::TritonJson::Value priority_json;
-      if (params_json.Find("priority", &priority_json)) {
-        uint64_t p;
-        RETURN_MSG_IF_ERR(
-            priority_json.AsUInt(&p), "Unable to parse 'priority'");
-        RETURN_IF_ERR(TRITONSERVER_InferenceRequestSetPriority(irequest, p));
-      }
-    }
-
-    {
-      triton::common::TritonJson::Value timeout_json;
-      if (params_json.Find("timeout", &timeout_json)) {
-        uint64_t t;
-        RETURN_MSG_IF_ERR(timeout_json.AsUInt(&t), "Unable to parse 'timeout'");
-        RETURN_IF_ERR(
-            TRITONSERVER_InferenceRequestSetTimeoutMicroseconds(irequest, t));
-      }
-    }
-
-    {
-      triton::common::TritonJson::Value bdo_json;
-      if (params_json.Find("binary_data_output", &bdo_json)) {
-        bool bdo;
-        RETURN_MSG_IF_ERR(
-            bdo_json.AsBool(&bdo), "Unable to parse 'binary_data_output'");
-        default_output_kind = (bdo) ? AllocPayload::OutputInfo::BINARY
-                                    : AllocPayload::OutputInfo::JSON;
-      }
-    }
-  }
-
   // Get the byte-size for each input and from that get the blocks
   // holding the data for that input
   triton::common::TritonJson::Value inputs_json;
@@ -2278,6 +2502,7 @@ HTTPAPIServer::EVBufferToInput(
       request_json.MemberAsArray("inputs", &inputs_json),
       "Unable to parse 'inputs'");
 
+  int& v_idx = *v_idx_ptr;
   for (size_t i = 0; i < inputs_json.ArraySize(); i++) {
     triton::common::TritonJson::Value request_input;
     RETURN_IF_ERR(inputs_json.At(i, &request_input));
@@ -2364,7 +2589,8 @@ HTTPAPIServer::EVBufferToInput(
       uint64_t shm_offset;
       const char* shm_region;
       RETURN_IF_ERR(CheckSharedMemoryData(
-          request_input, &use_shm, &shm_region, &shm_offset, &byte_size));
+          request_input, &use_shm, &shm_region, &shm_offset,
+          reinterpret_cast<uint64_t*>(&byte_size)));
       if (use_shm) {
         void* base;
         TRITONSERVER_MemoryType memory_type;
@@ -2515,48 +2741,217 @@ HTTPAPIServer::EVBufferToInput(
       }
     }
   }
-
-  infer_req->alloc_payload_.default_output_kind_ = default_output_kind;
-
   return nullptr;  // success
 }
 
 TRITONSERVER_Error*
-HTTPAPIServer::EVBufferToRawInput(
-    const std::string& model_name, TRITONSERVER_InferenceRequest* irequest,
-    evbuffer* input_buffer, InferRequestClass* infer_req)
+HTTPAPIServer::ParseJsonTritonParams(
+    triton::common::TritonJson::Value& request_json,
+    TRITONSERVER_InferenceRequest* irequest, InferRequestClass* infer_req)
 {
-  static const char* raw_input_name = "raw_input";
-  RETURN_IF_ERR(
-      TRITONSERVER_InferenceRequestAddRawInput(irequest, raw_input_name));
+  // The default setting for returned outputs (JSON or BINARY). This
+  // is needed for the case when outputs are not explicitly specified.
+  AllocPayload::OutputInfo::Kind output_kind = AllocPayload::OutputInfo::JSON;
 
-  size_t byte_size = evbuffer_get_length(input_buffer);
-  // zero-shape tensor
-  if (byte_size == 0) {
-    RETURN_IF_ERR(TRITONSERVER_InferenceRequestAppendInputData(
-        irequest, raw_input_name, nullptr, 0 /* byte_size */,
-        TRITONSERVER_MEMORY_CPU, 0 /* memory_type_id */));
-  } else {
-    struct evbuffer_iovec* v = nullptr;
-    int v_idx = 0;
-    int n = evbuffer_peek(input_buffer, -1, NULL, NULL, 0);
-    if (n > 0) {
-      v = static_cast<struct evbuffer_iovec*>(
-          alloca(sizeof(struct evbuffer_iovec) * n));
-      if (evbuffer_peek(input_buffer, -1, NULL, v, n) != n) {
+
+  triton::common::TritonJson::Value params_json;
+  if (request_json.Find("parameters", &params_json)) {
+    std::vector<std::string> parameters;
+    RETURN_MSG_IF_ERR(
+        params_json.Members(&parameters), "failed to get request params.");
+
+    uint32_t flags = 0;
+    for (auto& parameter : parameters) {
+      if (parameter == "sequence_id") {
+        uint64_t seq_id;
+        // Try to parse sequence_id as uint64_t
+        TRITONSERVER_Error* err;
+        if ((err = params_json.MemberAsUInt(parameter.c_str(), &seq_id)) !=
+            nullptr) {
+          TRITONSERVER_ErrorDelete(err);
+          // On failure try to parse as a string
+          std::string seq_id;
+          RETURN_MSG_IF_ERR(
+              params_json.MemberAsString(parameter.c_str(), &seq_id),
+              "Unable to parse 'sequence_id'");
+          RETURN_IF_ERR(TRITONSERVER_InferenceRequestSetCorrelationIdString(
+              irequest, seq_id.c_str()));
+        } else {
+          RETURN_IF_ERR(
+              TRITONSERVER_InferenceRequestSetCorrelationId(irequest, seq_id));
+        }
+      } else if (parameter == "sequence_start") {
+        bool start;
+        RETURN_MSG_IF_ERR(
+            params_json.MemberAsBool(parameter.c_str(), &start),
+            "Unable to parse 'sequence_start'");
+        if (start) {
+          flags |= TRITONSERVER_REQUEST_FLAG_SEQUENCE_START;
+        }
+      } else if (parameter == "sequence_end") {
+        bool end;
+        RETURN_MSG_IF_ERR(
+            params_json.MemberAsBool(parameter.c_str(), &end),
+            "Unable to parse 'sequence_end'");
+        if (end) {
+          flags |= TRITONSERVER_REQUEST_FLAG_SEQUENCE_END;
+        }
+      } else if (parameter == "priority") {
+        uint64_t p;
+        RETURN_MSG_IF_ERR(
+            params_json.MemberAsUInt(parameter.c_str(), &p),
+            "Unable to parse 'priority'");
+        RETURN_IF_ERR(
+            TRITONSERVER_InferenceRequestSetPriorityUInt64(irequest, p));
+      } else if (parameter == "timeout") {
+        uint64_t t;
+        RETURN_MSG_IF_ERR(
+            params_json.MemberAsUInt(parameter.c_str(), &t),
+            "Unable to parse 'timeout'");
+        RETURN_IF_ERR(
+            TRITONSERVER_InferenceRequestSetTimeoutMicroseconds(irequest, t));
+      } else if (parameter == "binary_data_output") {
+        bool bdo;
+        RETURN_MSG_IF_ERR(
+            params_json.MemberAsBool(parameter.c_str(), &bdo),
+            "Unable to parse 'binary_data_output'");
+        output_kind = (bdo) ? AllocPayload::OutputInfo::BINARY
+                            : AllocPayload::OutputInfo::JSON;
+      } else if (parameter.rfind("triton_", 0) == 0) {
         return TRITONSERVER_ErrorNew(
-            TRITONSERVER_ERROR_INTERNAL,
-            "unexpected error getting input buffers");
+            TRITONSERVER_ERROR_INVALID_ARG,
+            ("parameter keys starting with 'triton_' are reserved for Triton "
+             "usage "
+             "and should not be specified."));
+      } else {
+        RETURN_IF_ERR(SetTritonParameterFromJsonParameter(
+            parameter, params_json, irequest));
       }
     }
-    // Process one block at a time
-    while ((byte_size > 0) && (v_idx < n)) {
-      char* base = static_cast<char*>(v[v_idx].iov_base);
-      size_t base_size;
-      if (v[v_idx].iov_len > byte_size) {
-        base_size = byte_size;
-        v[v_idx].iov_base = static_cast<void*>(base + byte_size);
-        v[v_idx].iov_len -= byte_size;
+
+    RETURN_IF_ERR(TRITONSERVER_InferenceRequestSetFlags(irequest, flags));
+  }
+
+  // Set output kind to JSON by default, or BINARY if specified in parameters.
+  infer_req->alloc_payload_.default_output_kind_ = output_kind;
+  return nullptr;  // Success
+}
+
+TRITONSERVER_Error*
+HTTPAPIServer::ParseJsonTritonRequestID(
+    triton::common::TritonJson::Value& request_json,
+    TRITONSERVER_InferenceRequest* irequest)
+{
+  // Set InferenceRequest request_id
+  triton::common::TritonJson::Value id_json;
+  if (request_json.Find("id", &id_json)) {
+    const char* id;
+    size_t id_len;
+    RETURN_MSG_IF_ERR(id_json.AsString(&id, &id_len), "Unable to parse 'id'");
+    RETURN_IF_ERR(TRITONSERVER_InferenceRequestSetId(irequest, id));
+  }
+
+  return nullptr;  // Success
+}
+
+// TODO: Can refactor other non-inference routes to re-use this helper instead.
+TRITONSERVER_Error*
+HTTPAPIServer::EVRequestToJson(
+    evhtp_request_t* req, triton::common::TritonJson::Value* request_json_ptr)
+{
+  struct evbuffer_iovec* v = nullptr;
+  int v_idx = 0;
+  int n = evbuffer_peek(req->buffer_in, -1, NULL, NULL, 0);
+  if (n > 0) {
+    v = static_cast<struct evbuffer_iovec*>(
+        alloca(sizeof(struct evbuffer_iovec) * n));
+    if (evbuffer_peek(req->buffer_in, -1, NULL, v, n) != n) {
+      return TRITONSERVER_ErrorNew(
+          TRITONSERVER_ERROR_INTERNAL,
+          "Unexpected error getting request buffers");
+    }
+  }
+  size_t buffer_len = evbuffer_get_length(req->buffer_in);
+  RETURN_IF_ERR(EVBufferToJson(request_json_ptr, v, &v_idx, buffer_len, n));
+  return nullptr;  // success
+}
+
+TRITONSERVER_Error*
+HTTPAPIServer::EVBufferToInput(
+    const std::string& model_name, TRITONSERVER_InferenceRequest* irequest,
+    evbuffer* input_buffer, InferRequestClass* infer_req, size_t header_length)
+{
+  // Extract individual input data from HTTP body and register in
+  // 'irequest'. The HTTP body is not necessarily stored in contiguous
+  // memory.
+  //
+  // Get the addr and size of each chunk of memory holding the HTTP
+  // body.
+  struct evbuffer_iovec* v = nullptr;
+  int v_idx = 0;
+
+  int n = evbuffer_peek(input_buffer, -1, NULL, NULL, 0);
+  if (n > 0) {
+    v = static_cast<struct evbuffer_iovec*>(
+        alloca(sizeof(struct evbuffer_iovec) * n));
+    if (evbuffer_peek(input_buffer, -1, NULL, v, n) != n) {
+      return TRITONSERVER_ErrorNew(
+          TRITONSERVER_ERROR_INTERNAL,
+          "unexpected error getting input buffers");
+    }
+  }
+
+  // Extract just the json header from the HTTP body. 'header_length == 0' means
+  // that the entire HTTP body should be input data for a raw binary request.
+  triton::common::TritonJson::Value request_json;
+  RETURN_IF_ERR(EVBufferToJson(&request_json, v, &v_idx, header_length, n));
+
+  // Parse request JSON and fill related Triton fields
+  RETURN_IF_ERR(ParseJsonTritonRequestID(request_json, irequest));
+  RETURN_IF_ERR(ParseJsonTritonParams(request_json, irequest, infer_req));
+  RETURN_IF_ERR(ParseJsonTritonIO(
+      request_json, irequest, infer_req, model_name, v, &v_idx, header_length,
+      n));
+
+  return nullptr;  // success
+}
+
+TRITONSERVER_Error*
+HTTPAPIServer::EVBufferToRawInput(
+    const std::string& model_name, TRITONSERVER_InferenceRequest* irequest,
+    evbuffer* input_buffer, InferRequestClass* infer_req)
+{
+  static const char* raw_input_name = "raw_input";
+  RETURN_IF_ERR(
+      TRITONSERVER_InferenceRequestAddRawInput(irequest, raw_input_name));
+
+  size_t byte_size = evbuffer_get_length(input_buffer);
+  // zero-shape tensor
+  if (byte_size == 0) {
+    RETURN_IF_ERR(TRITONSERVER_InferenceRequestAppendInputData(
+        irequest, raw_input_name, nullptr, 0 /* byte_size */,
+        TRITONSERVER_MEMORY_CPU, 0 /* memory_type_id */));
+  } else {
+    struct evbuffer_iovec* v = nullptr;
+    int v_idx = 0;
+    int n = evbuffer_peek(input_buffer, -1, NULL, NULL, 0);
+    if (n > 0) {
+      v = static_cast<struct evbuffer_iovec*>(
+          alloca(sizeof(struct evbuffer_iovec) * n));
+      if (evbuffer_peek(input_buffer, -1, NULL, v, n) != n) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INTERNAL,
+            "unexpected error getting input buffers");
+      }
+    }
+    // Process one block at a time
+    while ((byte_size > 0) && (v_idx < n)) {
+      char* base = static_cast<char*>(v[v_idx].iov_base);
+      size_t base_size;
+      if (v[v_idx].iov_len > byte_size) {
+        base_size = byte_size;
+        v[v_idx].iov_base = static_cast<void*>(base + byte_size);
+        v[v_idx].iov_len -= byte_size;
         byte_size = 0;
       } else {
         base_size = v[v_idx].iov_len;
@@ -2574,201 +2969,654 @@ HTTPAPIServer::EVBufferToRawInput(
   return nullptr;  // success
 }
 
-void
-HTTPAPIServer::HandleInfer(
-    evhtp_request_t* req, const std::string& model_name,
-    const std::string& model_version_str)
-{
-  if (req->method != htp_method_POST) {
-    evhtp_send_reply(req, EVHTP_RES_METHNALLOWED);
-    return;
+struct HeaderSearchPayload {
+  HeaderSearchPayload(
+      const re2::RE2& regex, TRITONSERVER_InferenceRequest* request)
+      : regex_(regex), request_(request), error_(nullptr)
+  {
   }
 
-  bool connection_paused = false;
+  const re2::RE2& regex_;
+  TRITONSERVER_InferenceRequest* request_;
+  TRITONSERVER_Error* error_;
+};
 
-  int64_t requested_model_version;
-  auto err = GetModelVersionFromString(
-      model_version_str.c_str(), &requested_model_version);
+int
+ForEachHeader(evhtp_header_t* header, void* arg)
+{
+  HeaderSearchPayload* header_search_payload =
+      reinterpret_cast<HeaderSearchPayload*>(arg);
 
-  if (err == nullptr) {
-    uint32_t txn_flags;
-    err = TRITONSERVER_ServerModelTransactionProperties(
-        server_.get(), model_name.c_str(), requested_model_version, &txn_flags,
-        nullptr /* voidp */);
-    if ((err == nullptr) && (txn_flags & TRITONSERVER_TXN_DECOUPLED) != 0) {
-      err = TRITONSERVER_ErrorNew(
-          TRITONSERVER_ERROR_UNSUPPORTED,
-          "HTTP end point doesn't support models with decoupled "
-          "transaction policy");
+  TRITONSERVER_InferenceRequest* request = header_search_payload->request_;
+  const re2::RE2& regex = header_search_payload->regex_;
+
+  std::string matched_string;
+  if (RE2::PartialMatch(std::string(header->key), regex)) {
+    header_search_payload->error_ =
+        TRITONSERVER_InferenceRequestSetStringParameter(
+            request, header->key, header->val);
+
+    if (header_search_payload->error_ != nullptr) {
+      return 1;
     }
   }
 
-  // If tracing is enabled see if this request should be traced.
-  TRITONSERVER_InferenceTrace* triton_trace = nullptr;
+  return 0;
+}
+
+TRITONSERVER_Error*
+HTTPAPIServer::CheckTransactionPolicy(
+    evhtp_request_t* req, const std::string& model_name,
+    int64_t requested_model_version)
+{
+  uint32_t txn_flags;
+  RETURN_IF_ERR(TRITONSERVER_ServerModelTransactionProperties(
+      server_.get(), model_name.c_str(), requested_model_version, &txn_flags,
+      nullptr /* voidp */));
+  if ((txn_flags & TRITONSERVER_TXN_DECOUPLED) != 0) {
+    return TRITONSERVER_ErrorNew(
+        TRITONSERVER_ERROR_UNSUPPORTED,
+        "HTTP end point doesn't support models with decoupled "
+        "transaction policy");
+  }
+
+  return nullptr;  // success
+}
+
+std::shared_ptr<TraceManager::Trace>
+HTTPAPIServer::StartTrace(
+    evhtp_request_t* req, const std::string& model_name,
+    TRITONSERVER_InferenceTrace** triton_trace)
+{
 #ifdef TRITON_ENABLE_TRACING
   std::shared_ptr<TraceManager::Trace> trace;
-  if (err == nullptr) {
-    trace = std::move(trace_manager_->SampleTrace(model_name));
-    if (trace != nullptr) {
-      triton_trace = trace->trace_;
+  trace = std::move(trace_manager_->SampleTrace(model_name));
+  if (trace != nullptr) {
+    *triton_trace = trace->trace_;
+
+    // Timestamps from evhtp are capture in 'req'. We record here
+    // since this is the first place where we have access to trace
+    // manager.
+    trace->CaptureTimestamp("HTTP_RECV_START", req->recv_start_ns);
+    trace->CaptureTimestamp("HTTP_RECV_END", req->recv_end_ns);
+  }
+  return trace;
+#else
+  return nullptr;
+#endif  // TRITON_ENABLE_TRACING
+}
 
-      // Timestamps from evhtp are capture in 'req'. We record here
-      // since this is the first place where we have access to trace
-      // manager.
-      trace->CaptureTimestamp("HTTP_RECV_START", req->recv_start_ns);
-      trace->CaptureTimestamp("HTTP_RECV_END", req->recv_end_ns);
+TRITONSERVER_Error*
+HTTPAPIServer::DecompressBuffer(
+    evhtp_request_t* req, evbuffer** decompressed_buffer)
+{
+  auto compression_type = GetRequestCompressionType(req);
+  switch (compression_type) {
+    case DataCompressor::Type::DEFLATE:
+    case DataCompressor::Type::GZIP: {
+      *decompressed_buffer = evbuffer_new();
+      RETURN_IF_ERR(DataCompressor::DecompressData(
+          compression_type, req->buffer_in, *decompressed_buffer));
+      break;
     }
+    case DataCompressor::Type::UNKNOWN: {
+      // Encounter unsupported compressed type, send error with supported types
+      // in Accept-Encoding
+      evhtp_headers_add_header(
+          req->headers_out,
+          evhtp_header_new(kAcceptEncodingHTTPHeader, "gzip, deflate", 1, 1));
+      // FIXME: Map TRITONSERVER_ERROR_UNSUPPORTED to EVHTP_RES_UNSUPPORTED
+      return TRITONSERVER_ErrorNew(
+          TRITONSERVER_ERROR_UNSUPPORTED, "Unsupported compression type");
+    }
+    case DataCompressor::Type::IDENTITY:
+      // Do nothing
+      break;
+  }
+
+  return nullptr;  // success
+}
+
+TRITONSERVER_Error*
+HTTPAPIServer::EVRequestToTritonRequest(
+    evhtp_request_t* req, const std::string& model_name,
+    TRITONSERVER_InferenceRequest* irequest, evbuffer* decompressed_buffer,
+    InferRequestClass* infer_req, size_t header_length)
+{
+  if (header_length != 0) {
+    RETURN_IF_ERR(EVBufferToInput(
+        model_name, irequest,
+        (decompressed_buffer == nullptr) ? req->buffer_in : decompressed_buffer,
+        infer_req, header_length));
+  } else {
+    RETURN_IF_ERR(EVBufferToRawInput(
+        model_name, irequest,
+        (decompressed_buffer == nullptr) ? req->buffer_in : decompressed_buffer,
+        infer_req));
+  }
+
+  return nullptr;  // success
+}
+
+TRITONSERVER_Error*
+HTTPAPIServer::ForwardHeaders(
+    evhtp_request_t* req, TRITONSERVER_InferenceRequest* irequest)
+{
+  if (!header_forward_pattern_.empty()) {
+    HeaderSearchPayload header_search_payload(header_forward_regex_, irequest);
+    int status = evhtp_kvs_for_each(
+        req->headers_in, ForEachHeader,
+        reinterpret_cast<void*>(&header_search_payload));
+    if (status != 0) {
+      return header_search_payload.error_;
+    }
+  }
+
+  return nullptr;  // success
+}
+
+void
+HTTPAPIServer::HandleGenerate(
+    evhtp_request_t* req, const std::string& model_name,
+    const std::string& model_version_str, bool streaming)
+{
+  RETURN_AND_RESPOND_IF_RESTRICTED(
+      req, RestrictedCategory::INFERENCE, restricted_apis_);
+
+  AddContentTypeHeader(req, "application/json");
+  if (req->method != htp_method_POST) {
+    RETURN_AND_RESPOND_WITH_ERR(
+        req, EVHTP_RES_METHNALLOWED, "Method Not Allowed");
+  }
+
+  int64_t requested_model_version;
+  RETURN_AND_RESPOND_IF_ERR(
+      req,
+      GetModelVersionFromString(model_version_str, &requested_model_version));
+
+  // If tracing is enabled see if this request should be traced.
+  TRITONSERVER_InferenceTrace* triton_trace = nullptr;
+  std::shared_ptr<TraceManager::Trace> trace =
+      StartTrace(req, model_name, &triton_trace);
+
+  std::map<std::string, triton::common::TritonJson::Value> input_metadata;
+  triton::common::TritonJson::Value meta_data_root;
+  RETURN_AND_RESPOND_IF_ERR(
+      req, ModelInputMetadata(
+               model_name, requested_model_version, &input_metadata,
+               &meta_data_root));
+
+
+  // [FIXME] decompression should have been done here. before parsing request
+  // body
+  if (GetRequestCompressionType(req) != DataCompressor::Type::IDENTITY) {
+    RETURN_AND_RESPOND_IF_ERR(
+        req,
+        TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INVALID_ARG,
+            "Unsupported content-encoding, only 'identity' is supported."));
   }
-#endif  // TRITON_ENABLE_TRACING
 
   // Create the inference request object which provides all information needed
-  // for an inference.
+  // for an inference. Make sure it is cleaned up on early error.
   TRITONSERVER_InferenceRequest* irequest = nullptr;
-  if (err == nullptr) {
-    err = TRITONSERVER_InferenceRequestNew(
-        &irequest, server_.get(), model_name.c_str(), requested_model_version);
+  RETURN_AND_RESPOND_IF_ERR(
+      req, TRITONSERVER_InferenceRequestNew(
+               &irequest, server_.get(), model_name.c_str(),
+               requested_model_version));
+
+  std::shared_ptr<TRITONSERVER_InferenceRequest> irequest_shared = {
+      irequest, [](TRITONSERVER_InferenceRequest* request) {
+        LOG_TRITONSERVER_ERROR(
+            TRITONSERVER_InferenceRequestDelete(request),
+            "deleting HTTP/REST inference request");
+      }};
+
+  // HTTP request paused when creating inference request. Resume it on exit if
+  // this function returns early due to error. Otherwise resumed in callback.
+  std::unique_ptr<GenerateRequestClass> generate_request;
+  if (streaming) {
+    generate_request.reset(new GenerateRequestClass(
+        server_.get(), req, GetResponseCompressionType(req),
+        generate_stream_request_schema_.get(),
+        generate_stream_response_schema_.get(), streaming, irequest_shared));
+  } else {
+    generate_request.reset(new GenerateRequestClass(
+        server_.get(), req, GetResponseCompressionType(req),
+        generate_request_schema_.get(), generate_response_schema_.get(),
+        streaming, irequest_shared));
   }
-
-  // Decompress request body if it is compressed in supported type
-  evbuffer* decompressed_buffer = nullptr;
-  if (err == nullptr) {
-    auto compression_type = GetRequestCompressionType(req);
-    switch (compression_type) {
-      case DataCompressor::Type::DEFLATE:
-      case DataCompressor::Type::GZIP: {
-        decompressed_buffer = evbuffer_new();
-        err = DataCompressor::DecompressData(
-            compression_type, req->buffer_in, decompressed_buffer);
-        break;
+  generate_request->trace_ = trace;
+
+  const char* request_id = "<id_unknown>";
+  // Callback to cleanup on any errors encountered below. Capture everything
+  // by reference to capture local updates, except for shared pointers which
+  // should be captured by value in case of ref count issues.
+  // The callback does not own the error object.
+  auto error_callback = [&, trace](TRITONSERVER_Error* error) {
+    if (error != nullptr) {
+      // Get request ID for logging in case of error.
+      if (irequest != nullptr) {
+        LOG_TRITONSERVER_ERROR(
+            TRITONSERVER_InferenceRequestId(irequest, &request_id),
+            "unable to retrieve request ID string");
       }
-      case DataCompressor::Type::UNKNOWN: {
-        // Encounter unsupported compressed type,
-        // send 415 error with supported types in Accept-Encoding
-        evhtp_headers_add_header(
-            req->headers_out,
-            evhtp_header_new(kAcceptEncodingHTTPHeader, "gzip, deflate", 1, 1));
-        evhtp_send_reply(req, EVHTP_RES_UNSUPPORTED);
-        return;
+      if (!strncmp(request_id, "", 1)) {
+        request_id = "<id_unknown>";
       }
-      case DataCompressor::Type::IDENTITY:
-        // Do nothing
-        break;
+
+      LOG_VERBOSE(1) << "[request id: " << request_id << "] "
+                     << "Infer failed: " << TRITONSERVER_ErrorMessage(error);
+      AddContentTypeHeader(req, "application/json");
+      EVBufferAddErrorJson(req->buffer_out, error);
+      evhtp_send_reply(req, EVHTP_RES_BADREQ);
+      evhtp_request_resume(req);
+
+#ifdef TRITON_ENABLE_TRACING
+      // If HTTP server still owns Triton trace
+      if ((trace != nullptr) && (trace->trace_ != nullptr)) {
+        TraceManager::TraceRelease(trace->trace_, trace->trace_userp_);
+      }
+#endif  // TRITON_ENABLE_TRACING
     }
+  };
+
+  // Option 1: Form tensor-like JSON request and try to re-use HandleInfer
+  //           as much as possible. Probably need to do something like overwrite
+  //           req->buffer_in or create a new evhtp_request to pass and handle.
+  // Option 2: Do inference logic directly here after parsing request.
+  // Note:
+  //   Currently option 2 is selected. It is true that HandleInfer() includes
+  //   handling for features that will be requested for generate endpoints
+  //   (i.e. tracing), however, it is currently tied to infer endpoint logic and
+  //   some decoupling must be done to properly reuse it (for example, response
+  //   callback is tied to infer logic and inflexible for response streaming).
+  //   For the time being, it is less mental burden to support this endpoint
+  //   without early optimization for code reuse.
+  //   Also, there is limitation on Triton JSON library that makes forming
+  //   arbitrary JSON message convoluted (added key is reference to a string and
+  //   thus the string must live as long as the JSON message).
+  triton::common::TritonJson::Value request;
+  RETURN_AND_CALLBACK_IF_ERR(EVRequestToJson(req, &request), error_callback);
+
+  RETURN_AND_CALLBACK_IF_ERR(
+      generate_request->ConvertGenerateRequest(
+          input_metadata, generate_request->RequestSchema(), request),
+      error_callback);
+
+  auto request_release_payload =
+      std::make_unique<RequestReleasePayload>(irequest_shared, nullptr);
+  // [FIXME] decompression..
+  RETURN_AND_CALLBACK_IF_ERR(
+      TRITONSERVER_InferenceRequestSetReleaseCallback(
+          irequest, InferRequestClass::InferRequestComplete,
+          request_release_payload.get()),
+      error_callback);
+  RETURN_AND_CALLBACK_IF_ERR(
+      TRITONSERVER_InferenceRequestSetResponseCallback(
+          irequest, allocator_,
+          reinterpret_cast<void*>(&generate_request->alloc_payload_),
+          GenerateRequestClass::InferResponseComplete,
+          reinterpret_cast<void*>(generate_request.get())),
+      error_callback);
+
+  RETURN_AND_CALLBACK_IF_ERR(
+      TRITONSERVER_ServerInferAsync(server_.get(), irequest, triton_trace),
+      error_callback);
+
+#ifdef TRITON_ENABLE_TRACING
+  // Ownership of trace passed to Triton core, set trace to null to mark it
+  // as no longer owned here.
+  if (trace != nullptr) {
+    trace->trace_ = nullptr;
   }
+#endif  // TRITON_ENABLE_TRACING
+  generate_request.release();
+  request_release_payload.release();
+}
 
-  // Get the header length
-  size_t header_length;
-  if (err == nullptr) {
-    // Set to body size in case there is no Content-Length to compare with
-    int32_t content_length = evbuffer_get_length(req->buffer_in);
-    if (decompressed_buffer == nullptr) {
-      const char* content_length_c_str =
-          evhtp_kv_find(req->headers_in, kContentLengthHeader);
-      if (content_length_c_str != nullptr) {
-        try {
-          content_length = std::atoi(content_length_c_str);
+TRITONSERVER_Error*
+HTTPAPIServer::ModelInputMetadata(
+    const std::string& model_name, const int64_t model_version,
+    std::map<std::string, triton::common::TritonJson::Value>* input_metadata,
+    triton::common::TritonJson::Value* metadata_root)
+{
+  {
+    if (model_name.empty()) {
+      return TRITONSERVER_ErrorNew(
+          TRITONSERVER_ERROR_INVALID_ARG,
+          "Missing model name in metadata request");
+    }
+
+    TRITONSERVER_Message* message = nullptr;
+    RETURN_IF_ERR(TRITONSERVER_ServerModelMetadata(
+        server_.get(), model_name.c_str(), model_version, &message));
+    const char* buffer;
+    size_t byte_size;
+    TRITONSERVER_Error* err = nullptr;
+    err = TRITONSERVER_MessageSerializeToJson(message, &buffer, &byte_size);
+    if (err == nullptr) {
+      RETURN_IF_ERR(metadata_root->Parse(buffer, byte_size));
+    }
+    if (message) {
+      TRITONSERVER_MessageDelete(message);
+    }
+  }
+
+  // input
+  triton::common::TritonJson::Value inputs;
+  RETURN_IF_ERR(metadata_root->MemberAsArray("inputs", &inputs));
+  for (size_t i = 0; i < inputs.ArraySize(); ++i) {
+    triton::common::TritonJson::Value input;
+    RETURN_IF_ERR(inputs.At(i, &input));
+    std::string name = "";
+    RETURN_IF_ERR(input.MemberAsString("name", &name));
+    (*input_metadata)[name] = std::move(input);
+  }
+
+  return nullptr;  // success
+}
+
+TRITONSERVER_Error*
+HTTPAPIServer::GenerateRequestClass::ConvertGenerateRequest(
+    std::map<std::string, triton::common::TritonJson::Value>& input_metadata,
+    const MappingSchema* schema,
+    triton::common::TritonJson::Value& generate_request)
+{
+  // First find all top-level keys in JSON
+  std::vector<std::string> members;
+  RETURN_IF_ERR(generate_request.Members(&members));
+
+  for (const auto& m : members) {
+    auto it = schema->children_.find(m);
+    if (it != schema->children_.end()) {
+      switch (it->second->kind_) {
+        case MappingSchema::Kind::EXACT_MAPPING: {
+          // Read meta data
+          RETURN_IF_ERR(ExactMappingInput(m, generate_request, input_metadata));
+          break;
         }
-        catch (const std::invalid_argument& ia) {
-          err = TRITONSERVER_ErrorNew(
-              TRITONSERVER_ERROR_INVALID_ARG,
-              (std::string("Unable to parse ") + kContentLengthHeader +
-               ", got: " + content_length_c_str)
-                  .c_str());
+        case MappingSchema::Kind::MAPPING_SCHEMA: {
+          // The key is nested schema
+          if (input_metadata.find(m) != input_metadata.end()) {
+            return TRITONSERVER_ErrorNew(
+                TRITONSERVER_ERROR_INVALID_ARG,
+                (std::string(
+                     "Keyword '" + m +
+                     "' for nested schema also given as input tensor name")
+                     .c_str()));
+          }
+          triton::common::TritonJson::Value nested_generate_request;
+          RETURN_MSG_IF_ERR(
+              generate_request.MemberAsObject(
+                  m.c_str(), &nested_generate_request),
+              "Expected JSON object for keyword: '" + m + "'");
+          RETURN_MSG_IF_ERR(
+              ConvertGenerateRequest(
+                  input_metadata, it->second.get(), nested_generate_request),
+              "Converting keyword: '" + m + "'");
+          break;
         }
+        default:
+          return TRITONSERVER_ErrorNew(
+              TRITONSERVER_ERROR_UNSUPPORTED, "Unsupported schema kind");
       }
+    } else if (schema->allow_unspecified_) {
+      // Unspecified key follows EXACT_MAPPING
+      RETURN_IF_ERR(ExactMappingInput(m, generate_request, input_metadata));
     } else {
-      // The Content-Length doesn't reflect the actual request body size
-      // if compression is used, set 'content_length' to the decompressed size
-      content_length = evbuffer_get_length(decompressed_buffer);
+      return TRITONSERVER_ErrorNew(
+          TRITONSERVER_ERROR_UNSUPPORTED,
+          "The schema disallow unspecified key");
     }
+  }
+  return nullptr;  // success
+}
 
-    if (err == nullptr) {
-      err = GetInferenceHeaderLength(req, content_length, &header_length);
+TRITONSERVER_Error*
+HTTPAPIServer::GenerateRequestClass::ExactMappingInput(
+    const std::string& name,
+    triton::common::TritonJson::Value& generate_request,
+    std::map<std::string, triton::common::TritonJson::Value>& input_metadata)
+{
+  auto it = input_metadata.find(name);
+  if (it == input_metadata.end()) {
+    RETURN_IF_ERR(SetTritonParameterFromJsonParameter(
+        name, generate_request, triton_request_.get()));
+  } else {
+    // Parse data type and shape
+    std::string value;
+    it->second.MemberAsString("datatype", &value);
+    auto dtype = TRITONSERVER_StringToDataType(value.c_str());
+
+    // Perform shape validation, assume the value must be either
+    // primitive type or 1-D array.
+    triton::common::TritonJson::Value tensor_data;
+    if (!generate_request.Find(name.c_str(), &tensor_data)) {
+      return TRITONSERVER_ErrorNew(
+          TRITONSERVER_ERROR_INVALID_ARG,
+          (std::string("unexpected key not found in generate request, "
+                       "expecting key '") +
+           name + "'")
+              .c_str());
     }
-  }
 
-  if (err == nullptr) {
-    connection_paused = true;
+    size_t element_cnt = tensor_data.IsArray() ? tensor_data.ArraySize() : 1;
 
-    auto infer_request = CreateInferRequest(req);
-#ifdef TRITON_ENABLE_TRACING
-    infer_request->trace_ = trace;
-#endif  // TRITON_ENABLE_TRACING
+    size_t byte_size = 0;
+    if (dtype == TRITONSERVER_TYPE_BYTES) {
+      RETURN_IF_ERR(JsonBytesArrayByteSize(tensor_data, &byte_size));
+    } else {
+      byte_size = element_cnt * TRITONSERVER_DataTypeByteSize(dtype);
+    }
 
-    if (err == nullptr) {
-      if (header_length != 0) {
-        err = EVBufferToInput(
-            model_name, irequest,
-            (decompressed_buffer == nullptr) ? req->buffer_in
-                                             : decompressed_buffer,
-            infer_request.get(), header_length);
-      } else {
-        err = EVBufferToRawInput(
-            model_name, irequest,
-            (decompressed_buffer == nullptr) ? req->buffer_in
-                                             : decompressed_buffer,
-            infer_request.get());
+    std::vector<int64_t> shape_vec;
+    {
+      triton::common::TritonJson::Value value;
+      if (!it->second.Find("shape", &value)) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INTERNAL,
+            (std::string(
+                 "Unexpected 'shape' not found in model metadata for input '") +
+             name)
+                .c_str());
       }
-    }
-    if (err == nullptr) {
-      err = TRITONSERVER_InferenceRequestSetReleaseCallback(
-          irequest, InferRequestClass::InferRequestComplete,
-          decompressed_buffer);
-      if (err == nullptr) {
-        err = TRITONSERVER_InferenceRequestSetResponseCallback(
-            irequest, allocator_,
-            reinterpret_cast<void*>(&infer_request->alloc_payload_),
-            InferRequestClass::InferResponseComplete,
-            reinterpret_cast<void*>(infer_request.get()));
+      for (size_t i = 0; i < value.ArraySize(); ++i) {
+        int64_t d = 0;
+        RETURN_IF_ERR(value.IndexAsInt(i, &d));
+        shape_vec.push_back(d);
       }
-      if (err == nullptr) {
-        err = TRITONSERVER_ServerInferAsync(
-            server_.get(), irequest, triton_trace);
-#ifdef TRITON_ENABLE_TRACING
-        if (trace != nullptr) {
-          trace->trace_ = nullptr;
+      // Because generate request don't carry too much shape information, using
+      // a two-pass process to pad the request value to match input shape.
+      // 1. iterate shape for fixed dimension to distribute 'element_cnt'.
+      // 2. Set most inner dynamic shape to the remaining element count,
+      //    other dynamic shape to be 1.
+      for (auto rit = shape_vec.rbegin(); rit != shape_vec.rend(); ++rit) {
+        if (*rit != -1) {
+          if (element_cnt % *rit) {
+            return TRITONSERVER_ErrorNew(
+                TRITONSERVER_ERROR_INVALID_ARG,
+                (std::string("The schema can not convert input '") + name +
+                 "' to tensor with proper shape")
+                    .c_str());
+          }
+          element_cnt /= *rit;
         }
-#endif  // TRITON_ENABLE_TRACING
       }
-      if (err == nullptr) {
-        infer_request.release();
+      for (auto rit = shape_vec.rbegin(); rit != shape_vec.rend(); ++rit) {
+        if (*rit == -1) {
+          *rit = element_cnt;
+          element_cnt = 1;
+        }
+      }
+      if (element_cnt != 1) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INVALID_ARG,
+            (std::string("The schema can not convert input '") + name +
+             "' to tensor with proper shape")
+                .c_str());
       }
     }
+
+    serialized_data_.emplace_back();
+    std::vector<char>& serialized = serialized_data_.back();
+    serialized.resize(byte_size);
+    RETURN_IF_ERR(ReadDataFromJson(
+        name.c_str(), tensor_data, &serialized[0], dtype,
+        dtype == TRITONSERVER_TYPE_BYTES ? byte_size : element_cnt));
+
+    RETURN_IF_ERR(TRITONSERVER_InferenceRequestAddInput(
+        triton_request_.get(), name.c_str(), dtype, &shape_vec[0],
+        shape_vec.size()));
+    RETURN_IF_ERR(TRITONSERVER_InferenceRequestAppendInputData(
+        triton_request_.get(), name.c_str(), &serialized[0], serialized.size(),
+        TRITONSERVER_MEMORY_CPU, 0 /* memory_type_id */));
   }
+  return nullptr;  // success
+}
 
-  if (err != nullptr) {
-    LOG_VERBOSE(1) << "Infer failed: " << TRITONSERVER_ErrorMessage(err);
-    evhtp_headers_add_header(
-        req->headers_out,
-        evhtp_header_new(kContentTypeHeader, "application/json", 1, 1));
-    EVBufferAddErrorJson(req->buffer_out, err);
-    evhtp_send_reply(req, EVHTP_RES_BADREQ);
-    if (connection_paused) {
-      evhtp_request_resume(req);
-    }
-    TRITONSERVER_ErrorDelete(err);
+void
+HTTPAPIServer::HandleInfer(
+    evhtp_request_t* req, const std::string& model_name,
+    const std::string& model_version_str)
+{
+  RETURN_AND_RESPOND_IF_RESTRICTED(
+      req, RestrictedCategory::INFERENCE, restricted_apis_);
+
+  if (req->method != htp_method_POST) {
+    RETURN_AND_RESPOND_WITH_ERR(
+        req, EVHTP_RES_METHNALLOWED, "Method Not Allowed");
+  }
+
+  int64_t requested_model_version;
+  RETURN_AND_RESPOND_IF_ERR(
+      req, GetModelVersionFromString(
+               model_version_str.c_str(), &requested_model_version));
+  RETURN_AND_RESPOND_IF_ERR(
+      req, CheckTransactionPolicy(req, model_name, requested_model_version));
+
+  // If tracing is enabled see if this request should be traced.
+  TRITONSERVER_InferenceTrace* triton_trace = nullptr;
+  std::shared_ptr<TraceManager::Trace> trace =
+      StartTrace(req, model_name, &triton_trace);
+
+  // Decompress request body if it is compressed in supported type
+  evbuffer* decompressed_buffer = nullptr;
+  RETURN_AND_RESPOND_IF_ERR(req, DecompressBuffer(req, &decompressed_buffer));
+
+  // Get content length as a default header_length if no header specified
+  int32_t content_length = 0;
+  RETURN_AND_RESPOND_IF_ERR(
+      req, GetContentLength(req, decompressed_buffer, &content_length));
+
+  // Get the header length
+  size_t header_length = 0;
+  RETURN_AND_RESPOND_IF_ERR(
+      req, GetInferenceHeaderLength(req, content_length, &header_length));
+
+  // Create the inference request object which provides all information needed
+  // for an inference. Make sure it is cleaned up on early error.
+  TRITONSERVER_InferenceRequest* irequest = nullptr;
+  RETURN_AND_RESPOND_IF_ERR(
+      req, TRITONSERVER_InferenceRequestNew(
+               &irequest, server_.get(), model_name.c_str(),
+               requested_model_version));
+  std::shared_ptr<TRITONSERVER_InferenceRequest> irequest_shared(
+      irequest, [](TRITONSERVER_InferenceRequest* request) {
+        LOG_TRITONSERVER_ERROR(
+            TRITONSERVER_InferenceRequestDelete(request),
+            "deleting HTTP/REST inference request");
+      });
+  // HTTP request paused when creating inference request. Resume it on exit if
+  // this function returns early due to error. Otherwise resumed in callback.
+  bool connection_paused = true;
+  auto infer_request = CreateInferRequest(req, irequest_shared);
+  infer_request->trace_ = trace;
+
+  const char* request_id = "<id_unknown>";
+  // Callback to cleanup on any errors encountered below. Capture everything
+  // by reference to capture local updates, except for shared pointers which
+  // should be captured by value in case of ref count issues.
+  auto error_callback = [&, trace](TRITONSERVER_Error* error) {
+    if (error != nullptr) {
+      LOG_VERBOSE(1) << "[request id: " << request_id << "] "
+                     << "Infer failed: " << TRITONSERVER_ErrorMessage(error);
+      AddContentTypeHeader(req, "application/json");
+      EVBufferAddErrorJson(req->buffer_out, error);
+      evhtp_send_reply(req, EVHTP_RES_BADREQ);
+      if (connection_paused) {
+        evhtp_request_resume(req);
+      }
 #ifdef TRITON_ENABLE_TRACING
-    // If HTTP server still owns Triton trace
-    if ((trace != nullptr) && (trace->trace_ != nullptr)) {
-      TraceManager::TraceRelease(trace->trace_, trace->trace_userp_);
-    }
+      // If HTTP server still owns Triton trace
+      if ((trace != nullptr) && (trace->trace_ != nullptr)) {
+        TraceManager::TraceRelease(trace->trace_, trace->trace_userp_);
+      }
 #endif  // TRITON_ENABLE_TRACING
+    }
+  };
 
-    LOG_TRITONSERVER_ERROR(
-        TRITONSERVER_InferenceRequestDelete(irequest),
-        "deleting HTTP/REST inference request");
+  // Parse EV request and fill Triton request fields from it
+  RETURN_AND_CALLBACK_IF_ERR(
+      EVRequestToTritonRequest(
+          req, model_name, irequest, decompressed_buffer, infer_request.get(),
+          header_length),
+      error_callback);
+
+  // Get request ID for logging in case of error.
+  LOG_TRITONSERVER_ERROR(
+      TRITONSERVER_InferenceRequestId(irequest, &request_id),
+      "unable to retrieve request ID string");
+  // Reset id to unknown if empty in core.
+  if (!strncmp(request_id, "", 1)) {
+    request_id = "<id_unknown>";
   }
+
+  RETURN_AND_CALLBACK_IF_ERR(ForwardHeaders(req, irequest), error_callback);
+
+  auto request_release_payload = std::make_unique<RequestReleasePayload>(
+      irequest_shared, decompressed_buffer);
+  RETURN_AND_CALLBACK_IF_ERR(
+      TRITONSERVER_InferenceRequestSetReleaseCallback(
+          irequest, InferRequestClass::InferRequestComplete,
+          request_release_payload.get()),
+      error_callback);
+  RETURN_AND_CALLBACK_IF_ERR(
+      TRITONSERVER_InferenceRequestSetResponseCallback(
+          irequest, allocator_,
+          reinterpret_cast<void*>(&infer_request->alloc_payload_),
+          InferRequestClass::InferResponseComplete,
+          reinterpret_cast<void*>(infer_request.get())),
+      error_callback);
+
+  auto err =
+      TRITONSERVER_ServerInferAsync(server_.get(), irequest, triton_trace);
+#ifdef TRITON_ENABLE_TRACING
+  // Ownership of trace passed to Triton core, set trace to null to mark it
+  // as no longer owned here.
+  if (trace != nullptr) {
+    trace->trace_ = nullptr;
+  }
+#endif  // TRITON_ENABLE_TRACING
+
+  RETURN_AND_CALLBACK_IF_ERR(err, error_callback);
+  infer_request.release();
+  request_release_payload.release();
 }
 
 void
-HTTPAPIServer::OKReplyCallback(evthr_t* thr, void* arg, void* shared)
+HTTPAPIServer::InferRequestClass::OKReplyCallback(
+    evthr_t* thr, void* arg, void* shared)
 {
   HTTPAPIServer::InferRequestClass* infer_request =
       reinterpret_cast<HTTPAPIServer::InferRequestClass*>(arg);
 
   evhtp_request_t* request = infer_request->EvHtpRequest();
-  evhtp_send_reply(request, EVHTP_RES_OK);
-  evhtp_request_resume(request);
+
+  if (request != nullptr) {
+    evhtp_send_reply(request, EVHTP_RES_OK);
+    evhtp_request_resume(request);
+  }
 
 #ifdef TRITON_ENABLE_TRACING
   if (infer_request->trace_ != nullptr) {
@@ -2783,14 +3631,18 @@ HTTPAPIServer::OKReplyCallback(evthr_t* thr, void* arg, void* shared)
 }
 
 void
-HTTPAPIServer::BADReplyCallback(evthr_t* thr, void* arg, void* shared)
+HTTPAPIServer::InferRequestClass::BADReplyCallback(
+    evthr_t* thr, void* arg, void* shared)
 {
   HTTPAPIServer::InferRequestClass* infer_request =
       reinterpret_cast<HTTPAPIServer::InferRequestClass*>(arg);
 
   evhtp_request_t* request = infer_request->EvHtpRequest();
-  evhtp_send_reply(request, EVHTP_RES_BADREQ);
-  evhtp_request_resume(request);
+
+  if (request != nullptr) {
+    evhtp_send_reply(request, EVHTP_RES_BADREQ);
+    evhtp_request_resume(request);
+  }
 
 #ifdef TRITON_ENABLE_TRACING
   if (infer_request->trace_ != nullptr) {
@@ -2799,20 +3651,44 @@ HTTPAPIServer::BADReplyCallback(evthr_t* thr, void* arg, void* shared)
     infer_request->trace_->CaptureTimestamp(
         "HTTP_SEND_END", request->send_end_ns);
   }
-#endif  // TRITON_ENABLE_TRACING
-
-  delete infer_request;
+#endif  // TRITON_ENABLE_TRACING
+
+  delete infer_request;
+}
+
+evhtp_res
+HTTPAPIServer::InferRequestClass::RequestFiniHook(
+    evhtp_request* request, void* arg)
+{
+  HTTPAPIServer::InferRequestClass* infer_request =
+      reinterpret_cast<HTTPAPIServer::InferRequestClass*>(arg);
+  if (infer_request->req_ != request) {
+    LOG_ERROR << "[INTERNAL] mismatched request in fini hook";
+    return EVHTP_RES_ERROR;
+  } else {
+    LOG_TRITONSERVER_ERROR(
+        TRITONSERVER_InferenceRequestCancel(
+            infer_request->triton_request_.get()),
+        "cancelling request");
+    infer_request->req_ = nullptr;
+  }
+  return EVHTP_RES_OK;
 }
 
 HTTPAPIServer::InferRequestClass::InferRequestClass(
     TRITONSERVER_Server* server, evhtp_request_t* req,
-    DataCompressor::Type response_compression_type)
+    DataCompressor::Type response_compression_type,
+    const std::shared_ptr<TRITONSERVER_InferenceRequest>& triton_request)
     : server_(server), req_(req),
-      response_compression_type_(response_compression_type), response_count_(0)
+      response_compression_type_(response_compression_type), response_count_(0),
+      triton_request_(triton_request)
 {
   evhtp_connection_t* htpconn = evhtp_request_get_connection(req);
   thread_ = htpconn->thread;
   evhtp_request_pause(req);
+  evhtp_request_set_hook(
+      req_, evhtp_hook_on_request_fini, (evhtp_hook)(void*)RequestFiniHook,
+      reinterpret_cast<void*>(this));
 }
 
 void
@@ -2822,13 +3698,11 @@ HTTPAPIServer::InferRequestClass::InferRequestComplete(
   // FIXME need to manage the lifetime of InferRequestClass so that we
   // delete it here.
 
+  RequestReleasePayload* request_release_payload =
+      reinterpret_cast<RequestReleasePayload*>(userp);
+
   if ((flags & TRITONSERVER_REQUEST_RELEASE_ALL) != 0) {
-    if (userp != nullptr) {
-      evbuffer_free(reinterpret_cast<evbuffer*>(userp));
-    }
-    LOG_TRITONSERVER_ERROR(
-        TRITONSERVER_InferenceRequestDelete(request),
-        "deleting HTTP/REST inference request");
+    delete request_release_payload;
   }
 }
 
@@ -2878,11 +3752,15 @@ HTTPAPIServer::InferRequestClass::InferResponseComplete(
 #endif  // TRITON_ENABLE_TRACING
 
   if (err == nullptr) {
-    evthr_defer(infer_request->thread_, OKReplyCallback, infer_request);
+    evthr_defer(
+        infer_request->thread_, InferRequestClass::OKReplyCallback,
+        infer_request);
   } else {
     EVBufferAddErrorJson(infer_request->req_->buffer_out, err);
     TRITONSERVER_ErrorDelete(err);
-    evthr_defer(infer_request->thread_, BADReplyCallback, infer_request);
+    evthr_defer(
+        infer_request->thread_, InferRequestClass::BADReplyCallback,
+        infer_request);
   }
 
   LOG_TRITONSERVER_ERROR(
@@ -2899,9 +3777,9 @@ HTTPAPIServer::InferRequestClass::FinalizeResponse(
   triton::common::TritonJson::Value response_json(
       triton::common::TritonJson::ValueType::OBJECT);
 
-  const char* request_id;
+  const char* request_id = "";
   RETURN_IF_ERR(TRITONSERVER_InferenceResponseId(response, &request_id));
-  if ((request_id != nullptr) && (request_id[0] != '\0')) {
+  if (strncmp(request_id, "", 1)) {
     RETURN_IF_ERR(response_json.AddStringRef("id", request_id));
   }
 
@@ -3160,17 +4038,13 @@ HTTPAPIServer::InferRequestClass::SetResponseHeader(
     bool has_binary_data, size_t header_length)
 {
   if (has_binary_data) {
-    evhtp_headers_add_header(
-        req_->headers_out,
-        evhtp_header_new(kContentTypeHeader, "application/octet-stream", 1, 1));
+    AddContentTypeHeader(req_, "application/octet-stream");
     evhtp_headers_add_header(
         req_->headers_out, evhtp_header_new(
                                kInferHeaderContentLengthHTTPHeader,
                                std::to_string(header_length).c_str(), 1, 1));
   } else {
-    evhtp_headers_add_header(
-        req_->headers_out,
-        evhtp_header_new(kContentTypeHeader, "application/json", 1, 1));
+    AddContentTypeHeader(req_, "application/json");
   }
 
   switch (response_compression_type_) {
@@ -3196,6 +4070,441 @@ HTTPAPIServer::InferRequestClass::IncrementResponseCount()
   return response_count_++;
 }
 
+HTTPAPIServer::GenerateRequestClass::~GenerateRequestClass()
+{
+  while (!pending_http_responses_.empty()) {
+    evbuffer_free(pending_http_responses_.front());
+    pending_http_responses_.pop();
+  }
+}
+
+void
+HTTPAPIServer::GenerateRequestClass::InferResponseComplete(
+    TRITONSERVER_InferenceResponse* response, const uint32_t flags, void* userp)
+{
+  // FIXME can't use InferRequestClass object here since it's lifetime
+  // is different than response. For response we need to know how to
+  // send each output (as json, shm, or binary) and that information
+  // has to be maintained in a way that allows us to clean it up
+  // appropriately if connection closed or last response sent.
+  //
+  // But for now userp is the InferRequestClass object and the end of
+  // its life is in the OK or BAD ReplyCallback.
+
+  auto infer_request =
+      reinterpret_cast<HTTPAPIServer::GenerateRequestClass*>(userp);
+
+  // Assuming responses of the same request is sent in sequence.
+
+  TRITONSERVER_Error* err = nullptr;
+  if (response != nullptr) {
+    err = infer_request->FinalizeResponse(response);
+  }
+  if (err != nullptr) {
+    infer_request->AddErrorJson(err);
+  }
+
+
+  // First response starts the chunked response, the response code is set here
+  // so user should check response body in case of error at later time.
+  if (infer_request->IncrementResponseCount() == 0) {
+    infer_request->response_code_ =
+        (err == nullptr) ? EVHTP_RES_OK : EVHTP_RES_BADREQ;
+    evthr_defer(infer_request->thread_, StartResponse, infer_request);
+  }
+
+#ifdef TRITON_ENABLE_TRACING
+  if (infer_request->trace_ != nullptr) {
+    infer_request->trace_->CaptureTimestamp(
+        "INFER_RESPONSE_COMPLETE", TraceManager::CaptureTimestamp());
+  }
+#endif  // TRITON_ENABLE_TRACING
+
+  // Final flag indicates there is no more responses, ending chunked response.
+  if ((flags & TRITONSERVER_RESPONSE_COMPLETE_FINAL) != 0) {
+    evthr_defer(infer_request->thread_, EndResponseCallback, infer_request);
+  } else {
+    evthr_defer(infer_request->thread_, ChunkResponseCallback, infer_request);
+  }
+
+  LOG_TRITONSERVER_ERROR(
+      TRITONSERVER_InferenceResponseDelete(response),
+      "deleting inference response");
+}
+
+void
+HTTPAPIServer::GenerateRequestClass::StartResponse(
+    evthr_t* thr, void* arg, void* shared)
+{
+  auto infer_request =
+      reinterpret_cast<HTTPAPIServer::GenerateRequestClass*>(arg);
+  auto req = infer_request->EvHtpRequest();
+
+  if (req == nullptr) {
+    return;
+  }
+
+  if (infer_request->streaming_) {
+    AddContentTypeHeader(req, "text/event-stream; charset=utf-8");
+  } else {
+    AddContentTypeHeader(req, "application/json");
+  }
+  evhtp_send_reply_chunk_start(req, infer_request->response_code_);
+  evhtp_request_resume(req);
+}
+
+void
+HTTPAPIServer::GenerateRequestClass::ChunkResponseCallback(
+    evthr_t* thr, void* arg, void* shared)
+{
+  auto infer_request =
+      reinterpret_cast<HTTPAPIServer::GenerateRequestClass*>(arg);
+
+  if (infer_request->req_ == nullptr) {
+    return;
+  }
+
+  infer_request->SendChunkResponse(false /* end */);
+}
+
+void
+HTTPAPIServer::GenerateRequestClass::EndResponseCallback(
+    evthr_t* thr, void* arg, void* shared)
+{
+  auto infer_request =
+      reinterpret_cast<HTTPAPIServer::GenerateRequestClass*>(arg);
+
+  if (infer_request->EvHtpRequest() != nullptr) {
+    infer_request->SendChunkResponse(true /* end */);
+    evhtp_send_reply_chunk_end(infer_request->EvHtpRequest());
+  }
+
+  delete infer_request;
+}
+
+void
+HTTPAPIServer::GenerateRequestClass::SendChunkResponse(bool end)
+{
+  // check if response count in the case of non-streaming
+  if (!streaming_) {
+    std::lock_guard<std::mutex> lk(res_mtx_);
+    // For non-streaming, wait until end
+    if (!end) {
+      return;
+    }
+    if (pending_http_responses_.size() != 1) {
+      EVBufferAddErrorJson(
+          req_->buffer_out, TRITONSERVER_ErrorNew(
+                                TRITONSERVER_ERROR_INTERNAL,
+                                "generate expects model to produce exactly 1 "
+                                "response, use generate stream for model that "
+                                "generates various number of responses"));
+      evhtp_send_reply_chunk(req_, req_->buffer_out);
+      return;
+    }
+  }
+
+  evbuffer* buffer = nullptr;
+  {
+    std::lock_guard<std::mutex> lk(res_mtx_);
+    // This function may be called with no pending responses when
+    // response complete callback is invoked with flag-only
+    if (pending_http_responses_.empty()) {
+      return;
+    }
+    buffer = pending_http_responses_.front();
+    pending_http_responses_.pop();
+  }
+  evhtp_send_reply_chunk(req_, buffer);
+  evbuffer_free(buffer);
+
+#ifdef TRITON_ENABLE_TRACING
+  if (trace_ != nullptr) {
+    // [FIXME] currently send_start_ns / send_end_ns is
+    // not captured in evhtp when response is sent in chunks
+    trace_->CaptureTimestamp("HTTP_SEND_START", req_->send_start_ns);
+    trace_->CaptureTimestamp("HTTP_SEND_END", req_->send_end_ns);
+  }
+#endif  // TRITON_ENABLE_TRACING
+}
+
+TRITONSERVER_Error*
+HTTPAPIServer::GenerateRequestClass::FinalizeResponse(
+    TRITONSERVER_InferenceResponse* response)
+{
+  triton_response_ = response;
+  RETURN_IF_ERR(TRITONSERVER_InferenceResponseError(response));
+
+  triton::common::TritonJson::Value response_json(
+      triton::common::TritonJson::ValueType::OBJECT);
+
+  // Response metadata in addition to output tensor / parameter falls under
+  // "unspecified field" with predefined name:
+  // "id", "model_name", "model_version"
+  std::map<std::string, TritonOutput> triton_outputs;
+  const char* id = "";
+  RETURN_IF_ERR(TRITONSERVER_InferenceResponseId(response, &id));
+  if (strncmp(id, "", 1)) {
+    triton_outputs.emplace(
+        "id", TritonOutput(TritonOutput::Type::RESERVED, id));
+  }
+  const char* model_name;
+  int64_t model_version;
+  RETURN_IF_ERR(TRITONSERVER_InferenceResponseModel(
+      response, &model_name, &model_version));
+  triton_outputs.emplace(
+      "model_name", TritonOutput(TritonOutput::Type::RESERVED, model_name));
+  triton_outputs.emplace(
+      "model_version",
+      TritonOutput(
+          TritonOutput::Type::RESERVED, std::to_string(model_version)));
+
+  // If the response has any parameters, convert them to JSON.
+  uint32_t parameter_count;
+  RETURN_IF_ERR(
+      TRITONSERVER_InferenceResponseParameterCount(response, &parameter_count));
+  if (parameter_count > 0) {
+    for (uint32_t pidx = 0; pidx < parameter_count; ++pidx) {
+      const char* name;
+      TRITONSERVER_ParameterType type;
+      const void* vvalue;
+      RETURN_IF_ERR(TRITONSERVER_InferenceResponseParameter(
+          response, pidx, &name, &type, &vvalue));
+      switch (type) {
+        case TRITONSERVER_PARAMETER_BOOL:
+        case TRITONSERVER_PARAMETER_INT:
+        case TRITONSERVER_PARAMETER_STRING:
+          triton_outputs.emplace(
+              name, TritonOutput(TritonOutput::Type::PARAMETER, pidx));
+          break;
+        case TRITONSERVER_PARAMETER_BYTES:
+          return TRITONSERVER_ErrorNew(
+              TRITONSERVER_ERROR_UNSUPPORTED,
+              (std::string("Response parameter '") + name +
+               "' has type 'TRITONSERVER_PARAMETER_BYTES' which is "
+               "not currently supported")
+                  .c_str());
+          break;
+      }
+    }
+  }
+
+  // Go through each response output and transfer information to JSON
+  uint32_t output_count;
+  RETURN_IF_ERR(
+      TRITONSERVER_InferenceResponseOutputCount(response, &output_count));
+
+  for (uint32_t idx = 0; idx < output_count; ++idx) {
+    const char* cname;
+    TRITONSERVER_DataType datatype;
+    const int64_t* shape;
+    uint64_t dim_count;
+    const void* base;
+    size_t byte_size;
+    TRITONSERVER_MemoryType memory_type;
+    int64_t memory_type_id;
+    void* userp;
+
+    RETURN_IF_ERR(TRITONSERVER_InferenceResponseOutput(
+        response, idx, &cname, &datatype, &shape, &dim_count, &base, &byte_size,
+        &memory_type, &memory_type_id, &userp));
+    triton_outputs.emplace(
+        cname, TritonOutput(TritonOutput::Type::TENSOR, idx));
+  }
+
+  std::set<std::string> mapped_outputs;
+  RETURN_IF_ERR(ConvertGenerateResponse(
+      triton_outputs, response_schema_, &response_json, &mapped_outputs));
+  if (response_schema_->allow_unspecified_) {
+    for (const auto& to : triton_outputs) {
+      if (mapped_outputs.find(to.first) == mapped_outputs.end()) {
+        RETURN_IF_ERR(ExactMappingOutput(
+            to.first, to.second, &response_json, &mapped_outputs));
+      }
+    }
+  }
+
+  // [FIXME] compression
+  evbuffer* response_body = evbuffer_new();
+  if (streaming_) {
+    static std::string sse_prefix = "data: ";
+    evbuffer_add(response_body, sse_prefix.c_str(), sse_prefix.length());
+  }
+  // Write json metadata into response evbuffer
+  triton::common::TritonJson::WriteBuffer buffer;
+  RETURN_IF_ERR(response_json.Write(&buffer));
+  evbuffer_add(response_body, buffer.Base(), buffer.Size());
+  if (streaming_) {
+    static std::string sse_suffix = "\n\n";
+    evbuffer_add(response_body, sse_suffix.c_str(), sse_suffix.length());
+  }
+
+  {
+    std::lock_guard<std::mutex> lk(res_mtx_);
+    pending_http_responses_.emplace(response_body);
+  }
+
+  return nullptr;  // success
+}
+
+void
+HTTPAPIServer::GenerateRequestClass::AddErrorJson(TRITONSERVER_Error* error)
+{
+  evbuffer* buffer = evbuffer_new();
+  if (streaming_) {
+    static std::string sse_prefix = "data: ";
+    evbuffer_add(buffer, sse_prefix.c_str(), sse_prefix.length());
+  }
+  EVBufferAddErrorJson(buffer, error);
+  if (streaming_) {
+    static std::string sse_suffix = "\n\n";
+    evbuffer_add(buffer, sse_suffix.c_str(), sse_suffix.length());
+  }
+  TRITONSERVER_ErrorDelete(error);
+  {
+    std::lock_guard<std::mutex> lk(res_mtx_);
+    pending_http_responses_.emplace(buffer);
+  }
+}
+
+TRITONSERVER_Error*
+HTTPAPIServer::GenerateRequestClass::ConvertGenerateResponse(
+    const std::map<
+        std::string, HTTPAPIServer::GenerateRequestClass::TritonOutput>&
+        output_metadata,
+    const MappingSchema* schema,
+    triton::common::TritonJson::Value* generate_response,
+    std::set<std::string>* mapped_outputs)
+{
+  for (auto& nested : schema->children_) {
+    switch (nested.second->kind_) {
+      case MappingSchema::Kind::MAPPING_SCHEMA: {
+        triton::common::TritonJson::Value nested_response(
+            *generate_response, triton::common::TritonJson::ValueType::OBJECT);
+        RETURN_IF_ERR(ConvertGenerateResponse(
+            output_metadata, nested.second.get(), &nested_response,
+            mapped_outputs));
+        RETURN_IF_ERR(generate_response->Add(
+            nested.first.c_str(), std::move(nested_response)));
+        break;
+      }
+      case MappingSchema::Kind::EXACT_MAPPING: {
+        auto it = output_metadata.find(nested.first);
+        if (it == output_metadata.end()) {
+          if (!nested.second->allow_unspecified_) {
+            return TRITONSERVER_ErrorNew(
+                TRITONSERVER_ERROR_INTERNAL,
+                (std::string("Schema requires output '") + nested.first +
+                 "' to be produced by the model.")
+                    .c_str());
+          }
+        } else {
+          RETURN_IF_ERR(ExactMappingOutput(
+              nested.first, it->second, generate_response, mapped_outputs));
+        }
+        break;
+      }
+      default:
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_UNSUPPORTED, "Unsupported schema kind");
+    }
+  }
+  return nullptr;  // success
+}
+
+TRITONSERVER_Error*
+HTTPAPIServer::GenerateRequestClass::ExactMappingOutput(
+    const std::string& name,
+    const HTTPAPIServer::GenerateRequestClass::TritonOutput& triton_output,
+    triton::common::TritonJson::Value* generate_response,
+    std::set<std::string>* mapped_outputs)
+{
+  mapped_outputs->emplace(name);
+
+  switch (triton_output.type) {
+    case TritonOutput::Type::RESERVED: {
+      generate_response->AddStringRef(
+          name.c_str(), triton_output.value.c_str());
+      break;
+    }
+    case TritonOutput::Type::PARAMETER: {
+      const char* name;
+      TRITONSERVER_ParameterType type;
+      const void* vvalue;
+      RETURN_IF_ERR(TRITONSERVER_InferenceResponseParameter(
+          triton_response_, triton_output.index, &name, &type, &vvalue));
+      switch (type) {
+        case TRITONSERVER_PARAMETER_BOOL:
+          RETURN_IF_ERR(generate_response->AddBool(
+              name, *(reinterpret_cast<const bool*>(vvalue))));
+          break;
+        case TRITONSERVER_PARAMETER_INT:
+          RETURN_IF_ERR(generate_response->AddInt(
+              name, *(reinterpret_cast<const int64_t*>(vvalue))));
+          break;
+        case TRITONSERVER_PARAMETER_STRING:
+          RETURN_IF_ERR(generate_response->AddStringRef(
+              name, reinterpret_cast<const char*>(vvalue)));
+          break;
+        case TRITONSERVER_PARAMETER_BYTES:
+          return TRITONSERVER_ErrorNew(
+              TRITONSERVER_ERROR_UNSUPPORTED,
+              (std::string("Response parameter '") + name +
+               "' has type 'TRITONSERVER_PARAMETER_BYTES' which is "
+               "not currently supported")
+                  .c_str());
+          break;
+      }
+      break;
+    }
+    case TritonOutput::Type::TENSOR: {
+      const char* cname;
+      TRITONSERVER_DataType datatype;
+      const int64_t* shape;
+      uint64_t dim_count;
+      const void* base;
+      size_t byte_size;
+      TRITONSERVER_MemoryType memory_type;
+      int64_t memory_type_id;
+      void* userp;
+
+      RETURN_IF_ERR(TRITONSERVER_InferenceResponseOutput(
+          triton_response_, triton_output.index, &cname, &datatype, &shape,
+          &dim_count, &base, &byte_size, &memory_type, &memory_type_id,
+          &userp));
+
+      auto info = reinterpret_cast<AllocPayload::OutputInfo*>(userp);
+      // sanity check
+      if (info->kind_ != AllocPayload::OutputInfo::JSON) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INTERNAL,
+            (std::string("non-JSON output response type is requested for '") +
+             cname + "'")
+                .c_str());
+      }
+
+      size_t element_count = 1;
+      for (size_t j = 0; j < dim_count; j++) {
+        element_count *= shape[j];
+      }
+
+      triton::common::TritonJson::Value data_json(
+          *generate_response, triton::common::TritonJson::ValueType::ARRAY);
+      RETURN_IF_ERR(WriteDataToJson(
+          &data_json, cname, datatype, base, byte_size, element_count));
+      if (element_count > 1) {
+        RETURN_IF_ERR(generate_response->Add(cname, std::move(data_json)));
+      } else {
+        // if only 1 element, strip out the array
+        triton::common::TritonJson::Value el;
+        RETURN_IF_ERR(data_json.At(0, &el));
+        RETURN_IF_ERR(generate_response->Add(cname, std::move(el)));
+      }
+      break;
+    }
+  }
+  return nullptr;  // success
+}
 
 void
 HTTPAPIServer::Handle(evhtp_request_t* req)
@@ -3208,7 +4517,11 @@ HTTPAPIServer::Handle(evhtp_request_t* req)
     HandleModelStats(req);
     return;
   }
-
+  if (std::string(req->uri->path->full) == "/v2/logging") {
+    // change logging
+    HandleLogging(req);
+    return;
+  }
   std::string model_name, version, kind;
   if (RE2::FullMatch(
           std::string(req->uri->path->full), model_regex_, &model_name,
@@ -3221,6 +4534,14 @@ HTTPAPIServer::Handle(evhtp_request_t* req)
       // model infer
       HandleInfer(req, model_name, version);
       return;
+    } else if (kind == "generate") {
+      // text generation
+      HandleGenerate(req, model_name, version, false /* streaming */);
+      return;
+    } else if (kind == "generate_stream") {
+      // text generation (streaming)
+      HandleGenerate(req, model_name, version, true /* streaming */);
+      return;
     } else if (kind == "config") {
       // model configuration
       HandleModelConfig(req, model_name, version);
@@ -3283,9 +4604,8 @@ HTTPAPIServer::Handle(evhtp_request_t* req)
   }
 
   LOG_VERBOSE(1) << "HTTP error: " << req->method << " " << req->uri->path->full
-                 << " - " << static_cast<int>(EVHTP_RES_BADREQ);
-
-  evhtp_send_reply(req, EVHTP_RES_BADREQ);
+                 << " - " << static_cast<int>(EVHTP_RES_NOTFOUND);
+  RETURN_AND_RESPOND_WITH_ERR(req, EVHTP_RES_NOTFOUND, "Not Found");
 }
 
 TRITONSERVER_Error*
@@ -3293,11 +4613,14 @@ HTTPAPIServer::Create(
     const std::shared_ptr<TRITONSERVER_Server>& server,
     triton::server::TraceManager* trace_manager,
     const std::shared_ptr<SharedMemoryManager>& shm_manager, const int32_t port,
-    const std::string address, const int thread_cnt,
+    const bool reuse_port, const std::string& address,
+    const std::string& header_forward_pattern, const int thread_cnt,
+    const RestrictedFeatures& restricted_features,
     std::unique_ptr<HTTPServer>* http_server)
 {
   http_server->reset(new HTTPAPIServer(
-      server, trace_manager, shm_manager, port, address, thread_cnt));
+      server, trace_manager, shm_manager, port, reuse_port, address,
+      header_forward_pattern, thread_cnt, restricted_features));
 
   const std::string addr = address + ":" + std::to_string(port);
   LOG_INFO << "Started HTTPService at " << addr;
@@ -3305,4 +4628,22 @@ HTTPAPIServer::Create(
   return nullptr;
 }
 
+bool
+HTTPAPIServer::RespondIfRestricted(
+    evhtp_request_t* req, const Restriction& restriction)
+{
+  auto header = restriction.first;
+  auto expected_value = restriction.second;
+  const char* actual_value = evhtp_kv_find(req->headers_in, header.c_str());
+  if ((actual_value == nullptr) || (actual_value != expected_value)) {
+    EVBufferAddErrorJson(
+        req->buffer_out,
+        std::string("This API is restricted, expecting header '" + header + "'")
+            .c_str());
+    evhtp_send_reply(req, EVHTP_RES_FORBIDDEN);
+    return true;
+  }
+  return false;
+}
+
 }}  // namespace triton::server
diff --git a/src/http_server.h b/src/http_server.h
index 8aa29537cb..3c0e22712a 100644
--- a/src/http_server.h
+++ b/src/http_server.h
@@ -1,4 +1,4 @@
-// Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 //
 // Redistribution and use in source and binary forms, with or without
 // modification, are permitted provided that the following conditions
@@ -25,24 +25,55 @@
 // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 #pragma once
 
+#include <evhtp/evhtp.h>
 #include <re2/re2.h>
+
 #include <list>
 #include <map>
 #include <memory>
+#include <queue>
 #include <string>
 #include <thread>
 #include <unordered_map>
+
 #include "common.h"
 #include "data_compressor.h"
+#include "restricted_features.h"
 #include "shared_memory_manager.h"
 #include "tracer.h"
 #include "triton/common/logging.h"
 #include "triton/core/tritonserver.h"
 
-#include <evhtp/evhtp.h>
-
 namespace triton { namespace server {
 
+class MappingSchema {
+ public:
+  enum class Kind {
+    EXACT_MAPPING,
+    // An object of this kind means it is a nested mapping schema.
+    MAPPING_SCHEMA
+  };
+  std::map<std::string, std::unique_ptr<MappingSchema>> children_;
+  // Whether an unspecified key is allowed. If true,
+  // * for requests, the unspecified key will be converted to Triton input
+  //   following the EXACT_MAPPING rule.
+  // * for responses, the Triton output will be converted to JSON key-value
+  //   pairs at top level if the name is unspecified in the schema,
+  //   following the EXACT_MAPPING rule.
+  const bool allow_unspecified_{true};
+  const Kind kind_{Kind::EXACT_MAPPING};
+
+  explicit MappingSchema(
+      const MappingSchema::Kind& kind = Kind::EXACT_MAPPING,
+      const bool& allow_unspecified = true)
+      : allow_unspecified_(allow_unspecified), kind_(kind)
+  {
+  }
+
+
+ private:
+};
+
 // Generic HTTP server using evhtp
 class HTTPServer {
  public:
@@ -53,8 +84,11 @@ class HTTPServer {
 
  protected:
   explicit HTTPServer(
-      const int32_t port, const std::string address, const int thread_cnt)
-      : port_(port), address_(address), thread_cnt_(thread_cnt)
+      const int32_t port, const bool reuse_port, const std::string& address,
+      const std::string& header_forward_pattern, const int thread_cnt)
+      : port_(port), reuse_port_(reuse_port), address_(address),
+        header_forward_pattern_(header_forward_pattern),
+        thread_cnt_(thread_cnt), header_forward_regex_(header_forward_pattern_)
   {
   }
 
@@ -67,8 +101,11 @@ class HTTPServer {
   static void StopCallback(evutil_socket_t sock, short events, void* arg);
 
   int32_t port_;
+  bool reuse_port_;
   std::string address_;
+  std::string header_forward_pattern_;
   int thread_cnt_;
+  re2::RE2 header_forward_regex_;
 
   evhtp_t* htp_;
   struct event_base* evbase_;
@@ -92,8 +129,10 @@ class HTTPMetricsServer : public HTTPServer {
   explicit HTTPMetricsServer(
       const std::shared_ptr<TRITONSERVER_Server>& server, const int32_t port,
       std::string address, const int thread_cnt)
-      : HTTPServer(port, address, thread_cnt), server_(server),
-        api_regex_(R"(/metrics/?)")
+      : HTTPServer(
+            port, false /* reuse_port */, address,
+            "" /* header_forward_pattern */, thread_cnt),
+        server_(server), api_regex_(R"(/metrics/?)")
   {
   }
   void Handle(evhtp_request_t* req) override;
@@ -111,7 +150,9 @@ class HTTPAPIServer : public HTTPServer {
       const std::shared_ptr<TRITONSERVER_Server>& server,
       triton::server::TraceManager* trace_manager,
       const std::shared_ptr<SharedMemoryManager>& smb_manager,
-      const int32_t port, std::string address, const int thread_cnt,
+      const int32_t port, const bool reuse_port, const std::string& address,
+      const std::string& header_forward_pattern, const int thread_cnt,
+      const RestrictedFeatures& restricted_apis,
       std::unique_ptr<HTTPServer>* http_server);
 
   virtual ~HTTPAPIServer();
@@ -176,10 +217,20 @@ class HTTPAPIServer : public HTTPServer {
   // send the response.
   class InferRequestClass {
    public:
+    // [FIXME] decompression / compression should be handled implicitly
+    // within InferRequestClass. This alleviate the check for decompressed
+    // buffer in HTTPServer code.
     explicit InferRequestClass(
         TRITONSERVER_Server* server, evhtp_request_t* req,
-        DataCompressor::Type response_compression_type);
-    virtual ~InferRequestClass() = default;
+        DataCompressor::Type response_compression_type,
+        const std::shared_ptr<TRITONSERVER_InferenceRequest>& triton_request);
+    virtual ~InferRequestClass()
+    {
+      if (req_ != nullptr) {
+        evhtp_request_unset_hook(req_, evhtp_hook_on_request_fini);
+      }
+      req_ = nullptr;
+    }
 
     evhtp_request_t* EvHtpRequest() const { return req_; }
 
@@ -189,7 +240,7 @@ class HTTPAPIServer : public HTTPServer {
     static void InferResponseComplete(
         TRITONSERVER_InferenceResponse* response, const uint32_t flags,
         void* userp);
-    TRITONSERVER_Error* FinalizeResponse(
+    virtual TRITONSERVER_Error* FinalizeResponse(
         TRITONSERVER_InferenceResponse* response);
 
     // Helper function to set infer response header in the form specified by
@@ -199,9 +250,8 @@ class HTTPAPIServer : public HTTPServer {
 
     uint32_t IncrementResponseCount();
 
-#ifdef TRITON_ENABLE_TRACING
+    // Only used if tracing enabled
     std::shared_ptr<TraceManager::Trace> trace_;
-#endif  // TRITON_ENABLE_TRACING
 
     AllocPayload alloc_payload_;
 
@@ -210,6 +260,9 @@ class HTTPAPIServer : public HTTPServer {
     // lifetime of the request.
     std::list<std::vector<char>> serialized_data_;
 
+    static void OKReplyCallback(evthr_t* thr, void* arg, void* shared);
+    static void BADReplyCallback(evthr_t* thr, void* arg, void* shared);
+
    protected:
     TRITONSERVER_Server* server_;
     evhtp_request_t* req_;
@@ -219,6 +272,123 @@ class HTTPAPIServer : public HTTPServer {
 
     // Counter to keep track of number of responses generated.
     std::atomic<uint32_t> response_count_;
+
+    // Event hook for called before request deletion
+    static evhtp_res RequestFiniHook(evhtp_request* req, void* arg);
+
+    // Pointer to associated Triton request, this class does not own the
+    // request and must not reference it after a successful
+    // TRITONSERVER_ServerInferAsync (except for cancellation).
+    std::shared_ptr<TRITONSERVER_InferenceRequest> triton_request_{nullptr};
+  };
+
+  class GenerateRequestClass : public InferRequestClass {
+   public:
+    explicit GenerateRequestClass(
+        TRITONSERVER_Server* server, evhtp_request_t* req,
+        DataCompressor::Type response_compression_type,
+        const MappingSchema* request_schema,
+        const MappingSchema* response_schema, bool streaming,
+        const std::shared_ptr<TRITONSERVER_InferenceRequest>& triton_request)
+        : InferRequestClass(
+              server, req, response_compression_type, triton_request),
+          request_schema_(request_schema), response_schema_(response_schema),
+          streaming_(streaming)
+    {
+    }
+    virtual ~GenerateRequestClass();
+
+    // [FIXME] Specialize response complete function for now, should have
+    // been a dispatcher and call into object specific response function.
+    static void InferResponseComplete(
+        TRITONSERVER_InferenceResponse* response, const uint32_t flags,
+        void* userp);
+    static void ChunkResponseCallback(evthr_t* thr, void* arg, void* shared);
+    static void EndResponseCallback(evthr_t* thr, void* arg, void* shared);
+    // Return whether the response is ending
+    void SendChunkResponse(bool end);
+
+    // Response preparation
+    TRITONSERVER_Error* FinalizeResponse(
+        TRITONSERVER_InferenceResponse* response) override;
+    void AddErrorJson(TRITONSERVER_Error* error);
+    static void StartResponse(evthr_t* thr, void* arg, void* shared);
+
+    // [DLIS-5551] currently always performs basic conversion, only maps schema
+    // of EXACT_MAPPING kind. MAPPING_SCHEMA and upcoming kinds are for
+    // customized conversion where a detailed schema will be provided.
+    TRITONSERVER_Error* ConvertGenerateRequest(
+        std::map<std::string, triton::common::TritonJson::Value>&
+            input_metadata,
+        const MappingSchema* schema,
+        triton::common::TritonJson::Value& generate_request);
+
+    const MappingSchema* RequestSchema() { return request_schema_; }
+    const MappingSchema* ResponseSchema() { return response_schema_; }
+
+   private:
+    struct TritonOutput {
+      enum class Type { RESERVED, TENSOR, PARAMETER };
+      TritonOutput(Type t, const std::string& val) : type(t), value(val) {}
+      explicit TritonOutput(Type t, uint32_t i) : type(t), index(i) {}
+      Type type;
+      // RESERVED type
+      std::string value;
+      // TENSOR, PARAMETER type
+      uint32_t index;
+    };
+    TRITONSERVER_Error* ExactMappingInput(
+        const std::string& name, triton::common::TritonJson::Value& value,
+        std::map<std::string, triton::common::TritonJson::Value>&
+            input_metadata);
+
+    // [DLIS-5551] currently always performs basic conversion, only maps schema
+    // of EXACT_MAPPING kind. MAPPING_SCHEMA and upcoming kinds are for
+    // customized conversion where a detailed schema will be provided.
+    TRITONSERVER_Error* ConvertGenerateResponse(
+        const std::map<std::string, TritonOutput>& output_metadata,
+        const MappingSchema* schema,
+        triton::common::TritonJson::Value* generate_response,
+        std::set<std::string>* mapped_outputs);
+    TRITONSERVER_Error* ExactMappingOutput(
+        const std::string& name, const TritonOutput& triton_output,
+        triton::common::TritonJson::Value* generate_response,
+        std::set<std::string>* mapped_outputs);
+
+    const MappingSchema* request_schema_{nullptr};
+    const MappingSchema* response_schema_{nullptr};
+    const bool streaming_{false};
+    // Placeholder to completing response, this class does not own
+    // the response.
+    TRITONSERVER_InferenceResponse* triton_response_{nullptr};
+    // As InferResponseComplete and ChunkResponseCallback are called in
+    // different threads, need to have dedicated buffers for each response and
+    // ensure mutual exclusive access.
+    std::mutex res_mtx_;
+    std::queue<evbuffer*> pending_http_responses_;
+    bool end_{false};
+    // starting response code
+    evhtp_res response_code_;
+  };
+
+  // Simple structure that carries the userp payload needed for
+  // request release callback.
+  struct RequestReleasePayload final {
+    RequestReleasePayload(
+        const std::shared_ptr<TRITONSERVER_InferenceRequest>& inference_request,
+        evbuffer* buffer)
+        : inference_request_(inference_request), buffer_(buffer){};
+
+    ~RequestReleasePayload()
+    {
+      if (buffer_ != nullptr) {
+        evbuffer_free(buffer_);
+      }
+    };
+
+   private:
+    std::shared_ptr<TRITONSERVER_InferenceRequest> inference_request_ = nullptr;
+    evbuffer* buffer_ = nullptr;
   };
 
  protected:
@@ -226,13 +396,18 @@ class HTTPAPIServer : public HTTPServer {
       const std::shared_ptr<TRITONSERVER_Server>& server,
       triton::server::TraceManager* trace_manager,
       const std::shared_ptr<SharedMemoryManager>& shm_manager,
-      const int32_t port, const std::string address, const int thread_cnt);
+      const int32_t port, const bool reuse_port, const std::string& address,
+      const std::string& header_forward_pattern, const int thread_cnt,
+      const RestrictedFeatures& restricted_apis = {});
+
   virtual void Handle(evhtp_request_t* req) override;
+  // [FIXME] extract to "infer" class
   virtual std::unique_ptr<InferRequestClass> CreateInferRequest(
-      evhtp_request_t* req)
+      evhtp_request_t* req,
+      const std::shared_ptr<TRITONSERVER_InferenceRequest>& triton_request)
   {
     return std::unique_ptr<InferRequestClass>(new InferRequestClass(
-        server_.get(), req, GetResponseCompressionType(req)));
+        server_.get(), req, GetResponseCompressionType(req), triton_request));
   }
 
   // Helper function to retrieve infer request header in the form specified by
@@ -245,6 +420,24 @@ class HTTPAPIServer : public HTTPServer {
   virtual DataCompressor::Type GetRequestCompressionType(evhtp_request_t* req);
   virtual DataCompressor::Type GetResponseCompressionType(evhtp_request_t* req);
 
+
+  TRITONSERVER_Error* GetModelConfig(
+      const std::string& model_name, int64_t requested_model_version,
+      std::string* config_json);
+  TRITONSERVER_Error* GetContentLength(
+      evhtp_request_t* req, evbuffer* decompressed_buffer,
+      int32_t* content_length);
+  TRITONSERVER_Error* DecompressBuffer(
+      evhtp_request_t* req, evbuffer** decompressed_buffer);
+  TRITONSERVER_Error* CheckTransactionPolicy(
+      evhtp_request_t* req, const std::string& model_name,
+      int64_t requested_model_version);
+  std::shared_ptr<TraceManager::Trace> StartTrace(
+      evhtp_request_t* req, const std::string& model_name,
+      TRITONSERVER_InferenceTrace** triton_trace);
+  TRITONSERVER_Error* ForwardHeaders(
+      evhtp_request_t* req, TRITONSERVER_InferenceRequest* irequest);
+
   static TRITONSERVER_Error* InferResponseAlloc(
       TRITONSERVER_ResponseAllocator* allocator, const char* tensor_name,
       size_t byte_size, TRITONSERVER_MemoryType preferred_memory_type,
@@ -292,7 +485,32 @@ class HTTPAPIServer : public HTTPServer {
       evhtp_request_t* req, const std::string& region_name,
       const std::string& action);
   void HandleTrace(evhtp_request_t* req, const std::string& model_name = "");
+  void HandleLogging(evhtp_request_t* req);
 
+  // Text Generation / LLM format
+  //'streaming' selects the schema pair to convert request / response.
+  // 'streaming' also controls the response convention, if true,
+  // Server-Sent Events format will be used to send responses.
+  void HandleGenerate(
+      evhtp_request_t* req, const std::string& model_name,
+      const std::string& model_version_str, bool streaming);
+
+  // 'meta_data_root' is the root JSON document for 'input_metadata'.
+  // In TritonJson, the Value objects are references to the root document.
+  // Therefore the document must stay valid.
+  TRITONSERVER_Error* ModelInputMetadata(
+      const std::string& model_name, const int64_t model_version,
+      std::map<std::string, triton::common::TritonJson::Value>* input_metadata,
+      triton::common::TritonJson::Value* meta_data_root);
+
+  // Parses full evhtp request and its evbuffers into JSON.
+  TRITONSERVER_Error* EVRequestToJson(
+      evhtp_request_t* req, triton::common::TritonJson::Value* request_json);
+  // Parses evhtp request buffers into Triton Inference Request.
+  TRITONSERVER_Error* EVRequestToTritonRequest(
+      evhtp_request_t* req, const std::string& model_name,
+      TRITONSERVER_InferenceRequest* irequest, evbuffer* decompressed_buffer,
+      InferRequestClass* infer_req, size_t header_length);
   TRITONSERVER_Error* EVBufferToInput(
       const std::string& model_name, TRITONSERVER_InferenceRequest* irequest,
       evbuffer* input_buffer, InferRequestClass* infer_req,
@@ -301,8 +519,19 @@ class HTTPAPIServer : public HTTPServer {
       const std::string& model_name, TRITONSERVER_InferenceRequest* irequest,
       evbuffer* input_buffer, InferRequestClass* infer_req);
 
-  static void OKReplyCallback(evthr_t* thr, void* arg, void* shared);
-  static void BADReplyCallback(evthr_t* thr, void* arg, void* shared);
+
+  // Helpers for parsing JSON requests for Triton-specific fields
+  TRITONSERVER_Error* ParseJsonTritonIO(
+      triton::common::TritonJson::Value& request_json,
+      TRITONSERVER_InferenceRequest* irequest, InferRequestClass* infer_req,
+      const std::string& model_name, evbuffer_iovec* v, int* v_idx_ptr,
+      size_t header_length, int n);
+  TRITONSERVER_Error* ParseJsonTritonParams(
+      triton::common::TritonJson::Value& request_json,
+      TRITONSERVER_InferenceRequest* irequest, InferRequestClass* infer_req);
+  TRITONSERVER_Error* ParseJsonTritonRequestID(
+      triton::common::TritonJson::Value& request_json,
+      TRITONSERVER_InferenceRequest* irequest);
 
   std::shared_ptr<TRITONSERVER_Server> server_;
 
@@ -323,6 +552,38 @@ class HTTPAPIServer : public HTTPServer {
   re2::RE2 systemsharedmemory_regex_;
   re2::RE2 cudasharedmemory_regex_;
   re2::RE2 trace_regex_;
+
+  // [DLIS-5551] currently always performs basic conversion, only maps schema
+  // of EXACT_MAPPING kind. MAPPING_SCHEMA and upcoming kinds are for
+  // customized conversion where a detailed schema will be provided.
+  std::unique_ptr<MappingSchema> generate_request_schema_{new MappingSchema()};
+  std::unique_ptr<MappingSchema> generate_response_schema_{new MappingSchema()};
+  std::unique_ptr<MappingSchema> generate_stream_response_schema_{
+      new MappingSchema()};
+  std::unique_ptr<MappingSchema> generate_stream_request_schema_{
+      new MappingSchema()};
+
+  // Provisional definition of generate mapping schema
+  // to allow for parameters passing
+  //
+  // Note: subject to change
+  void ConfigureGenerateMappingSchema()
+  {
+    // Reserved field parameters for generate
+    // If present, parameters will be converted to tensors
+    // or parameters based on model config
+
+    const std::string parameters_field = "parameters";
+    generate_stream_request_schema_->children_.emplace(
+        parameters_field,
+        new MappingSchema(MappingSchema::Kind::MAPPING_SCHEMA, true));
+    generate_request_schema_->children_.emplace(
+        parameters_field,
+        new MappingSchema(MappingSchema::Kind::MAPPING_SCHEMA, true));
+  }
+  RestrictedFeatures restricted_apis_{};
+  bool RespondIfRestricted(
+      evhtp_request_t* req, const Restriction& restriction);
 };
 
 }}  // namespace triton::server
diff --git a/src/main.cc b/src/main.cc
index 4f12bc431d..14fde049c3 100644
--- a/src/main.cc
+++ b/src/main.cc
@@ -1,4 +1,4 @@
-// Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 //
 // Redistribution and use in source and binary forms, with or without
 // modification, are permitted provided that the following conditions
@@ -39,20 +39,21 @@
 #endif
 
 #include <stdint.h>
+
 #include <algorithm>
 #include <cctype>
 #include <iomanip>
 #include <iostream>
-#include <list>
-#include <set>
 #include <sstream>
 #include <thread>
+
 #include "triton_signal.h"
 
 #ifdef TRITON_ENABLE_ASAN
 #include <sanitizer/lsan_interface.h>
 #endif  // TRITON_ENABLE_ASAN
 
+#include "command_line_parser.h"
 #include "common.h"
 #include "shared_memory_manager.h"
 #include "tracer.h"
@@ -69,7 +70,7 @@
 #include "vertex_ai_server.h"
 #endif  // TRITON_ENABLE_VERTEX_AI
 #ifdef TRITON_ENABLE_GRPC
-#include "grpc_server.h"
+#include "grpc/grpc_server.h"
 #endif  // TRITON_ENABLE_GRPC
 
 #ifdef TRITON_ENABLE_GPU
@@ -80,632 +81,39 @@ static_assert(
 
 namespace {
 
-// Interval, in seconds, when the model repository is polled for
-// changes.
-int32_t repository_poll_secs_ = 15;
-
-// The HTTP, GRPC and metrics service/s and ports. Initialized to
-// default values and modifyied based on command-line args.
 #ifdef TRITON_ENABLE_HTTP
-std::unique_ptr<triton::server::HTTPServer> http_service_;
-bool allow_http_ = true;
-int32_t http_port_ = 8000;
-std::string http_address_ = "0.0.0.0";
+std::unique_ptr<triton::server::HTTPServer> g_http_service;
 #endif  // TRITON_ENABLE_HTTP
 
-#ifdef TRITON_ENABLE_SAGEMAKER
-std::unique_ptr<triton::server::HTTPServer> sagemaker_service_;
-bool allow_sagemaker_ = false;
-int32_t sagemaker_port_ = 8080;
-// Triton uses "0.0.0.0" as default address for SageMaker.
-std::string sagemaker_address_ = "0.0.0.0";
-bool sagemaker_safe_range_set_ = false;
-std::pair<int32_t, int32_t> sagemaker_safe_range_ = {-1, -1};
-// The number of threads to initialize for the SageMaker HTTP front-end.
-int sagemaker_thread_cnt_ = 8;
-#endif  // TRITON_ENABLE_SAGEMAKER
-
-#ifdef TRITON_ENABLE_VERTEX_AI
-std::unique_ptr<triton::server::HTTPServer> vertex_ai_service_;
-// Triton uses "0.0.0.0" as default address for Vertex AI.
-std::string vertex_ai_address_ = "0.0.0.0";
-bool allow_vertex_ai_ = false;
-int32_t vertex_ai_port_ = 8080;
-// The number of threads to initialize for the Vertex AI HTTP front-end.
-int vertex_ai_thread_cnt_ = 8;
-std::string vertex_ai_default_model_;
-#endif  // TRITON_ENABLE_VERTEX_AI
-
 #ifdef TRITON_ENABLE_GRPC
-std::unique_ptr<triton::server::GRPCServer> grpc_service_;
-bool allow_grpc_ = true;
-int32_t grpc_port_ = 8001;
-std::string grpc_address_ = "0.0.0.0";
-bool grpc_use_ssl_ = false;
-triton::server::SslOptions grpc_ssl_options_;
-grpc_compression_level grpc_response_compression_level_ =
-    GRPC_COMPRESS_LEVEL_NONE;
-// KeepAlive defaults: https://grpc.github.io/grpc/cpp/md_doc_keepalive.html
-triton::server::KeepAliveOptions grpc_keepalive_options_;
+std::unique_ptr<triton::server::grpc::Server> g_grpc_service;
 #endif  // TRITON_ENABLE_GRPC
 
 #ifdef TRITON_ENABLE_METRICS
-std::unique_ptr<triton::server::HTTPServer> metrics_service_;
-bool allow_metrics_ = true;
-int32_t metrics_port_ = 8002;
-float metrics_interval_ms_ = 2000;
-#ifndef TRITON_ENABLE_HTTP
-// Triton uses the same address for http and metrics services.
-// Need to set http address for metrics when http service is disable.
-std::string http_address_ = "0.0.0.0";
-#endif  // NOT TRITON_ENABLE_HTTP
+std::unique_ptr<triton::server::HTTPServer> g_metrics_service;
 #endif  // TRITON_ENABLE_METRICS
 
-#ifdef TRITON_ENABLE_TRACING
-std::string trace_filepath_;
-TRITONSERVER_InferenceTraceLevel trace_level_ =
-    TRITONSERVER_TRACE_LEVEL_DISABLED;
-int32_t trace_rate_ = 1000;
-int32_t trace_count_ = -1;
-int32_t trace_log_frequency_ = 0;
-#endif  // TRITON_ENABLE_TRACING
-
-#if defined(TRITON_ENABLE_GRPC)
-// The maximum number of inference request/response objects that
-// remain allocated for reuse. As long as the number of in-flight
-// requests doesn't exceed this value there will be no
-// allocation/deallocation of request/response objects.
-int grpc_infer_allocation_pool_size_ = 8;
-#endif  // TRITON_ENABLE_GRPC
-
-#if defined(TRITON_ENABLE_HTTP)
-// The number of threads to initialize for the HTTP front-end.
-int http_thread_cnt_ = 8;
-#endif  // TRITON_ENABLE_HTTP
-
-#ifdef _WIN32
-// Minimum implementation of <getopt.h> for Windows
-#define required_argument 1
-#define no_argument 2
-
-int optind = 1;
-const char* optarg = nullptr;
-
-struct option {
-  option(const char* name, int has_arg, int* flag, int val)
-      : name_(name), has_arg_(has_arg), flag_(flag), val_(val)
-  {
-  }
-  const char* name_;
-  int has_arg_;
-  int* flag_;
-  int val_;
-};
-
-bool
-end_of_long_opts(const struct option* longopts)
-{
-  return (
-      (longopts->name_ == nullptr) && (longopts->has_arg_ == 0) &&
-      (longopts->flag_ == nullptr) && (longopts->val_ == 0));
-}
-
-int
-getopt_long(
-    int argc, char* const argv[], const char* optstring,
-    const struct option* longopts, int* longindex)
-{
-  if ((longindex != NULL) || (optind >= argc)) {
-    return -1;
-  }
-  const struct option* curr_longopt = longopts;
-  std::string argv_str = argv[optind];
-  size_t found = argv_str.find_first_of("=");
-  std::string key = argv_str.substr(
-      2, (found == std::string::npos) ? std::string::npos : (found - 2));
-  while (!end_of_long_opts(curr_longopt)) {
-    if (key == curr_longopt->name_) {
-      if (curr_longopt->has_arg_ == required_argument) {
-        if (found == std::string::npos) {
-          optind++;
-          if (optind >= argc) {
-            std::cerr << argv[0] << ": option '" << argv_str
-                      << "' requires an argument" << std::endl;
-            return '?';
-          }
-          optarg = argv[optind];
-        } else {
-          optarg = (argv[optind] + found + 1);
-        }
-      }
-      optind++;
-      return curr_longopt->val_;
-    }
-    curr_longopt++;
-  }
-  return -1;
-}
-#endif
-
-// Command-line options
-enum OptionId {
-  OPTION_HELP = 1000,
-#ifdef TRITON_ENABLE_LOGGING
-  OPTION_LOG_VERBOSE,
-  OPTION_LOG_INFO,
-  OPTION_LOG_WARNING,
-  OPTION_LOG_ERROR,
-  OPTION_LOG_FORMAT,
-#endif  // TRITON_ENABLE_LOGGING
-  OPTION_ID,
-  OPTION_MODEL_REPOSITORY,
-  OPTION_EXIT_ON_ERROR,
-  OPTION_STRICT_MODEL_CONFIG,
-  OPTION_STRICT_READINESS,
-#if defined(TRITON_ENABLE_HTTP)
-  OPTION_ALLOW_HTTP,
-  OPTION_HTTP_PORT,
-  OPTION_HTTP_ADDRESS,
-  OPTION_HTTP_THREAD_COUNT,
-#endif  // TRITON_ENABLE_HTTP
-#if defined(TRITON_ENABLE_GRPC)
-  OPTION_ALLOW_GRPC,
-  OPTION_GRPC_PORT,
-  OPTION_GRPC_ADDRESS,
-  OPTION_GRPC_INFER_ALLOCATION_POOL_SIZE,
-  OPTION_GRPC_USE_SSL,
-  OPTION_GRPC_USE_SSL_MUTUAL,
-  OPTION_GRPC_SERVER_CERT,
-  OPTION_GRPC_SERVER_KEY,
-  OPTION_GRPC_ROOT_CERT,
-  OPTION_GRPC_RESPONSE_COMPRESSION_LEVEL,
-  OPTION_GRPC_ARG_KEEPALIVE_TIME_MS,
-  OPTION_GRPC_ARG_KEEPALIVE_TIMEOUT_MS,
-  OPTION_GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS,
-  OPTION_GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA,
-  OPTION_GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS,
-  OPTION_GRPC_ARG_HTTP2_MAX_PING_STRIKES,
-#endif  // TRITON_ENABLE_GRPC
-#if defined(TRITON_ENABLE_SAGEMAKER)
-  OPTION_ALLOW_SAGEMAKER,
-  OPTION_SAGEMAKER_PORT,
-  OPTION_SAGEMAKER_SAFE_PORT_RANGE,
-  OPTION_SAGEMAKER_THREAD_COUNT,
-#endif  // TRITON_ENABLE_SAGEMAKER
-#if defined(TRITON_ENABLE_VERTEX_AI)
-  OPTION_ALLOW_VERTEX_AI,
-  OPTION_VERTEX_AI_PORT,
-  OPTION_VERTEX_AI_THREAD_COUNT,
-  OPTION_VERTEX_AI_DEFAULT_MODEL,
-#endif  // TRITON_ENABLE_VERTEX_AI
-#ifdef TRITON_ENABLE_METRICS
-  OPTION_ALLOW_METRICS,
-  OPTION_ALLOW_GPU_METRICS,
-  OPTION_METRICS_PORT,
-  OPTION_METRICS_INTERVAL_MS,
-#endif  // TRITON_ENABLE_METRICS
-#ifdef TRITON_ENABLE_TRACING
-  OPTION_TRACE_FILEPATH,
-  OPTION_TRACE_LEVEL,
-  OPTION_TRACE_RATE,
-  OPTION_TRACE_COUNT,
-  OPTION_TRACE_LOG_FREQUENCY,
-#endif  // TRITON_ENABLE_TRACING
-  OPTION_MODEL_CONTROL_MODE,
-  OPTION_POLL_REPO_SECS,
-  OPTION_STARTUP_MODEL,
-  OPTION_RATE_LIMIT,
-  OPTION_RATE_LIMIT_RESOURCE,
-  OPTION_PINNED_MEMORY_POOL_BYTE_SIZE,
-  OPTION_CUDA_MEMORY_POOL_BYTE_SIZE,
-  OPTION_RESPONSE_CACHE_BYTE_SIZE,
-  OPTION_MIN_SUPPORTED_COMPUTE_CAPABILITY,
-  OPTION_EXIT_TIMEOUT_SECS,
-  OPTION_BACKEND_DIR,
-  OPTION_REPOAGENT_DIR,
-  OPTION_BUFFER_MANAGER_THREAD_COUNT,
-  OPTION_MODEL_LOAD_THREAD_COUNT,
-  OPTION_BACKEND_CONFIG,
-  OPTION_HOST_POLICY
-};
-
-struct Option {
-  static constexpr const char* ArgNone = "";
-  static constexpr const char* ArgBool = "boolean";
-  static constexpr const char* ArgFloat = "float";
-  static constexpr const char* ArgInt = "integer";
-  static constexpr const char* ArgStr = "string";
-
-  Option(OptionId id, std::string flag, std::string arg_desc, std::string desc)
-      : id_(id), flag_(flag), arg_desc_(arg_desc), desc_(desc)
-  {
-  }
-
-  struct option GetLongOption() const
-  {
-    struct option lo {
-      flag_.c_str(), (!arg_desc_.empty()) ? required_argument : no_argument,
-          nullptr, id_
-    };
-    return lo;
-  }
-
-  const OptionId id_;
-  const std::string flag_;
-  const std::string arg_desc_;
-  const std::string desc_;
-};
-
-std::vector<Option> options_
-{
-  {OPTION_HELP, "help", Option::ArgNone, "Print usage"},
-#ifdef TRITON_ENABLE_LOGGING
-      {OPTION_LOG_VERBOSE, "log-verbose", Option::ArgInt,
-       "Set verbose logging level. Zero (0) disables verbose logging and "
-       "values >= 1 enable verbose logging."},
-      {OPTION_LOG_INFO, "log-info", Option::ArgBool,
-       "Enable/disable info-level logging."},
-      {OPTION_LOG_WARNING, "log-warning", Option::ArgBool,
-       "Enable/disable warning-level logging."},
-      {OPTION_LOG_ERROR, "log-error", Option::ArgBool,
-       "Enable/disable error-level logging."},
-      {OPTION_LOG_FORMAT, "log-format", Option::ArgStr,
-       "Set the logging format. Options are \"default\" and \"ISO8601\". "
-       "The default is \"default\". For \"default\", the log severity (L) and "
-       "timestamp will be logged as \"LMMDD hh:mm:ss.ssssss\". "
-       "For \"ISO8601\", the log format will be \"YYYY-MM-DDThh:mm:ssZ L\"."},
-#endif  // TRITON_ENABLE_LOGGING
-      {OPTION_ID, "id", Option::ArgStr, "Identifier for this server."},
-      {OPTION_MODEL_REPOSITORY, "model-store", Option::ArgStr,
-       "Equivalent to --model-repository."},
-      {OPTION_MODEL_REPOSITORY, "model-repository", Option::ArgStr,
-       "Path to model repository directory. It may be specified multiple times "
-       "to add multiple model repositories. Note that if a model is not unique "
-       "across all model repositories at any time, the model will not be "
-       "available."},
-      {OPTION_EXIT_ON_ERROR, "exit-on-error", Option::ArgBool,
-       "Exit the inference server if an error occurs during initialization."},
-      {OPTION_STRICT_MODEL_CONFIG, "strict-model-config", Option::ArgBool,
-       "If true model configuration files must be provided and all required "
-       "configuration settings must be specified. If false the model "
-       "configuration may be absent or only partially specified and the "
-       "server will attempt to derive the missing required configuration."},
-      {OPTION_STRICT_READINESS, "strict-readiness", Option::ArgBool,
-       "If true /v2/health/ready endpoint indicates ready if the server "
-       "is responsive and all models are available. If false "
-       "/v2/health/ready endpoint indicates ready if server is responsive "
-       "even if some/all models are unavailable."},
-#if defined(TRITON_ENABLE_HTTP)
-      {OPTION_ALLOW_HTTP, "allow-http", Option::ArgBool,
-       "Allow the server to listen for HTTP requests."},
-      {OPTION_HTTP_PORT, "http-port", Option::ArgInt,
-       "The port for the server to listen on for HTTP requests."},
-      {OPTION_HTTP_ADDRESS, "http-address", Option::ArgStr,
-       "The address for the http server to binds to."},
-      {OPTION_HTTP_THREAD_COUNT, "http-thread-count", Option::ArgInt,
-       "Number of threads handling HTTP requests."},
-#endif  // TRITON_ENABLE_HTTP
-#if defined(TRITON_ENABLE_GRPC)
-      {OPTION_ALLOW_GRPC, "allow-grpc", Option::ArgBool,
-       "Allow the server to listen for GRPC requests."},
-      {OPTION_GRPC_PORT, "grpc-port", Option::ArgInt,
-       "The port for the server to listen on for GRPC requests."},
-      {OPTION_GRPC_ADDRESS, "grpc-address", Option::ArgStr,
-       "The address for the grpc server to binds to."},
-      {OPTION_GRPC_INFER_ALLOCATION_POOL_SIZE,
-       "grpc-infer-allocation-pool-size", Option::ArgInt,
-       "The maximum number of inference request/response objects that remain "
-       "allocated for reuse. As long as the number of in-flight requests "
-       "doesn't exceed this value there will be no allocation/deallocation of "
-       "request/response objects."},
-      {OPTION_GRPC_USE_SSL, "grpc-use-ssl", Option::ArgBool,
-       "Use SSL authentication for GRPC requests. Default is false."},
-      {OPTION_GRPC_USE_SSL_MUTUAL, "grpc-use-ssl-mutual", Option::ArgBool,
-       "Use mututal SSL authentication for GRPC requests. Default is false."},
-      {OPTION_GRPC_SERVER_CERT, "grpc-server-cert", Option::ArgStr,
-       "File holding PEM-encoded server certificate. Ignored unless "
-       "--grpc-use-ssl is true."},
-      {OPTION_GRPC_SERVER_KEY, "grpc-server-key", Option::ArgStr,
-       "File holding PEM-encoded server key. Ignored unless "
-       "--grpc-use-ssl is true."},
-      {OPTION_GRPC_ROOT_CERT, "grpc-root-cert", Option::ArgStr,
-       "File holding PEM-encoded root certificate. Ignore unless "
-       "--grpc-use-ssl is false."},
-      {OPTION_GRPC_RESPONSE_COMPRESSION_LEVEL,
-       "grpc-infer-response-compression-level", Option::ArgStr,
-       "The compression level to be used while returning the infer response to "
-       "the peer. Allowed values are none, low, medium and high. By default, "
-       "compression level is selected as none."},
-      {OPTION_GRPC_ARG_KEEPALIVE_TIME_MS, "grpc-keepalive-time", Option::ArgInt,
-       "The period (in milliseconds) after which a keepalive ping is sent on "
-       "the transport. Default is 7200000 (2 hours)."},
-      {OPTION_GRPC_ARG_KEEPALIVE_TIMEOUT_MS, "grpc-keepalive-timeout",
-       Option::ArgInt,
-       "The period (in milliseconds) the sender of the keepalive ping waits "
-       "for an acknowledgement. If it does not receive an acknowledgment "
-       "within this time, it will close the connection. "
-       "Default is 20000 (20 seconds)."},
-      {OPTION_GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS,
-       "grpc-keepalive-permit-without-calls", Option::ArgBool,
-       "Allows keepalive pings to be sent even if there are no calls in flight "
-       "(0 : false; 1 : true). Default is 0 (false)."},
-      {OPTION_GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA,
-       "grpc-http2-max-pings-without-data", Option::ArgInt,
-       "The maximum number of pings that can be sent when there is no "
-       "data/header frame to be sent. gRPC Core will not continue sending "
-       "pings if we run over the limit. Setting it to 0 allows sending pings "
-       "without such a restriction. Default is 2."},
-      {OPTION_GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS,
-       "grpc-http2-min-recv-ping-interval-without-data", Option::ArgInt,
-       "If there are no data/header frames being sent on the transport, this "
-       "channel argument on the server side controls the minimum time "
-       "(in milliseconds) that gRPC Core would expect between receiving "
-       "successive pings. If the time between successive pings is less than "
-       "this time, then the ping will be considered a bad ping from the peer. "
-       "Such a ping counts as a ‘ping strike’. Default is 300000 (5 minutes)."},
-      {OPTION_GRPC_ARG_HTTP2_MAX_PING_STRIKES, "grpc-http2-max-ping-strikes",
-       Option::ArgInt,
-       "Maximum number of bad pings that the server will tolerate before "
-       "sending an HTTP2 GOAWAY frame and closing the transport. Setting it to "
-       "0 allows the server to accept any number of bad pings. Default is 2."},
-#endif  // TRITON_ENABLE_GRPC
-#if defined(TRITON_ENABLE_SAGEMAKER)
-      {OPTION_ALLOW_SAGEMAKER, "allow-sagemaker", Option::ArgBool,
-       "Allow the server to listen for Sagemaker requests. Default is false."},
-      {OPTION_SAGEMAKER_PORT, "sagemaker-port", Option::ArgInt,
-       "The port for the server to listen on for Sagemaker requests. Default "
-       "is 8080."},
-      {OPTION_SAGEMAKER_SAFE_PORT_RANGE, "sagemaker-safe-port-range",
-       "<integer>-<integer>",
-       "Set the allowed port range for endpoints other than the SageMaker "
-       "endpoints."},
-      {OPTION_SAGEMAKER_THREAD_COUNT, "sagemaker-thread-count", Option::ArgInt,
-       "Number of threads handling Sagemaker requests. Default is 8."},
-#endif  // TRITON_ENABLE_SAGEMAKER
-#if defined(TRITON_ENABLE_VERTEX_AI)
-      {OPTION_ALLOW_VERTEX_AI, "allow-vertex-ai", Option::ArgBool,
-       "Allow the server to listen for Vertex AI requests. Default is true if "
-       "AIP_MODE=PREDICTION, false otherwise."},
-      {OPTION_VERTEX_AI_PORT, "vertex-ai-port", Option::ArgInt,
-       "The port for the server to listen on for Vertex AI requests. Default "
-       "is AIP_HTTP_PORT if set, 8080 otherwise."},
-      {OPTION_VERTEX_AI_THREAD_COUNT, "vertex-ai-thread-count", Option::ArgInt,
-       "Number of threads handling Vertex AI requests. Default is 8."},
-      {OPTION_VERTEX_AI_DEFAULT_MODEL, "vertex-ai-default-model",
-       Option::ArgStr,
-       "The name of the model to use for single-model inference requests."},
-#endif  // TRITON_ENABLE_VERTEX_AI
-#ifdef TRITON_ENABLE_METRICS
-      {OPTION_ALLOW_METRICS, "allow-metrics", Option::ArgBool,
-       "Allow the server to provide prometheus metrics."},
-      {OPTION_ALLOW_GPU_METRICS, "allow-gpu-metrics", Option::ArgBool,
-       "Allow the server to provide GPU metrics. Ignored unless "
-       "--allow-metrics is true."},
-      {OPTION_METRICS_PORT, "metrics-port", Option::ArgInt,
-       "The port reporting prometheus metrics."},
-      {OPTION_METRICS_INTERVAL_MS, "metrics-interval-ms", Option::ArgFloat,
-       "Metrics will be collected once every <metrics-interval-ms> "
-       "milliseconds. Default is 2000 milliseconds."},
-#endif  // TRITON_ENABLE_METRICS
-#ifdef TRITON_ENABLE_TRACING
-      {OPTION_TRACE_FILEPATH, "trace-file", Option::ArgStr,
-       "Set the file where trace output will be saved. If --trace-log-frequency"
-       " is also specified, this argument value will be the prefix of the files"
-       " to save the trace output. See --trace-log-frequency for detail."},
-      {OPTION_TRACE_LEVEL, "trace-level", Option::ArgStr,
-       "Specify a trace level. OFF to disable tracing, TIMESTAMPS to "
-       "trace timestamps, TENSORS to trace tensors. It may be specified "
-       "multiple times to trace multiple informations. Default is OFF."},
-      {OPTION_TRACE_RATE, "trace-rate", Option::ArgInt,
-       "Set the trace sampling rate. Default is 1000."},
-      {OPTION_TRACE_COUNT, "trace-count", Option::ArgInt,
-       "Set the number of traces to be sampled. If the value is -1, the number "
-       "of traces to be sampled will not be limited. Default is -1."},
-      {OPTION_TRACE_LOG_FREQUENCY, "trace-log-frequency", Option::ArgInt,
-       "Set the trace log frequency. If the value is 0, Triton will only log "
-       "the trace output to <trace-file> when shutting down. Otherwise, Triton "
-       "will log the trace output to <trace-file>.<idx> when it collects the "
-       "specified number of traces. For example, if the log frequency is 100, "
-       "when Triton collects the 100-th trace, it logs the traces to file "
-       "<trace-file>.0, and when it collects the 200-th trace, it logs the "
-       "101-th to the 200-th traces to file <trace-file>.1. Default is 0."},
-#endif  // TRITON_ENABLE_TRACING
-      {OPTION_MODEL_CONTROL_MODE, "model-control-mode", Option::ArgStr,
-       "Specify the mode for model management. Options are \"none\", \"poll\" "
-       "and \"explicit\". The default is \"none\". "
-       "For \"none\", the server will load all models in the model "
-       "repository(s) at startup and will not make any changes to the load "
-       "models after that. For \"poll\", the server will poll the model "
-       "repository(s) to detect changes and will load/unload models based on "
-       "those changes. The poll rate is controlled by 'repository-poll-secs'. "
-       "For \"explicit\", model load and unload is initiated by using the "
-       "model control APIs, and only models specified with --load-model will "
-       "be loaded at startup."},
-      {OPTION_POLL_REPO_SECS, "repository-poll-secs", Option::ArgInt,
-       "Interval in seconds between each poll of the model repository to check "
-       "for changes. Valid only when --model-control-mode=poll is "
-       "specified."},
-      {OPTION_STARTUP_MODEL, "load-model", Option::ArgStr,
-       "Name of the model to be loaded on server startup. It may be specified "
-       "multiple times to add multiple models. To load ALL models at startup, "
-       "specify '*' as the model name with --load-model=* as the ONLY "
-       "--load-model argument, this does not imply any pattern matching. "
-       "Specifying --load-model=* in conjunction with another --load-model "
-       "argument will result in error. Note that this option will only take "
-       "effect if --model-control-mode=explicit is true."},
-      // FIXME:  fix the default to execution_count once RL logic is complete.
-      {OPTION_RATE_LIMIT, "rate-limit", Option::ArgStr,
-       "Specify the mode for rate limiting. Options are \"execution_count\" "
-       "and \"off\". The default is \"off\". For "
-       "\"execution_count\", the server will determine the instance using "
-       "configured priority and the number of time the instance has been "
-       "used to run inference. The inference will finally be executed once "
-       "the required resources are available. For \"off\", the server will "
-       "ignore any rate limiter config and run inference as soon as an "
-       "instance is ready."},
-      {OPTION_RATE_LIMIT_RESOURCE, "rate-limit-resource",
-       "<string>:<integer>:<integer>",
-       "The number of resources available to the server. The format of this "
-       "flag is --rate-limit-resource=<resource_name>:<count>:<device>. The "
-       "<device> is optional and if not listed will be applied to every "
-       "device. If the resource is specified as \"GLOBAL\" in the model "
-       "configuration the resource is considered shared among all the devices "
-       "in the system. The <device> property is ignored for such resources. "
-       "This flag can be specified multiple times to specify each resources "
-       "and their availability. By default, the max across all instances that "
-       "list the resource is selected as its availability. The values for this "
-       "flag is case-insensitive."},
-      {OPTION_PINNED_MEMORY_POOL_BYTE_SIZE, "pinned-memory-pool-byte-size",
-       Option::ArgInt,
-       "The total byte size that can be allocated as pinned system memory. "
-       "If GPU support is enabled, the server will allocate pinned system "
-       "memory to accelerate data transfer between host and devices until it "
-       "exceeds the specified byte size. If 'numa-node' is configured via "
-       "--host-policy, the pinned system memory of the pool size will be "
-       "allocated on each numa node. This option will not affect the "
-       "allocation conducted by the backend frameworks. Default is 256 MB."},
-      {OPTION_CUDA_MEMORY_POOL_BYTE_SIZE, "cuda-memory-pool-byte-size",
-       "<integer>:<integer>",
-       "The total byte size that can be allocated as CUDA memory for the GPU "
-       "device. If GPU support is enabled, the server will allocate CUDA "
-       "memory to minimize data transfer between host and devices until it "
-       "exceeds the specified byte size. This option will not affect the "
-       "allocation conducted by the backend frameworks. The argument should be "
-       "2 integers separated by colons in the format "
-       "<GPU device ID>:<pool byte size>. This option can be used multiple "
-       "times, but only once per GPU device. Subsequent uses will overwrite "
-       "previous uses for the same GPU device. Default is 64 MB."},
-      {OPTION_RESPONSE_CACHE_BYTE_SIZE, "response-cache-byte-size",
-       Option::ArgInt,
-       "The size in bytes to allocate for a request/response cache. When "
-       "non-zero, Triton allocates the requested size in CPU memory and "
-       "shares the cache across all inference requests and across all models. "
-       "For a given model to use request caching, the model must enable "
-       "request caching in the model configuration. By default, no model uses "
-       "request caching even if the request cache is enabled with the "
-       "--response-cache-byte-size flag. Default is 0."},
-      {OPTION_MIN_SUPPORTED_COMPUTE_CAPABILITY,
-       "min-supported-compute-capability", Option::ArgFloat,
-       "The minimum supported CUDA compute capability. GPUs that don't support "
-       "this compute capability will not be used by the server."},
-      {OPTION_EXIT_TIMEOUT_SECS, "exit-timeout-secs", Option::ArgInt,
-       "Timeout (in seconds) when exiting to wait for in-flight inferences to "
-       "finish. After the timeout expires the server exits even if inferences "
-       "are still in flight."},
-      {OPTION_BACKEND_DIR, "backend-directory", Option::ArgStr,
-       "The global directory searched for backend shared libraries. Default is "
-       "'/opt/tritonserver/backends'."},
-      {OPTION_REPOAGENT_DIR, "repoagent-directory", Option::ArgStr,
-       "The global directory searched for repository agent shared libraries. "
-       "Default is '/opt/tritonserver/repoagents'."},
-      {OPTION_BUFFER_MANAGER_THREAD_COUNT, "buffer-manager-thread-count",
-       Option::ArgInt,
-       "The number of threads used to accelerate copies and other operations "
-       "required to manage input and output tensor contents. Default is 0."},
-      {OPTION_MODEL_LOAD_THREAD_COUNT, "model-load-thread-count",
-       Option::ArgInt,
-       "The number of threads used to concurrently load models in "
-       "model repositories. Default is 2*<num_cpu_cores>."},
-      {OPTION_BACKEND_CONFIG, "backend-config", "<string>,<string>=<string>",
-       "Specify a backend-specific configuration setting. The format of this "
-       "flag is --backend-config=<backend_name>,<setting>=<value>. Where "
-       "<backend_name> is the name of the backend, such as 'tensorrt'."},
-  {
-    OPTION_HOST_POLICY, "host-policy", "<string>,<string>=<string>",
-        "Specify a host policy setting associated with a policy name. The "
-        "format of this flag is --host-policy=<policy_name>,<setting>=<value>. "
-        "Currently supported settings are 'numa-node', 'cpu-cores'. Note that "
-        "'numa-node' setting will affect pinned memory pool behavior, see "
-        "--pinned-memory-pool for more detail."
-  }
-};
-
-bool
-CheckPortCollision()
-{
-  // List of enabled services and their constraints
-  std::vector<
-      std::tuple<std::string, std::string, int32_t, bool, int32_t, int32_t>>
-      ports;
-#ifdef TRITON_ENABLE_HTTP
-  if (allow_http_) {
-    ports.emplace_back("HTTP", http_address_, http_port_, false, -1, -1);
-  }
-#endif  // TRITON_ENABLE_HTTP
-#ifdef TRITON_ENABLE_GRPC
-  if (allow_grpc_) {
-    ports.emplace_back("GRPC", grpc_address_, grpc_port_, false, -1, -1);
-  }
-#endif  // TRITON_ENABLE_GRPC
-#ifdef TRITON_ENABLE_METRICS
-  if (allow_metrics_) {
-    ports.emplace_back("metrics", http_address_, metrics_port_, false, -1, -1);
-  }
-#endif  // TRITON_ENABLE_METRICS
 #ifdef TRITON_ENABLE_SAGEMAKER
-  if (allow_sagemaker_) {
-    ports.emplace_back(
-        "SageMaker", sagemaker_address_, sagemaker_port_,
-        sagemaker_safe_range_set_, sagemaker_safe_range_.first,
-        sagemaker_safe_range_.second);
-  }
+std::unique_ptr<triton::server::HTTPServer> g_sagemaker_service;
 #endif  // TRITON_ENABLE_SAGEMAKER
+
 #ifdef TRITON_ENABLE_VERTEX_AI
-  if (allow_vertex_ai_) {
-    ports.emplace_back(
-        "Vertex AI", vertex_ai_address_, vertex_ai_port_, false, -1, -1);
-  }
+std::unique_ptr<triton::server::HTTPServer> g_vertex_ai_service;
 #endif  // TRITON_ENABLE_VERTEX_AI
 
-  for (auto curr_it = ports.begin(); curr_it != ports.end(); ++curr_it) {
-    // If the current service doesn't specify the allow port range for other
-    // services, then we don't need to revisit the checked services
-    auto comparing_it = (std::get<3>(*curr_it)) ? ports.begin() : (curr_it + 1);
-    for (; comparing_it != ports.end(); ++comparing_it) {
-      if (comparing_it == curr_it) {
-        continue;
-      }
-      if (std::get<1>(*curr_it) != std::get<1>(*comparing_it)) {
-        continue;
-      }
-      // Set range and comparing service port is out of range
-      if (std::get<3>(*curr_it) &&
-          ((std::get<2>(*comparing_it) < std::get<4>(*curr_it)) ||
-           (std::get<2>(*comparing_it) > std::get<5>(*curr_it)))) {
-        std::cerr << "The server cannot listen to "
-                  << std::get<0>(*comparing_it) << " requests at port "
-                  << std::get<2>(*comparing_it) << ", allowed port range is ["
-                  << std::get<4>(*curr_it) << ", " << std::get<5>(*curr_it)
-                  << "]" << std::endl;
-        return true;
-      }
-      if (std::get<2>(*curr_it) == std::get<2>(*comparing_it)) {
-        std::cerr << "The server cannot listen to " << std::get<0>(*curr_it)
-                  << " requests "
-                  << "and " << std::get<0>(*comparing_it)
-                  << " requests at the same address and port "
-                  << std::get<1>(*curr_it) << ":" << std::get<2>(*curr_it)
-                  << std::endl;
-        return true;
-      }
-    }
-  }
-
-  return false;
-}
+triton::server::TritonServerParameters g_triton_params;
 
 #ifdef TRITON_ENABLE_GRPC
 TRITONSERVER_Error*
 StartGrpcService(
-    std::unique_ptr<triton::server::GRPCServer>* service,
+    std::unique_ptr<triton::server::grpc::Server>* service,
     const std::shared_ptr<TRITONSERVER_Server>& server,
     triton::server::TraceManager* trace_manager,
     const std::shared_ptr<triton::server::SharedMemoryManager>& shm_manager)
 {
-  TRITONSERVER_Error* err = triton::server::GRPCServer::Create(
-      server, trace_manager, shm_manager, grpc_port_, grpc_address_,
-      grpc_use_ssl_, grpc_ssl_options_, grpc_infer_allocation_pool_size_,
-      grpc_response_compression_level_, grpc_keepalive_options_, service);
+  TRITONSERVER_Error* err = triton::server::grpc::Server::Create(
+      server, trace_manager, shm_manager, g_triton_params.grpc_options_,
+      service);
   if (err == nullptr) {
     err = (*service)->Start();
   }
@@ -727,8 +135,11 @@ StartHttpService(
     const std::shared_ptr<triton::server::SharedMemoryManager>& shm_manager)
 {
   TRITONSERVER_Error* err = triton::server::HTTPAPIServer::Create(
-      server, trace_manager, shm_manager, http_port_, http_address_,
-      http_thread_cnt_, service);
+      server, trace_manager, shm_manager, g_triton_params.http_port_,
+      g_triton_params.reuse_http_port_, g_triton_params.http_address_,
+      g_triton_params.http_forward_header_pattern_,
+      g_triton_params.http_thread_cnt_, g_triton_params.http_restricted_apis_,
+      service);
   if (err == nullptr) {
     err = (*service)->Start();
   }
@@ -748,7 +159,8 @@ StartMetricsService(
     const std::shared_ptr<TRITONSERVER_Server>& server)
 {
   TRITONSERVER_Error* err = triton::server::HTTPMetricsServer::Create(
-      server, metrics_port_, http_address_, 1 /* HTTP thread count */, service);
+      server, g_triton_params.metrics_port_, g_triton_params.metrics_address_,
+      1 /* HTTP thread count */, service);
   if (err == nullptr) {
     err = (*service)->Start();
   }
@@ -769,8 +181,9 @@ StartSagemakerService(
     const std::shared_ptr<triton::server::SharedMemoryManager>& shm_manager)
 {
   TRITONSERVER_Error* err = triton::server::SagemakerAPIServer::Create(
-      server, trace_manager, shm_manager, sagemaker_port_, sagemaker_address_,
-      sagemaker_thread_cnt_, service);
+      server, trace_manager, shm_manager, g_triton_params.sagemaker_port_,
+      g_triton_params.sagemaker_address_, g_triton_params.sagemaker_thread_cnt_,
+      service);
   if (err == nullptr) {
     err = (*service)->Start();
   }
@@ -792,8 +205,9 @@ StartVertexAiService(
     const std::shared_ptr<triton::server::SharedMemoryManager>& shm_manager)
 {
   TRITONSERVER_Error* err = triton::server::VertexAiAPIServer::Create(
-      server, trace_manager, shm_manager, vertex_ai_port_, vertex_ai_address_,
-      vertex_ai_thread_cnt_, vertex_ai_default_model_, service);
+      server, trace_manager, shm_manager, g_triton_params.vertex_ai_port_,
+      g_triton_params.vertex_ai_address_, g_triton_params.vertex_ai_thread_cnt_,
+      g_triton_params.vertex_ai_default_model_, service);
   if (err == nullptr) {
     err = (*service)->Start();
   }
@@ -824,9 +238,9 @@ StartEndpoints(
 
 #ifdef TRITON_ENABLE_GRPC
   // Enable GRPC endpoints if requested...
-  if (allow_grpc_) {
+  if (g_triton_params.allow_grpc_) {
     TRITONSERVER_Error* err =
-        StartGrpcService(&grpc_service_, server, trace_manager, shm_manager);
+        StartGrpcService(&g_grpc_service, server, trace_manager, shm_manager);
     if (err != nullptr) {
       LOG_TRITONSERVER_ERROR(err, "failed to start GRPC service");
       return false;
@@ -836,9 +250,9 @@ StartEndpoints(
 
 #ifdef TRITON_ENABLE_HTTP
   // Enable HTTP endpoints if requested...
-  if (allow_http_) {
+  if (g_triton_params.allow_http_) {
     TRITONSERVER_Error* err =
-        StartHttpService(&http_service_, server, trace_manager, shm_manager);
+        StartHttpService(&g_http_service, server, trace_manager, shm_manager);
     if (err != nullptr) {
       LOG_TRITONSERVER_ERROR(err, "failed to start HTTP service");
       return false;
@@ -849,9 +263,9 @@ StartEndpoints(
 
 #ifdef TRITON_ENABLE_SAGEMAKER
   // Enable Sagemaker endpoints if requested...
-  if (allow_sagemaker_) {
+  if (g_triton_params.allow_sagemaker_) {
     TRITONSERVER_Error* err = StartSagemakerService(
-        &sagemaker_service_, server, trace_manager, shm_manager);
+        &g_sagemaker_service, server, trace_manager, shm_manager);
     if (err != nullptr) {
       LOG_TRITONSERVER_ERROR(err, "failed to start Sagemaker service");
       return false;
@@ -861,9 +275,9 @@ StartEndpoints(
 
 #ifdef TRITON_ENABLE_VERTEX_AI
   // Enable Vertex AI endpoints if requested...
-  if (allow_vertex_ai_) {
+  if (g_triton_params.allow_vertex_ai_) {
     TRITONSERVER_Error* err = StartVertexAiService(
-        &vertex_ai_service_, server, trace_manager, shm_manager);
+        &g_vertex_ai_service, server, trace_manager, shm_manager);
     if (err != nullptr) {
       LOG_TRITONSERVER_ERROR(err, "failed to start Vertex AI service");
       return false;
@@ -873,8 +287,8 @@ StartEndpoints(
 
 #ifdef TRITON_ENABLE_METRICS
   // Enable metrics endpoint if requested...
-  if (allow_metrics_) {
-    TRITONSERVER_Error* err = StartMetricsService(&metrics_service_, server);
+  if (g_triton_params.allow_metrics_) {
+    TRITONSERVER_Error* err = StartMetricsService(&g_metrics_service, server);
     if (err != nullptr) {
       LOG_TRITONSERVER_ERROR(err, "failed to start Metrics service");
       return false;
@@ -891,62 +305,62 @@ StopEndpoints()
   bool ret = true;
 
 #ifdef TRITON_ENABLE_HTTP
-  if (http_service_) {
-    TRITONSERVER_Error* err = http_service_->Stop();
+  if (g_http_service) {
+    TRITONSERVER_Error* err = g_http_service->Stop();
     if (err != nullptr) {
       LOG_TRITONSERVER_ERROR(err, "failed to stop HTTP service");
       ret = false;
     }
 
-    http_service_.reset();
+    g_http_service.reset();
   }
 #endif  // TRITON_ENABLE_HTTP
 
 #ifdef TRITON_ENABLE_GRPC
-  if (grpc_service_) {
-    TRITONSERVER_Error* err = grpc_service_->Stop();
+  if (g_grpc_service) {
+    TRITONSERVER_Error* err = g_grpc_service->Stop();
     if (err != nullptr) {
       LOG_TRITONSERVER_ERROR(err, "failed to stop GRPC service");
       ret = false;
     }
 
-    grpc_service_.reset();
+    g_grpc_service.reset();
   }
 #endif  // TRITON_ENABLE_GRPC
 
 #ifdef TRITON_ENABLE_METRICS
-  if (metrics_service_) {
-    TRITONSERVER_Error* err = metrics_service_->Stop();
+  if (g_metrics_service) {
+    TRITONSERVER_Error* err = g_metrics_service->Stop();
     if (err != nullptr) {
       LOG_TRITONSERVER_ERROR(err, "failed to stop Metrics service");
       ret = false;
     }
 
-    metrics_service_.reset();
+    g_metrics_service.reset();
   }
 #endif  // TRITON_ENABLE_METRICS
 
 #ifdef TRITON_ENABLE_SAGEMAKER
-  if (sagemaker_service_) {
-    TRITONSERVER_Error* err = sagemaker_service_->Stop();
+  if (g_sagemaker_service) {
+    TRITONSERVER_Error* err = g_sagemaker_service->Stop();
     if (err != nullptr) {
       LOG_TRITONSERVER_ERROR(err, "failed to stop Sagemaker service");
       ret = false;
     }
 
-    sagemaker_service_.reset();
+    g_sagemaker_service.reset();
   }
 #endif  // TRITON_ENABLE_SAGEMAKER
 
 #ifdef TRITON_ENABLE_VERTEX_AI
-  if (vertex_ai_service_) {
-    TRITONSERVER_Error* err = vertex_ai_service_->Stop();
+  if (g_vertex_ai_service) {
+    TRITONSERVER_Error* err = g_vertex_ai_service->Stop();
     if (err != nullptr) {
       LOG_TRITONSERVER_ERROR(err, "failed to stop Vertex AI service");
       ret = false;
     }
 
-    vertex_ai_service_.reset();
+    g_vertex_ai_service.reset();
   }
 #endif  // TRITON_ENABLE_VERTEX_AI
 
@@ -969,8 +383,10 @@ StartTracing(triton::server::TraceManager** trace_manager)
 
 #ifdef TRITON_ENABLE_TRACING
   TRITONSERVER_Error* err = triton::server::TraceManager::Create(
-      trace_manager, trace_level_, trace_rate_, trace_count_,
-      trace_log_frequency_, trace_filepath_);
+      trace_manager, g_triton_params.trace_level_, g_triton_params.trace_rate_,
+      g_triton_params.trace_count_, g_triton_params.trace_log_frequency_,
+      g_triton_params.trace_filepath_, g_triton_params.trace_mode_,
+      g_triton_params.trace_config_map_);
 
   if (err != nullptr) {
     LOG_TRITONSERVER_ERROR(err, "failed to configure tracing");
@@ -998,929 +414,49 @@ StopTracing(triton::server::TraceManager** trace_manager)
   return true;
 }
 
-std::string
-FormatUsageMessage(std::string str, int offset)
-{
-  int width = 60;
-  int current_pos = offset;
-  while (current_pos + width < int(str.length())) {
-    int n = str.rfind(' ', current_pos + width);
-    if (n != int(std::string::npos)) {
-      str.replace(n, 1, "\n\t");
-      current_pos += (width + 9);
-    }
-  }
-
-  return str;
-}
-
-std::string
-Usage()
-{
-  std::stringstream ss;
-
-  ss << "Usage: tritonserver [options]" << std::endl;
-  for (const auto& o : options_) {
-    if (!o.arg_desc_.empty()) {
-      ss << "  --" << o.flag_ << " <" << o.arg_desc_ << ">" << std::endl
-         << "\t" << FormatUsageMessage(o.desc_, 0) << std::endl;
-    } else {
-      ss << "  --" << o.flag_ << std::endl
-         << "\t" << FormatUsageMessage(o.desc_, 0) << std::endl;
-    }
-  }
-
-  return ss.str();
-}
-
-bool
-ParseBoolOption(std::string arg)
-{
-  std::transform(arg.begin(), arg.end(), arg.begin(), [](unsigned char c) {
-    return std::tolower(c);
-  });
-
-  if ((arg == "true") || (arg == "on") || (arg == "1")) {
-    return true;
-  }
-  if ((arg == "false") || (arg == "off") || (arg == "0")) {
-    return false;
-  }
-
-  std::cerr << "invalid value for bool option: " << arg << std::endl;
-  std::cerr << Usage() << std::endl;
-  exit(1);
-}
-
-// Template specialization for ParsePairOption
-// [FIXME] replace ParseXXXOPtion with these
-template <typename T>
-T ParseOption(const std::string& arg);
-
-template <>
-int
-ParseOption(const std::string& arg)
-{
-  return std::stoi(arg);
-}
-
-template <>
-uint64_t
-ParseOption(const std::string& arg)
-{
-  return std::stoll(arg);
-}
-
-int
-ParseIntOption(const std::string arg)
-{
-  return std::stoi(arg);
-}
-
-int64_t
-ParseLongLongOption(const std::string arg)
-{
-  return std::stoll(arg);
-}
-
-#if 0
-float
-ParseFloatOption(const std::string arg)
-{
-  return std::stof(arg);
-}
-#endif
-
-double
-ParseDoubleOption(const std::string arg)
-{
-  return std::stod(arg);
-}
+}  // namespace
 
-// Condition here merely to avoid compilation error, this function will
-// be defined but not used otherwise.
-#ifdef TRITON_ENABLE_LOGGING
 int
-ParseIntBoolOption(std::string arg)
-{
-  std::transform(arg.begin(), arg.end(), arg.begin(), [](unsigned char c) {
-    return std::tolower(c);
-  });
-
-  if (arg == "true") {
-    return 1;
-  }
-  if (arg == "false") {
-    return 0;
-  }
-
-  return ParseIntOption(arg);
-}
-#endif  // TRITON_ENABLE_LOGGING
-
-#ifdef TRITON_ENABLE_TRACING
-TRITONSERVER_InferenceTraceLevel
-ParseTraceLevelOption(std::string arg)
-{
-  std::transform(arg.begin(), arg.end(), arg.begin(), [](unsigned char c) {
-    return std::tolower(c);
-  });
-
-  if ((arg == "false") || (arg == "off")) {
-    return TRITONSERVER_TRACE_LEVEL_DISABLED;
-  }
-  if ((arg == "true") || (arg == "on") || (arg == "min") || (arg == "max") ||
-      (arg == "timestamps")) {
-    return TRITONSERVER_TRACE_LEVEL_TIMESTAMPS;
-  }
-  if (arg == "tensors") {
-    return TRITONSERVER_TRACE_LEVEL_TENSORS;
-  }
-
-  std::cerr << "invalid value for trace level option: " << arg << std::endl;
-  std::cerr << Usage() << std::endl;
-  exit(1);
-}
-#endif  // TRITON_ENABLE_TRACING
-
-std::tuple<std::string, int, int>
-ParseRateLimiterResourceOption(const std::string arg)
-{
-  std::string error_string(
-      "--rate-limit-resource option format is "
-      "'<resource_name>:<count>:<device>' or '<resource_name>:<count>'. Got " +
-      arg);
-
-  std::string name_string("");
-  int count = -1;
-  int device_id = -1;
-
-  size_t delim_first = arg.find(":");
-  size_t delim_second = arg.find(":", delim_first + 1);
-
-  if (delim_second != std::string::npos) {
-    // Handle format `<resource_name>:<count>:<device>'
-    size_t delim_third = arg.find(":", delim_second + 1);
-    if (delim_third != std::string::npos) {
-      std::cerr << error_string << std::endl;
-      exit(1);
-    }
-    name_string = arg.substr(0, delim_first);
-    count = ParseIntOption(
-        arg.substr(delim_first + 1, delim_second - delim_first - 1));
-    device_id = ParseIntOption(arg.substr(delim_second + 1));
-  } else if (delim_first != std::string::npos) {
-    // Handle format `<resource_name>:<count>'
-    name_string = arg.substr(0, delim_first);
-    count = ParseIntOption(arg.substr(delim_first + 1));
-  } else {
-    // If no colons found
-    std::cerr << error_string << std::endl;
-    exit(1);
-  }
-
-  return {name_string, count, device_id};
-}
-
-std::tuple<std::string, std::string, std::string>
-ParseBackendConfigOption(const std::string arg)
-{
-  // Format is "<backend_name>,<setting>=<value>" for specific
-  // config/settings and "<setting>=<value>" for backend agnostic
-  // configs/settings
-  int delim_name = arg.find(",");
-  int delim_setting = arg.find("=", delim_name + 1);
-
-  std::string name_string = std::string();
-  if (delim_name > 0) {
-    name_string = arg.substr(0, delim_name);
-  } else if (delim_name == 0) {
-    std::cerr << "No backend specified. --backend-config option format is "
-              << "<backend name>,<setting>=<value> or "
-              << "<setting>=<value>. Got " << arg << std::endl;
-    exit(1);
-  }  // else global backend config
-
-  if (delim_setting < 0) {
-    std::cerr << "--backend-config option format is '<backend "
-                 "name>,<setting>=<value>'. Got "
-              << arg << std::endl;
-    exit(1);
-  }
-  std::string setting_string =
-      arg.substr(delim_name + 1, delim_setting - delim_name - 1);
-  std::string value_string = arg.substr(delim_setting + 1);
-
-  if (setting_string.empty() || value_string.empty()) {
-    std::cerr << "--backend-config option format is '<backend "
-                 "name>,<setting>=<value>'. Got "
-              << arg << std::endl;
-    exit(1);
-  }
-
-  return {name_string, setting_string, value_string};
-}
-
-std::tuple<std::string, std::string, std::string>
-ParseHostPolicyOption(const std::string arg)
+main(int argc, char** argv)
 {
-  // Format is "<backend_name>,<setting>=<value>"
-  int delim_name = arg.find(",");
-  int delim_setting = arg.find("=", delim_name + 1);
-
-  // Check for 2 semicolons
-  if ((delim_name < 0) || (delim_setting < 0)) {
-    std::cerr << "--host-policy option format is '<policy "
-                 "name>,<setting>=<value>'. Got "
-              << arg << std::endl;
+  // Parse command-line to create the options for the inference
+  // server.
+  triton::server::TritonParser tp;
+  try {
+    auto res = tp.Parse(argc, argv);
+    g_triton_params = res.first;
+    g_triton_params.CheckPortCollision();
+  }
+  catch (const triton::server::ParseException& pe) {
+    std::cerr << pe.what() << std::endl;
+    std::cerr << "Usage: tritonserver [options]" << std::endl;
+    std::cerr << tp.Usage() << std::endl;
     exit(1);
   }
 
-  std::string name_string = arg.substr(0, delim_name);
-  std::string setting_string =
-      arg.substr(delim_name + 1, delim_setting - delim_name - 1);
-  std::string value_string = arg.substr(delim_setting + 1);
-
-  if (name_string.empty() || setting_string.empty() || value_string.empty()) {
-    std::cerr << "--host-policy option format is '<policy "
-                 "name>,<setting>=<value>'. Got "
-              << arg << std::endl;
-    exit(1);
+  triton::server::TritonServerParameters::ManagedTritonServerOptionPtr
+      triton_options(nullptr, TRITONSERVER_ServerOptionsDelete);
+  try {
+    triton_options = g_triton_params.BuildTritonServerOptions();
   }
-
-  return {name_string, setting_string, value_string};
-}
-
-template <typename T1, typename T2>
-std::pair<T1, T2>
-ParsePairOption(const std::string& arg, const std::string& delim_str)
-{
-  int delim = arg.find(delim_str);
-
-  if ((delim < 0)) {
-    std::cerr << "Cannot parse pair option due to incorrect number of inputs."
-                 "--<pair option> argument requires format <first>"
-              << delim_str << "<second>. "
-              << "Found: " << arg << std::endl;
-    std::cerr << Usage() << std::endl;
+  catch (const triton::server::ParseException& pe) {
+    std::cerr << "Failed to build Triton option:" << std::endl;
+    std::cerr << pe.what() << std::endl;
     exit(1);
   }
 
-  std::string first_string = arg.substr(0, delim);
-  std::string second_string = arg.substr(delim + delim_str.length());
-
-  // Specific conversion from key-value string to actual key-value type,
-  // should be extracted out of this function if we need to parse
-  // more pair option of different types.
-  return {ParseOption<T1>(first_string), ParseOption<T2>(second_string)};
-}
-
-bool
-Parse(TRITONSERVER_ServerOptions** server_options, int argc, char** argv)
-{
-  std::string server_id("triton");
-  std::set<std::string> model_repository_paths;
-  bool exit_on_error = true;
-  bool strict_model_config = true;
-  bool strict_readiness = true;
-  std::list<std::pair<int, uint64_t>> cuda_pools;
-  int32_t exit_timeout_secs = 30;
-  int32_t repository_poll_secs = repository_poll_secs_;
-  int64_t pinned_memory_pool_byte_size = 1 << 28;
-  int32_t buffer_manager_thread_count = 0;
-  // hardware_concurrency() returns 0 if not well defined or not computable.
-  uint32_t model_load_thread_count =
-      std::max(2u, 2 * std::thread::hardware_concurrency());
-  uint64_t response_cache_byte_size = 0;
-
-  std::string backend_dir = "/opt/tritonserver/backends";
-  std::string repoagent_dir = "/opt/tritonserver/repoagents";
-  std::vector<std::tuple<std::string, std::string, std::string>>
-      backend_config_settings;
-  std::vector<std::tuple<std::string, std::string, std::string>> host_policies;
-
-#ifdef TRITON_ENABLE_GPU
-  double min_supported_compute_capability = TRITON_MIN_COMPUTE_CAPABILITY;
-#else
-  double min_supported_compute_capability = 0;
-#endif  // TRITON_ENABLE_GPU
-
-#if defined(TRITON_ENABLE_HTTP)
-  int32_t http_port = http_port_;
-  std::string http_address = http_address_;
-  int32_t http_thread_cnt = http_thread_cnt_;
-#endif  // TRITON_ENABLE_HTTP
-
-#if defined(TRITON_ENABLE_GRPC)
-  int32_t grpc_port = grpc_port_;
-  std::string grpc_address = grpc_address_;
-  int32_t grpc_use_ssl = grpc_use_ssl_;
-  int32_t grpc_infer_allocation_pool_size = grpc_infer_allocation_pool_size_;
-  grpc_compression_level grpc_response_compression_level =
-      grpc_response_compression_level_;
-#endif  // TRITON_ENABLE_GRPC
-
-#if defined(TRITON_ENABLE_SAGEMAKER)
-  int32_t sagemaker_port = sagemaker_port_;
-  int32_t sagemaker_thread_cnt = sagemaker_thread_cnt_;
-  bool sagemaker_safe_range_set = sagemaker_safe_range_set_;
-  std::pair<int32_t, int32_t> sagemaker_safe_range = sagemaker_safe_range_;
-#endif  // TRITON_ENABLE_SAGEMAKER
-
-#if defined(TRITON_ENABLE_VERTEX_AI)
-  // Set different default value if specific flag is set
-  {
-    auto aip_mode =
-        triton::server::GetEnvironmentVariableOrDefault("AIP_MODE", "");
-    // Enable Vertex AI service and disable HTTP / GRPC service by default
-    // if detecting Vertex AI environment
-    if (aip_mode == "PREDICTION") {
-      allow_vertex_ai_ = true;
-#ifdef TRITON_ENABLE_HTTP
-      allow_http_ = false;
-#endif  // TRITON_ENABLE_HTTP
-#ifdef TRITON_ENABLE_GRPC
-      allow_grpc_ = false;
-#endif  // TRITON_ENABLE_GRPC
-    }
-    auto port = triton::server::GetEnvironmentVariableOrDefault(
-        "AIP_HTTP_PORT", "8080");
-    vertex_ai_port_ = ParseIntOption(port);
-  }
-  int32_t vertex_ai_port = vertex_ai_port_;
-  int32_t vertex_ai_thread_cnt = vertex_ai_thread_cnt_;
-  std::string vertex_ai_default_model = vertex_ai_default_model_;
-#endif  // TRITON_ENABLE_VERTEX_AI
-
-#ifdef TRITON_ENABLE_METRICS
-  int32_t metrics_port = metrics_port_;
-  bool allow_gpu_metrics = true;
-  float metrics_interval_ms = metrics_interval_ms_;
-#endif  // TRITON_ENABLE_METRICS
-
-#ifdef TRITON_ENABLE_TRACING
-  std::string trace_filepath = trace_filepath_;
-  std::vector<TRITONSERVER_InferenceTraceLevel> trace_level_settings = {
-      trace_level_};
-  int32_t trace_rate = trace_rate_;
-  int32_t trace_count = trace_count_;
-  int32_t trace_log_frequency = trace_log_frequency_;
-#endif  // TRITON_ENABLE_TRACING
-
-  TRITONSERVER_ModelControlMode control_mode = TRITONSERVER_MODEL_CONTROL_NONE;
-  std::set<std::string> startup_models_;
-
-  // FIXME: Once the rate limiter implementation is complete make
-  // EXEC_COUNT the default.
-  // TRITONSERVER_RateLimitMode rate_limit_mode =
-  //    TRITONSERVER_RATE_LIMIT_EXEC_COUNT;
-  TRITONSERVER_RateLimitMode rate_limit_mode = TRITONSERVER_RATE_LIMIT_OFF;
-  std::vector<std::tuple<std::string, int, int>> rate_limit_resources;
-
-#ifdef TRITON_ENABLE_LOGGING
-  bool log_info = true;
-  bool log_warn = true;
-  bool log_error = true;
-  int32_t log_verbose = 0;
-  auto log_format = triton::common::Logger::Format::kDEFAULT;
-#endif  // TRITON_ENABLE_LOGGING
-
-  std::vector<struct option> long_options;
-  for (const auto& o : options_) {
-    long_options.push_back(o.GetLongOption());
-  }
-  long_options.push_back({nullptr, 0, nullptr, 0});
-
-  int flag;
-  while ((flag = getopt_long(argc, argv, "", &long_options[0], NULL)) != -1) {
-    switch (flag) {
-      case OPTION_HELP:
-      case '?':
-        std::cerr << Usage() << std::endl;
-        return false;
-#ifdef TRITON_ENABLE_LOGGING
-      case OPTION_LOG_VERBOSE:
-        log_verbose = ParseIntBoolOption(optarg);
-        break;
-      case OPTION_LOG_INFO:
-        log_info = ParseBoolOption(optarg);
-        break;
-      case OPTION_LOG_WARNING:
-        log_warn = ParseBoolOption(optarg);
-        break;
-      case OPTION_LOG_ERROR:
-        log_error = ParseBoolOption(optarg);
-        break;
-      case OPTION_LOG_FORMAT: {
-        std::string format_str(optarg);
-        if (format_str == "default") {
-          log_format = triton::common::Logger::Format::kDEFAULT;
-        } else if (format_str == "ISO8601") {
-          log_format = triton::common::Logger::Format::kISO8601;
-        } else {
-          std::cerr << "invalid argument for --log-format" << std::endl;
-          std::cerr << Usage() << std::endl;
-          return false;
-        }
-        break;
-      }
-#endif  // TRITON_ENABLE_LOGGING
-
-      case OPTION_ID:
-        server_id = optarg;
-        break;
-      case OPTION_MODEL_REPOSITORY:
-        model_repository_paths.insert(optarg);
-        break;
-
-      case OPTION_EXIT_ON_ERROR:
-        exit_on_error = ParseBoolOption(optarg);
-        break;
-      case OPTION_STRICT_MODEL_CONFIG:
-        strict_model_config = ParseBoolOption(optarg);
-        break;
-      case OPTION_STRICT_READINESS:
-        strict_readiness = ParseBoolOption(optarg);
-        break;
-
-#if defined(TRITON_ENABLE_HTTP)
-      case OPTION_ALLOW_HTTP:
-        allow_http_ = ParseBoolOption(optarg);
-        break;
-      case OPTION_HTTP_PORT:
-        http_port = ParseIntOption(optarg);
-        break;
-      case OPTION_HTTP_ADDRESS:
-        http_address = optarg;
-        break;
-      case OPTION_HTTP_THREAD_COUNT:
-        http_thread_cnt = ParseIntOption(optarg);
-        break;
-#endif  // TRITON_ENABLE_HTTP
-
-#if defined(TRITON_ENABLE_SAGEMAKER)
-      case OPTION_ALLOW_SAGEMAKER:
-        allow_sagemaker_ = ParseBoolOption(optarg);
-        break;
-      case OPTION_SAGEMAKER_PORT:
-        sagemaker_port = ParseIntOption(optarg);
-        break;
-      case OPTION_SAGEMAKER_SAFE_PORT_RANGE:
-        sagemaker_safe_range_set = true;
-        sagemaker_safe_range = ParsePairOption<int, int>(optarg, "-");
-        break;
-      case OPTION_SAGEMAKER_THREAD_COUNT:
-        sagemaker_thread_cnt = ParseIntOption(optarg);
-        break;
-#endif  // TRITON_ENABLE_SAGEMAKER
-
-#if defined(TRITON_ENABLE_VERTEX_AI)
-      case OPTION_ALLOW_VERTEX_AI:
-        allow_vertex_ai_ = ParseBoolOption(optarg);
-        break;
-      case OPTION_VERTEX_AI_PORT:
-        vertex_ai_port = ParseIntOption(optarg);
-        break;
-      case OPTION_VERTEX_AI_THREAD_COUNT:
-        vertex_ai_thread_cnt = ParseIntOption(optarg);
-        break;
-      case OPTION_VERTEX_AI_DEFAULT_MODEL:
-        vertex_ai_default_model = optarg;
-        break;
-#endif  // TRITON_ENABLE_VERTEX_AI
-
-#if defined(TRITON_ENABLE_GRPC)
-      case OPTION_ALLOW_GRPC:
-        allow_grpc_ = ParseBoolOption(optarg);
-        break;
-      case OPTION_GRPC_PORT:
-        grpc_port = ParseIntOption(optarg);
-        break;
-      case OPTION_GRPC_ADDRESS:
-        grpc_address = optarg;
-        break;
-      case OPTION_GRPC_INFER_ALLOCATION_POOL_SIZE:
-        grpc_infer_allocation_pool_size = ParseIntOption(optarg);
-        break;
-      case OPTION_GRPC_USE_SSL:
-        grpc_use_ssl = ParseBoolOption(optarg);
-        break;
-      case OPTION_GRPC_USE_SSL_MUTUAL:
-        grpc_ssl_options_.use_mutual_auth = ParseBoolOption(optarg);
-        grpc_use_ssl = true;
-        break;
-      case OPTION_GRPC_SERVER_CERT:
-        grpc_ssl_options_.server_cert = optarg;
-        break;
-      case OPTION_GRPC_SERVER_KEY:
-        grpc_ssl_options_.server_key = optarg;
-        break;
-      case OPTION_GRPC_ROOT_CERT:
-        grpc_ssl_options_.root_cert = optarg;
-        break;
-      case OPTION_GRPC_RESPONSE_COMPRESSION_LEVEL: {
-        std::string mode_str(optarg);
-        std::transform(
-            mode_str.begin(), mode_str.end(), mode_str.begin(), ::tolower);
-        if (mode_str == "none") {
-          grpc_response_compression_level = GRPC_COMPRESS_LEVEL_NONE;
-        } else if (mode_str == "low") {
-          grpc_response_compression_level = GRPC_COMPRESS_LEVEL_LOW;
-        } else if (mode_str == "medium") {
-          grpc_response_compression_level = GRPC_COMPRESS_LEVEL_MED;
-        } else if (mode_str == "high") {
-          grpc_response_compression_level = GRPC_COMPRESS_LEVEL_HIGH;
-        } else {
-          std::cerr
-              << "invalid argument for --grpc_infer_response_compression_level"
-              << std::endl;
-          std::cerr << Usage() << std::endl;
-          return false;
-        }
-        break;
-      }
-      case OPTION_GRPC_ARG_KEEPALIVE_TIME_MS:
-        grpc_keepalive_options_.keepalive_time_ms = ParseIntOption(optarg);
-        break;
-      case OPTION_GRPC_ARG_KEEPALIVE_TIMEOUT_MS:
-        grpc_keepalive_options_.keepalive_timeout_ms = ParseIntOption(optarg);
-        break;
-      case OPTION_GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS:
-        grpc_keepalive_options_.keepalive_permit_without_calls =
-            ParseBoolOption(optarg);
-        break;
-      case OPTION_GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA:
-        grpc_keepalive_options_.http2_max_pings_without_data =
-            ParseIntOption(optarg);
-        break;
-      case OPTION_GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS:
-        grpc_keepalive_options_.http2_min_recv_ping_interval_without_data_ms =
-            ParseIntOption(optarg);
-        break;
-      case OPTION_GRPC_ARG_HTTP2_MAX_PING_STRIKES:
-        grpc_keepalive_options_.http2_max_ping_strikes = ParseIntOption(optarg);
-        break;
-#endif  // TRITON_ENABLE_GRPC
-
-#ifdef TRITON_ENABLE_METRICS
-      case OPTION_ALLOW_METRICS:
-        allow_metrics_ = ParseBoolOption(optarg);
-        break;
-      case OPTION_ALLOW_GPU_METRICS:
-        allow_gpu_metrics = ParseBoolOption(optarg);
-        break;
-      case OPTION_METRICS_PORT:
-        metrics_port = ParseIntOption(optarg);
-        break;
-      case OPTION_METRICS_INTERVAL_MS:
-        metrics_interval_ms = ParseIntOption(optarg);
-        break;
-#endif  // TRITON_ENABLE_METRICS
-
-#ifdef TRITON_ENABLE_TRACING
-      case OPTION_TRACE_FILEPATH:
-        trace_filepath = optarg;
-        break;
-      case OPTION_TRACE_LEVEL:
-        trace_level_settings.push_back(ParseTraceLevelOption(optarg));
-        break;
-      case OPTION_TRACE_RATE:
-        trace_rate = ParseIntOption(optarg);
-        break;
-      case OPTION_TRACE_COUNT:
-        trace_count = ParseIntOption(optarg);
-        break;
-      case OPTION_TRACE_LOG_FREQUENCY:
-        trace_log_frequency = ParseIntOption(optarg);
-        break;
-#endif  // TRITON_ENABLE_TRACING
-
-      case OPTION_POLL_REPO_SECS:
-        repository_poll_secs = ParseIntOption(optarg);
-        break;
-      case OPTION_STARTUP_MODEL:
-        startup_models_.insert(optarg);
-        break;
-      case OPTION_MODEL_CONTROL_MODE: {
-        std::string mode_str(optarg);
-        std::transform(
-            mode_str.begin(), mode_str.end(), mode_str.begin(), ::tolower);
-        if (mode_str == "none") {
-          control_mode = TRITONSERVER_MODEL_CONTROL_NONE;
-        } else if (mode_str == "poll") {
-          control_mode = TRITONSERVER_MODEL_CONTROL_POLL;
-        } else if (mode_str == "explicit") {
-          control_mode = TRITONSERVER_MODEL_CONTROL_EXPLICIT;
-        } else {
-          std::cerr << "invalid argument for --model-control-mode" << std::endl;
-          std::cerr << Usage() << std::endl;
-          return false;
-        }
-        break;
-      }
-      case OPTION_RATE_LIMIT: {
-        std::string rate_limit_str(optarg);
-        std::transform(
-            rate_limit_str.begin(), rate_limit_str.end(),
-            rate_limit_str.begin(), ::tolower);
-        if (rate_limit_str == "execution_count") {
-          rate_limit_mode = TRITONSERVER_RATE_LIMIT_EXEC_COUNT;
-        } else if (rate_limit_str == "off") {
-          rate_limit_mode = TRITONSERVER_RATE_LIMIT_OFF;
-        } else {
-          std::cerr << "invalid argument for --rate-limit" << std::endl;
-          std::cerr << Usage() << std::endl;
-          return false;
-        }
-        break;
-      }
-      case OPTION_RATE_LIMIT_RESOURCE: {
-        std::string rate_limit_resource_str(optarg);
-        std::transform(
-            rate_limit_resource_str.begin(), rate_limit_resource_str.end(),
-            rate_limit_resource_str.begin(), ::tolower);
-        try {
-          rate_limit_resources.push_back(
-              ParseRateLimiterResourceOption(optarg));
-        }
-        catch (const std::invalid_argument& ia) {
-          return TRITONSERVER_ErrorNew(
-              TRITONSERVER_ERROR_INVALID_ARG,
-              (std::string("failed to parse '") + optarg +
-               "' as <str>:<int>:<int>")
-                  .c_str());
-        }
-        break;
-      }
-      case OPTION_PINNED_MEMORY_POOL_BYTE_SIZE:
-        pinned_memory_pool_byte_size = ParseLongLongOption(optarg);
-        break;
-      case OPTION_CUDA_MEMORY_POOL_BYTE_SIZE:
-        cuda_pools.push_back(ParsePairOption<int, uint64_t>(optarg, ":"));
-        break;
-      case OPTION_RESPONSE_CACHE_BYTE_SIZE:
-        response_cache_byte_size = (uint64_t)ParseLongLongOption(optarg);
-        break;
-      case OPTION_MIN_SUPPORTED_COMPUTE_CAPABILITY:
-        min_supported_compute_capability = ParseDoubleOption(optarg);
-        break;
-      case OPTION_EXIT_TIMEOUT_SECS:
-        exit_timeout_secs = ParseIntOption(optarg);
-        break;
-      case OPTION_BACKEND_DIR:
-        backend_dir = optarg;
-        break;
-      case OPTION_REPOAGENT_DIR:
-        repoagent_dir = optarg;
-        break;
-      case OPTION_BUFFER_MANAGER_THREAD_COUNT:
-        buffer_manager_thread_count = ParseIntOption(optarg);
-        break;
-      case OPTION_MODEL_LOAD_THREAD_COUNT:
-        model_load_thread_count = ParseIntOption(optarg);
-        break;
-      case OPTION_BACKEND_CONFIG:
-        backend_config_settings.push_back(ParseBackendConfigOption(optarg));
-        break;
-      case OPTION_HOST_POLICY:
-        host_policies.push_back(ParseHostPolicyOption(optarg));
-        break;
-    }
-  }
-
-  if (optind < argc) {
-    std::cerr << "Unexpected argument: " << argv[optind] << std::endl;
-    std::cerr << Usage() << std::endl;
-    return false;
-  }
-
 #ifdef TRITON_ENABLE_LOGGING
   // Initialize our own logging instance since it is used by GRPC and
   // HTTP endpoints. This logging instance is separate from the one in
   // libtritonserver so we must initialize explicitly.
-  LOG_ENABLE_INFO(log_info);
-  LOG_ENABLE_WARNING(log_warn);
-  LOG_ENABLE_ERROR(log_error);
-  LOG_SET_VERBOSE(log_verbose);
-  LOG_SET_FORMAT(log_format);
+  LOG_ENABLE_INFO(g_triton_params.log_info_);
+  LOG_ENABLE_WARNING(g_triton_params.log_warn_);
+  LOG_ENABLE_ERROR(g_triton_params.log_error_);
+  LOG_SET_VERBOSE(g_triton_params.log_verbose_);
+  LOG_SET_FORMAT(g_triton_params.log_format_);
+  LOG_SET_OUT_FILE(g_triton_params.log_file_);
 #endif  // TRITON_ENABLE_LOGGING
 
-  repository_poll_secs_ = 0;
-  if (control_mode == TRITONSERVER_MODEL_CONTROL_POLL) {
-    repository_poll_secs_ = std::max(0, repository_poll_secs);
-  }
-
-#if defined(TRITON_ENABLE_HTTP)
-  http_port_ = http_port;
-  http_address_ = http_address;
-  http_thread_cnt_ = http_thread_cnt;
-#endif  // TRITON_ENABLE_HTTP
-
-#if defined(TRITON_ENABLE_SAGEMAKER)
-  sagemaker_port_ = sagemaker_port;
-  sagemaker_thread_cnt_ = sagemaker_thread_cnt;
-  sagemaker_safe_range_set_ = sagemaker_safe_range_set;
-  sagemaker_safe_range_ = sagemaker_safe_range;
-#endif  // TRITON_ENABLE_SAGEMAKER
-
-#if defined(TRITON_ENABLE_VERTEX_AI)
-  // Set default model repository if specific flag is set, postpone the
-  // check to after parsing so we only monitor the default repository if
-  // Vertex service is allowed
-  {
-    auto aip_storage_uri =
-        triton::server::GetEnvironmentVariableOrDefault("AIP_STORAGE_URI", "");
-    if (!aip_storage_uri.empty() && model_repository_paths.empty()) {
-      model_repository_paths.insert(aip_storage_uri);
-    }
-  }
-  vertex_ai_port_ = vertex_ai_port;
-  vertex_ai_thread_cnt_ = vertex_ai_thread_cnt;
-  vertex_ai_default_model_ = vertex_ai_default_model;
-#endif  // TRITON_ENABLE_VERTEX_AI
-
-
-#if defined(TRITON_ENABLE_GRPC)
-  grpc_port_ = grpc_port;
-  grpc_address_ = grpc_address;
-  grpc_infer_allocation_pool_size_ = grpc_infer_allocation_pool_size;
-  grpc_use_ssl_ = grpc_use_ssl;
-  grpc_response_compression_level_ = grpc_response_compression_level;
-#endif  // TRITON_ENABLE_GRPC
-
-#ifdef TRITON_ENABLE_METRICS
-  metrics_port_ = metrics_port;
-  allow_gpu_metrics = allow_metrics_ ? allow_gpu_metrics : false;
-  metrics_interval_ms_ = metrics_interval_ms;
-#endif  // TRITON_ENABLE_METRICS
-
-#ifdef TRITON_ENABLE_TRACING
-  trace_filepath_ = trace_filepath;
-  for (auto& trace_level : trace_level_settings) {
-    trace_level_ = static_cast<TRITONSERVER_InferenceTraceLevel>(
-        trace_level_ | trace_level);
-  }
-  trace_rate_ = trace_rate;
-  trace_count_ = trace_count;
-  trace_log_frequency_ = trace_log_frequency;
-#endif  // TRITON_ENABLE_TRACING
-
-  // Check if HTTP, GRPC and metrics port clash
-  if (CheckPortCollision()) {
-    return false;
-  }
-
-  FAIL_IF_ERR(
-      TRITONSERVER_ServerOptionsNew(server_options), "creating server options");
-  auto loptions = *server_options;
-  FAIL_IF_ERR(
-      TRITONSERVER_ServerOptionsSetServerId(loptions, server_id.c_str()),
-      "setting server ID");
-  for (const auto& model_repository_path : model_repository_paths) {
-    FAIL_IF_ERR(
-        TRITONSERVER_ServerOptionsSetModelRepositoryPath(
-            loptions, model_repository_path.c_str()),
-        "setting model repository path");
-  }
-  FAIL_IF_ERR(
-      TRITONSERVER_ServerOptionsSetModelControlMode(loptions, control_mode),
-      "setting model control mode");
-  for (const auto& model : startup_models_) {
-    FAIL_IF_ERR(
-        TRITONSERVER_ServerOptionsSetStartupModel(loptions, model.c_str()),
-        "setting startup model");
-  }
-  FAIL_IF_ERR(
-      TRITONSERVER_ServerOptionsSetRateLimiterMode(loptions, rate_limit_mode),
-      "setting rate limiter configuration");
-  for (const auto& resource : rate_limit_resources) {
-    FAIL_IF_ERR(
-        TRITONSERVER_ServerOptionsAddRateLimiterResource(
-            loptions, std::get<0>(resource).c_str(), std::get<1>(resource),
-            std::get<2>(resource)),
-        "setting rate limiter resource");
-  }
-  FAIL_IF_ERR(
-      TRITONSERVER_ServerOptionsSetPinnedMemoryPoolByteSize(
-          loptions, pinned_memory_pool_byte_size),
-      "setting total pinned memory byte size");
-  for (const auto& cuda_pool : cuda_pools) {
-    FAIL_IF_ERR(
-        TRITONSERVER_ServerOptionsSetCudaMemoryPoolByteSize(
-            loptions, cuda_pool.first, cuda_pool.second),
-        "setting total CUDA memory byte size");
-  }
-  FAIL_IF_ERR(
-      TRITONSERVER_ServerOptionsSetResponseCacheByteSize(
-          loptions, response_cache_byte_size),
-      "setting total response cache byte size");
-  FAIL_IF_ERR(
-      TRITONSERVER_ServerOptionsSetMinSupportedComputeCapability(
-          loptions, min_supported_compute_capability),
-      "setting minimum supported CUDA compute capability");
-  FAIL_IF_ERR(
-      TRITONSERVER_ServerOptionsSetExitOnError(loptions, exit_on_error),
-      "setting exit on error");
-  FAIL_IF_ERR(
-      TRITONSERVER_ServerOptionsSetStrictModelConfig(
-          loptions, strict_model_config),
-      "setting strict model configuration");
-  FAIL_IF_ERR(
-      TRITONSERVER_ServerOptionsSetStrictReadiness(loptions, strict_readiness),
-      "setting strict readiness");
-  FAIL_IF_ERR(
-      TRITONSERVER_ServerOptionsSetExitTimeout(
-          loptions, std::max(0, exit_timeout_secs)),
-      "setting exit timeout");
-  FAIL_IF_ERR(
-      TRITONSERVER_ServerOptionsSetBufferManagerThreadCount(
-          loptions, std::max(0, buffer_manager_thread_count)),
-      "setting buffer manager thread count");
-  FAIL_IF_ERR(
-      TRITONSERVER_ServerOptionsSetModelLoadThreadCount(
-          loptions, std::max(1u, model_load_thread_count)),
-      "setting model load thread count");
-
-#ifdef TRITON_ENABLE_LOGGING
-  FAIL_IF_ERR(
-      TRITONSERVER_ServerOptionsSetLogInfo(loptions, log_info),
-      "setting log info enable");
-  FAIL_IF_ERR(
-      TRITONSERVER_ServerOptionsSetLogWarn(loptions, log_warn),
-      "setting log warn enable");
-  FAIL_IF_ERR(
-      TRITONSERVER_ServerOptionsSetLogError(loptions, log_error),
-      "setting log error enable");
-  FAIL_IF_ERR(
-      TRITONSERVER_ServerOptionsSetLogVerbose(loptions, log_verbose),
-      "setting log verbose level");
-  switch (log_format) {
-    case triton::common::Logger::Format::kDEFAULT:
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetLogFormat(
-              loptions, TRITONSERVER_LOG_DEFAULT),
-          "setting log format");
-      break;
-    case triton::common::Logger::Format::kISO8601:
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetLogFormat(
-              loptions, TRITONSERVER_LOG_ISO8601),
-          "setting log format");
-      break;
-  }
-#endif  // TRITON_ENABLE_LOGGING
-
-#ifdef TRITON_ENABLE_METRICS
-  FAIL_IF_ERR(
-      TRITONSERVER_ServerOptionsSetMetrics(loptions, allow_metrics_),
-      "setting metrics enable");
-  FAIL_IF_ERR(
-      TRITONSERVER_ServerOptionsSetGpuMetrics(loptions, allow_gpu_metrics),
-      "setting GPU metrics enable");
-  FAIL_IF_ERR(
-      TRITONSERVER_ServerOptionsSetMetricsInterval(
-          loptions, metrics_interval_ms_),
-      "setting metrics interval");
-#endif  // TRITON_ENABLE_METRICS
-
-  FAIL_IF_ERR(
-      TRITONSERVER_ServerOptionsSetBackendDirectory(
-          loptions, backend_dir.c_str()),
-      "setting backend directory");
-  FAIL_IF_ERR(
-      TRITONSERVER_ServerOptionsSetRepoAgentDirectory(
-          loptions, repoagent_dir.c_str()),
-      "setting repository agent directory");
-  for (const auto& bcs : backend_config_settings) {
-    FAIL_IF_ERR(
-        TRITONSERVER_ServerOptionsSetBackendConfig(
-            loptions, std::get<0>(bcs).c_str(), std::get<1>(bcs).c_str(),
-            std::get<2>(bcs).c_str()),
-        "setting backend configurtion");
-  }
-  for (const auto& hp : host_policies) {
-    FAIL_IF_ERR(
-        TRITONSERVER_ServerOptionsSetHostPolicy(
-            loptions, std::get<0>(hp).c_str(), std::get<1>(hp).c_str(),
-            std::get<2>(hp).c_str()),
-        "setting host policy");
-  }
-
-  return true;
-}
-
-}  // namespace
-
-int
-main(int argc, char** argv)
-{
-  // Parse command-line to create the options for the inference
-  // server.
-  TRITONSERVER_ServerOptions* server_options = nullptr;
-  if (!Parse(&server_options, argc, argv)) {
-    exit(1);
-  }
-
   // Trace manager.
   triton::server::TraceManager* trace_manager;
 
@@ -1930,10 +466,8 @@ main(int argc, char** argv)
   // Create the server...
   TRITONSERVER_Server* server_ptr = nullptr;
   FAIL_IF_ERR(
-      TRITONSERVER_ServerNew(&server_ptr, server_options), "creating server");
-  FAIL_IF_ERR(
-      TRITONSERVER_ServerOptionsDelete(server_options),
-      "deleting server options");
+      TRITONSERVER_ServerNew(&server_ptr, triton_options.get()),
+      "creating server");
 
   std::shared_ptr<TRITONSERVER_Server> server(
       server_ptr, TRITONSERVER_ServerDelete);
@@ -1959,7 +493,7 @@ main(int argc, char** argv)
   while (!triton::server::signal_exiting_) {
     // If enabled, poll the model repository to see if there have been
     // any changes.
-    if (repository_poll_secs_ > 0) {
+    if (g_triton_params.repository_poll_secs_ > 0) {
       LOG_TRITONSERVER_ERROR(
           TRITONSERVER_ServerPollModelRepository(server_ptr),
           "failed to poll model repository");
@@ -1969,7 +503,9 @@ main(int argc, char** argv)
     // enabled). Will be woken if the server is exiting.
     std::unique_lock<std::mutex> lock(triton::server::signal_exit_mu_);
     std::chrono::seconds wait_timeout(
-        (repository_poll_secs_ == 0) ? 3600 : repository_poll_secs_);
+        (g_triton_params.repository_poll_secs_ == 0)
+            ? 3600
+            : g_triton_params.repository_poll_secs_);
     triton::server::signal_exit_cv_.wait_for(lock, wait_timeout);
   }
 
diff --git a/src/memory_alloc.cc b/src/memory_alloc.cc
index 4b0ad6f6ec..64f61510e9 100644
--- a/src/memory_alloc.cc
+++ b/src/memory_alloc.cc
@@ -28,12 +28,14 @@
 #include <rapidjson/document.h>
 #include <rapidjson/error/en.h>
 #include <unistd.h>
+
 #include <chrono>
 #include <future>
 #include <iostream>
 #include <string>
 #include <thread>
 #include <vector>
+
 #include "common.h"
 #include "triton/core/tritonserver.h"
 
diff --git a/src/multi_server.cc b/src/multi_server.cc
index cc89000f28..d575931b58 100644
--- a/src/multi_server.cc
+++ b/src/multi_server.cc
@@ -27,6 +27,7 @@
 #include <rapidjson/document.h>
 #include <rapidjson/error/en.h>
 #include <unistd.h>
+
 #include <chrono>
 #include <cstring>
 #include <future>
@@ -35,6 +36,7 @@
 #include <thread>
 #include <unordered_map>
 #include <vector>
+
 #include "common.h"
 #include "triton/core/tritonserver.h"
 
diff --git a/src/restricted_features.h b/src/restricted_features.h
new file mode 100644
index 0000000000..1b366e8ec4
--- /dev/null
+++ b/src/restricted_features.h
@@ -0,0 +1,114 @@
+// Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NVIDIA CORPORATION nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#pragma once
+
+#include <algorithm>
+#include <array>
+#include <string>
+
+namespace triton { namespace server {
+
+/// Header and Value pair for a restricted feature
+using Restriction = std::pair<std::string, std::string>;
+
+/// Restricted Categories
+enum RestrictedCategory : uint8_t {
+  HEALTH,
+  METADATA,
+  INFERENCE,
+  SHARED_MEMORY,
+  MODEL_CONFIG,
+  MODEL_REPOSITORY,
+  STATISTICS,
+  TRACE,
+  LOGGING,
+  INVALID,
+  CATEGORY_COUNT = INVALID
+};
+
+/// Restricted Category Names
+const std::array<const std::string, RestrictedCategory::CATEGORY_COUNT>
+    RESTRICTED_CATEGORY_NAMES{
+        "health",        "metadata",     "inference",
+        "shared-memory", "model-config", "model-repository",
+        "statistics",    "trace",        "logging"};
+
+/// Collection of restricted features
+///
+/// Initially empty and all categories unrestricted
+class RestrictedFeatures {
+ public:
+  /// Returns RestrictedCategory enum from category name
+  ///
+  /// \param[in] category category name
+  /// \return category enum returns INVALID if unknown
+  static RestrictedCategory ToCategory(const std::string& category)
+  {
+    const auto found = std::find(
+        begin(RESTRICTED_CATEGORY_NAMES), end(RESTRICTED_CATEGORY_NAMES),
+        category);
+    const auto offset = std::distance(begin(RESTRICTED_CATEGORY_NAMES), found);
+    return RestrictedCategory(offset);
+  }
+
+  /// Insert restriction for given category
+  ///
+  /// \param[in] category category to restrict
+  /// \param[in] restriction header, value pair
+  void Insert(const RestrictedCategory& category, Restriction&& restriction)
+  {
+    restrictions_[category] = std::move(restriction);
+    restricted_categories_[category] = true;
+  }
+
+  /// Get header,value pair for restricted category
+  ///
+  /// \param[in] category category to restrict
+  /// \return restriction header, value pair
+  const Restriction& Get(RestrictedCategory category) const
+  {
+    return restrictions_[category];
+  }
+
+  /// Return true if a category is restricted
+  ///
+  /// \param[in] category category to restrict
+  /// \return true if category is restricted, false otherwise
+
+  const bool& IsRestricted(RestrictedCategory category) const
+  {
+    return restricted_categories_[category];
+  }
+
+  RestrictedFeatures() = default;
+  ~RestrictedFeatures() = default;
+
+ private:
+  std::array<Restriction, RestrictedCategory::CATEGORY_COUNT> restrictions_{};
+
+  std::array<bool, RestrictedCategory::CATEGORY_COUNT> restricted_categories_{};
+};
+}}  // namespace triton::server
diff --git a/src/sagemaker_server.cc b/src/sagemaker_server.cc
index 5de151fa14..b300d9d6fd 100644
--- a/src/sagemaker_server.cc
+++ b/src/sagemaker_server.cc
@@ -226,8 +226,21 @@ SagemakerAPIServer::Handle(evhtp_request_t* req)
             evhtp_send_reply(req, EVHTP_RES_NOTFOUND); /* 404*/
             return;
           }
+          LOG_VERBOSE(1) << "SageMaker MME Custom Invoke Model Path";
 
-          HandleInfer(req, multi_model_name, model_version_str_);
+          /* Extract targetModel to log the associated archive */
+          const char* target_model =
+              evhtp_kv_find(req->headers_in, "X-Amzn-SageMaker-Target-Model");
+
+          /* If target_model is not available (e.g., in local testing) use
+           * model_name_hash as target_model) */
+          if (target_model == nullptr) {
+            target_model = multi_model_name.c_str();
+          }
+
+          LOG_INFO << "Invoking SageMaker TargetModel: " << target_model;
+
+          SageMakerMMEHandleInfer(req, target_model, model_version_str_);
           return;
         }
         if (action.empty()) {
@@ -329,48 +342,500 @@ SagemakerAPIServer::ParseSageMakerRequest(
   if (action == "load") {
     (*parse_map)["url"] = url_string.c_str();
   }
-  (*parse_map)["model_name"] = model_name_string.c_str();
+  (*parse_map)["model_name_hash"] = model_name_string.c_str();
+
+  /* Extract target_model, specified in header, to log the associated archive */
+  const char* target_model =
+      evhtp_kv_find(req->headers_in, "X-Amzn-SageMaker-Target-Model");
+
+
+  /* If target_model is not available (e.g., in local testing) use
+   * model_name_hash as target_model) */
+  if (target_model != nullptr) {
+    (*parse_map)["target_model"] = target_model;
+  } else {
+    (*parse_map)["target_model"] = model_name_string.c_str();
+  }
+
+  LOG_INFO << "Loading SageMaker TargetModel: " << target_model;
 
   return;
 }
 
+void
+SagemakerAPIServer::SagemakeInferRequestClass::InferResponseComplete(
+    TRITONSERVER_InferenceResponse* response, const uint32_t flags, void* userp)
+{
+  // FIXME can't use InferRequestClass object here since it's lifetime
+  // is different than response. For response we need to know how to
+  // send each output (as json, shm, or binary) and that information
+  // has to be maintained in a way that allows us to clean it up
+  // appropriately if connection closed or last response sent.
+  //
+  // But for now userp is the InferRequestClass object and the end of
+  // its life is in the OK or BAD ReplyCallback.
+
+  SagemakerAPIServer::SagemakeInferRequestClass* infer_request =
+      reinterpret_cast<SagemakerAPIServer::SagemakeInferRequestClass*>(userp);
+
+  auto response_count = infer_request->IncrementResponseCount();
+
+  // Defer to the callback with the final response
+  if ((flags & TRITONSERVER_RESPONSE_COMPLETE_FINAL) == 0) {
+    LOG_ERROR << "[INTERNAL] received a response without FINAL flag";
+    return;
+  }
+
+  TRITONSERVER_Error* err = nullptr;
+  if (response_count != 0) {
+    err = TRITONSERVER_ErrorNew(
+        TRITONSERVER_ERROR_INTERNAL, std::string(
+                                         "expected a single response, got " +
+                                         std::to_string(response_count + 1))
+                                         .c_str());
+  } else if (response == nullptr) {
+    err = TRITONSERVER_ErrorNew(
+        TRITONSERVER_ERROR_INTERNAL, "received an unexpected null response");
+  } else {
+    err = infer_request->FinalizeResponse(response);
+  }
+
+#ifdef TRITON_ENABLE_TRACING
+  if (infer_request->trace_ != nullptr) {
+    infer_request->trace_->CaptureTimestamp(
+        "INFER_RESPONSE_COMPLETE", TraceManager::CaptureTimestamp());
+  }
+#endif  // TRITON_ENABLE_TRACING
+
+  if (err == nullptr) {
+    evthr_defer(infer_request->thread_, OKReplyCallback, infer_request);
+  } else {
+    EVBufferAddErrorJson(infer_request->req_->buffer_out, err);
+    if (SageMakerMMECheckOOMError(err) == true) {
+      LOG_VERBOSE(1)
+          << "Received an OOM error during INVOKE MODEL. Returning a 507."
+          << std::endl;
+      evthr_defer(infer_request->thread_, BADReplyCallback507, infer_request);
+    } else {
+      evthr_defer(infer_request->thread_, BADReplyCallback, infer_request);
+    }
+    TRITONSERVER_ErrorDelete(err);
+  }
+
+  LOG_TRITONSERVER_ERROR(
+      TRITONSERVER_InferenceResponseDelete(response),
+      "deleting inference response");
+}
+
+void
+SagemakerAPIServer::BADReplyCallback507(evthr_t* thr, void* arg, void* shared)
+{
+  HTTPAPIServer::InferRequestClass* infer_request =
+      reinterpret_cast<HTTPAPIServer::InferRequestClass*>(arg);
+
+  evhtp_request_t* request = infer_request->EvHtpRequest();
+  evhtp_send_reply(request, 507);
+
+  evhtp_request_resume(request);
+
+#ifdef TRITON_ENABLE_TRACING
+  if (infer_request->trace_ != nullptr) {
+    infer_request->trace_->CaptureTimestamp(
+        "HTTP_SEND_START", request->send_start_ns);
+    infer_request->trace_->CaptureTimestamp(
+        "HTTP_SEND_END", request->send_end_ns);
+  }
+#endif  // TRITON_ENABLE_TRACING
+
+  delete infer_request;
+}
+
+void
+SagemakerAPIServer::SageMakerMMEHandleInfer(
+    evhtp_request_t* req, const std::string& model_name,
+    const std::string& model_version_str)
+{
+  if (req->method != htp_method_POST) {
+    evhtp_send_reply(req, EVHTP_RES_METHNALLOWED);
+    return;
+  }
+
+  bool connection_paused = false;
+
+  int64_t requested_model_version;
+  auto err = GetModelVersionFromString(
+      model_version_str.c_str(), &requested_model_version);
+
+  if (err == nullptr) {
+    uint32_t txn_flags;
+    err = TRITONSERVER_ServerModelTransactionProperties(
+        server_.get(), model_name.c_str(), requested_model_version, &txn_flags,
+        nullptr /* voidp */);
+    if ((err == nullptr) && (txn_flags & TRITONSERVER_TXN_DECOUPLED) != 0) {
+      err = TRITONSERVER_ErrorNew(
+          TRITONSERVER_ERROR_UNSUPPORTED,
+          "HTTP end point doesn't support models with decoupled "
+          "transaction policy");
+    }
+  }
+
+  // If tracing is enabled see if this request should be traced.
+  TRITONSERVER_InferenceTrace* triton_trace = nullptr;
+#ifdef TRITON_ENABLE_TRACING
+  std::shared_ptr<TraceManager::Trace> trace;
+  if (err == nullptr) {
+    trace = std::move(trace_manager_->SampleTrace(model_name));
+    if (trace != nullptr) {
+      triton_trace = trace->trace_;
+
+      // Timestamps from evhtp are capture in 'req'. We record here
+      // since this is the first place where we have access to trace
+      // manager.
+      trace->CaptureTimestamp("HTTP_RECV_START", req->recv_start_ns);
+      trace->CaptureTimestamp("HTTP_RECV_END", req->recv_end_ns);
+    }
+  }
+#endif  // TRITON_ENABLE_TRACING
+
+  // Create the inference request object which provides all information needed
+  // for an inference.
+  TRITONSERVER_InferenceRequest* irequest = nullptr;
+  std::shared_ptr<TRITONSERVER_InferenceRequest> irequest_shared = nullptr;
+  if (err == nullptr) {
+    err = TRITONSERVER_InferenceRequestNew(
+        &irequest, server_.get(), model_name.c_str(), requested_model_version);
+  }
+  if (err == nullptr) {
+    irequest_shared = std::shared_ptr<TRITONSERVER_InferenceRequest>(
+        irequest, [](TRITONSERVER_InferenceRequest* request) {
+          LOG_TRITONSERVER_ERROR(
+              TRITONSERVER_InferenceRequestDelete(request),
+              "deleting HTTP/REST inference request");
+        });
+  }
+  // Decompress request body if it is compressed in supported type
+  evbuffer* decompressed_buffer = nullptr;
+  if (err == nullptr) {
+    auto compression_type = GetRequestCompressionType(req);
+    switch (compression_type) {
+      case DataCompressor::Type::DEFLATE:
+      case DataCompressor::Type::GZIP: {
+        decompressed_buffer = evbuffer_new();
+        err = DataCompressor::DecompressData(
+            compression_type, req->buffer_in, decompressed_buffer);
+        break;
+      }
+      case DataCompressor::Type::UNKNOWN: {
+        // Encounter unsupported compressed type,
+        // send 415 error with supported types in Accept-Encoding
+        evhtp_headers_add_header(
+            req->headers_out,
+            evhtp_header_new(kAcceptEncodingHTTPHeader, "gzip, deflate", 1, 1));
+        evhtp_send_reply(req, EVHTP_RES_UNSUPPORTED);
+        return;
+      }
+      case DataCompressor::Type::IDENTITY:
+        // Do nothing
+        break;
+    }
+  }
+
+  // Get the header length
+  size_t header_length;
+  if (err == nullptr) {
+    // Set to body size in case there is no Content-Length to compare with
+    int32_t content_length = evbuffer_get_length(req->buffer_in);
+    if (decompressed_buffer == nullptr) {
+      const char* content_length_c_str =
+          evhtp_kv_find(req->headers_in, kContentLengthHeader);
+      if (content_length_c_str != nullptr) {
+        try {
+          content_length = std::atoi(content_length_c_str);
+        }
+        catch (const std::invalid_argument& ia) {
+          err = TRITONSERVER_ErrorNew(
+              TRITONSERVER_ERROR_INVALID_ARG,
+              (std::string("Unable to parse ") + kContentLengthHeader +
+               ", got: " + content_length_c_str)
+                  .c_str());
+        }
+      }
+    } else {
+      // The Content-Length doesn't reflect the actual request body size
+      // if compression is used, set 'content_length' to the decompressed size
+      content_length = evbuffer_get_length(decompressed_buffer);
+    }
+
+    if (err == nullptr) {
+      err = GetInferenceHeaderLength(req, content_length, &header_length);
+    }
+  }
+
+  if (err == nullptr) {
+    connection_paused = true;
+
+    auto infer_request = CreateInferRequest(req, irequest_shared);
+    auto request_release_payload = std::make_unique<RequestReleasePayload>(
+        irequest_shared, decompressed_buffer);
+
+#ifdef TRITON_ENABLE_TRACING
+    infer_request->trace_ = trace;
+#endif  // TRITON_ENABLE_TRACING
+
+    if (err == nullptr) {
+      if (header_length != 0) {
+        err = EVBufferToInput(
+            model_name, irequest,
+            (decompressed_buffer == nullptr) ? req->buffer_in
+                                             : decompressed_buffer,
+            infer_request.get(), header_length);
+      } else {
+        err = EVBufferToRawInput(
+            model_name, irequest,
+            (decompressed_buffer == nullptr) ? req->buffer_in
+                                             : decompressed_buffer,
+            infer_request.get());
+      }
+    }
+    if (err == nullptr) {
+      err = TRITONSERVER_InferenceRequestSetReleaseCallback(
+          irequest, InferRequestClass::InferRequestComplete,
+          request_release_payload.get());
+      if (err == nullptr) {
+        err = TRITONSERVER_InferenceRequestSetResponseCallback(
+            irequest, allocator_,
+            reinterpret_cast<void*>(&infer_request->alloc_payload_),
+            SagemakerAPIServer::SagemakeInferRequestClass::
+                InferResponseComplete,
+            reinterpret_cast<void*>(infer_request.get()));
+
+        LOG_VERBOSE(1) << std::endl;
+      }
+      if (err == nullptr) {
+        err = TRITONSERVER_ServerInferAsync(
+            server_.get(), irequest, triton_trace);
+#ifdef TRITON_ENABLE_TRACING
+        if (trace != nullptr) {
+          trace->trace_ = nullptr;
+        }
+#endif  // TRITON_ENABLE_TRACING
+      }
+      if (err == nullptr) {
+        infer_request.release();
+        request_release_payload.release();
+      }
+    }
+  }
+
+  if (err != nullptr) {
+    LOG_VERBOSE(1) << "Infer failed: " << TRITONSERVER_ErrorMessage(err);
+    evhtp_headers_add_header(
+        req->headers_out,
+        evhtp_header_new(kContentTypeHeader, "application/json", 1, 1));
+
+    SageMakerMMEHandleOOMError(req, err);
+
+    if (connection_paused) {
+      evhtp_request_resume(req);
+    }
+    TRITONSERVER_ErrorDelete(err);
+#ifdef TRITON_ENABLE_TRACING
+    // If HTTP server still owns Triton trace
+    if ((trace != nullptr) && (trace->trace_ != nullptr)) {
+      TraceManager::TraceRelease(trace->trace_, trace->trace_userp_);
+    }
+#endif  // TRITON_ENABLE_TRACING
+  }
+}
+
+TRITONSERVER_Error*
+SagemakerAPIServer::SageMakerMMECheckUnloadedModelIsUnavailable(
+    const char* model_name, bool* is_model_unavailable)
+{
+  /* Use the RepositoryIndex API to check if the model state has become
+  UNAVAILABLE i.e. model is no longer in the 'in-the-process-of' being
+  UNLOADED. Consequently, the reason field should be 'unloaded'.*/
+  TRITONSERVER_Message* server_model_index_message = nullptr;
+  uint32_t ready_flag = 0;  // value of 1 should be set if only the 'ready'
+                            // models are required from the index. In this case,
+                            // we need all models.
+  TRITONSERVER_ServerModelIndex(
+      server_.get(), ready_flag, &server_model_index_message);
+
+  std::shared_ptr<TRITONSERVER_Message> shared_ptr_msg(
+      server_model_index_message,
+      [](TRITONSERVER_Message* msg) { TRITONSERVER_MessageDelete(msg); });
+
+  const char* index_buffer;
+  size_t index_byte_size;
+
+  RETURN_IF_ERR(TRITONSERVER_MessageSerializeToJson(
+      server_model_index_message, &index_buffer, &index_byte_size));
+
+  /* Read into json buffer*/
+  triton::common::TritonJson::Value server_model_index_json;
+  server_model_index_json.Parse(index_buffer, index_byte_size);
+
+  const char* name;
+  const char* state;
+  const char* reason;
+  const char* version;
+
+  size_t name_len;
+  size_t state_len;
+  size_t reason_len;
+  size_t version_len;
+
+  for (size_t id = 0; id < server_model_index_json.ArraySize(); ++id) {
+    triton::common::TritonJson::Value index_json;
+    server_model_index_json.IndexAsObject(id, &index_json);
+
+    RETURN_IF_ERR(index_json.MemberAsString("name", &name, &name_len));
+
+    if (std::string(name) == std::string(model_name)) {
+      RETURN_IF_ERR(index_json.MemberAsString("state", &state, &state_len));
+
+      if (std::string(state) == UNLOAD_EXPECTED_STATE_) {
+        RETURN_IF_ERR(
+            index_json.MemberAsString("reason", &reason, &reason_len));
+
+        if (std::string(reason) == UNLOAD_EXPECTED_REASON_) {
+          *is_model_unavailable = true;
+
+          RETURN_IF_ERR(
+              index_json.MemberAsString("version", &version, &version_len));
+
+          LOG_VERBOSE(1) << "Discovered model: " << name
+                         << ", version: " << version << " in state: " << state
+                         << " for the reason: " << reason;
+
+          break;
+        }
+      }
+    }
+  }
+
+  return nullptr;
+}
+
 void
 SagemakerAPIServer::SageMakerMMEUnloadModel(
-    evhtp_request_t* req, const char* model_name)
+    evhtp_request_t* req, const char* model_name_hash)
 {
-  std::lock_guard<std::mutex> lock(mutex_);
+  /* Extract targetModel to log the associated archive */
+  const char* target_model =
+      evhtp_kv_find(req->headers_in, "X-Amzn-SageMaker-Target-Model");
+
+  /* If target_model is not available (e.g., in local testing) use
+   * model_name_hash as target_model) */
+  if (target_model == nullptr) {
+    target_model = model_name_hash;
+  }
 
-  if (sagemaker_models_list_.find(model_name) == sagemaker_models_list_.end()) {
-    LOG_VERBOSE(1) << "Model " << model_name << "is not loaded." << std::endl;
+  if (sagemaker_models_list_.find(model_name_hash) ==
+      sagemaker_models_list_.end()) {
+    LOG_VERBOSE(1) << "Model " << target_model << " with model hash "
+                   << model_name_hash << " is not loaded." << std::endl;
     evhtp_send_reply(req, EVHTP_RES_NOTFOUND); /* 404*/
     return;
   }
 
-  HandleRepositoryControl(req, "", model_name, "unload");
+  LOG_INFO << "Unloading SageMaker TargetModel: " << target_model << std::endl;
 
-  std::string repo_path = sagemaker_models_list_.at(model_name);
+  auto start_time = std::chrono::high_resolution_clock::now();
 
-  std::string repo_parent_path, subdir, customer_subdir;
-  RE2::FullMatch(
-      repo_path, model_path_regex_, &repo_parent_path, &subdir,
-      &customer_subdir);
-
-  TRITONSERVER_Error* unload_err = TRITONSERVER_ServerUnregisterModelRepository(
-      server_.get(), repo_parent_path.c_str());
+  /* Always unload dependents as well - this is required to unload dependents in
+   * ensemble */
+  TRITONSERVER_Error* unload_err = nullptr;
+  unload_err =
+      TRITONSERVER_ServerUnloadModelAndDependents(server_.get(), target_model);
 
   if (unload_err != nullptr) {
     EVBufferAddErrorJson(req->buffer_out, unload_err);
     evhtp_send_reply(req, EVHTP_RES_BADREQ);
+
+    LOG_ERROR
+        << "Error when unloading SageMaker Model with dependents for model: "
+        << target_model << std::endl;
+
+    TRITONSERVER_ErrorDelete(unload_err);
+    return;
+  }
+
+  /*Note: Model status check is repo-specific and therefore must be run before
+   * unregistering the repo, else the model information is lost*/
+  bool is_model_unavailable = false;
+  int64_t unload_time_in_secs = 0;
+
+  /* Wait for the model to be completely unloaded. SageMaker waits a maximum
+  of 360 seconds for the UNLOAD request to timeout. Setting a limit of 350
+  seconds for Triton unload. This should be run only if above UNLOAD call has
+  succeeded.*/
+  if (unload_err == nullptr) {
+    LOG_VERBOSE(1) << "Using Model Repository Index during UNLOAD to check for "
+                      "status of model hash: "
+                   << model_name_hash << " for model: " << target_model;
+    while (is_model_unavailable == false &&
+           unload_time_in_secs < UNLOAD_TIMEOUT_SECS_) {
+      LOG_VERBOSE(1) << "In the loop to wait for model to be unavailable";
+      unload_err = SageMakerMMECheckUnloadedModelIsUnavailable(
+          target_model, &is_model_unavailable);
+      if (unload_err != nullptr) {
+        LOG_ERROR << "Error: Received non-zero exit code on checking for "
+                     "model unavailability. "
+                  << TRITONSERVER_ErrorMessage(unload_err);
+        break;
+      }
+      std::this_thread::sleep_for(
+          std::chrono::milliseconds(UNLOAD_SLEEP_MILLISECONDS_));
+
+      auto end_time = std::chrono::high_resolution_clock::now();
+
+      unload_time_in_secs = std::chrono::duration_cast<std::chrono::seconds>(
+                                end_time - start_time)
+                                .count();
+    }
+    LOG_INFO << "UNLOAD for model " << target_model << " completed in "
+             << unload_time_in_secs << " seconds.";
     TRITONSERVER_ErrorDelete(unload_err);
   }
 
-  sagemaker_models_list_.erase(model_name);
+  if ((is_model_unavailable == false) &&
+      (unload_time_in_secs >= UNLOAD_TIMEOUT_SECS_)) {
+    LOG_ERROR << "Error: UNLOAD did not complete within expected "
+              << UNLOAD_TIMEOUT_SECS_
+              << " seconds. This may "
+                 "result in SageMaker UNLOAD timeout.";
+  }
+
+  std::string repo_parent_path = sagemaker_models_list_.at(model_name_hash);
+
+  TRITONSERVER_Error* unregister_err = nullptr;
+
+  unregister_err = TRITONSERVER_ServerUnregisterModelRepository(
+      server_.get(), repo_parent_path.c_str());
+
+  if (unregister_err != nullptr) {
+    EVBufferAddErrorJson(req->buffer_out, unload_err);
+    evhtp_send_reply(req, EVHTP_RES_BADREQ);
+    LOG_ERROR << "Unable to unregister model repository for path: "
+              << repo_parent_path << std::endl;
+  } else {
+    evhtp_send_reply(req, EVHTP_RES_OK);
+  }
+
+  TRITONSERVER_ErrorDelete(unregister_err);
+
+  std::lock_guard<std::mutex> lock(models_list_mutex_);
+  sagemaker_models_list_.erase(model_name_hash);
 }
 
 void
 SagemakerAPIServer::SageMakerMMEGetModel(
     evhtp_request_t* req, const char* model_name)
 {
+  std::lock_guard<std::mutex> lock(models_list_mutex_);
+
   if (sagemaker_models_list_.find(model_name) == sagemaker_models_list_.end()) {
     evhtp_send_reply(req, EVHTP_RES_NOTFOUND); /* 404*/
     return;
@@ -400,6 +865,8 @@ SagemakerAPIServer::SageMakerMMEGetModel(
 void
 SagemakerAPIServer::SageMakerMMEListModel(evhtp_request_t* req)
 {
+  std::lock_guard<std::mutex> lock(models_list_mutex_);
+
   triton::common::TritonJson::Value sagemaker_list_json(
       triton::common::TritonJson::ValueType::OBJECT);
 
@@ -440,30 +907,61 @@ SagemakerAPIServer::SageMakerMMEListModel(evhtp_request_t* req)
   evhtp_send_reply(req, EVHTP_RES_OK);
 }
 
-void
-SagemakerAPIServer::SageMakerMMEHandleLoadError(
-    evhtp_request_t* req, TRITONSERVER_Error* load_err)
+bool
+SagemakerAPIServer::SageMakerMMECheckOOMError(TRITONSERVER_Error* err)
 {
-  const char* message = TRITONSERVER_ErrorMessage(load_err);
+  const char* message = TRITONSERVER_ErrorMessage(err);
   std::string error_string(message);
 
+  LOG_VERBOSE(1) << "Logging Verbose Error: " << std::endl
+                 << error_string.c_str() << std::endl;
+
   const std::vector<std::string> error_messages{
       "CUDA out of memory", /* pytorch */
       "CUDA_OUT_OF_MEMORY", /* tensorflow */
       "Out of memory",      /* generic */
-      "out of memory", "MemoryError"};
-
-  EVBufferAddErrorJson(req->buffer_out, load_err);
-
+      "Out Of Memory",
+      "out of memory",
+      "MemoryError",
+      "OutOfMemory",
+      "OOM",
+      "Dst tensor is not initialized",
+      "Src tensor is not initialized",
+      "CNMEM_STATUS_OUT_OF_MEMORY",
+      "CUDNN_STATUS_NOT_INITIALIZED",
+      "CUBLAS_STATUS_ALLOC_FAILED",
+      "CUBLAS_STATUS_NOT_INITIALIZED",
+      "Failed to allocate memory",
+      "failed to allocate memory",
+      "No space left on device"};
+
+  /*
+    TODO: Improve the search to do pattern match on whole words only
+  */
   for (long unsigned int i = 0; i < error_messages.size(); i++) {
     if (error_string.find(error_messages[i]) != std::string::npos) {
-      /* Return a 507*/
-      evhtp_send_reply(req, 507);
-      LOG_VERBOSE(1)
-          << "Received an OOM error during LOAD MODEL. Returning a 507.";
-      return;
+      LOG_VERBOSE(1) << "OOM string '" << error_messages[i].c_str()
+                     << "' detected in logs.";
+      return true;
     }
   }
+
+  return false;
+}
+
+void
+SagemakerAPIServer::SageMakerMMEHandleOOMError(
+    evhtp_request_t* req, TRITONSERVER_Error* err)
+{
+  EVBufferAddErrorJson(req->buffer_out, err);
+
+  if (SageMakerMMECheckOOMError(err) == true) {
+    /* Return a 507*/
+    evhtp_send_reply(req, 507);
+    LOG_VERBOSE(1)
+        << "Received an OOM error during LOAD MODEL. Returning a 507.";
+    return;
+  }
   /* Return a 400*/
   evhtp_send_reply(req, EVHTP_RES_BADREQ);
   return;
@@ -476,16 +974,17 @@ SagemakerAPIServer::SageMakerMMELoadModel(
     const std::unordered_map<std::string, std::string> parse_map)
 {
   std::string repo_path = parse_map.at("url");
-  std::string model_name = parse_map.at("model_name");
+  std::string model_name_hash = parse_map.at("model_name_hash");
+  std::string target_model = parse_map.at("target_model");
 
-  /* Error out if there's more than one subdir/version within
-   * supplied model repo, as ensemble in MME is not (currently)
-   * supported
+  /* Check subdirs for models and find ensemble model within the repo_path
+   * If only 1 model, that will be selected as model_subdir
+   * Else ensemble model directory is set as model_subdir
    */
   DIR* dir;
   struct dirent* ent;
   int dir_count = 0;
-  std::string model_subdir;
+  std::string model_subdir, ensemble_model_subdir;
 
   if ((dir = opendir(repo_path.c_str())) != NULL) {
     while ((ent = readdir(dir)) != NULL) {
@@ -494,21 +993,56 @@ SagemakerAPIServer::SageMakerMMELoadModel(
         dir_count += 1;
         model_subdir = std::string(ent->d_name);
       }
-      if (dir_count > 1) {
-        HTTP_RESPOND_IF_ERR(
-            req,
-            TRITONSERVER_ErrorNew(
-                TRITONSERVER_ERROR_INTERNAL,
-                "More than one version or model directories found. Note that "
-                "hidden folders are not allowed and "
-                "Ensemble models are not supported in SageMaker MME mode."));
-        closedir(dir);
-        return;
+
+      if (dir_count >= 2) {
+        LOG_VERBOSE(1) << "More than one model detected in archive. "
+                          "Checking if it is an ensemble."
+                       << std::endl;
+      }
+
+      LOG_VERBOSE(1) << "Reading model sub-directory: " << model_subdir.c_str()
+                     << std::endl;
+
+      // Read the config.pbtxt file at each path, if available
+      std::string ensemble_config_path =
+          repo_path + "/" + model_subdir + "/" + "config.pbtxt";
+      std::ifstream config_fstream(ensemble_config_path);
+      std::stringstream ensemble_config_content;
+
+      if (config_fstream.is_open()) {
+        ensemble_config_content << config_fstream.rdbuf();
+      } else {
+        continue;  // A valid config.pbtxt does not exist at this path, or
+                   // cannot be read
+      }
+
+      /* Compare matched string with `platform: "ensemble"` or
+       * `platform:"ensemble"`. If present, we break, and use the model_subdir
+       * to load the ensemble model
+       */
+      std::string detected_ensemble_regex;
+      if (RE2::PartialMatch(
+              ensemble_config_content.str(), platform_ensemble_regex_,
+              &detected_ensemble_regex)) {
+        LOG_INFO << "SageMaker front-end detected an Ensemble config at path: "
+                 << ensemble_config_path << std::endl;
+        ensemble_model_subdir = model_subdir;
+      }
+
+      if (dir_count > 5) {
+        LOG_WARNING
+            << "Several model directories found. If using ensemble, smaller "
+               "ensembles are recommended for better memory management."
+            << std::endl;
       }
     }
     closedir(dir);
   }
 
+  if (!strcmp(ensemble_model_subdir.c_str(), "") == 0) {
+    model_subdir = ensemble_model_subdir;
+  }
+
   std::vector<const TRITONSERVER_Parameter*> subdir_modelname_map;
 
   /* Split repo path into three parts:
@@ -538,7 +1072,8 @@ SagemakerAPIServer::SageMakerMMELoadModel(
   }
 
   auto param = TRITONSERVER_ParameterNew(
-      model_subdir.c_str(), TRITONSERVER_PARAMETER_STRING, model_name.c_str());
+      model_subdir.c_str(), TRITONSERVER_PARAMETER_STRING,
+      target_model.c_str());
 
   if (param != nullptr) {
     subdir_modelname_map.emplace_back(param);
@@ -571,7 +1106,7 @@ SagemakerAPIServer::SageMakerMMELoadModel(
     return;
   }
 
-  err = TRITONSERVER_ServerLoadModel(server_.get(), model_name.c_str());
+  err = TRITONSERVER_ServerLoadModel(server_.get(), target_model.c_str());
 
   /* Unlikely after duplicate repo check, but in case Load Model also returns
    * ALREADY_EXISTS error */
@@ -582,11 +1117,12 @@ SagemakerAPIServer::SageMakerMMELoadModel(
     TRITONSERVER_ErrorDelete(err);
     return;
   } else if (err != nullptr) {
-    SageMakerMMEHandleLoadError(req, err);
+    SageMakerMMEHandleOOMError(req, err);
   } else {
-    std::lock_guard<std::mutex> lock(mutex_);
+    std::lock_guard<std::mutex> lock(models_list_mutex_);
 
-    sagemaker_models_list_.emplace(model_name, repo_path);
+    /* Use model name hash as expected in SageMaker MME contract */
+    sagemaker_models_list_.emplace(model_name_hash, repo_parent_path);
     evhtp_send_reply(req, EVHTP_RES_OK);
   }
 
@@ -596,7 +1132,7 @@ SagemakerAPIServer::SageMakerMMELoadModel(
         server_.get(), repo_parent_path.c_str());
     LOG_VERBOSE(1)
         << "Unregistered model repository due to load failure for model: "
-        << model_name << std::endl;
+        << target_model << std::endl;
   }
 
   if (err != nullptr) {
@@ -607,5 +1143,4 @@ SagemakerAPIServer::SageMakerMMELoadModel(
 
   return;
 }
-
 }}  // namespace triton::server
diff --git a/src/sagemaker_server.h b/src/sagemaker_server.h
index 355fb60acb..479c2b8391 100644
--- a/src/sagemaker_server.h
+++ b/src/sagemaker_server.h
@@ -1,4 +1,4 @@
-// Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 //
 // Redistribution and use in source and binary forms, with or without
 // modification, are permitted provided that the following conditions
@@ -25,9 +25,11 @@
 // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 #pragma once
 
-#include <mutex>
 #include <sys/stat.h>
 
+#include <fstream>
+#include <mutex>
+
 #include "common.h"
 #include "dirent.h"
 #include "http_server.h"
@@ -49,10 +51,16 @@ class SagemakerAPIServer : public HTTPAPIServer {
    public:
     explicit SagemakeInferRequestClass(
         TRITONSERVER_Server* server, evhtp_request_t* req,
-        DataCompressor::Type response_compression_type)
-        : InferRequestClass(server, req, response_compression_type)
+        DataCompressor::Type response_compression_type,
+        const std::shared_ptr<TRITONSERVER_InferenceRequest>& triton_request)
+        : InferRequestClass(
+              server, req, response_compression_type, triton_request)
     {
     }
+    using InferRequestClass::InferResponseComplete;
+    static void InferResponseComplete(
+        TRITONSERVER_InferenceResponse* response, const uint32_t flags,
+        void* userp);
 
     void SetResponseHeader(
         const bool has_binary_data, const size_t header_length) override;
@@ -65,12 +73,15 @@ class SagemakerAPIServer : public HTTPAPIServer {
       const std::shared_ptr<SharedMemoryManager>& shm_manager,
       const int32_t port, const std::string address, const int thread_cnt)
       : HTTPAPIServer(
-            server, trace_manager, shm_manager, port, address, thread_cnt),
+            server, trace_manager, shm_manager, port, false /* reuse_port */,
+            address, "" /* header_forward_pattern */, thread_cnt),
         ping_regex_(R"(/ping)"), invocations_regex_(R"(/invocations)"),
         models_regex_(R"(/models(?:/)?([^/]+)?(/invoke)?)"),
         model_path_regex_(
             R"((\/opt\/ml\/models\/[0-9A-Za-z._]+)\/(model)\/?([0-9A-Za-z._]+)?)"),
-        ping_mode_("ready"),
+        platform_ensemble_regex_(R"(platform:(\s)*\"ensemble\")"),
+        ping_mode_(GetEnvironmentVariableOrDefault(
+            "SAGEMAKER_TRITON_PING_MODE", "ready")),
         model_name_(GetEnvironmentVariableOrDefault(
             "SAGEMAKER_TRITON_DEFAULT_MODEL_NAME",
             "unspecified_SAGEMAKER_TRITON_DEFAULT_MODEL_NAME")),
@@ -83,33 +94,48 @@ class SagemakerAPIServer : public HTTPAPIServer {
       std::unordered_map<std::string, std::string>* parse_map,
       const std::string& action);
 
+  void SageMakerMMEHandleInfer(
+      evhtp_request_t* req, const std::string& model_name,
+      const std::string& model_version_str);
+
   void SageMakerMMELoadModel(
       evhtp_request_t* req,
       const std::unordered_map<std::string, std::string> parse_map);
 
-  void SageMakerMMEHandleLoadError(
+  void SageMakerMMEHandleOOMError(
       evhtp_request_t* req, TRITONSERVER_Error* load_err);
 
+  static bool SageMakerMMECheckOOMError(TRITONSERVER_Error* load_err);
+
   void SageMakerMMEUnloadModel(evhtp_request_t* req, const char* model_name);
 
+  TRITONSERVER_Error* SageMakerMMECheckUnloadedModelIsUnavailable(
+      const char* model_name, bool* is_model_unavailable);
+
   void SageMakerMMEListModel(evhtp_request_t* req);
 
   void SageMakerMMEGetModel(evhtp_request_t* req, const char* model_name);
 
   void Handle(evhtp_request_t* req) override;
 
+  /* Method to return 507 on invoke i.e. during SageMakerMMEHandleInfer
+   */
+  static void BADReplyCallback507(evthr_t* thr, void* arg, void* shared);
+
   std::unique_ptr<InferRequestClass> CreateInferRequest(
-      evhtp_request_t* req) override
+      evhtp_request_t* req,
+      const std::shared_ptr<TRITONSERVER_InferenceRequest>& triton_request)
+      override
   {
     return std::unique_ptr<InferRequestClass>(new SagemakeInferRequestClass(
-        server_.get(), req, GetResponseCompressionType(req)));
+        server_.get(), req, GetResponseCompressionType(req), triton_request));
   }
   TRITONSERVER_Error* GetInferenceHeaderLength(
       evhtp_request_t* req, int32_t content_length,
       size_t* header_length) override;
 
 
-  // Currently the compresssion schema hasn't been defined,
+  // Currently the compression schema hasn't been defined,
   // assume identity compression type is used for both request and response
   DataCompressor::Type GetRequestCompressionType(evhtp_request_t* req) override
   {
@@ -123,6 +149,7 @@ class SagemakerAPIServer : public HTTPAPIServer {
   re2::RE2 invocations_regex_;
   re2::RE2 models_regex_;
   re2::RE2 model_path_regex_;
+  re2::RE2 platform_ensemble_regex_;
 
   const std::string ping_mode_;
 
@@ -137,7 +164,13 @@ class SagemakerAPIServer : public HTTPAPIServer {
   std::unordered_map<std::string, std::string> sagemaker_models_list_;
 
   /* Mutex to handle concurrent updates */
-  std::mutex mutex_;
+  std::mutex models_list_mutex_;
+
+  /* Constants */
+  const uint32_t UNLOAD_TIMEOUT_SECS_ = 350;
+  const uint32_t UNLOAD_SLEEP_MILLISECONDS_ = 500;
+  const std::string UNLOAD_EXPECTED_STATE_ = "UNAVAILABLE";
+  const std::string UNLOAD_EXPECTED_REASON_ = "unloaded";
 };
 
 }}  // namespace triton::server
diff --git a/src/shared_memory_manager.cc b/src/shared_memory_manager.cc
index ab30a1e8c1..6e4d6fc0e2 100644
--- a/src/shared_memory_manager.cc
+++ b/src/shared_memory_manager.cc
@@ -1,4 +1,4 @@
-// Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 //
 // Redistribution and use in source and binary forms, with or without
 // modification, are permitted provided that the following conditions
@@ -121,6 +121,7 @@ SharedMemoryManager::UnregisterHelper(
 #include <fcntl.h>
 #include <sys/mman.h>
 #include <unistd.h>
+
 #include "common.h"
 #include "triton/common/logging.h"
 
@@ -150,7 +151,8 @@ MapSharedMemory(
     void** mapped_addr)
 {
   // map shared memory to process address space
-  *mapped_addr = mmap(NULL, byte_size, PROT_WRITE, MAP_SHARED, shm_fd, offset);
+  *mapped_addr =
+      mmap(NULL, byte_size, PROT_WRITE | PROT_READ, MAP_SHARED, shm_fd, offset);
   if (*mapped_addr == MAP_FAILED) {
     return TRITONSERVER_ErrorNew(
         TRITONSERVER_ERROR_INTERNAL, std::string(
@@ -345,8 +347,8 @@ SharedMemoryManager::GetMemoryInfo(
             .c_str());
   }
   if (it->second->kind_ == TRITONSERVER_MEMORY_CPU) {
-    *shm_mapped_addr =
-        (void*)((uint8_t*)it->second->mapped_addr_ + it->second->offset_ + offset);
+    *shm_mapped_addr = (void*)((uint8_t*)it->second->mapped_addr_ +
+                               it->second->offset_ + offset);
   } else {
     *shm_mapped_addr = (void*)((uint8_t*)it->second->mapped_addr_ + offset);
   }
diff --git a/src/shared_memory_manager.h b/src/shared_memory_manager.h
index 8b39c3115b..b282f77bc7 100644
--- a/src/shared_memory_manager.h
+++ b/src/shared_memory_manager.h
@@ -29,6 +29,7 @@
 #include <map>
 #include <memory>
 #include <mutex>
+
 #include "triton/core/tritonserver.h"
 
 #define TRITONJSON_STATUSTYPE TRITONSERVER_Error*
diff --git a/src/simple.cc b/src/simple.cc
index 081bf3cd70..c429861cc5 100644
--- a/src/simple.cc
+++ b/src/simple.cc
@@ -1,4 +1,4 @@
-// Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 //
 // Redistribution and use in source and binary forms, with or without
 // modification, are permitted provided that the following conditions
@@ -27,6 +27,7 @@
 #include <rapidjson/document.h>
 #include <rapidjson/error/en.h>
 #include <unistd.h>
+
 #include <chrono>
 #include <cstring>
 #include <future>
@@ -35,6 +36,7 @@
 #include <thread>
 #include <unordered_map>
 #include <vector>
+
 #include "common.h"
 #include "triton/core/tritonserver.h"
 
@@ -245,10 +247,11 @@ ResponseRelease(
 }
 
 void
-InferRequestComplete(
+InferRequestRelease(
     TRITONSERVER_InferenceRequest* request, const uint32_t flags, void* userp)
 {
-  // We reuse the request so we don't delete it here.
+  std::promise<void>* barrier = reinterpret_cast<std::promise<void>*>(userp);
+  barrier->set_value();
 }
 
 void
@@ -562,7 +565,8 @@ main(int argc, char** argv)
   // options can be deleted.
   TRITONSERVER_Server* server_ptr = nullptr;
   FAIL_IF_ERR(
-      TRITONSERVER_ServerNew(&server_ptr, server_options), "creating server object");
+      TRITONSERVER_ServerNew(&server_ptr, server_options),
+      "creating server object");
   FAIL_IF_ERR(
       TRITONSERVER_ServerOptionsDelete(server_options),
       "deleting server options");
@@ -721,10 +725,14 @@ main(int argc, char** argv)
       TRITONSERVER_InferenceRequestSetId(irequest, "my_request_id"),
       "setting ID for the request");
 
+  std::unique_ptr<std::promise<void>> barrier =
+      std::make_unique<std::promise<void>>();
   FAIL_IF_ERR(
       TRITONSERVER_InferenceRequestSetReleaseCallback(
-          irequest, InferRequestComplete, nullptr /* request_release_userp */),
+          irequest, InferRequestRelease,
+          reinterpret_cast<void*>(barrier.get())),
       "setting request release callback");
+  std::future<void> request_release_future = barrier->get_future();
 
   // Add the 2 input tensors to the request...
   auto input0 = "INPUT0";
@@ -835,7 +843,7 @@ main(int argc, char** argv)
       "assigning INPUT1 data");
 
   // Perform inference by calling TRITONSERVER_ServerInferAsync. This
-  // call is asychronous and therefore returns immediately. The
+  // call is asynchronous and therefore returns immediately. The
   // completion of the inference and delivery of the response is done
   // by triton by calling the "response complete" callback functions
   // (InferResponseComplete in this case).
@@ -857,7 +865,6 @@ main(int argc, char** argv)
     // The InferResponseComplete function sets the std::promise so
     // that this thread will block until the response is returned.
     TRITONSERVER_InferenceResponse* completed_response = completed.get();
-
     FAIL_IF_ERR(
         TRITONSERVER_InferenceResponseError(completed_response),
         "response status");
@@ -899,6 +906,19 @@ main(int argc, char** argv)
             InferResponseComplete, reinterpret_cast<void*>(p)),
         "setting response callback");
 
+    // We need to make sure that the previous request was released before
+    // reusing it.
+    request_release_future.get();
+
+    // Register a new promise for the request callback barrier.
+    barrier = std::make_unique<std::promise<void>>();
+    request_release_future = barrier->get_future();
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestSetReleaseCallback(
+            irequest, InferRequestRelease,
+            reinterpret_cast<void*>(barrier.get())),
+        "setting request release callback");
+
     FAIL_IF_ERR(
         TRITONSERVER_ServerInferAsync(
             server.get(), irequest, nullptr /* trace */),
@@ -945,6 +965,16 @@ main(int argc, char** argv)
             InferResponseComplete, reinterpret_cast<void*>(p)),
         "setting response callback");
 
+    // Register a new promise for the request callback barrier.
+    barrier = std::make_unique<std::promise<void>>();
+    request_release_future.get();
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestSetReleaseCallback(
+            irequest, InferRequestRelease,
+            reinterpret_cast<void*>(barrier.get())),
+        "setting request release callback");
+
     FAIL_IF_ERR(
         TRITONSERVER_ServerInferAsync(
             server.get(), irequest, nullptr /* trace */),
diff --git a/src/test/CMakeLists.txt b/src/test/CMakeLists.txt
index d021a51a15..1d3e00c40a 100644
--- a/src/test/CMakeLists.txt
+++ b/src/test/CMakeLists.txt
@@ -1,4 +1,4 @@
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -34,7 +34,7 @@ find_package(GTest REQUIRED)
 #
 # caffe2plan
 #
-if(${TRITON_ENABLE_TENSORRT})
+if(${TRITON_ENABLE_TENSORRT} AND NOT ${TRITON_IGPU_BUILD})
   add_executable(caffe2plan caffe2plan.cc)
   target_include_directories(caffe2plan PRIVATE ${CUDA_INCLUDE_DIRS})
   target_link_libraries(
@@ -84,6 +84,7 @@ if(${TRITON_ENABLE_HTTP} OR ${TRITON_ENABLE_METRICS} OR
     data_compressor_test
     PRIVATE
       triton-core-serverapi   # from repo-core
+      triton-core-serverstub  # from repo-core
       GTest::gtest
       GTest::gtest_main
       ${LIBEVENT_LIBRARIES}
@@ -100,9 +101,10 @@ add_subdirectory(repoagent/relocation_repoagent repoagent/relocation_repoagent)
 
 add_subdirectory(distributed_addsub distributed_addsub)
 add_subdirectory(dyna_sequence dyna_sequence)
-add_subdirectory(implicit_state implicit_state)
+add_subdirectory(iterative_sequence iterative_sequence)
 add_subdirectory(query_backend query_backend)
 
 if(${TRITON_ENABLE_GPU})
   add_subdirectory(sequence sequence)
+  add_subdirectory(implicit_state implicit_state)
 endif()
diff --git a/src/test/caffe2plan.cc b/src/test/caffe2plan.cc
index 7bda39c2eb..301129f10a 100644
--- a/src/test/caffe2plan.cc
+++ b/src/test/caffe2plan.cc
@@ -30,6 +30,7 @@
 #include <errno.h>
 #include <stddef.h>
 #include <unistd.h>
+
 #include <algorithm>
 #include <cstring>
 #include <fstream>
@@ -415,9 +416,9 @@ main(int argc, char** argv)
 
   if (!CaffeToPlan(
           output_filename, prototxt_filename, model_filename, output_names,
-          (use_fp16) ? nvinfer1::DataType::kHALF
-                     : (use_int8) ? nvinfer1::DataType::kINT8
-                                  : nvinfer1::DataType::kFLOAT,
+          (use_fp16)   ? nvinfer1::DataType::kHALF
+          : (use_int8) ? nvinfer1::DataType::kINT8
+                       : nvinfer1::DataType::kFLOAT,
           calibration_filename, max_batch_size, max_workspace_size)) {
     std::cerr << "Failed to create PLAN file" << std::endl;
     return 1;
diff --git a/src/test/data_compressor_test.cc b/src/test/data_compressor_test.cc
index e1b46cb641..292c8c544a 100644
--- a/src/test/data_compressor_test.cc
+++ b/src/test/data_compressor_test.cc
@@ -33,6 +33,7 @@
 #endif
 
 #include <event2/buffer.h>
+
 #include <chrono>
 #include <condition_variable>
 #include <fstream>
@@ -43,6 +44,7 @@
 #include <string>
 #include <thread>
 #include <vector>
+
 #include "data_compressor.h"
 
 namespace ni = triton::server;
@@ -140,8 +142,8 @@ class DataCompressorTest : public ::testing::Test {
       : raw_data_length_(0), deflate_compressed_length_(0),
         gzip_compressed_length_(0)
   {
-    std::vector<std::string> files{"raw_data", "deflate_compressed_data",
-                                   "gzip_compressed_data"};
+    std::vector<std::string> files{
+        "raw_data", "deflate_compressed_data", "gzip_compressed_data"};
     for (const auto& file : files) {
       std::fstream fs(file);
       // get length of file
diff --git a/src/test/distributed_addsub/src/distributed_addsub.cc b/src/test/distributed_addsub/src/distributed_addsub.cc
index e52c274aad..db6e8222e6 100644
--- a/src/test/distributed_addsub/src/distributed_addsub.cc
+++ b/src/test/distributed_addsub/src/distributed_addsub.cc
@@ -27,6 +27,7 @@
 #include <atomic>
 #include <memory>
 #include <thread>
+
 #include "triton/backend/backend_common.h"
 #include "triton/backend/backend_model.h"
 #include "triton/backend/backend_model_instance.h"
@@ -661,10 +662,14 @@ TRITONBACKEND_ModelInstanceExecute(
     uint64_t input_1_byte_size = input_byte_size;
     GUARDED_RESPOND_IF_ERROR(
         responses, r,
-        ReadInputTensor(request, "INPUT0", input_0.data(), &input_0_byte_size));
+        ReadInputTensor(
+            request, "INPUT0", input_0.data(),
+            reinterpret_cast<size_t*>(&input_0_byte_size)));
     GUARDED_RESPOND_IF_ERROR(
         responses, r,
-        ReadInputTensor(request, "INPUT1", input_1.data(), &input_1_byte_size));
+        ReadInputTensor(
+            request, "INPUT1", input_1.data(),
+            reinterpret_cast<size_t*>(&input_1_byte_size)));
     if (responses[r] == nullptr) {
       LOG_MESSAGE(
           TRITONSERVER_LOG_ERROR,
@@ -678,7 +683,7 @@ TRITONBACKEND_ModelInstanceExecute(
     // Compute... Get GPU instance from model state and let it compute
     // the subtraction, while the CPU instance computes the addition.
     // In real world some parallelization should be used, but here just
-    // seralize the "distributed" work.
+    // serialize the "distributed" work.
     TRITONBACKEND_Response* response = responses[r];
 
     uint64_t compute_start_ns = 0;
diff --git a/src/test/dyna_sequence/src/dyna_sequence.cc b/src/test/dyna_sequence/src/dyna_sequence.cc
index b78df20142..91f83db7c9 100644
--- a/src/test/dyna_sequence/src/dyna_sequence.cc
+++ b/src/test/dyna_sequence/src/dyna_sequence.cc
@@ -27,6 +27,7 @@
 #include <algorithm>
 #include <memory>
 #include <thread>
+
 #include "triton/backend/backend_common.h"
 #include "triton/backend/backend_model.h"
 #include "triton/backend/backend_model_instance.h"
diff --git a/src/test/implicit_state/src/implicit_state.cc b/src/test/implicit_state/src/implicit_state.cc
index 773715afcd..7def94934c 100644
--- a/src/test/implicit_state/src/implicit_state.cc
+++ b/src/test/implicit_state/src/implicit_state.cc
@@ -26,6 +26,7 @@
 
 #include <algorithm>
 #include <vector>
+
 #include "triton/backend/backend_common.h"
 #include "triton/backend/backend_model.h"
 #include "triton/backend/backend_model_instance.h"
@@ -47,8 +48,15 @@ namespace triton { namespace backend { namespace implicit {
 //   for a non existent state or a model that doesn't have states section in
 //   sequence batching.
 //
-//   * STATE_UPDATE_FALSE = 3: Tests not calling the state update and expecting
+//   * STATE_UPDATE_FALSE = 1: Tests not calling the state update and expecting
 //   the implicit state to not be updated.
+//
+//   * USE_SINGLE_STATE_BUFFER = 2: For this scenario we will be using the same
+//   buffer for both input and output state. In total there will be 3 requests
+//   sent in a sequence.
+//
+//   * USE_GROWABLE_STATE_BUFFER = 3: In this test case we use growable state
+//   buffer. Currently, growable state buffer only supports CUDA memory.
 
 #define GUARDED_RESPOND_IF_ERROR(RESPONSES, IDX, REQUEST, X)            \
   do {                                                                  \
@@ -178,6 +186,10 @@ class ModelInstanceState : public BackendModelInstance {
 
   // Get the state of the model that corresponds to this instance.
   ModelState* StateForModel() const { return model_state_; }
+  void* state_ = nullptr;
+
+  // Index of the request in the sequence
+  uint32_t request_index_ = 0;
 
  private:
   ModelInstanceState(
@@ -364,14 +376,6 @@ TRITONBACKEND_ModelInstanceInitialize(TRITONBACKEND_ModelInstance* instance)
   RETURN_IF_ERROR(TRITONBACKEND_ModelInstanceSetState(
       instance, reinterpret_cast<void*>(instance_state)));
 
-  // Because this backend just copies IN -> OUT and requires that
-  // input and output be in CPU memory, we fail if a GPU instances is
-  // requested.
-  RETURN_ERROR_IF_FALSE(
-      instance_state->Kind() == TRITONSERVER_INSTANCEGROUPKIND_CPU,
-      TRITONSERVER_ERROR_INVALID_ARG,
-      std::string("'implicit_state' backend only supports CPU instances"));
-
   return nullptr;  // success
 }
 
@@ -633,6 +637,12 @@ TRITONBACKEND_ModelInstanceExecute(
       continue;
     }
 
+    const float* lstart_buffer = reinterpret_cast<const float*>(start_buffer);
+    if (*lstart_buffer == 1) {
+      instance_state->request_index_ = 0;
+      instance_state->state_ = nullptr;
+    }
+
     const void* end_buffer = nullptr;
     GUARDED_RESPOND_IF_ERROR(
         responses, r, request,
@@ -744,7 +754,7 @@ TRITONBACKEND_ModelInstanceExecute(
         *reinterpret_cast<const int32_t*>(test_case_buffer);
     const int32_t ipbuffer_int =
         *reinterpret_cast<const int32_t*>(input_buffer);
-    int32_t ipbuffer_state_int;
+    int32_t ipbuffer_state_int = 0;
 
     if (test_case_buffer_int != 0) {
       TRITONBACKEND_Input* input_state = nullptr;
@@ -767,7 +777,24 @@ TRITONBACKEND_ModelInstanceExecute(
               input_state, 0 /* input_buffer_count */, &input_state_buffer,
               &buffer_byte_size, &input_memory_type, &input_memory_type_id));
       if ((responses[r] == nullptr) ||
-          (input_memory_type == TRITONSERVER_MEMORY_GPU)) {
+          (test_case_buffer_int == 3 &&
+           input_memory_type != TRITONSERVER_MEMORY_GPU)) {
+        GUARDED_RESPOND_IF_ERROR(
+            responses, r, request,
+            TRITONSERVER_ErrorNew(
+                TRITONSERVER_ERROR_UNSUPPORTED,
+                "growable memory should always provide memory in GPU"));
+        LOG_MESSAGE(
+            TRITONSERVER_LOG_ERROR,
+            (std::string("request ") + std::to_string(r) +
+             ": failed to get input buffer in GPU memory, error "
+             "response sent")
+                .c_str());
+        continue;
+      } else if (
+          (responses[r] == nullptr) ||
+          (input_memory_type == TRITONSERVER_MEMORY_GPU &&
+           test_case_buffer_int != 3)) {
         GUARDED_RESPOND_IF_ERROR(
             responses, r, request,
             TRITONSERVER_ErrorNew(
@@ -782,9 +809,32 @@ TRITONBACKEND_ModelInstanceExecute(
         continue;
       }
 
-      const int32_t ipbuffer_state =
-          *reinterpret_cast<const int32_t*>(input_state_buffer);
-      ipbuffer_state_int = ipbuffer_state;
+      // When using single state buffer, input/output tensors should point to
+      // the buffer.
+      if ((test_case_buffer_int == 2 || test_case_buffer_int == 3) &&
+          instance_state->state_ != nullptr) {
+        if (input_state_buffer != instance_state->state_) {
+          GUARDED_RESPOND_IF_ERROR(
+              responses, r, request,
+              TRITONSERVER_ErrorNew(
+                  TRITONSERVER_ERROR_UNSUPPORTED,
+                  "Input and output state are using different buffers."));
+          LOG_MESSAGE(
+              TRITONSERVER_LOG_ERROR,
+              (std::string("request ") + std::to_string(r) +
+               ": input and output state are using different buffers, error "
+               "response sent")
+                  .c_str());
+          continue;
+        }
+      }
+
+      if (test_case_buffer_int == 2 || test_case_buffer_int == 1 ||
+          test_case_buffer_int == 0) {
+        const int32_t ipbuffer_state =
+            *reinterpret_cast<const int32_t*>(input_state_buffer);
+        ipbuffer_state_int = ipbuffer_state;
+      }
     }
 
     switch (test_case_buffer_int) {
@@ -866,7 +916,6 @@ TRITONBACKEND_ModelInstanceExecute(
                 response_state, reinterpret_cast<void**>(&buffer),
                 sizeof(int32_t), &actual_memory_type, &actual_memory_type_id));
 
-
         if ((responses[r] == nullptr) ||
             (actual_memory_type == TRITONSERVER_MEMORY_GPU)) {
           GUARDED_RESPOND_IF_ERROR(
@@ -974,7 +1023,130 @@ TRITONBACKEND_ModelInstanceExecute(
         }
         lbuffer = reinterpret_cast<int32_t*>(buffer);
         *lbuffer = ipbuffer_int + ipbuffer_state_int;
-      }
+      } break;
+      // USE_SINGLE_BUFFER
+      case 2: {
+        TRITONBACKEND_State* response_state;
+        std::vector<int64_t> shape{1};
+        GUARDED_RESPOND_IF_ERROR(
+            responses, r, request,
+            TRITONBACKEND_StateNew(
+                &response_state, request, "OUTPUT_STATE",
+                TRITONSERVER_TYPE_INT32, shape.data() /* data */,
+                shape.size() /* dim_count */));
+
+        if (responses[r] == nullptr) {
+          LOG_MESSAGE(
+              TRITONSERVER_LOG_ERROR,
+              (std::string("request ") + std::to_string(r) +
+               ": failed to create the output state 'OUTPUT_STATE', error "
+               "response sent")
+                  .c_str());
+          continue;
+        }
+        TRITONSERVER_MemoryType actual_memory_type = TRITONSERVER_MEMORY_CPU;
+        int64_t actual_memory_type_id = 0;
+        char* buffer;
+
+        // Request an output buffer in GPU. This is only for testing purposes
+        // to make sure that GPU output buffers can be requested.
+        GUARDED_RESPOND_IF_ERROR(
+            responses, r, request,
+            TRITONBACKEND_StateBuffer(
+                response_state, reinterpret_cast<void**>(&buffer),
+                sizeof(int32_t), &actual_memory_type, &actual_memory_type_id));
+
+        instance_state->state_ = buffer;
+      } break;
+      case 3: {
+        TRITONBACKEND_State* response_state;
+        size_t block_size = sizeof(int8_t) * 1024 * 1024;
+        int64_t current_elements =
+            (instance_state->request_index_ + 1) * 1024 * 1024;
+        std::cout << "current elements are "
+                  << (instance_state->request_index_ + 1) << std::endl;
+        std::vector<int64_t> shape{current_elements};
+        GUARDED_RESPOND_IF_ERROR(
+            responses, r, request,
+            TRITONBACKEND_StateNew(
+                &response_state, request, "OUTPUT_STATE",
+                TRITONSERVER_TYPE_INT8, shape.data() /* data */,
+                shape.size() /* dim_count */));
+
+        if (responses[r] == nullptr) {
+          LOG_MESSAGE(
+              TRITONSERVER_LOG_ERROR,
+              (std::string("request ") + std::to_string(r) +
+               ": failed to create the output state 'OUTPUT_STATE', error "
+               "response sent")
+                  .c_str());
+          continue;
+        }
+        TRITONSERVER_MemoryType actual_memory_type = TRITONSERVER_MEMORY_GPU;
+        int64_t actual_memory_type_id = 0;
+        char* buffer;
+
+        // Request an output buffer in GPU. This is only for testing purposes
+        // to make sure that GPU output buffers can be requested.
+        GUARDED_RESPOND_IF_ERROR(
+            responses, r, request,
+            TRITONBACKEND_StateBuffer(
+                response_state, reinterpret_cast<void**>(&buffer),
+                block_size * (instance_state->request_index_ + 1),
+                &actual_memory_type, &actual_memory_type_id));
+
+        // Only write the new data to the portion of the state buffer that
+        // has been grown.
+        cudaMemset(
+            buffer + block_size * (instance_state->request_index_),
+            instance_state->request_index_, block_size);
+
+        TRITONBACKEND_Output* response_output;
+        GUARDED_RESPOND_IF_ERROR(
+            responses, r, request,
+            TRITONBACKEND_ResponseOutput(
+                responses[r], &response_output, "OUTPUT_STATE",
+                TRITONSERVER_TYPE_INT8, shape.data() /* data */,
+                shape.size() /* dim_count */));
+
+        actual_memory_type = TRITONSERVER_MEMORY_CPU;
+        actual_memory_type_id = 0;
+        char* output_buffer;
+        GUARDED_RESPOND_IF_ERROR(
+            responses, r, request,
+            TRITONBACKEND_OutputBuffer(
+                response_output, reinterpret_cast<void**>(&output_buffer),
+                block_size * (instance_state->request_index_ + 1),
+                &actual_memory_type, &actual_memory_type_id));
+        if ((responses[r] == nullptr) ||
+            (actual_memory_type != TRITONSERVER_MEMORY_CPU)) {
+          GUARDED_RESPOND_IF_ERROR(
+              responses, r, request,
+              TRITONSERVER_ErrorNew(
+                  TRITONSERVER_ERROR_UNSUPPORTED,
+                  "the backend can only handle CPU tensors"));
+          LOG_MESSAGE(
+              TRITONSERVER_LOG_ERROR,
+              (std::string("request ") + std::to_string(r) +
+               "the backend can only handle CPU tensors"
+               "response sent")
+                  .c_str());
+          continue;
+        }
+        cudaMemcpy(
+            output_buffer, buffer,
+            block_size * (instance_state->request_index_ + 1),
+            cudaMemcpyDeviceToHost);
+
+        instance_state->state_ = buffer;
+      } break;
+    }
+    const float* lend_buffer = reinterpret_cast<const float*>(end_buffer);
+
+    if (*lend_buffer == 1) {
+      instance_state->request_index_ = 0;
+    } else {
+      instance_state->request_index_ += 1;
     }
 
     uint64_t exec_end_ns = 0;
diff --git a/src/test/iterative_sequence/CMakeLists.txt b/src/test/iterative_sequence/CMakeLists.txt
new file mode 100644
index 0000000000..9321e32049
--- /dev/null
+++ b/src/test/iterative_sequence/CMakeLists.txt
@@ -0,0 +1,118 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+cmake_minimum_required(VERSION 3.17)
+
+project(tritoniterativesequencebackend LANGUAGES C CXX)
+
+#
+# libtriton_iterative_sequence.so
+# Shared library implementing the Triton Sequence Backend API
+#
+configure_file(src/libtriton_iterative_sequence.ldscript libtriton_iterative_sequence.ldscript COPYONLY)
+
+add_library(
+  triton-iterative-sequence-backend SHARED
+  src/iterative_sequence.cc
+)
+
+add_library(
+  TritonIterativeSequenceBackend::triton-iterative-sequence-backend ALIAS triton-iterative-sequence-backend
+)
+
+target_compile_features(triton-iterative-sequence-backend PRIVATE cxx_std_11)
+target_compile_options(
+  triton-iterative-sequence-backend PRIVATE
+  $<$<OR:$<CXX_COMPILER_ID:Clang>,$<CXX_COMPILER_ID:AppleClang>,$<CXX_COMPILER_ID:GNU>>:
+    -Wall -Wextra -Wno-unused-parameter -Wno-type-limits -Werror>
+)
+
+target_link_libraries(
+  triton-iterative-sequence-backend
+  PRIVATE
+    triton-backend-utils    # from repo-backend
+    triton-core-serverapi   # from repo-core
+    triton-core-backendapi  # from repo-core
+    triton-core-serverstub  # from repo-core
+)
+
+set_target_properties(
+  triton-iterative-sequence-backend PROPERTIES
+  POSITION_INDEPENDENT_CODE ON
+  OUTPUT_NAME triton_iterative_sequence
+  LINK_DEPENDS ${CMAKE_CURRENT_BINARY_DIR}/libtriton_iterative_sequence.ldscript
+  LINK_FLAGS "-Wl,--version-script libtriton_iterative_sequence.ldscript"
+)
+
+#
+# Install
+#
+include(GNUInstallDirs)
+set(INSTALL_CONFIGDIR ${CMAKE_INSTALL_LIBDIR}/cmake/TritonIterativeSequenceBackend)
+
+install(
+  TARGETS
+    triton-iterative-sequence-backend
+  EXPORT
+    triton-iterative-sequence-backend-targets
+  LIBRARY DESTINATION ${CMAKE_INSTALL_PREFIX}/backends/iterative_sequence
+  ARCHIVE DESTINATION ${CMAKE_INSTALL_PREFIX}/backends/iterative_sequence
+)
+
+install(
+  EXPORT
+    triton-iterative-sequence-backend-targets
+  FILE
+    TritonIterativeSequenceBackendTargets.cmake
+  NAMESPACE
+    TritonIterativeSequenceBackend::
+  DESTINATION
+    ${INSTALL_CONFIGDIR}
+)
+
+include(CMakePackageConfigHelpers)
+configure_package_config_file(
+  ${CMAKE_CURRENT_LIST_DIR}/cmake/TritonIterativeSequenceBackendConfig.cmake.in
+  ${CMAKE_CURRENT_BINARY_DIR}/TritonIterativeSequenceBackendConfig.cmake
+  INSTALL_DESTINATION ${INSTALL_CONFIGDIR}
+)
+
+install(
+  FILES
+  ${CMAKE_CURRENT_BINARY_DIR}/TritonIterativeSequenceBackendConfig.cmake
+  DESTINATION ${INSTALL_CONFIGDIR}
+)
+
+#
+# Export from build tree
+#
+export(
+  EXPORT triton-iterative-sequence-backend-targets
+  FILE ${CMAKE_CURRENT_BINARY_DIR}/TritonIterativeSequenceBackendTargets.cmake
+  NAMESPACE TritonIterativeSequenceBackend::
+)
+
+export(PACKAGE TritonIterativeSequenceBackend)
diff --git a/src/test/iterative_sequence/cmake/TritonIterativeSequenceBackendConfig.cmake.in b/src/test/iterative_sequence/cmake/TritonIterativeSequenceBackendConfig.cmake.in
new file mode 100644
index 0000000000..516fbc23b3
--- /dev/null
+++ b/src/test/iterative_sequence/cmake/TritonIterativeSequenceBackendConfig.cmake.in
@@ -0,0 +1,39 @@
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+include(CMakeFindDependencyMacro)
+
+get_filename_component(
+  TRITONSEQUENCEBACKEND_CMAKE_DIR "${CMAKE_CURRENT_LIST_FILE}" PATH
+)
+
+list(APPEND CMAKE_MODULE_PATH ${TRITONSEQUENCEBACKEND_CMAKE_DIR})
+
+if(NOT TARGET TritonIterativeSequenceBackend::triton-sequence-backend)
+  include("${TRITONSEQUENCEBACKEND_CMAKE_DIR}/TritonIterativeSequenceBackendTargets.cmake")
+endif()
+
+set(TRITONSEQUENCEBACKEND_LIBRARIES TritonIterativeSequenceBackend::triton-sequence-backend)
diff --git a/src/test/iterative_sequence/src/iterative_sequence.cc b/src/test/iterative_sequence/src/iterative_sequence.cc
new file mode 100644
index 0000000000..4c71a2adf0
--- /dev/null
+++ b/src/test/iterative_sequence/src/iterative_sequence.cc
@@ -0,0 +1,582 @@
+// Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NVIDIA CORPORATION nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#include <algorithm>
+#include <memory>
+#include <thread>
+
+#include "triton/backend/backend_common.h"
+#include "triton/backend/backend_model.h"
+#include "triton/backend/backend_model_instance.h"
+
+namespace triton { namespace backend { namespace iterative_sequence {
+
+
+// Simple iterative sequence backend that demonstrates the use of
+// TRITONSERVER_REQUEST_RELEASE_RESCHEDULE flag to iteratively produce
+// sequence response.
+//
+// The backend supports models that take 1 input tensor, an INT32 [ 1 ]
+// input named "INPUT"; and produces an output tensor "OUTPUT" with the same
+// shape as the input tensor. The input value indicates the total number of
+// responses to be generated and the output value indicates the number of
+// remaining responses. For example, if the request input has value 2,
+// the backend will:
+//   - Send a response with value 1.
+//   - Release request with RESCHEDULE flag.
+//   - When execute on the same request, send the last response with value 0.
+//   - Release request with ALL flag.
+//
+
+#define GUARDED_RESPOND_IF_ERROR(RESPONSES, IDX, X)                     \
+  do {                                                                  \
+    if ((RESPONSES)[IDX] != nullptr) {                                  \
+      TRITONSERVER_Error* err__ = (X);                                  \
+      if (err__ != nullptr) {                                           \
+        LOG_IF_ERROR(                                                   \
+            TRITONBACKEND_ResponseSend(                                 \
+                (RESPONSES)[IDX], TRITONSERVER_RESPONSE_COMPLETE_FINAL, \
+                err__),                                                 \
+            "failed to send error response");                           \
+        (RESPONSES)[IDX] = nullptr;                                     \
+        TRITONSERVER_ErrorDelete(err__);                                \
+      }                                                                 \
+    }                                                                   \
+  } while (false)
+
+//
+// ModelState
+//
+// State associated with a model that is using this backend. An object
+// of this class is created and associated with each
+// TRITONBACKEND_Model.
+//
+class ModelState : public BackendModel {
+ public:
+  static TRITONSERVER_Error* Create(
+      TRITONBACKEND_Model* triton_model, ModelState** state);
+  virtual ~ModelState() = default;
+
+ private:
+  ModelState(TRITONBACKEND_Model* triton_model);
+};
+
+TRITONSERVER_Error*
+ModelState::Create(TRITONBACKEND_Model* triton_model, ModelState** state)
+{
+  try {
+    *state = new ModelState(triton_model);
+  }
+  catch (const BackendModelException& ex) {
+    RETURN_ERROR_IF_TRUE(
+        ex.err_ == nullptr, TRITONSERVER_ERROR_INTERNAL,
+        std::string("unexpected nullptr in BackendModelException"));
+    RETURN_IF_ERROR(ex.err_);
+  }
+
+  return nullptr;  // success
+}
+
+ModelState::ModelState(TRITONBACKEND_Model* triton_model)
+    : BackendModel(triton_model)
+{
+}
+
+//
+// ModelInstanceState
+//
+// State associated with a model instance. An object of this class is
+// created and associated with each TRITONBACKEND_ModelInstance.
+//
+class ModelInstanceState : public BackendModelInstance {
+ public:
+  static TRITONSERVER_Error* Create(
+      ModelState* model_state,
+      TRITONBACKEND_ModelInstance* triton_model_instance,
+      ModelInstanceState** state);
+  virtual ~ModelInstanceState() = default;
+
+  // Get the state of the model that corresponds to this instance.
+  ModelState* StateForModel() const { return model_state_; }
+
+  // return output value on receiving request, initialize remainder
+  // if the corrid hasn't been recorded.
+  int32_t GetOutput(uint64_t corrid, int32_t init_value);
+
+ private:
+  ModelInstanceState(
+      ModelState* model_state,
+      TRITONBACKEND_ModelInstance* triton_model_instance);
+
+  ModelState* model_state_;
+
+  // A map from correlation ID to the remaining responses.
+  std::unordered_map<uint64_t, int32_t> remainders_;
+};
+
+TRITONSERVER_Error*
+ModelInstanceState::Create(
+    ModelState* model_state, TRITONBACKEND_ModelInstance* triton_model_instance,
+    ModelInstanceState** state)
+{
+  try {
+    *state = new ModelInstanceState(model_state, triton_model_instance);
+  }
+  catch (const BackendModelInstanceException& ex) {
+    RETURN_ERROR_IF_TRUE(
+        ex.err_ == nullptr, TRITONSERVER_ERROR_INTERNAL,
+        std::string("unexpected nullptr in BackendModelInstanceException"));
+    RETURN_IF_ERROR(ex.err_);
+  }
+
+  return nullptr;  // success
+}
+
+ModelInstanceState::ModelInstanceState(
+    ModelState* model_state, TRITONBACKEND_ModelInstance* triton_model_instance)
+    : BackendModelInstance(model_state, triton_model_instance),
+      model_state_(model_state)
+{
+}
+
+int32_t
+ModelInstanceState::GetOutput(uint64_t corrid, int32_t init_value)
+{
+  auto it = remainders_.find(corrid);
+  if (it == remainders_.end()) {
+    it = remainders_.emplace(corrid, init_value).first;
+  }
+  auto res = --it->second;
+  if (res <= 0) {
+    remainders_.erase(it);
+  }
+  return res;
+}
+
+/////////////
+
+extern "C" {
+
+// Implementing TRITONBACKEND_Initialize is optional. The backend
+// should initialize any global state that is intended to be shared
+// across all models and model instances that use the backend.
+TRITONSERVER_Error*
+TRITONBACKEND_Initialize(TRITONBACKEND_Backend* backend)
+{
+  const char* cname;
+  RETURN_IF_ERROR(TRITONBACKEND_BackendName(backend, &cname));
+  std::string name(cname);
+
+  LOG_MESSAGE(
+      TRITONSERVER_LOG_INFO,
+      (std::string("TRITONBACKEND_Initialize: ") + name).c_str());
+
+  // We should check the backend API version that Triton supports
+  // vs. what this backend was compiled against.
+  uint32_t api_version_major, api_version_minor;
+  RETURN_IF_ERROR(
+      TRITONBACKEND_ApiVersion(&api_version_major, &api_version_minor));
+
+  if ((api_version_major != TRITONBACKEND_API_VERSION_MAJOR) ||
+      (api_version_minor < TRITONBACKEND_API_VERSION_MINOR)) {
+    return TRITONSERVER_ErrorNew(
+        TRITONSERVER_ERROR_UNSUPPORTED,
+        "triton backend API version does not support this backend");
+  }
+
+  return nullptr;  // success
+}
+
+// Implementing TRITONBACKEND_ModelInitialize is optional. The backend
+// should initialize any state that is intended to be shared across
+// all instances of the model.
+TRITONSERVER_Error*
+TRITONBACKEND_ModelInitialize(TRITONBACKEND_Model* model)
+{
+  const char* cname;
+  RETURN_IF_ERROR(TRITONBACKEND_ModelName(model, &cname));
+  std::string name(cname);
+
+  uint64_t version;
+  RETURN_IF_ERROR(TRITONBACKEND_ModelVersion(model, &version));
+
+  LOG_MESSAGE(
+      TRITONSERVER_LOG_INFO,
+      (std::string("TRITONBACKEND_ModelInitialize: ") + name + " (version " +
+       std::to_string(version) + ")")
+          .c_str());
+
+  // With each model we create a ModelState object and associate it
+  // with the TRITONBACKEND_Model.
+  ModelState* model_state;
+  RETURN_IF_ERROR(ModelState::Create(model, &model_state));
+  RETURN_IF_ERROR(
+      TRITONBACKEND_ModelSetState(model, reinterpret_cast<void*>(model_state)));
+
+  return nullptr;  // success
+}
+
+// Implementing TRITONBACKEND_ModelFinalize is optional unless state
+// is set using TRITONBACKEND_ModelSetState. The backend must free
+// this state and perform any other cleanup.
+TRITONSERVER_Error*
+TRITONBACKEND_ModelFinalize(TRITONBACKEND_Model* model)
+{
+  void* vstate;
+  RETURN_IF_ERROR(TRITONBACKEND_ModelState(model, &vstate));
+  ModelState* model_state = reinterpret_cast<ModelState*>(vstate);
+
+  LOG_MESSAGE(
+      TRITONSERVER_LOG_INFO, "TRITONBACKEND_ModelFinalize: delete model state");
+
+  delete model_state;
+
+  return nullptr;  // success
+}
+
+// Implementing TRITONBACKEND_ModelInstanceInitialize is optional. The
+// backend should initialize any state that is required for a model
+// instance.
+TRITONSERVER_Error*
+TRITONBACKEND_ModelInstanceInitialize(TRITONBACKEND_ModelInstance* instance)
+{
+  const char* cname;
+  RETURN_IF_ERROR(TRITONBACKEND_ModelInstanceName(instance, &cname));
+  std::string name(cname);
+
+  int32_t device_id;
+  RETURN_IF_ERROR(TRITONBACKEND_ModelInstanceDeviceId(instance, &device_id));
+  TRITONSERVER_InstanceGroupKind kind;
+  RETURN_IF_ERROR(TRITONBACKEND_ModelInstanceKind(instance, &kind));
+
+  LOG_MESSAGE(
+      TRITONSERVER_LOG_INFO,
+      (std::string("TRITONBACKEND_ModelInstanceInitialize: ") + name + " (" +
+       TRITONSERVER_InstanceGroupKindString(kind) + " device " +
+       std::to_string(device_id) + ")")
+          .c_str());
+
+  // The instance can access the corresponding model as well... here
+  // we get the model and from that get the model's state.
+  TRITONBACKEND_Model* model;
+  RETURN_IF_ERROR(TRITONBACKEND_ModelInstanceModel(instance, &model));
+
+  void* vmodelstate;
+  RETURN_IF_ERROR(TRITONBACKEND_ModelState(model, &vmodelstate));
+  ModelState* model_state = reinterpret_cast<ModelState*>(vmodelstate);
+
+  // With each instance we create a ModelInstanceState object and
+  // associate it with the TRITONBACKEND_ModelInstance.
+  ModelInstanceState* instance_state;
+  RETURN_IF_ERROR(
+      ModelInstanceState::Create(model_state, instance, &instance_state));
+  RETURN_IF_ERROR(TRITONBACKEND_ModelInstanceSetState(
+      instance, reinterpret_cast<void*>(instance_state)));
+
+  // Because this backend just copies IN -> OUT and requires that
+  // input and output be in CPU memory, we fail if a GPU instances is
+  // requested.
+  RETURN_ERROR_IF_FALSE(
+      instance_state->Kind() == TRITONSERVER_INSTANCEGROUPKIND_CPU,
+      TRITONSERVER_ERROR_INVALID_ARG,
+      std::string("'iterative_sequence' backend only supports CPU instances"));
+
+  return nullptr;  // success
+}
+
+// Implementing TRITONBACKEND_ModelInstanceFinalize is optional unless
+// state is set using TRITONBACKEND_ModelInstanceSetState. The backend
+// must free this state and perform any other cleanup.
+TRITONSERVER_Error*
+TRITONBACKEND_ModelInstanceFinalize(TRITONBACKEND_ModelInstance* instance)
+{
+  void* vstate;
+  RETURN_IF_ERROR(TRITONBACKEND_ModelInstanceState(instance, &vstate));
+  ModelInstanceState* instance_state =
+      reinterpret_cast<ModelInstanceState*>(vstate);
+
+  LOG_MESSAGE(
+      TRITONSERVER_LOG_INFO,
+      "TRITONBACKEND_ModelInstanceFinalize: delete instance state");
+
+  delete instance_state;
+
+  return nullptr;  // success
+}
+
+// Implementing TRITONBACKEND_ModelInstanceExecute is required.
+TRITONSERVER_Error*
+TRITONBACKEND_ModelInstanceExecute(
+    TRITONBACKEND_ModelInstance* instance, TRITONBACKEND_Request** requests,
+    const uint32_t request_count)
+{
+  // Triton will not call this function simultaneously for the same
+  // 'instance'. But since this backend could be used by multiple
+  // instances from multiple models the implementation needs to handle
+  // multiple calls to this function at the same time (with different
+  // 'instance' objects). Suggested practice for this is to use only
+  // function-local and model-instance-specific state (obtained from
+  // 'instance'), which is what we do here.
+  ModelInstanceState* instance_state;
+  RETURN_IF_ERROR(TRITONBACKEND_ModelInstanceState(
+      instance, reinterpret_cast<void**>(&instance_state)));
+  ModelState* model_state = instance_state->StateForModel();
+
+  // This backend specifies BLOCKING execution policy. That means that
+  // we should not return from this function until execution is
+  // complete. Triton will automatically release 'instance' on return
+  // from this function so that it is again available to be used for
+  // another call to TRITONBACKEND_ModelInstanceExecute.
+
+  LOG_MESSAGE(
+      TRITONSERVER_LOG_INFO,
+      (std::string("model ") + model_state->Name() + ", instance " +
+       instance_state->Name() + ", executing " + std::to_string(request_count) +
+       " requests")
+          .c_str());
+
+  bool supports_batching = false;
+  RETURN_IF_ERROR(model_state->SupportsFirstDimBatching(&supports_batching));
+
+  // 'responses' is initialized with the response objects below and
+  // if/when an error response is sent the corresponding entry in
+  // 'responses' is set to nullptr to indicate that that response has
+  // already been sent.
+  std::vector<TRITONBACKEND_Response*> responses;
+  responses.reserve(request_count);
+
+  // Create a single response object for each request. If something
+  // goes wrong when attempting to create the response objects just
+  // fail all of the requests by returning an error.
+  for (uint32_t r = 0; r < request_count; ++r) {
+    TRITONBACKEND_Request* request = requests[r];
+
+    TRITONBACKEND_Response* response;
+    RETURN_IF_ERROR(TRITONBACKEND_ResponseNew(&response, request));
+    responses.push_back(response);
+  }
+
+  // The way we collect these batch timestamps is not entirely
+  // accurate. Normally, in a performant backend you would execute all
+  // the requests at the same time, and so there would be a single
+  // compute-start / compute-end time-range. But here we execute each
+  // request separately so there is no single range. As a result we
+  // just show the entire execute time as being the compute time as
+  // well.
+  uint64_t min_exec_start_ns = std::numeric_limits<uint64_t>::max();
+  uint64_t max_exec_end_ns = 0;
+  uint64_t total_batch_size = 0;
+
+  // After this point we take ownership of 'requests', which means
+  // that a response must be sent for every request. If something does
+  // go wrong in processing a particular request then we send an error
+  // response just for the specific request.
+
+  // For simplicity we just process each request separately... in
+  // general a backend should try to operate on the entire batch of
+  // requests at the same time for improved performance.
+  std::vector<uint8_t> start_buffer, end_buffer, ready_buffer, corrid_buffer,
+      input_buffer;
+  for (uint32_t r = 0; r < request_count; ++r) {
+    ++total_batch_size;
+
+    uint64_t exec_start_ns = 0;
+    SET_TIMESTAMP(exec_start_ns);
+    min_exec_start_ns = std::min(min_exec_start_ns, exec_start_ns);
+
+    TRITONBACKEND_Request* request = requests[r];
+
+    uint64_t correlation_id = 0;
+    GUARDED_RESPOND_IF_ERROR(
+        responses, r,
+        TRITONBACKEND_RequestCorrelationId(request, &correlation_id));
+    // If an error response was sent for the above then display an error
+    // message and move on to next request.
+    if (responses[r] == nullptr) {
+      LOG_MESSAGE(
+          TRITONSERVER_LOG_ERROR,
+          (std::string("request ") + std::to_string(r) +
+           ": failed to read request input/output counts, error response "
+           "sent")
+              .c_str());
+      continue;
+    }
+
+    TRITONBACKEND_Input* input = nullptr;
+    GUARDED_RESPOND_IF_ERROR(
+        responses, r, TRITONBACKEND_RequestInput(request, "INPUT", &input));
+    if (responses[r] == nullptr) {
+      LOG_MESSAGE(
+          TRITONSERVER_LOG_ERROR,
+          (std::string("request ") + std::to_string(r) +
+           ": failed to read input 'INPUT', error response sent")
+              .c_str());
+      continue;
+    }
+
+    const void* input_buffer = nullptr;
+    uint64_t buffer_byte_size = 0;
+    TRITONSERVER_MemoryType input_memory_type = TRITONSERVER_MEMORY_CPU;
+    int64_t input_memory_type_id = 0;
+    GUARDED_RESPOND_IF_ERROR(
+        responses, r,
+        TRITONBACKEND_InputBuffer(
+            input, 0 /* input_buffer_count */, &input_buffer, &buffer_byte_size,
+            &input_memory_type, &input_memory_type_id));
+    if ((responses[r] == nullptr) ||
+        (input_memory_type == TRITONSERVER_MEMORY_GPU)) {
+      GUARDED_RESPOND_IF_ERROR(
+          responses, r,
+          TRITONSERVER_ErrorNew(
+              TRITONSERVER_ERROR_UNSUPPORTED,
+              "failed to get input buffer in CPU memory"));
+      LOG_MESSAGE(
+          TRITONSERVER_LOG_ERROR,
+          (std::string("request ") + std::to_string(r) +
+           ": failed to get input buffer in CPU memory, error "
+           "response sent")
+              .c_str());
+      continue;
+    }
+
+    const int32_t init_value = *reinterpret_cast<const int32_t*>(input_buffer);
+    auto output_value = instance_state->GetOutput(correlation_id, init_value);
+
+    TRITONBACKEND_Response* response = responses[r];
+
+    // The output shape is [1, 1] if the model
+    // configuration supports batching, or just
+    // [1] if the model configuration does not
+    // support batching.
+    std::vector<int64_t> shape;
+    if (supports_batching) {
+      shape.push_back(1);
+    }
+    shape.push_back(1);
+
+    TRITONBACKEND_Output* output;
+    GUARDED_RESPOND_IF_ERROR(
+        responses, r,
+        TRITONBACKEND_ResponseOutput(
+            response, &output, "OUTPUT", TRITONSERVER_TYPE_INT32, shape.data(),
+            shape.size()));
+    if (responses[r] == nullptr) {
+      LOG_MESSAGE(
+          TRITONSERVER_LOG_ERROR,
+          (std::string("request ") + std::to_string(r) +
+           ": failed to create response output, error response sent")
+              .c_str());
+      continue;
+    }
+
+    // Step 2. Get the output buffer. We request a buffer in CPU
+    // memory but we have to handle any returned type. If we get
+    // back a buffer in GPU memory we just fail the request.
+    void* output_buffer;
+    TRITONSERVER_MemoryType output_memory_type = TRITONSERVER_MEMORY_CPU;
+    int64_t output_memory_type_id = 0;
+    GUARDED_RESPOND_IF_ERROR(
+        responses, r,
+        TRITONBACKEND_OutputBuffer(
+            output, &output_buffer, sizeof(int32_t), &output_memory_type,
+            &output_memory_type_id));
+    if ((responses[r] == nullptr) ||
+        (output_memory_type == TRITONSERVER_MEMORY_GPU)) {
+      GUARDED_RESPOND_IF_ERROR(
+          responses, r,
+          TRITONSERVER_ErrorNew(
+              TRITONSERVER_ERROR_UNSUPPORTED,
+              "failed to create output buffer in CPU memory"));
+      LOG_MESSAGE(
+          TRITONSERVER_LOG_ERROR,
+          (std::string("request ") + std::to_string(r) +
+           ": failed to create output buffer in CPU memory, error "
+           "response sent")
+              .c_str());
+      continue;
+    }
+
+    reinterpret_cast<int32_t*>(output_buffer)[0] = output_value;
+
+    // Set response flag and request flag correctly based on whether this
+    // is the last response of the sequence.
+    uint32_t res_flag =
+        (output_value <= 0) ? TRITONSERVER_RESPONSE_COMPLETE_FINAL : 0;
+    uint32_t req_flag = (output_value <= 0)
+                            ? TRITONSERVER_REQUEST_RELEASE_ALL
+                            : TRITONSERVER_REQUEST_RELEASE_RESCHEDULE;
+
+    uint64_t exec_end_ns = 0;
+    SET_TIMESTAMP(exec_end_ns);
+    max_exec_end_ns = std::max(max_exec_end_ns, exec_end_ns);
+
+    // wait for 0.5 second before rescheduling the request.
+    std::this_thread::sleep_for(std::chrono::milliseconds(500));
+    // Release the request first as the testing backend may be configured to
+    // receive error on request release, in such a case, the error will be
+    // propagated back through error response.
+    auto err = TRITONBACKEND_RequestRelease(request, req_flag);
+    if (err) {
+      // Release request with ALL flag
+      LOG_IF_ERROR(
+          TRITONBACKEND_RequestRelease(
+              request, TRITONSERVER_REQUEST_RELEASE_ALL),
+          "failed releasing request");
+      res_flag = TRITONSERVER_RESPONSE_COMPLETE_FINAL;
+    }
+
+    // Send all the responses that haven't already been sent because of
+    // an earlier error.
+    if (responses[r] != nullptr) {
+      LOG_IF_ERROR(
+          TRITONBACKEND_ResponseSend(responses[r], res_flag, err),
+          "failed sending response");
+    }
+    TRITONSERVER_ErrorDelete(err);
+
+    // Report statistics for each request.
+    LOG_IF_ERROR(
+        TRITONBACKEND_ModelInstanceReportStatistics(
+            instance_state->TritonModelInstance(), request,
+            (responses[r] != nullptr) /* success */, exec_start_ns,
+            exec_start_ns, exec_end_ns, exec_end_ns),
+        "failed reporting request statistics");
+  }
+
+  // Report the entire batch statistics.
+  LOG_IF_ERROR(
+      TRITONBACKEND_ModelInstanceReportBatchStatistics(
+          instance_state->TritonModelInstance(), total_batch_size,
+          min_exec_start_ns, min_exec_start_ns, max_exec_end_ns,
+          max_exec_end_ns),
+      "failed reporting batch request statistics");
+
+  return nullptr;  // success
+}
+
+}  // extern "C"
+
+}}}  // namespace triton::backend::iterative_sequence
diff --git a/deploy/gke-marketplace-app/server-deployer/chart/triton/templates/istio-gateway.yaml b/src/test/iterative_sequence/src/libtriton_iterative_sequence.ldscript
similarity index 81%
rename from deploy/gke-marketplace-app/server-deployer/chart/triton/templates/istio-gateway.yaml
rename to src/test/iterative_sequence/src/libtriton_iterative_sequence.ldscript
index 7e4e0c647d..00ee877745 100644
--- a/deploy/gke-marketplace-app/server-deployer/chart/triton/templates/istio-gateway.yaml
+++ b/src/test/iterative_sequence/src/libtriton_iterative_sequence.ldscript
@@ -1,4 +1,4 @@
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -23,18 +23,8 @@
 # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
-apiVersion: networking.istio.io/v1alpha3
-kind: Gateway
-metadata:
-  name: triton-gateway
-spec:
-  selector:
-    istio: ingressgateway # use istio default controller
-  servers:
-  - port:
-      number: 80 
-      name: http
-      protocol: HTTP
-    hosts:
-    - "*"
+{
+  global:
+    TRITONBACKEND_*;
+  local: *;
+};
diff --git a/src/test/query_backend/src/query.cc b/src/test/query_backend/src/query.cc
index dcbabe6c0b..8cc2fd4a06 100644
--- a/src/test/query_backend/src/query.cc
+++ b/src/test/query_backend/src/query.cc
@@ -25,6 +25,7 @@
 // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 #include <vector>
+
 #include "triton/backend/backend_common.h"
 #include "triton/backend/backend_model.h"
 #include "triton/backend/backend_model_instance.h"
@@ -104,8 +105,8 @@ TRITONBACKEND_ModelInstanceExecute(
     } else {
       names = {"OUTPUT0", "OUTPUT1"};
     }
-    std::vector<TRITONSERVER_MemoryType> types{TRITONSERVER_MEMORY_CPU_PINNED,
-                                               TRITONSERVER_MEMORY_CPU_PINNED};
+    std::vector<TRITONSERVER_MemoryType> types{
+        TRITONSERVER_MEMORY_CPU_PINNED, TRITONSERVER_MEMORY_CPU_PINNED};
     std::vector<int64_t> type_ids{1, 1};
     for (size_t i = 0; i < names.size(); ++i) {
       auto err = TRITONBACKEND_RequestOutputBufferProperties(
diff --git a/src/test/repoagent/relocation_repoagent/src/relocation.cc b/src/test/repoagent/relocation_repoagent/src/relocation.cc
index 8ad25a4ad5..3fee0b3ca0 100644
--- a/src/test/repoagent/relocation_repoagent/src/relocation.cc
+++ b/src/test/repoagent/relocation_repoagent/src/relocation.cc
@@ -24,15 +24,15 @@
 // (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-#include "triton/core/tritonrepoagent.h"
-#include "triton/core/tritonserver.h"
-
 #include <cctype>
 #include <cstring>
 #include <iomanip>
 #include <stdexcept>
 #include <string>
 
+#include "triton/core/tritonrepoagent.h"
+#include "triton/core/tritonserver.h"
+
 //
 // Relocation Repository Agent that is for test only.
 //
@@ -166,4 +166,4 @@ TRITONREPOAGENT_ModelAction(
 
 }  // extern "C"
 
-}}}  // namespace triton::repoagent::relocation
\ No newline at end of file
+}}}  // namespace triton::repoagent::relocation
diff --git a/src/test/sequence/src/sequence.cc b/src/test/sequence/src/sequence.cc
index c599ca46f3..44896d2974 100644
--- a/src/test/sequence/src/sequence.cc
+++ b/src/test/sequence/src/sequence.cc
@@ -26,6 +26,7 @@
 
 #include <memory>
 #include <thread>
+
 #include "triton/backend/backend_common.h"
 #include "triton/backend/backend_model.h"
 #include "triton/backend/backend_model_instance.h"
diff --git a/src/tracer.cc b/src/tracer.cc
index 4d9aab737d..c64d10ee10 100644
--- a/src/tracer.cc
+++ b/src/tracer.cc
@@ -1,4 +1,4 @@
-// Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 //
 // Redistribution and use in source and binary forms, with or without
 // modification, are permitted provided that the following conditions
@@ -27,12 +27,24 @@
 #include "tracer.h"
 
 #include <stdlib.h>
+
 #include <unordered_map>
+
 #include "common.h"
 #include "triton/common/logging.h"
 #ifdef TRITON_ENABLE_GPU
 #include <cuda_runtime_api.h>
 #endif  // TRITON_ENABLE_GPU
+#ifndef _WIN32
+#include "opentelemetry/exporters/ostream/span_exporter_factory.h"
+#include "opentelemetry/exporters/otlp/otlp_http_exporter_factory.h"
+#include "opentelemetry/sdk/resource/semantic_conventions.h"
+namespace otlp = opentelemetry::exporter::otlp;
+namespace otel_trace_sdk = opentelemetry::sdk::trace;
+namespace otel_trace_api = opentelemetry::trace;
+namespace otel_common = opentelemetry::common;
+namespace otel_resource = opentelemetry::sdk::resource;
+#endif
 
 namespace triton { namespace server {
 
@@ -40,28 +52,40 @@ TRITONSERVER_Error*
 TraceManager::Create(
     TraceManager** manager, const TRITONSERVER_InferenceTraceLevel level,
     const uint32_t rate, const int32_t count, const uint32_t log_frequency,
-    const std::string& filepath)
+    const std::string& filepath, const InferenceTraceMode mode,
+    const triton::server::TraceConfigMap& config_map)
 {
   // Always create TraceManager regardless of the global setting as they
   // can be updated at runtime even if tracing is not enable at start.
   // No trace should be sampled if the setting is not valid.
-  *manager = new TraceManager(level, rate, count, log_frequency, filepath);
+  *manager = new TraceManager(
+      level, rate, count, log_frequency, filepath, mode, config_map);
+
   return nullptr;  // success
 }
 
 TraceManager::TraceManager(
     const TRITONSERVER_InferenceTraceLevel level, const uint32_t rate,
     const int32_t count, const uint32_t log_frequency,
-    const std::string& filepath)
+    const std::string& filepath, const InferenceTraceMode mode,
+    const TraceConfigMap& config_map)
 {
   std::shared_ptr<TraceFile> file(new TraceFile(filepath));
   global_default_.reset(new TraceSetting(
-      level, rate, count, log_frequency, file, false, false, false, false,
-      false));
+      level, rate, count, log_frequency, file, mode, config_map,
+      false /*level_specified*/, false /*rate_specified*/,
+      false /*count_specified*/, false /*log_frequency_specified*/,
+      false /*filepath_specified*/, false /*mode_specified*/,
+      false /*config_map_specified*/));
   global_setting_.reset(new TraceSetting(
-      level, rate, count, log_frequency, file, false, false, false, false,
-      false));
+      level, rate, count, log_frequency, file, mode, config_map,
+      false /*level_specified*/, false /*rate_specified*/,
+      false /*count_specified*/, false /*log_frequency_specified*/,
+      false /*filepath_specified*/, false /*mode_specified*/,
+      false /*config_map_specified*/));
   trace_files_.emplace(filepath, file);
+
+  InitTracer(config_map);
 }
 
 TRITONSERVER_Error*
@@ -115,6 +139,8 @@ TraceManager::UpdateTraceSettingInternal(
   int32_t count = fallback_setting->count_;
   uint32_t log_frequency = fallback_setting->log_frequency_;
   std::string filepath = fallback_setting->file_->FileName();
+  InferenceTraceMode mode = fallback_setting->mode_;
+  TraceConfigMap config_map = fallback_setting->config_map_;
 
   // Whether the field value is specified:
   // if clear then it is not specified, otherwise,
@@ -145,6 +171,7 @@ TraceManager::UpdateTraceSettingInternal(
                                    : (((current_setting != nullptr) &&
                                        current_setting->filepath_specified_) ||
                                       (new_setting.filepath_ != nullptr)));
+
   if (level_specified) {
     level = (new_setting.level_ != nullptr) ? *new_setting.level_
                                             : current_setting->level_;
@@ -204,8 +231,10 @@ TraceManager::UpdateTraceSettingInternal(
   }
 
   std::shared_ptr<TraceSetting> lts(new TraceSetting(
-      level, rate, count, log_frequency, file, level_specified, rate_specified,
-      count_specified, log_frequency_specified, filepath_specified));
+      level, rate, count, log_frequency, file, mode, config_map,
+      level_specified, rate_specified, count_specified, log_frequency_specified,
+      filepath_specified, false /*mode_specified*/,
+      false /*config_map_specified*/));
   // The only invalid setting allowed is if it disables tracing
   if ((!lts->Valid()) && (level != TRITONSERVER_TRACE_LEVEL_DISABLED)) {
     return TRITONSERVER_ErrorNew(
@@ -279,8 +308,17 @@ TraceManager::SampleTrace(const std::string& model_name)
 
 TraceManager::Trace::~Trace()
 {
-  // Write trace now
-  setting_->WriteTrace(streams_);
+  if (setting_->mode_ == TRACE_MODE_TRITON) {
+    // Write trace now
+    setting_->WriteTrace(streams_);
+  } else if (setting_->mode_ == TRACE_MODE_OPENTELEMETRY) {
+#ifndef _WIN32
+    EndSpan(kRootSpan);
+#else
+    LOG_ERROR << "Unsupported trace mode: "
+              << TraceManager::InferenceTraceModeString(setting_->mode_);
+#endif
+  }
 }
 
 void
@@ -288,41 +326,347 @@ TraceManager::Trace::CaptureTimestamp(
     const std::string& name, uint64_t timestamp_ns)
 {
   if (setting_->level_ & TRITONSERVER_TRACE_LEVEL_TIMESTAMPS) {
-    std::lock_guard<std::mutex> lk(mtx_);
-    std::stringstream* ss = nullptr;
-    {
-      if (streams_.find(trace_id_) == streams_.end()) {
-        std::unique_ptr<std::stringstream> stream(new std::stringstream());
-        ss = stream.get();
-        streams_.emplace(trace_id_, std::move(stream));
-      } else {
-        ss = streams_[trace_id_].get();
-        // If the string stream is not newly created, add "," as there is
-        // already content in the string stream
-        *ss << ",";
+    if (setting_->mode_ == TRACE_MODE_TRITON) {
+      std::lock_guard<std::mutex> lk(mtx_);
+      std::stringstream* ss = nullptr;
+      {
+        if (streams_.find(trace_id_) == streams_.end()) {
+          std::unique_ptr<std::stringstream> stream(new std::stringstream());
+          ss = stream.get();
+          streams_.emplace(trace_id_, std::move(stream));
+        } else {
+          ss = streams_[trace_id_].get();
+          // If the string stream is not newly created, add "," as there is
+          // already content in the string stream
+          *ss << ",";
+        }
       }
+      *ss << "{\"id\":" << trace_id_ << ",\"timestamps\":["
+          << "{\"name\":\"" << name << "\",\"ns\":" << timestamp_ns << "}]}";
+    } else if (setting_->mode_ == TRACE_MODE_OPENTELEMETRY) {
+#ifndef _WIN32
+      AddEvent(kRootSpan, name, timestamp_ns);
+#else
+      LOG_ERROR << "Unsupported trace mode: "
+                << TraceManager::InferenceTraceModeString(setting_->mode_);
+#endif
     }
-    *ss << "{\"id\":" << trace_id_ << ",\"timestamps\":["
-        << "{\"name\":\"" << name << "\",\"ns\":" << timestamp_ns << "}]}";
   }
 }
 
 void
-TraceManager::TraceRelease(TRITONSERVER_InferenceTrace* trace, void* userp)
+TraceManager::InitTracer(const triton::server::TraceConfigMap& config_map)
+{
+  switch (global_setting_->mode_) {
+    case TRACE_MODE_OPENTELEMETRY: {
+#if !defined(_WIN32) && defined(TRITON_ENABLE_TRACING)
+      otlp::OtlpHttpExporterOptions opts;
+      otel_resource::ResourceAttributes attributes = {};
+      attributes[otel_resource::SemanticConventions::kServiceName] =
+          "triton-inference-server";
+      auto mode_key = std::to_string(TRACE_MODE_OPENTELEMETRY);
+      auto otel_options_it = config_map.find(mode_key);
+      if (otel_options_it != config_map.end()) {
+        for (const auto& setting : otel_options_it->second) {
+          // FIXME add more configuration options of OTLP HTTP Exporter
+          if (setting.first == "url") {
+            opts.url = setting.second;
+          }
+          if (setting.first == "resource") {
+            auto pos = setting.second.find('=');
+            auto key = setting.second.substr(0, pos);
+            auto value = setting.second.substr(pos + 1);
+            attributes[key] = value;
+          }
+        }
+      }
+      auto exporter = otlp::OtlpHttpExporterFactory::Create(opts);
+      auto test_exporter = triton::server::GetEnvironmentVariableOrDefault(
+          "TRITON_OPENTELEMETRY_TEST", "false");
+      if (test_exporter != "false") {
+        exporter = opentelemetry::exporter::trace::OStreamSpanExporterFactory::
+            Create();
+      }
+      auto processor = otel_trace_sdk::SimpleSpanProcessorFactory::Create(
+          std::move(exporter));
+      auto resource = otel_resource::Resource::Create(attributes);
+      std::shared_ptr<otel_trace_api::TracerProvider> provider =
+          otel_trace_sdk::TracerProviderFactory::Create(
+              std::move(processor), resource);
+
+      otel_trace_api::Provider::SetTracerProvider(provider);
+      break;
+#else
+      LOG_ERROR << "Unsupported trace mode: "
+                << TraceManager::InferenceTraceModeString(
+                       global_setting_->mode_);
+      break;
+#endif
+    }
+    default:
+      return;
+  }
+}
+
+void
+TraceManager::CleanupTracer()
+{
+  switch (global_setting_->mode_) {
+    case TRACE_MODE_OPENTELEMETRY: {
+#if !defined(_WIN32) && defined(TRITON_ENABLE_TRACING)
+      std::shared_ptr<otel_trace_api::TracerProvider> none;
+      otel_trace_api::Provider::SetTracerProvider(none);
+      break;
+#else
+      LOG_ERROR << "Unsupported trace mode: "
+                << TraceManager::InferenceTraceModeString(
+                       global_setting_->mode_);
+      break;
+#endif
+    }
+    default:
+      return;
+  }
+}
+
+#ifndef _WIN32
+void
+TraceManager::Trace::StartSpan(
+    std::string span_key, TRITONSERVER_InferenceTrace* trace,
+    TRITONSERVER_InferenceTraceActivity activity, uint64_t timestamp_ns,
+    uint64_t trace_id)
 {
   uint64_t parent_id;
   LOG_TRITONSERVER_ERROR(
       TRITONSERVER_InferenceTraceParentId(trace, &parent_id),
       "getting trace parent id");
+  std::string parent_span_key = "";
+
+  // Currently, only 2 types of sub-spans are supported:
+  // request span and compute span. Compute span is a leaf span
+  // and can not be a parent of any sub-span. If parent_id==0,
+  // then current model is either a standalone model, or an ensemble model.
+  // In both cases, the parent of the new request sub-span is the kRootSpan.
+  // A request span with trace id = `trace_id` is a parent of a compute span,
+  // started in the same trace.
+  // If parent_id > 0, then this is a child trace, spawned from
+  // the ensamble's main request. For this instance, the parent
+  // span is the ensembles's request span.
+  if (parent_id == 0 && activity == TRITONSERVER_TRACE_REQUEST_START) {
+    parent_span_key = kRootSpan;
+  } else if (activity == TRITONSERVER_TRACE_REQUEST_START) {
+    // [FIXME] For BLS requests parent span for children's request spans
+    // should be parent model's compute span. Currently,
+    // this won't work, since parent's compute span will be created
+    // only after children's spans are created.
+    parent_span_key = kRequestSpan + std::to_string(parent_id);
+  } else if (activity == TRITONSERVER_TRACE_COMPUTE_START) {
+    parent_span_key = kRequestSpan + std::to_string(trace_id);
+  }
+
+  std::string display_name = "compute";
+  const char* model_name;
+  if (activity == TRITONSERVER_TRACE_REQUEST_START) {
+    LOG_TRITONSERVER_ERROR(
+        TRITONSERVER_InferenceTraceModelName(trace, &model_name),
+        "getting model name");
+    display_name = model_name;
+  }
+
+  auto span = StartSpan(display_name, timestamp_ns, parent_span_key);
+
+  if (activity == TRITONSERVER_TRACE_REQUEST_START) {
+    int64_t model_version;
+    const char* request_id;
+    LOG_TRITONSERVER_ERROR(
+        TRITONSERVER_InferenceTraceModelVersion(trace, &model_version),
+        "getting model version");
+    LOG_TRITONSERVER_ERROR(
+        TRITONSERVER_InferenceTraceRequestId(trace, &request_id),
+        "getting request id");
+    span->SetAttribute("triton.model_name", model_name);
+    span->SetAttribute("triton.model_version", model_version);
+    span->SetAttribute("triton.trace_id", trace_id);
+    span->SetAttribute("triton.trace_parent_id", parent_id);
+    if (std::string(request_id) != "") {
+      span->SetAttribute("triton.request_id", request_id);
+    }
+  }
+
+  otel_context_ = otel_context_.SetValue(span_key, span);
+}
+
+opentelemetry::nostd::shared_ptr<otel_trace_api::Span>
+TraceManager::Trace::StartSpan(
+    std::string display_name, const uint64_t& raw_timestamp_ns,
+    std::string parent_span_key)
+{
+  otel_trace_api::StartSpanOptions options;
+  options.kind = otel_trace_api::SpanKind::kServer;
+  options.start_system_time =
+      time_offset_ + std::chrono::nanoseconds{raw_timestamp_ns};
+  options.start_steady_time =
+      otel_common::SteadyTimestamp{std::chrono::nanoseconds{raw_timestamp_ns}};
+
+  // If the new span is a child span, we need to retrieve its parent from
+  // the context and provide it through StartSpanOptions to the child span
+  if (!parent_span_key.empty() && otel_context_.HasKey(parent_span_key)) {
+    auto parent_span = opentelemetry::nostd::get<
+        opentelemetry::nostd::shared_ptr<otel_trace_api::Span>>(
+        otel_context_.GetValue(parent_span_key));
+    options.parent = parent_span->GetContext();
+  }
+  auto provider = opentelemetry::trace::Provider::GetTracerProvider();
+  return provider->GetTracer(kTritonTracer)->StartSpan(display_name, options);
+}
+
+void
+TraceManager::Trace::EndSpan(std::string span_key)
+{
+  auto timestamp_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(
+                          std::chrono::steady_clock::now().time_since_epoch())
+                          .count();
+  EndSpan(span_key, timestamp_ns);
+}
+
+
+void
+TraceManager::Trace::EndSpan(
+    std::string span_key, const uint64_t& raw_timestamp_ns)
+{
+  if (otel_context_.HasKey(span_key)) {
+    auto span = opentelemetry::nostd::get<
+        opentelemetry::nostd::shared_ptr<otel_trace_api::Span>>(
+        otel_context_.GetValue(span_key));
+
+    if (span == nullptr) {
+      return;
+    }
+
+    otel_trace_api::EndSpanOptions end_options;
+    end_options.end_steady_time = otel_common::SteadyTimestamp{
+        std::chrono::nanoseconds{raw_timestamp_ns}};
+    span->End(end_options);
+  }
+}
+
+void
+TraceManager::Trace::ReportToOpenTelemetry(
+    TRITONSERVER_InferenceTrace* trace,
+    TRITONSERVER_InferenceTraceActivity activity, uint64_t timestamp_ns)
+{
+  uint64_t id;
+  LOG_TRITONSERVER_ERROR(
+      TRITONSERVER_InferenceTraceId(trace, &id), "getting trace id");
+
+  auto current_span_key = GetSpanKeyForActivity(activity, id);
+  if (current_span_key.empty()) {
+    return;
+  }
+
+  AddEvent(current_span_key, trace, activity, timestamp_ns, id);
+}
+
+std::string
+TraceManager::Trace::GetSpanKeyForActivity(
+    TRITONSERVER_InferenceTraceActivity activity, uint64_t trace_id)
+{
+  std::string span_name;
+  switch (activity) {
+    case TRITONSERVER_TRACE_REQUEST_START:
+    case TRITONSERVER_TRACE_QUEUE_START:
+    case TRITONSERVER_TRACE_REQUEST_END: {
+      span_name = kRequestSpan + std::to_string(trace_id);
+      break;
+    }
+
+    case TRITONSERVER_TRACE_COMPUTE_START:
+    case TRITONSERVER_TRACE_COMPUTE_INPUT_END:
+    case TRITONSERVER_TRACE_COMPUTE_OUTPUT_START:
+    case TRITONSERVER_TRACE_COMPUTE_END: {
+      span_name = kComputeSpan + std::to_string(trace_id);
+      break;
+    }
+    case TRITONSERVER_TRACE_TENSOR_QUEUE_INPUT:
+    case TRITONSERVER_TRACE_TENSOR_BACKEND_INPUT:
+    case TRITONSERVER_TRACE_TENSOR_BACKEND_OUTPUT:
+    default: {
+      LOG_ERROR << "Unsupported activity: "
+                << TRITONSERVER_InferenceTraceActivityString(activity);
+      span_name = "";
+      break;
+    }
+  }
+
+  return span_name;
+}
+
+void
+TraceManager::Trace::AddEvent(
+    std::string span_key, TRITONSERVER_InferenceTrace* trace,
+    TRITONSERVER_InferenceTraceActivity activity, uint64_t timestamp_ns,
+    uint64_t id)
+{
+  if (activity == TRITONSERVER_TRACE_REQUEST_START ||
+      activity == TRITONSERVER_TRACE_COMPUTE_START) {
+    StartSpan(span_key, trace, activity, timestamp_ns, id);
+  }
+
+  AddEvent(
+      span_key, TRITONSERVER_InferenceTraceActivityString(activity),
+      timestamp_ns);
+
+  if (activity == TRITONSERVER_TRACE_REQUEST_END ||
+      activity == TRITONSERVER_TRACE_COMPUTE_END) {
+    EndSpan(span_key, timestamp_ns);
+  }
+}
+
+void
+TraceManager::Trace::AddEvent(
+    std::string span_key, std::string event, uint64_t timestamp)
+{
+  if (otel_context_.HasKey(span_key)) {
+    auto span = opentelemetry::nostd::get<
+        opentelemetry::nostd::shared_ptr<otel_trace_api::Span>>(
+        otel_context_.GetValue(span_key));
+    span->AddEvent(event, time_offset_ + std::chrono::nanoseconds{timestamp});
+  }
+}
+#endif
+
+void
+TraceManager::TraceRelease(TRITONSERVER_InferenceTrace* trace, void* userp)
+{
+  uint64_t id;
+  LOG_TRITONSERVER_ERROR(
+      TRITONSERVER_InferenceTraceId(trace, &id), "getting trace id");
+
+  auto ts = reinterpret_cast<std::shared_ptr<TraceManager::Trace>*>(userp);
+  std::lock_guard<std::mutex> lk((*ts)->mtx_);
+  (*ts)->spawned_traces_tracker_.erase(id);
   // The userp will be shared with the trace children, so only delete it
-  // if the root trace is being released
-  if (parent_id == 0) {
-    delete reinterpret_cast<std::shared_ptr<TraceManager::Trace>*>(userp);
+  // if no more TraceRelease calls are expected
+  if ((*ts)->spawned_traces_tracker_.empty()) {
+    delete ts;
   }
   LOG_TRITONSERVER_ERROR(
       TRITONSERVER_InferenceTraceDelete(trace), "deleting trace");
 }
 
+const char*
+TraceManager::InferenceTraceModeString(InferenceTraceMode mode)
+{
+  switch (mode) {
+    case TRACE_MODE_TRITON:
+      return "TRITON";
+    case TRACE_MODE_OPENTELEMETRY:
+      return "OPENTELEMETRY";
+  }
+
+  return "<unknown>";
+}
+
 void
 TraceManager::TraceActivity(
     TRITONSERVER_InferenceTrace* trace,
@@ -339,6 +683,20 @@ TraceManager::TraceActivity(
       reinterpret_cast<std::shared_ptr<TraceManager::Trace>*>(userp)->get();
 
   std::lock_guard<std::mutex> lk(ts->mtx_);
+  if (ts->spawned_traces_tracker_.find(id) ==
+      ts->spawned_traces_tracker_.end()) {
+    ts->spawned_traces_tracker_.emplace(id);
+  }
+
+  if (ts->setting_->mode_ == TRACE_MODE_OPENTELEMETRY) {
+#ifndef _WIN32
+    ts->ReportToOpenTelemetry(trace, activity, timestamp_ns);
+#else
+    LOG_ERROR << "Unsupported trace mode: "
+              << TraceManager::InferenceTraceModeString(ts->setting_->mode_);
+#endif
+    return;
+  }
   std::stringstream* ss = nullptr;
   {
     if (ts->streams_.find(id) == ts->streams_.end()) {
@@ -352,13 +710,13 @@ TraceManager::TraceActivity(
       *ss << ",";
     }
   }
-
   // If 'activity' is TRITONSERVER_TRACE_REQUEST_START then collect
   // and serialize trace details.
   if (activity == TRITONSERVER_TRACE_REQUEST_START) {
     const char* model_name;
     int64_t model_version;
     uint64_t parent_id;
+    const char* request_id;
 
     LOG_TRITONSERVER_ERROR(
         TRITONSERVER_InferenceTraceModelName(trace, &model_name),
@@ -369,9 +727,17 @@ TraceManager::TraceActivity(
     LOG_TRITONSERVER_ERROR(
         TRITONSERVER_InferenceTraceParentId(trace, &parent_id),
         "getting trace parent id");
+    LOG_TRITONSERVER_ERROR(
+        TRITONSERVER_InferenceTraceRequestId(trace, &request_id),
+        "getting request id");
 
     *ss << "{\"id\":" << id << ",\"model_name\":\"" << model_name
         << "\",\"model_version\":" << model_version;
+
+    if (std::string(request_id) != "") {
+      *ss << ",\"request_id\":\"" << request_id << "\"";
+    }
+
     if (parent_id != 0) {
       *ss << ",\"parent_id\":" << parent_id;
     }
@@ -425,177 +791,186 @@ TraceManager::TraceTensorActivity(
   auto ts =
       reinterpret_cast<std::shared_ptr<TraceManager::Trace>*>(userp)->get();
 
-  std::lock_guard<std::mutex> lk(ts->mtx_);
-  std::stringstream* ss = nullptr;
-  {
-    if (ts->streams_.find(id) == ts->streams_.end()) {
-      std::unique_ptr<std::stringstream> stream(new std::stringstream());
-      ss = stream.get();
-      ts->streams_.emplace(id, std::move(stream));
-    } else {
-      ss = ts->streams_[id].get();
-      // If the string stream is not newly created, add "," as there is
-      // already content in the string stream
-      *ss << ",";
+  if (ts->setting_->mode_ == TRACE_MODE_OPENTELEMETRY) {
+    LOG_ERROR << "Tensor level tracing is not supported by the mode: "
+              << TraceManager::InferenceTraceModeString(ts->setting_->mode_);
+  } else if (ts->setting_->mode_ == TRACE_MODE_TRITON) {
+    std::lock_guard<std::mutex> lk(ts->mtx_);
+    std::stringstream* ss = nullptr;
+    {
+      if (ts->streams_.find(id) == ts->streams_.end()) {
+        std::unique_ptr<std::stringstream> stream(new std::stringstream());
+        ss = stream.get();
+        ts->streams_.emplace(id, std::move(stream));
+        ts->spawned_traces_tracker_.emplace(id);
+      } else {
+        ss = ts->streams_[id].get();
+        // If the string stream is not newly created, add "," as there is
+        // already content in the string stream
+        *ss << ",";
+      }
     }
-  }
 
-  // collect and serialize trace details.
-  *ss << "{\"id\":" << id << ",\"activity\":\""
-      << TRITONSERVER_InferenceTraceActivityString(activity) << "\"";
-  // collect tensor
-  *ss << ",\"tensor\":{";
-  // collect tensor name
-  *ss << "\"name\":\"" << std::string(name) << "\"";
-  // collect tensor data
-  *ss << ",\"data\":\"";
-  size_t element_count = 1;
-  for (uint64_t i = 0; i < dim_count; i++) {
-    element_count *= shape[i];
-  }
-  switch (datatype) {
-    case TRITONSERVER_TYPE_BOOL: {
-      const uint8_t* bool_base = reinterpret_cast<const uint8_t*>(buffer_base);
-      for (size_t e = 0; e < element_count; ++e) {
-        *ss << ((bool_base[e] == 0) ? false : true);
-        if (e < (element_count - 1))
-          *ss << ",";
-      }
-      break;
+    // collect and serialize trace details.
+    *ss << "{\"id\":" << id << ",\"activity\":\""
+        << TRITONSERVER_InferenceTraceActivityString(activity) << "\"";
+    // collect tensor
+    *ss << ",\"tensor\":{";
+    // collect tensor name
+    *ss << "\"name\":\"" << std::string(name) << "\"";
+    // collect tensor data
+    *ss << ",\"data\":\"";
+    size_t element_count = 1;
+    for (uint64_t i = 0; i < dim_count; i++) {
+      element_count *= shape[i];
     }
-    case TRITONSERVER_TYPE_UINT8: {
-      const uint8_t* cbase = reinterpret_cast<const uint8_t*>(buffer_base);
-      for (size_t e = 0; e < element_count; ++e) {
-        *ss << cbase[e];
-        if (e < (element_count - 1))
-          *ss << ",";
+    switch (datatype) {
+      case TRITONSERVER_TYPE_BOOL: {
+        const uint8_t* bool_base =
+            reinterpret_cast<const uint8_t*>(buffer_base);
+        for (size_t e = 0; e < element_count; ++e) {
+          *ss << ((bool_base[e] == 0) ? false : true);
+          if (e < (element_count - 1))
+            *ss << ",";
+        }
+        break;
       }
-      break;
-    }
-    case TRITONSERVER_TYPE_UINT16: {
-      const uint16_t* cbase = reinterpret_cast<const uint16_t*>(buffer_base);
-      for (size_t e = 0; e < element_count; ++e) {
-        *ss << cbase[e];
-        if (e < (element_count - 1))
-          *ss << ",";
+      case TRITONSERVER_TYPE_UINT8: {
+        const uint8_t* cbase = reinterpret_cast<const uint8_t*>(buffer_base);
+        for (size_t e = 0; e < element_count; ++e) {
+          *ss << cbase[e];
+          if (e < (element_count - 1))
+            *ss << ",";
+        }
+        break;
       }
-      break;
-    }
-    case TRITONSERVER_TYPE_UINT32: {
-      const uint32_t* cbase = reinterpret_cast<const uint32_t*>(buffer_base);
-      for (size_t e = 0; e < element_count; ++e) {
-        *ss << cbase[e];
-        if (e < (element_count - 1))
-          *ss << ",";
+      case TRITONSERVER_TYPE_UINT16: {
+        const uint16_t* cbase = reinterpret_cast<const uint16_t*>(buffer_base);
+        for (size_t e = 0; e < element_count; ++e) {
+          *ss << cbase[e];
+          if (e < (element_count - 1))
+            *ss << ",";
+        }
+        break;
       }
-      break;
-    }
-    case TRITONSERVER_TYPE_UINT64: {
-      const uint64_t* cbase = reinterpret_cast<const uint64_t*>(buffer_base);
-      for (size_t e = 0; e < element_count; ++e) {
-        *ss << cbase[e];
-        if (e < (element_count - 1))
-          *ss << ",";
+      case TRITONSERVER_TYPE_UINT32: {
+        const uint32_t* cbase = reinterpret_cast<const uint32_t*>(buffer_base);
+        for (size_t e = 0; e < element_count; ++e) {
+          *ss << cbase[e];
+          if (e < (element_count - 1))
+            *ss << ",";
+        }
+        break;
       }
-      break;
-    }
-    case TRITONSERVER_TYPE_INT8: {
-      const int8_t* cbase = reinterpret_cast<const int8_t*>(buffer_base);
-      for (size_t e = 0; e < element_count; ++e) {
-        *ss << cbase[e];
-        if (e < (element_count - 1))
-          *ss << ",";
+      case TRITONSERVER_TYPE_UINT64: {
+        const uint64_t* cbase = reinterpret_cast<const uint64_t*>(buffer_base);
+        for (size_t e = 0; e < element_count; ++e) {
+          *ss << cbase[e];
+          if (e < (element_count - 1))
+            *ss << ",";
+        }
+        break;
       }
-      break;
-    }
-    case TRITONSERVER_TYPE_INT16: {
-      const int16_t* cbase = reinterpret_cast<const int16_t*>(buffer_base);
-      for (size_t e = 0; e < element_count; ++e) {
-        *ss << cbase[e];
-        if (e < (element_count - 1))
-          *ss << ",";
+      case TRITONSERVER_TYPE_INT8: {
+        const int8_t* cbase = reinterpret_cast<const int8_t*>(buffer_base);
+        for (size_t e = 0; e < element_count; ++e) {
+          *ss << cbase[e];
+          if (e < (element_count - 1))
+            *ss << ",";
+        }
+        break;
       }
-      break;
-    }
-    case TRITONSERVER_TYPE_INT32: {
-      const int32_t* cbase = reinterpret_cast<const int32_t*>(buffer_base);
-      for (size_t e = 0; e < element_count; ++e) {
-        *ss << cbase[e];
-        if (e < (element_count - 1))
-          *ss << ",";
+      case TRITONSERVER_TYPE_INT16: {
+        const int16_t* cbase = reinterpret_cast<const int16_t*>(buffer_base);
+        for (size_t e = 0; e < element_count; ++e) {
+          *ss << cbase[e];
+          if (e < (element_count - 1))
+            *ss << ",";
+        }
+        break;
       }
-      break;
-    }
-    case TRITONSERVER_TYPE_INT64: {
-      const int64_t* cbase = reinterpret_cast<const int64_t*>(buffer_base);
-      for (size_t e = 0; e < element_count; ++e) {
-        *ss << cbase[e];
-        if (e < (element_count - 1))
-          *ss << ",";
+      case TRITONSERVER_TYPE_INT32: {
+        const int32_t* cbase = reinterpret_cast<const int32_t*>(buffer_base);
+        for (size_t e = 0; e < element_count; ++e) {
+          *ss << cbase[e];
+          if (e < (element_count - 1))
+            *ss << ",";
+        }
+        break;
+      }
+      case TRITONSERVER_TYPE_INT64: {
+        const int64_t* cbase = reinterpret_cast<const int64_t*>(buffer_base);
+        for (size_t e = 0; e < element_count; ++e) {
+          *ss << cbase[e];
+          if (e < (element_count - 1))
+            *ss << ",";
+        }
+        break;
       }
-      break;
-    }
-
-    // FP16 / BF16 already handled as binary blobs, no need to manipulate here
-    case TRITONSERVER_TYPE_FP16: {
-      break;
-    }
-    case TRITONSERVER_TYPE_BF16: {
-      break;
-    }
 
-    case TRITONSERVER_TYPE_FP32: {
-      const float* cbase = reinterpret_cast<const float*>(buffer_base);
-      for (size_t e = 0; e < element_count; ++e) {
-        *ss << cbase[e];
-        if (e < (element_count - 1))
-          *ss << ",";
+      // FP16 / BF16 already handled as binary blobs, no need to manipulate
+      // here
+      case TRITONSERVER_TYPE_FP16: {
+        break;
       }
-      break;
-    }
-    case TRITONSERVER_TYPE_FP64: {
-      const double* cbase = reinterpret_cast<const double*>(buffer_base);
-      for (size_t e = 0; e < element_count; ++e) {
-        *ss << cbase[e];
-        if (e < (element_count - 1))
-          *ss << ",";
+      case TRITONSERVER_TYPE_BF16: {
+        break;
       }
-      break;
-    }
-    case TRITONSERVER_TYPE_BYTES: {
-      const char* cbase = reinterpret_cast<const char*>(buffer_base);
-      size_t offset = 0;
-      for (size_t e = 0; e < element_count; ++e) {
-        if ((offset + sizeof(uint32_t)) > byte_size) {
-          return;
+
+      case TRITONSERVER_TYPE_FP32: {
+        const float* cbase = reinterpret_cast<const float*>(buffer_base);
+        for (size_t e = 0; e < element_count; ++e) {
+          *ss << cbase[e];
+          if (e < (element_count - 1))
+            *ss << ",";
         }
-        const size_t len = *(reinterpret_cast<const uint32_t*>(cbase + offset));
-        offset += sizeof(uint32_t);
-        if ((offset + len) > byte_size) {
-          return;
+        break;
+      }
+      case TRITONSERVER_TYPE_FP64: {
+        const double* cbase = reinterpret_cast<const double*>(buffer_base);
+        for (size_t e = 0; e < element_count; ++e) {
+          *ss << cbase[e];
+          if (e < (element_count - 1))
+            *ss << ",";
         }
-        std::string str(cbase + offset, len);
-        *ss << "\"" << str << "\"";
-        offset += len;
+        break;
+      }
+      case TRITONSERVER_TYPE_BYTES: {
+        const char* cbase = reinterpret_cast<const char*>(buffer_base);
+        size_t offset = 0;
+        for (size_t e = 0; e < element_count; ++e) {
+          if ((offset + sizeof(uint32_t)) > byte_size) {
+            return;
+          }
+          const size_t len =
+              *(reinterpret_cast<const uint32_t*>(cbase + offset));
+          offset += sizeof(uint32_t);
+          if ((offset + len) > byte_size) {
+            return;
+          }
+          std::string str(cbase + offset, len);
+          *ss << "\\\"" << str << "\\\"";
+          offset += len;
 
-        if (e < (element_count - 1))
-          *ss << ",";
+          if (e < (element_count - 1))
+            *ss << ",";
+        }
+        break;
+      }
+      case TRITONSERVER_TYPE_INVALID: {
+        return;
       }
-      break;
-    }
-    case TRITONSERVER_TYPE_INVALID: {
-      return;
     }
-  }
-  *ss << "\",\"shape\":\"";
-  for (uint64_t i = 0; i < dim_count; i++) {
-    *ss << shape[i];
-    if (i < (dim_count - 1)) {
-      *ss << ",";
+    *ss << "\",\"shape\":\"";
+    for (uint64_t i = 0; i < dim_count; i++) {
+      *ss << shape[i];
+      if (i < (dim_count - 1)) {
+        *ss << ",";
+      }
     }
+    *ss << "\",\"dtype\":\"" << TRITONSERVER_DataTypeString(datatype) << "\"}";
+    *ss << "}";
   }
-  *ss << "\",\"dtype\":\"" << TRITONSERVER_DataTypeString(datatype) << "\"}";
-  *ss << "}";
 
   if (memory_type == TRITONSERVER_MEMORY_GPU) {
 #ifdef TRITON_ENABLE_GPU
@@ -664,7 +1039,8 @@ TraceManager::TraceSetting::SampleTrace()
   if (create_trace) {
     std::shared_ptr<TraceManager::Trace> lts(new Trace());
     // Split 'Trace' management to frontend and Triton trace separately
-    // to avoid dependency between frontend request and Triton trace's liveness
+    // to avoid dependency between frontend request and Triton trace's
+    // liveness
     auto trace_userp = new std::shared_ptr<TraceManager::Trace>(lts);
     TRITONSERVER_InferenceTrace* trace;
     TRITONSERVER_Error* err = TRITONSERVER_InferenceTraceTensorNew(
@@ -680,6 +1056,22 @@ TraceManager::TraceSetting::SampleTrace()
     LOG_TRITONSERVER_ERROR(
         TRITONSERVER_InferenceTraceId(trace, &lts->trace_id_),
         "getting trace id");
+    if (mode_ == TRACE_MODE_OPENTELEMETRY) {
+#ifndef _WIN32
+      auto steady_timestamp_ns =
+          std::chrono::duration_cast<std::chrono::nanoseconds>(
+              std::chrono::steady_clock::now().time_since_epoch())
+              .count();
+      auto root_span = lts->StartSpan("InferRequest", steady_timestamp_ns);
+      // Initializing OTel context and storing "InferRequest" span as a root
+      // span to keep it alive for the duration of the request.
+      lts->otel_context_ =
+          opentelemetry::context::Context({kRootSpan, root_span});
+#else
+      LOG_ERROR << "Unsupported trace mode: "
+                << TraceManager::InferenceTraceModeString(mode_);
+#endif
+    }
     return lts;
   }
   return nullptr;
@@ -709,7 +1101,8 @@ TraceManager::TraceSetting::WriteTrace(
   }
   // Write to file with index when one of the following is true
   // 1. trace_count is specified and that number of traces has been collected
-  // 2. log_frequency is specified and that number of traces has been collected
+  // 2. log_frequency is specified and that number of traces has been
+  // collected
   if (((count_ == 0) && (collected_ == sample_)) ||
       ((log_frequency_ != 0) && (sample_in_stream_ >= log_frequency_))) {
     // Reset variables and release lock before saving to file
@@ -725,21 +1118,25 @@ TraceManager::TraceSetting::WriteTrace(
 TraceManager::TraceSetting::TraceSetting(
     const TRITONSERVER_InferenceTraceLevel level, const uint32_t rate,
     const int32_t count, const uint32_t log_frequency,
-    const std::shared_ptr<TraceFile>& file, const bool level_specified,
+    const std::shared_ptr<TraceFile>& file, const InferenceTraceMode mode,
+    const TraceConfigMap& config_map, const bool level_specified,
     const bool rate_specified, const bool count_specified,
-    const bool log_frequency_specified, const bool filepath_specified)
+    const bool log_frequency_specified, const bool filepath_specified,
+    const bool mode_specified, const bool config_map_specified)
     : level_(level), rate_(rate), count_(count), log_frequency_(log_frequency),
-      file_(file), level_specified_(level_specified),
-      rate_specified_(rate_specified), count_specified_(count_specified),
+      file_(file), mode_(mode), config_map_(config_map),
+      level_specified_(level_specified), rate_specified_(rate_specified),
+      count_specified_(count_specified),
       log_frequency_specified_(log_frequency_specified),
-      filepath_specified_(filepath_specified), sample_(0), created_(0),
+      filepath_specified_(filepath_specified), mode_specified_(mode_specified),
+      config_map_specified_(config_map_specified), sample_(0), created_(0),
       collected_(0), sample_in_stream_(0)
 {
   if (level_ == TRITONSERVER_TRACE_LEVEL_DISABLED) {
     invalid_reason_ = "tracing is disabled";
   } else if (rate_ == 0) {
     invalid_reason_ = "sample rate must be non-zero";
-  } else if (file_->FileName().empty()) {
+  } else if (mode_ == TRACE_MODE_TRITON && file_->FileName().empty()) {
     invalid_reason_ = "trace file name is not given";
   }
 }
@@ -747,9 +1144,8 @@ TraceManager::TraceSetting::TraceSetting(
 TraceManager::TraceSetting::~TraceSetting()
 {
   // If log frequency is set, should log the remaining traces to indexed file.
-  if (sample_in_stream_ != 0) {
+  if (mode_ == TRACE_MODE_TRITON && sample_in_stream_ != 0) {
     file_->SaveTraces(trace_stream_, (log_frequency_ != 0));
   }
 }
-
 }}  // namespace triton::server
diff --git a/src/tracer.h b/src/tracer.h
index 49ed4e3bc3..baba2c8893 100644
--- a/src/tracer.h
+++ b/src/tracer.h
@@ -1,4 +1,4 @@
-// Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 //
 // Redistribution and use in source and binary forms, with or without
 // modification, are permitted provided that the following conditions
@@ -34,10 +34,43 @@
 #include <sstream>
 #include <string>
 #include <unordered_map>
+
+#if !defined(_WIN32) && defined(TRITON_ENABLE_TRACING)
+#include "opentelemetry/nostd/shared_ptr.h"
+#include "opentelemetry/sdk/resource/resource.h"
+#include "opentelemetry/sdk/trace/processor.h"
+#include "opentelemetry/sdk/trace/simple_processor_factory.h"
+#include "opentelemetry/sdk/trace/tracer_provider_factory.h"
+#include "opentelemetry/trace/context.h"
+#include "opentelemetry/trace/provider.h"
+namespace otel_trace_sdk = opentelemetry::sdk::trace;
+namespace otel_trace_api = opentelemetry::trace;
+#endif
+
 #include "triton/core/tritonserver.h"
 
 namespace triton { namespace server {
 
+using TraceConfig = std::vector<std::pair<std::string, std::string>>;
+using TraceConfigMap = std::unordered_map<std::string, TraceConfig>;
+
+// Common OTel span keys to store in OTel context
+// with the corresponding trace id.
+constexpr char kRootSpan[] = "root_span";
+constexpr char kRequestSpan[] = "request_span";
+constexpr char kComputeSpan[] = "compute_span";
+
+// OTel tracer name
+constexpr char kTritonTracer[] = "triton-server";
+
+/// Trace modes.
+typedef enum tracemode_enum {
+  /// Default is Triton tracing API
+  TRACE_MODE_TRITON = 0,
+  /// OpenTelemetry API for tracing
+  TRACE_MODE_OPENTELEMETRY = 1
+} InferenceTraceMode;
+
 //
 // Manager for tracing to a file.
 //
@@ -46,6 +79,7 @@ class TraceManager {
   class TraceSetting;
 
  public:
+  static constexpr int32_t MIN_TRACE_COUNT_VALUE{-1};
   // The new field values for a setting, 'clear_xxx_' indicates
   // whether to clear the previously specified filed value.
   // If false, 'xxx_' will be used as the new field value.
@@ -55,7 +89,8 @@ class TraceManager {
         : clear_level_(false), level_(nullptr), clear_rate_(false),
           rate_(nullptr), clear_count_(false), count_(nullptr),
           clear_log_frequency_(false), log_frequency_(nullptr),
-          clear_filepath_(false), filepath_(nullptr)
+          clear_filepath_(false), filepath_(nullptr), mode_(nullptr),
+          config_map_(nullptr)
     {
     }
     bool clear_level_;
@@ -72,6 +107,10 @@ class TraceManager {
 
     bool clear_filepath_;
     const std::string* filepath_;
+
+    const InferenceTraceMode* mode_;
+
+    const TraceConfigMap* config_map_;
   };
 
   struct Trace;
@@ -80,9 +119,10 @@ class TraceManager {
   static TRITONSERVER_Error* Create(
       TraceManager** manager, const TRITONSERVER_InferenceTraceLevel level,
       const uint32_t rate, const int32_t count, const uint32_t log_frequency,
-      const std::string& filepath);
+      const std::string& filepath, const InferenceTraceMode mode,
+      const TraceConfigMap& config_map);
 
-  ~TraceManager() = default;
+  ~TraceManager() { CleanupTracer(); }
 
   // Return a trace that should be used to collected trace activities
   // for an inference request. Return nullptr if no tracing should occur.
@@ -108,6 +148,21 @@ class TraceManager {
 
   static void TraceRelease(TRITONSERVER_InferenceTrace* trace, void* userp);
 
+  static const char* InferenceTraceModeString(InferenceTraceMode mode);
+
+  /// In OpenTelemetry trace mode initializes Opentelemetry exporter, processor,
+  /// and sets the global trace provider.
+  /// In Triton trace mode is a no-op.
+  ///
+  /// \param config_map A config map, which stores all parameters, specified
+  /// by user.
+  void InitTracer(const TraceConfigMap& config_map);
+
+  /// In OpenTelemetry trace mode cleans global tracer provider,
+  /// set by InitTracer.
+  /// In Triton trace mode is a no-op.
+  void CleanupTracer();
+
   struct Trace {
     Trace() : trace_(nullptr), trace_id_(0) {}
     ~Trace();
@@ -115,6 +170,10 @@ class TraceManager {
     // Group the spawned traces by trace ID for better formatting
     std::mutex mtx_;
     std::unordered_map<uint64_t, std::unique_ptr<std::stringstream>> streams_;
+    // We use the set to track the number of spawned traces, so that
+    // when TraceManager::TraceRelease() with 'trace_userp_' is called
+    // we can safely release 'trace_userp_'
+    std::set<uint64_t> spawned_traces_tracker_;
     // Triton trace object that this trace is assosicated with,
     // 'Trace' object does not take ownership of 'trace_'. The caller of
     // SampleTrace() must call TraceManager::TraceRelease() with 'trace_userp_'
@@ -128,13 +187,139 @@ class TraceManager {
     // Capture a timestamp generated outside of triton and associate it
     // with this trace.
     void CaptureTimestamp(const std::string& name, uint64_t timestamp_ns);
+
+#if !defined(_WIN32) && defined(TRITON_ENABLE_TRACING)
+    /// Reports TRITONSERVER_InferenceTraceActivity as event to
+    /// the currently active span. If activity is an instance of
+    /// `TRITONSERVER_TRACE_REQUEST_START` or
+    /// `TRITONSERVER_TRACE_COMPUTE_START`,
+    /// it starts a new request or compute span. For the request span it
+    /// adds some triton related attributes, and adds this span to
+    /// `otel_context_`. Alternatively, if activity is
+    /// `TRITONSERVER_TRACE_REQUEST_END` or `TRITONSERVER_TRACE_COMPUTE_END`,
+    /// it ends the corresponding span.
+    ///
+    /// \param trace TRITONSERVER_InferenceTrace instance.
+    /// \param activity  Trace activity.
+    /// \param timestamp_ns Steady timestamp, which is used to calculate
+    /// OpenTelemetry SystemTimestamp to display span on a timeline, and
+    /// OpenTelemetry SteadyTimestamp to calculate the duration on the span
+    /// with better precision.
+    void ReportToOpenTelemetry(
+        TRITONSERVER_InferenceTrace* trace,
+        TRITONSERVER_InferenceTraceActivity activity, uint64_t timestamp_ns);
+
+    /// Starts a span with the provided timestamp and name.
+    ///
+    /// \param display_name Span's name, which will be shown in the trace.
+    /// \param raw_timestamp_ns Steady timestamp, which is used to calculate
+    /// OpenTelemetry SystemTimestamp to display span on a timeline, and
+    /// OpenTelemetry SteadyTimestamp to calculate the duration on the span
+    /// with better precision.
+    /// \param parent_span_key A span key, to find a parent span in the
+    /// OpenTelemetry context. If empty, a root span will be started,
+    /// i.e. with no parent span specified.
+    /// \return A shared pointer to a newly created OpenTelemetry span.
+    opentelemetry::nostd::shared_ptr<otel_trace_api::Span> StartSpan(
+        std::string display_name, const uint64_t& raw_timestamp_ns,
+        std::string parent_span_key = "");
+
+    // OTel context to store spans, created in the current trace
+    opentelemetry::context::Context otel_context_;
+
+   private:
+    // OpenTelemetry SDK relies on system's clock for event timestamps.
+    // Triton Tracing records timestamps using steady_clock. This is a
+    // monotonic clock, i.e. time is always moving forward. It is not related
+    // to wall clock time (for example, it can be time since last reboot).
+    // `time_offset_` is recorded when the trace instance is created,
+    // and further used to calculate `opentelemetry::common::SystemTimestamp`
+    // as `time_offset_` + std::chrono:nanoseconds{temestamp_ns}. This way,
+    // every event recorded timestamp will receive a timestamp of
+    // <time when the trace started> + <nanoseconds passed since the start>
+    // FIXME: add steady clock timestamps to Triton OpenTelemetry SDK,
+    // when created
+    const std::chrono::time_point<std::chrono::system_clock> time_offset_ =
+        std::chrono::system_clock::now() -
+        std::chrono::duration_cast<std::chrono::nanoseconds>(
+            std::chrono::steady_clock::now().time_since_epoch());
+
+    /// Starts a compute or request span based on `activity`.
+    /// For request spans, it will add the following attributes to the span:
+    /// `model_name`, `model_version`, `trace_id`, `parent_id`.
+    ///
+    /// \param span_key Span's key to retrieve the corresponding span from the
+    /// OpenTelemetry context.
+    /// \param trace TRITONSERVER_InferenceTrace, used to request model's name,
+    /// version, trace parent_id from the backend.
+    /// \param activity Trace activity.
+    /// \param timestamp_ns Steady timestamp, which is used to calculate
+    /// OpenTelemetry SystemTimestamp to display span on a timeline, and
+    /// OpenTelemetry SteadyTimestamp to calculate the duration on the span
+    /// with better precision.
+    /// \param trace_id Trace id.
+    void StartSpan(
+        std::string span_key, TRITONSERVER_InferenceTrace* trace,
+        TRITONSERVER_InferenceTraceActivity activity, uint64_t timestamp_ns,
+        uint64_t trace_id);
+
+    /// Ends the provided span.
+    ///
+    /// \param span_key Span's key to retrieve the corresponding span from the
+    /// OpenTelemetry context.
+    void EndSpan(std::string span_key);
+
+    /// Ends the provided span at specified steady timestamp.
+    ///
+    /// \param span_key Span's key to retrieve the corresponding span from the
+    /// OpenTelemetry context.
+    /// \param raw_timestamp_ns Steady timestamp to use as
+    /// `EndSpanOptions::end_steady_time`.
+    void EndSpan(std::string span_key, const uint64_t& raw_timestamp_ns);
+
+    /// Returns the span key, for which the activity belongs.
+    ///
+    /// \param activity reported activity.
+    /// \param trace_id Trace id.
+    /// \return A key to identify span, stored in the OpenTelemetry context.
+    std::string GetSpanKeyForActivity(
+        TRITONSERVER_InferenceTraceActivity activity, uint64_t trace_id);
+
+    /// Adds event to the span, which is retrieved from OpenTelemetry context
+    /// with the provided `span_key`. If activity is
+    /// TRITONSERVER_TRACE_REQUEST_START, or TRITONSERVER_TRACE_COMPUTE_START,
+    /// starts a new span and adds it to `otel_context_`.
+    ///
+    /// \param span_key Span's key to retrieve the corresponding span from the
+    /// OpenTelemetry context.
+    /// \param trace TRITONSERVER_InferenceTrace, used to request model's name,
+    /// version, trace parent_id from the backend.
+    /// \param activity Trace activity.
+    /// \param timestamp_ns Timestamp of the provided event.
+    /// \param id Trace id.
+    void AddEvent(
+        std::string span_key, TRITONSERVER_InferenceTrace* trace,
+        TRITONSERVER_InferenceTraceActivity activity, uint64_t timestamp_ns,
+        uint64_t id);
+
+    /// Adds event to the OpenTelemetry span, retrieved from an OpenTelementry
+    /// context with the provided `span_key`.
+    ///
+    /// \param span_key Span's key to retrieve the corresponding span from the
+    /// OpenTelemetry context.
+    /// \param event An event to add to the span.
+    /// \param timestamp_ns Timestamp of the provided event.
+    void AddEvent(
+        std::string span_key, std::string event, uint64_t timestamp_ns);
+#endif
   };
 
  private:
   TraceManager(
       const TRITONSERVER_InferenceTraceLevel level, const uint32_t rate,
       const int32_t count, const uint32_t log_frequency,
-      const std::string& filepath);
+      const std::string& filepath, const InferenceTraceMode mode,
+      const TraceConfigMap& config_map);
 
   static void TraceActivity(
       TRITONSERVER_InferenceTrace* trace,
@@ -186,19 +371,22 @@ class TraceManager {
    public:
     TraceSetting()
         : level_(TRITONSERVER_TRACE_LEVEL_DISABLED), rate_(0), count_(-1),
-          log_frequency_(0), level_specified_(false), rate_specified_(false),
-          count_specified_(false), log_frequency_specified_(false),
-          filepath_specified_(false), sample_(0), created_(0), collected_(0),
-          sample_in_stream_(0)
+          log_frequency_(0), mode_(TRACE_MODE_TRITON), level_specified_(false),
+          rate_specified_(false), count_specified_(false),
+          log_frequency_specified_(false), filepath_specified_(false),
+          mode_specified_(false), config_map_specified_(false), sample_(0),
+          created_(0), collected_(0), sample_in_stream_(0)
     {
       invalid_reason_ = "Setting hasn't been initialized";
     }
     TraceSetting(
         const TRITONSERVER_InferenceTraceLevel level, const uint32_t rate,
         const int32_t count, const uint32_t log_frequency,
-        const std::shared_ptr<TraceFile>& file, const bool level_specified,
+        const std::shared_ptr<TraceFile>& file, const InferenceTraceMode mode,
+        const TraceConfigMap& config_map, const bool level_specified,
         const bool rate_specified, const bool count_specified,
-        const bool log_frequency_specified, const bool filepath_specified);
+        const bool log_frequency_specified, const bool filepath_specified,
+        const bool mode_specified, const bool config_map_specified);
 
     ~TraceSetting();
 
@@ -216,6 +404,8 @@ class TraceManager {
     int32_t count_;
     const uint32_t log_frequency_;
     const std::shared_ptr<TraceFile> file_;
+    const InferenceTraceMode mode_;
+    const TraceConfigMap config_map_;
 
     // Whether the field value is specified or mirror from upper level setting
     const bool level_specified_;
@@ -223,6 +413,8 @@ class TraceManager {
     const bool count_specified_;
     const bool log_frequency_specified_;
     const bool filepath_specified_;
+    const bool mode_specified_;
+    const bool config_map_specified_;
 
    private:
     std::string invalid_reason_;
diff --git a/src/triton_signal.h b/src/triton_signal.h
index 870df7ed43..d5aefbf0bf 100644
--- a/src/triton_signal.h
+++ b/src/triton_signal.h
@@ -27,6 +27,7 @@
 
 #include <condition_variable>
 #include <mutex>
+
 #include "triton/core/tritonserver.h"
 
 namespace triton { namespace server {
diff --git a/src/vertex_ai_server.cc b/src/vertex_ai_server.cc
index e0415a658c..a7792d9c1b 100644
--- a/src/vertex_ai_server.cc
+++ b/src/vertex_ai_server.cc
@@ -1,4 +1,4 @@
-// Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 //
 // Redistribution and use in source and binary forms, with or without
 // modification, are permitted provided that the following conditions
@@ -26,6 +26,7 @@
 #include "vertex_ai_server.h"
 
 #include <memory>
+
 #include "common.h"
 
 namespace triton { namespace server {
@@ -43,7 +44,8 @@ VertexAiAPIServer::VertexAiAPIServer(
     const std::string& prediction_route, const std::string& health_route,
     const std::string& default_model_name)
     : HTTPAPIServer(
-          server, trace_manager, shm_manager, port, address, thread_cnt),
+          server, trace_manager, shm_manager, port, false /* reuse_port */,
+          address, "" /* header_forward_pattern */, thread_cnt),
       prediction_regex_(prediction_route), health_regex_(health_route),
       health_mode_("ready"), model_name_(default_model_name),
       model_version_str_("")
diff --git a/src/vertex_ai_server.h b/src/vertex_ai_server.h
index cf4d5f9ab9..13fb62675a 100644
--- a/src/vertex_ai_server.h
+++ b/src/vertex_ai_server.h
@@ -57,7 +57,7 @@ class VertexAiAPIServer : public HTTPAPIServer {
       evhtp_request_t* req, int32_t content_length,
       size_t* header_length) override;
 
-  // Currently the compresssion schema hasn't been defined,
+  // Currently the compression schema hasn't been defined,
   // assume identity compression type is used for both request and response
   DataCompressor::Type GetRequestCompressionType(evhtp_request_t* req) override
   {